Interview questions

DSP Architecture Interview Questions

DSP architecture questions are tested at Texas Instruments, Qualcomm, and Samsung Semiconductors for firmware and DSP algorithm engineer roles. IT companies rarely ask them. These questions appear in the second or third technical round, especially when the JD mentions TMS320, SHARC, or Hexagon DSP architecture, and often follow algorithm implementation questions to check whether candidates understand how hardware executes their code.

ECE, EI

Interview questions & answers

Q1. What is the Harvard architecture and why do DSP processors use it instead of von Neumann?

Harvard architecture has separate memory buses for program instructions and data, allowing simultaneous instruction fetch and data access in the same clock cycle; von Neumann shares one bus and must alternate between instruction and data access. On the TMS320C6748, the CPU can fetch a 256-bit instruction packet from program memory while simultaneously reading two 32-bit operands from data memory in the same cycle, doubling effective throughput compared to a von Neumann design. DSPs exploit this parallelism for MAC-intensive algorithms — a direct-form FIR filter on the C6000 achieves one output sample per cycle because coefficient and data fetches happen in parallel with computation.

Follow-up: What is a modified Harvard architecture and how does it relax the strict Harvard constraint?

Q2. What is a MAC unit in a DSP and why is it the most important functional unit?

A MAC (Multiply-Accumulate) unit computes P = A×B + C in a single clock cycle without rounding or overflow between the multiply and accumulate, which is the fundamental operation in FIR convolution, IIR recursion, matrix multiplication, and FFT butterfly computation. The TMS320C5545 has a 17×17-bit MAC that produces a 34-bit result accumulated into a 40-bit accumulator, providing 6 guard bits to prevent overflow during 64-sample dot products without normalization. Without a dedicated MAC unit, a general-purpose MCU like Cortex-M0 requires separate multiply and add instructions plus an explicit accumulator register, taking 2–4 cycles instead of 1.

Follow-up: What is a SIMD MAC and how does it increase DSP throughput for filter computation?

Q3. What is VLIW (Very Long Instruction Word) architecture and how does TI C6000 use it?

VLIW processors pack multiple independent operations into one wide instruction word and execute them in parallel on multiple functional units each clock cycle, with the compiler (not hardware) responsible for finding and scheduling parallelism. The TI C6000 DSP has a 256-bit instruction packet containing up to 8 32-bit instructions that can execute simultaneously on 8 functional units: two ALUs, two multipliers, two load/store units, and two branch units. The C6000 compiler's loop optimizer can fill these 8 slots for a software-pipelined FIR loop achieving 8 MACs per cycle at 1 GHz, versus 1 MAC per cycle on a scalar processor — which is why TI's C674x achieves 3.6 GFLOPS single-precision.

Follow-up: What is the difference between VLIW and superscalar out-of-order execution?

Q4. What is a DSP pipeline and what is a pipeline hazard?

A DSP pipeline overlaps multiple instruction stages — fetch, decode, execute, write-back — so that while one instruction executes, the next is being decoded and the one after that is being fetched, increasing throughput to one instruction per cycle in steady state. A pipeline hazard is a condition that prevents the pipeline from advancing — a data hazard occurs when an instruction needs a result not yet produced by a preceding instruction, and in the C6000 this causes an automatic NOP insertion (interlock) that stalls the pipeline for 1–4 cycles depending on the instruction latency. Hand-scheduling NOPs or reordering instructions to hide latency is critical in inner loops on the C6000 where a missed interlock doubles the execution time.

Follow-up: What is the difference between a stall and a branch penalty in a pipeline?

Q5. What is the role of the circular addressing mode in DSP architecture?

Circular addressing automatically wraps an address pointer back to the start of a buffer when it reaches the end, implementing a ring buffer in hardware without software modulo operations. On the TMS320C5545, a 256-word circular buffer for an FIR delay line is configured by setting the BK0 (block size) register and AR3 (circular base address); the increment after each access wraps automatically in hardware. Without circular addressing, each FIR tap access requires a software modulo: index = (index + 1) % N, which costs 2–4 extra instructions per access — eliminating this overhead is what makes DSP FIR loops run at one MAC per cycle.

Follow-up: What is bit-reversed addressing and which DSP operation uses it?

Q6. What is bit-reversed addressing mode and why is it needed in FFT computation?

Bit-reversed addressing generates memory addresses by reversing the binary bits of a sequential index, placing FFT data in the natural order required by the butterfly computation without explicit in-place permutation. The Cooley-Tukey radix-2 FFT's input must be in bit-reversed order for the in-place butterfly outputs to be in natural order — for N=8, index 3 (011b) maps to bit-reversed 6 (110b). The TMS320C55x and C6000 support bit-reversed address modification in hardware, allowing the FFT data reordering loop to run at one access per cycle instead of requiring a software permutation table lookup per sample.

Follow-up: What is the alternative if hardware bit-reversal is not available — how is it done in software?

Q7. What is the memory hierarchy of a typical DSP chip and how does it affect algorithm performance?

A DSP memory hierarchy typically has level 1 SRAM (L1D and L1P caches or tightly coupled memory) accessible in 1 cycle, L2 cache in 4–6 cycles, and external DDR memory in 30–100 cycles. On the TI C6748, L1D is 32 KB (4-cycle access), L2 is 256 KB (6-cycle), and external SDRAM takes 70 cycles per access; an FIR filter whose coefficients and data fit in L1D runs 10–20x faster than the same filter reading from SDRAM. Profiling DSP algorithm bottlenecks always starts with measuring cache miss rates using the PMU (Performance Monitoring Unit) — SDRAM-bound code is the most common cause of failing to meet real-time deadlines in audio and radar processing.

Follow-up: What is the difference between a unified L2 cache and a scratchpad SRAM in DSP memory architecture?

Q8. What is software pipelining in a DSP compiler and why is it critical for loop performance?

Software pipelining is a compiler transformation that overlaps iterations of a loop — starting the memory loads for iteration N+2 while computing iteration N+1's multiply while storing iteration N's result — achieving full functional unit utilization in steady state. On the TI C6000, the software-pipelined FIR loop achieves 8 useful operations per cycle (four 16×16 MACs per cycle using DOTP2 instructions), versus 3 cycles per iteration for a naive translation. Without software pipelining, the C6000's 8-wide VLIW machine would be mostly idle in a MAC loop because all 8 operations for one iteration have sequential data dependencies.

Follow-up: What is the initiation interval (II) in software pipelining?

Q9. What is the difference between fixed-point and floating-point DSP and when do you choose each?

Fixed-point DSPs represent numbers as integers with an implicit binary point and require the programmer to manage scaling and overflow; floating-point DSPs (or DSPs with FPU) use IEEE 754 representation and handle scaling automatically but use more power and silicon area per operation. The TI C5535 (fixed-point) runs an IIR biquad in Q15 at 50 mW while achieving 100 MIPS, whereas the C674x (floating-point) achieves 3.6 GFLOPS but consumes 1.5 W — fixed-point is chosen for battery-powered hearing aids and IoT sensors, floating-point for radar and software-defined radio where dynamic range requirements make Q15 scaling impractical. The dynamic range of Q15 is about 96 dB while IEEE 754 single precision provides about 150 dB, making floating-point mandatory for wide-dynamic-range applications.

Follow-up: What is Q-format notation and how do you convert a floating-point coefficient to Q15?

Q10. What is DMA's role in DSP architecture and how does double-buffering maximize CPU utilization?

DMA in DSP systems transfers audio or ADC sample blocks from peripherals to memory while the CPU simultaneously processes the previous block, eliminating the I/O wait that would otherwise stall the processor. On a TMS320C5535 audio DSP running at 100 MHz with 48 kHz stereo audio at 256-sample frames, DMA fills one ping-pong buffer every 5.3 ms; the CPU processes the completed buffer during this window, achieving 100% CPU utilization for signal processing rather than spending any cycles waiting for I/O. Without double-buffering, the CPU must wait idle for 30–50% of the time while DMA fills the buffer before processing can begin.

Follow-up: What is the interrupt overhead in a DMA double-buffer scheme and how do you minimize it?

Q11. What is the significance of the 40-bit accumulator in TI C55x DSPs?

The 40-bit accumulator on TI C55x provides 8 guard bits above the 32-bit product of two 16-bit operands, allowing 256 (2⁸) accumulations of maximum-value 16-bit products without overflow before requiring a normalization step. An FIR filter with 200 Q15 taps computing a dot product of two 200-element Q15 vectors can accumulate all 200 products in 40 bits before the final right-shift and round, whereas a 32-bit accumulator would need intermediate normalization at approximately every 8 taps. This guard bit capacity is why the C55x can implement long FIR filters without overflow checking in the inner loop, which is a key differentiator from using a general-purpose 32-bit Cortex-M processor for the same task.

Follow-up: How do you convert a Q15 accumulator result back to Q15 after accumulation?

Q12. What is a FFT hardware accelerator and how does it differ from a software FFT on a DSP?

An FFT hardware accelerator is a dedicated silicon block that computes the FFT in parallel using hardwired butterfly units, twiddle factor ROMs, and autonomous memory access, executing a 1024-point FFT in tens of microseconds with minimal CPU involvement. The Hexagon DSP in Qualcomm Snapdragon includes vector extensions (HVX) that execute a 1024-point single-precision FFT in approximately 20 µs at 1 GHz, while a software FFT on the same core takes 4x longer. Software FFTs provide flexibility (arbitrary N, arbitrary precision) while hardware accelerators are fixed to power-of-2 sizes but deliver orders-of-magnitude better energy efficiency — critical in always-on audio keyword detection applications.

Follow-up: What is the energy cost comparison between an FFT on a DSP core versus an FFT hardware accelerator?

Q13. What is the role of the program cache (L1P) versus data cache (L1D) in C6000 DSP?

L1P (Level 1 Program cache) caches instruction packets fetched from L2 or external memory, ensuring the 8-wide VLIW fetch unit never stalls on instruction supply; L1D (Level 1 Data cache) caches data reads and writes, preventing the load/store units from stalling on data access. On the C6748, L1P is 32 KB direct-mapped and L1D is 32 KB 2-way set-associative — for a 1 KB FIR kernel that fits in L1P, instruction fetch is always 1-cycle, but if the coefficient table is 8 KB (larger than L1D), cache thrashing causes 6-cycle L2 misses every few iterations. DSP algorithm optimization always involves ensuring the hot loop's instruction footprint fits in L1P and the working data set fits in L1D.

Follow-up: What is cache thrashing and how do you detect it using profiling tools?

Q14. What is the difference between a DSP processor and a FPGA for signal processing applications?

A DSP processor executes sequential software instructions on fixed programmable hardware, offering flexibility and ease of algorithm update but limited parallelism (8–32 operations per cycle); an FPGA implements the algorithm in reconfigurable parallel hardware, achieving thousands of simultaneous operations but with higher development time and power for small signal paths. For a 64-channel beamforming radar, an FPGA processing 64 parallel FFTs simultaneously at 500 MHz outperforms any DSP by 50–100x in throughput, but implementing a new beamforming algorithm requires RTL redesign and re-synthesis rather than a firmware update. DSPs are chosen for algorithms that change frequently; FPGAs for fixed, throughput-critical processing pipelines.

Follow-up: What is an SoC that combines a DSP and an FPGA, and name one example?

Q15. What is the EDMA (Enhanced DMA) on TI C6000 and how is it programmed?

EDMA on TI C6000 is a 3-dimensional DMA controller that transfers rectangular blocks of memory with configurable source/destination strides, allowing column extraction from a matrix, scatter-gather, and linked parameter sets without CPU involvement between frames. An EDMA channel for ping-pong audio buffering is configured with a PaRAM set specifying source=McASP FIFO address, destination=buffer A, BCNT=128 samples, and linking to a second PaRAM set pointing to buffer B — the EDMA automatically alternates between A and B and raises a completion interrupt for each. Correctly configuring EDMA source/destination strides (SRCBIDX, DSTBIDX) to skip rows in a 2D array is the most complex aspect and requires drawing the memory layout explicitly before writing the configuration.

Follow-up: What is the difference between AB-synchronized and A-synchronized EDMA transfers?

Common misconceptions

Misconception: Harvard architecture means the DSP has two separate processors.

Correct: Harvard architecture means one processor has two separate memory buses — one for instructions and one for data — allowing simultaneous access to both in the same clock cycle.

Misconception: A DSP with VLIW automatically runs faster because it executes more instructions simultaneously.

Correct: VLIW throughput depends on the compiler's ability to find parallel independent operations; poorly written code with sequential dependencies may still execute only one useful operation per cycle even on an 8-wide VLIW machine.

Misconception: Floating-point DSPs are always better than fixed-point for signal processing.

Correct: Fixed-point DSPs consume significantly less power and are preferred for battery-powered devices like hearing aids and IoT sensors where dynamic range requirements can be met with Q15 or Q31 arithmetic.

Misconception: Software pipelining and hardware pipelining are the same thing.

Correct: Hardware pipelining overlaps stages within a single instruction execution; software pipelining overlaps entire iterations of a loop, scheduling instructions from multiple iterations to fill all functional unit slots simultaneously.

Quick one-liners

What does MAC stand for in DSP architecture?Multiply-Accumulate — a single-cycle operation computing P = A×B + C critical for FIR, FFT, and matrix algorithms.
What is VLIW?Very Long Instruction Word — a processor architecture where the compiler packs multiple independent operations into one wide instruction for parallel execution.
What is circular addressing used for in DSP?Implementing ring buffers for FIR delay lines without software modulo operations, reducing loop overhead to one instruction per tap.
What is bit-reversed addressing used for?Reordering FFT input or output data in the bit-reversed permutation required by Cooley-Tukey in-place butterfly computation.
What is the accumulator word length on TI C55x?40 bits — providing 8 guard bits above the 32-bit product to prevent overflow during long dot product accumulations.
What is the Harvard architecture advantage over von Neumann?Simultaneous instruction fetch and data access in the same cycle, doubling effective memory bandwidth for compute-intensive loops.
What is software pipelining in a DSP compiler?A loop transformation that overlaps instructions from multiple iterations to keep all functional units busy every cycle.
What is the difference between L1D cache and scratchpad SRAM?L1D cache is hardware-managed and transparent to software; scratchpad SRAM is software-managed and explicitly addressed by the programmer for guaranteed latency.
What is the initiation interval (II) in software pipelining?The number of cycles between the start of successive loop iterations in the pipelined schedule — II=1 means one new iteration starts every cycle.
What processor family does TI use for audio and measurement DSP?The TMS320C5000 series (C5535, C5545) for low-power fixed-point audio DSP applications.

More Digital Signal Processing questions