How it works
DSP processors use Harvard architecture: separate data and program memory buses allow simultaneous instruction fetch and two data operand fetches. The MAC unit performs a 16×16-bit multiplication producing a 32-bit result accumulated into a 40-bit accumulator in one clock cycle, preventing overflow. The TMS320C5x has a 5-stage pipeline: fetch, decode, access, read, execute; a branch flushes the first three stages, causing a 3-cycle penalty. Zero-overhead hardware loop counters (using RPTK instruction in C54x) execute a block-repeat loop without a branch penalty, giving sustained throughput of 1 MAC/cycle for FIR filters. VLIW DSPs like TMS320C6x issue up to 8 instructions per cycle from a 256-bit instruction packet.
Key points to remember
The single-cycle MAC is the defining feature of any DSP architecture — in a 256-tap FIR filter at 8 MHz, the C54x takes 256 cycles (32 μs), which is well within a 125 μs sampling interval at 8 kHz. The TMS320C6x can execute 8 operations per cycle using VLIW, achieving 8× more throughput than a scalar DSP at the same clock frequency. On-chip dual-port RAM (16-word circular buffer) supports zero-overhead coefficient addressing for filters. The accumulator guard bits (8 extra bits above 32 in TMS320 accumulators) prevent intermediate overflow without scaling, a feature absent in general-purpose processors. Circular addressing for delay-line buffers avoids software pointer wraparound, saving 1 instruction per MAC.
Exam tip
The examiner always asks you to explain why Harvard architecture gives higher throughput than Von Neumann architecture for DSP algorithms — state that separate buses allow simultaneous instruction fetch and dual data reads, enabling one MAC per clock, and support this with a reference to the FIR filter inner loop.