Short notes

DSP Architecture Short Notes

Inside a TMS320C54x DSP processor running an FIR filter on a GSM speech codec at 13 kbps, a single Multiply-Accumulate (MAC) instruction completes one multiply, one accumulate, and two data memory fetches simultaneously in a single clock cycle — something a general-purpose Intel processor cannot do without multiple cycles. This parallelism is what makes DSPs different from microprocessors, and the architectural features that enable it are the direct subject of university exam questions.

ECE, EI

How it works

DSP processors use Harvard architecture: separate data and program memory buses allow simultaneous instruction fetch and two data operand fetches. The MAC unit performs a 16×16-bit multiplication producing a 32-bit result accumulated into a 40-bit accumulator in one clock cycle, preventing overflow. The TMS320C5x has a 5-stage pipeline: fetch, decode, access, read, execute; a branch flushes the first three stages, causing a 3-cycle penalty. Zero-overhead hardware loop counters (using RPTK instruction in C54x) execute a block-repeat loop without a branch penalty, giving sustained throughput of 1 MAC/cycle for FIR filters. VLIW DSPs like TMS320C6x issue up to 8 instructions per cycle from a 256-bit instruction packet.

Key points to remember

The single-cycle MAC is the defining feature of any DSP architecture — in a 256-tap FIR filter at 8 MHz, the C54x takes 256 cycles (32 μs), which is well within a 125 μs sampling interval at 8 kHz. The TMS320C6x can execute 8 operations per cycle using VLIW, achieving 8× more throughput than a scalar DSP at the same clock frequency. On-chip dual-port RAM (16-word circular buffer) supports zero-overhead coefficient addressing for filters. The accumulator guard bits (8 extra bits above 32 in TMS320 accumulators) prevent intermediate overflow without scaling, a feature absent in general-purpose processors. Circular addressing for delay-line buffers avoids software pointer wraparound, saving 1 instruction per MAC.

Exam tip

The examiner always asks you to explain why Harvard architecture gives higher throughput than Von Neumann architecture for DSP algorithms — state that separate buses allow simultaneous instruction fetch and dual data reads, enabling one MAC per clock, and support this with a reference to the FIR filter inner loop.

More Digital Signal Processing notes