implementation of dsp ic

VSP Lecture4 - Fast Algorithms (cwliu@twins.ee.nctu.edu.tw)

Implementation of DSP IC

Lecture 4 Fast Algorithms for Digital Signal Processing

Algorithm Strength Reduction• Strength reduction leads to a reduction in

hardware complexity by exploiting substructure sharing and leads to less silicon area or power consumption in a VLSI ASIC or iteration period in a programmable DSP implementation

• Strength reduction enables design of parallel FIR filters with a less-than-linear increase in hardware

Algorithm Strength Reduction• Motivation

– The number of strong operations, such as multiplications, is reduced possibly at the expense of an increase in the number of weaker operations, such as additions.

• Reduce computation complexity• Example: Complex multiplication

– (a+jb)(c+jd)=e+jf, a,b,c,d,e,f R– The direct implementation requires 4 multiplications and 2

additions

– However, the number of multiplication can be reduced to 3 at the expense of 3 extra additions by using the identities

)()()()(

baddcbbcadbaddcabdac

3 multiplications

5 additions

Complex Multiplication

Reduce the number of strong operation (less switched capacitance), however, increase the critical path

Speed?, Area?, Power? ….

cwliu@twins.ee.nctu.edu.tw 5

Review of Discrete Fourier Transform

4 Forms of Fourier Analysis

“Sampled” frequency

Continuous-Time and Continuous-Frequency

ContinuousAperiodic

Continuous-Time and Discrete-Frequency

Fourier series of periodic continuous signals

PeriodicContinuous

Discrete Aperiodic

Discrete-Time and Continuous-Frequency

Fourier transform of aperiodic discrete signals

DiscreteAperiodic Continuous

Periodic

Discrete Fourier Transform

• DFT is identical to samples of Fourier transforms• In DSP applications, we are able to store only a finite number of samples• we are able to compute the spectrum only at specific discrete values of

Discrete Fourier Transform• Discrete Fourier transform (DFT) pairs

NnWkXN

NkWnxkX

,1,,1,0 ,][1][

1,,1,0 ,][][

• DFT/IDFT can be implemented by using the same hardware• It requires N2 complex multiplications and N(N-1) complex additions

N complex multiplicationsN-1 complex additions

More About DFT• Properties of Discrete Fourier

Transform• Linear Convolution and Discrete

Fourier Transform

Periodic Sequence• Consider a periodic sequence of period N• The sequence can be represented by Fourier

series

• The Fourier series for any discrete-time signal with period N requires only N harmonically related complex exponentials.

][~ nx

knNjekXN

nx /2][~1][~

][][ /2/2 neeene kknNjnlNkNj

/2][~1][~ N

knNjekXN

Apply the Orthogonality property, we have

Interchange the order of summation

The coefficients are also periodic with period N

DFS Representation of a Periodic Sequence

Synthesis equation Analysis equation

NnxkX period of sequence periodic are ~ and ~

Physical Significance

One period

vs][~ kX )( jeX

Example

Sampling the Fourier Transform

unit circle

The sampling sequence is periodic with period N

Suppose exists

Aliasing Problem 1• x[n] is infinite-length sequence

][~ nx

Aliasing Problem 2• If x[n] is finite-length sequence, 0nM-1• Consider the case NM

][][~ nxnx

][~ nx

Concluding Remarks

][~ nx

The case NM

Circular Shift of a Sequence

][]2[~

A rotation ofthe cylinder

Circular Shift of a Sequence

][]13[~

15 nRnx

A rotation ofthe cylinder

Review of Convolution

• Given two sequences:– Data sequence xi, 0 ≤ i≤ N-1, of length N– Filter sequence hi, 0 ≤ i≤ L-1, of length L

• Linear convolution

• Direct computation, for example 2-by-2 convolution2,,1,0 , NLixhhxy iiiii

NL multiplications

hx sL-point sequence N-point

sequence

(L+N-1)-point sequence

sss require 4 multiplications

and 1 addition

Linear Convolution

Linear Shift

Linear Shift vs Circular Shift

Conventional shift(linear shift)

Circular Shift Example

Periodic/Circular Convolution

Circular Shift

Circular Convolution Definition• Suppose two finite-length duration sequences:

x1[n] and x2[n] of length N

x3[n] is also a finite-length duration sequences of length N

Computation for Circular Convolution

1. To period the two sequence with period N (large enough)

2. To compute the periodic convolution of the two periodic sequences

3. To get out the duration sequence between [0, N-1]

Example

Step 1

Step 2

Step 3

Circular Convolution Property• Usually, we use the following notation to

represent the circular convolution of length N

• Circular convolution property

Circular Convolution Implementation

• Direct Implementation

hx sN-point sequence N-point

sequence

N-point sequence

44 cyclic convolution

16 multiplications12 additions

Circular Convolution

~ O(N2)

Using Circular Convolution to Implement Linear Convolution

• Consider two sequences x1[n] of length L and x2[n] of length P, respectively

• The linear convolution x3=x1[n] x2[n]

• Choose N, such that NL+P-1, then

a sequence of length L+P-1The same concept related to Winogrand Algorithm

Linear Convolution

Circular Convolution with N=L+P-1

Time aliasing in the circular convolution of two finite-length sequence can be avoided if N L+P-1

Concluding Remarks• The convolution of two finite-length sequences can be

interpreted by circular convolution with large enough length• Circular convolution can be implemented by DFT/FFT

• However, in real applications….– For an FIR system, the input sequence is of indefinite duration– To store the entire input signal requires ?

• A large delay in processing• An indefinite memory

– Block convolution

Block Convolution• Step1: To segment a sequence into

sections of length L• Step2: Each section is convolved with the

finite-length impulse response of length P by using DFT/FFT of length N=L+P-1

• Step3: The filtered sections are fitted together in an appropriate way

• Overlap-add method• Overlap-save method

Overlap-Add Methodhx y

Step1 Zero padding

Zero padding

Step2&

Time shift

][ ][][][][ N nhnxnhnxny rrr with L+P-1 length

Time shift

Fast Convolution with the FFT• Given two sequences x1 and x2 of length N1 and N2

respectively– Direct implementation requires N1N2 complex

multiplications• Consider using FFT to convolve two sequences:

– Pick N, a power of 2, such that N≥N1+N2-1– Zero-pad x1 and x2 to length N– Compute N-point FFTs of zero-padded x1 and x2, one

obtains X1 and X2– Multiply X1 and X2– Apply the IFFT to obtain the convolution sum of x1 and

x2– Computation complexity: 2(N/2) log2N + N + (N/2)log2N

Example• A sequence x[n] of length 1024• FIR filter h[n] of length 34

• Direct computation: 341024=34816• Using radix-2 FFT: 35840 (N=2048)• Using overlap-add radix-2 FFT:

– x[n] is segmented into a set of contiguous blocks of equal length 95

– Apply radix-2 FFT of length 128– Each segment requires 1472 multiplications– This algorithm requires total 16192 multiplications

Discrete Fourier Transform• Discrete Fourier transform (DFT) pairs

NnWkXN

NkWnxkX

,1,,1,0 ,][1][

1,,1,0 ,][][

• DFT/IDFT can be implemented by using the same hardware• It requires N2 complex multiplications and N(N-1) complex additions

N complex multiplicationsN-1 complex additions

Decimation in Time

N+2(N/2)2 complex multiplications vs. N2 complex multiplication

twiddle factor

2ℓ+1

Flow Graph of the DIT FFT

8-point DIT DFT

Remarks• It requires v=log2N stages. Each stage has N/2 butterfly

operation (radix-2 DIT FFT), which requires 2 complex multiplications and 2 complex additions

• Each stage has N complex multiplications and N complex additions

• The number of complex multiplications (as well as additions) is equal to N log2N

• By symmetry property, we have (butterfly operation)222 N

N WeWWWW

2 complex multiplications2 complex additions

1 complex multiplications2 complex additions

8-point FFT

Normal orderBit-Reversed order

In-Place Computation

Stage 1

X0[000]

X0[001]

X0[010]

X0[011]

X0[100]

X0[101]

X0[110]

X0[111]

X1[000]

X1[001]

X1[010]

X1[011]

X1[100]

X1[101]

X1[110]

X1[111]

X2[000]

X2[001]

X2[010]

X2[011]

X2[100]

X2[101]

X2[110]

X2[111]

Stage 3Stage 2

X3[000]

X3[001]

X3[010]

X3[011]

X3[100]

X3[101]

X3[110]

X3[111]

The same register array can be used in each stage

8-point FFT

Normal order Bit-reversed order

Normal-Order Sorting v.s. Bit-Reversed Sorting

Normal Order Bit-reversed Order

bottom

DFT v.s. Radix-2 FFT• DFT: N2 complex multiplications and N(N-1)

complex additions• Recall that each butterfly operation requires one

complex multiplication and two complex additions• FFT: (N/2) log2N multiplications and N log2N

complex additions

• In-place computations: the input and the output nodes for each butterfly operation are horizontally adjacent only one storage arrays will be required

Decimation in Frequency (DIF)• Recall that the DFT is

• DIT FFT algorithm is based on the decomposition of the DFT computations by forming small subsequences in time domain index “n”: n=2ℓ or n=2ℓ+1

• One can consider dividing the output sequence X[k], in frequency domain, into smaller subsequences: k=2r or k=2r+1:

10 ,][1

NkWnxkXN

Substitution of variables

DIF FFT Algorithm (1)

is just N/2-point DFT. Similarly,

DIF FFT Algorithm (2)

v=log2N stages, each stage has N/2 butterfly operation.

(N/2)log2N complex multiplications, N complex additions

Remarks• The basic butterfly operations for DIT FFT and DIF FFT

respectively are transposed-form pair.

• The I/O values of DIT FFT and DIF FFT are the same• Applying the transpose transform to each DIT FFT

algorithm, one obtains DIF FFT algorithm

DIF BF unitDIT BF unit

Fast Convolution with the FFT• Given two sequences x1 and x2 of length N1 and N2

respectively– Direct implementation requires N1N2 complex

multiplications• Consider using FFT to convolve two sequences:

– Pick N, a power of 2, such that N≥N1+N2-1– Zero-pad x1 and x2 to length N– Compute N-point FFTs of zero-padded x1 and x2, then we

obtain X1 and X2– Multiply X1 and X2– Apply the IFFT to obtain the convolution sum of x1 and

x2– Computation complexity: 2(N/2) log2N + N + (N/2)log2N

Implementation Issues• Radix-2, Radix-4, Radix-8, Split-Radix,Radix-22, …, • I/O Indexing• In-place computation

– Bit-reversed sorting is necessary– Efficient use of memory– Random access (not sequential) of memory. An address

generator unit is required.– Good for cascade form: FFT followed by IFFT (or vice

versa)• E.g. fast convolution algorithm

• Twiddle factors– Look up table– CORDIC rotator

FIR Filters

yyyy Transform-domainTime-domain

Example: Linear Phase FIRLinear phase FIR filter: with approximately constant frequency-response magnitude and linear phase (constant group delay) in pass-band

N multipliersN-1 adders

(N+1)/2 multipliersN-1 adders, if odd N

N/2 multipliersN-1 adders, if even N

By exploiting substructure sharing to reduce area

An Efficient Decomposition• Example: 2-fold decomposition

• Example 3-fold decomposition

• General case (N-fold decomposition)

654321

)]5[]3[]1[()]6[]4[]2[]0[( ]6[]5[]4[]3[]2[]1[]0[)(

zhzhhzzhzhzhhzhzhzhzhzhzhhzH

654321

)]5[]2[()]4[]1[()]6[]3[]0[( ]6[]5[]4[]3[]2[]1[]0[)(

zHzHzH

zhhzzhhzzhzhhzhzhzhzhzhzhhzH

k zlNkhzHzHzzkhzH ][)( where,)(][)(1

Traditional Parallel Architecture• 2-fold parallel architecture

4(N/2) multiplicationsN/2-tap 4(N/2-1)+2 additions

Traditional Parallel FIR

L-parallel FIR filter of length N/L requires 1. L2 (N/L) multiplications, i.e. LN multiplications2. L2 (N/L -1) +L(L-1) additions, i.e. L(N-1) additions

~ LN multiply-add operations

Fast FIR Algorithm (FFA)• First by applying L-fold polyphase

decomposition for H(z)– There are L filters of length N/L

• By applying Winograd algorithm– 2 polynomials of degree (N/L)-1 can be

implemented by using 2 (N/L)-1 product terms.– Each product terms are equivalent to filtering

operations in the block formulation– Consequently, it can be realized using

approximately L FIR filters of length N/L It requires 2N-L multiplications

FIR Using Polyphase Decomposition

Traditional Parallel Architecture• 2-fold parallel architecture

4(N/2) multiplicationsN/2-tap 4(N/2-1)+2 additions

2-Parallel Fast FIR Filter

2-Parallel FFA• It requires 3 distinct sub-filters of length N/2

and 4 pre/post-processing additions. • Totally, it requires 1.5N multiplications and

3(N/2 -1)+4=1.5N +1 additions

implementation of dsp ic

Documents

dsp implementation of ofdm acoustic modem

dsp-based synchronization algorithm implementation for...

introduction to mpeg -...

simulation and dsp board implementation of an optical...

dsp ii financial sector benchmarks implementation

2011 dsp ic lec 6 folding part i

lab 2: embedded dsp implementation of energy-based voice...

ecg implementation on the tms320vc5505 dsp medical...

digital service provider (dsp) operational framework...

dsp implementation of a-control algorithm for a …

dsp implementation of channel estimation algorithms...

implementation of dsp ic - national chiao tung...

1000base-t gigabit ethernet baseband dsp ic...

implementation of dsp radio receiver amaar ahmad syed

ecg implementation on the tms320vc5505 dsp medical ... ·...

high performance dsp/fpga controller for implementation of

dsp-based synchronization algorithm implementation for...

dsp implementation of a novel artificial bee colony

dsp-cis chapter -6: filter implementation

dsp implementation of a disk drive controller