etd_0728106_120055

7/27/2019 etd_0728106_120055

1/113

E 1 :%;k5%

alav

raaxE4-2? FEzi$ J'%i$/)i'H%53:

iiiidAN EFFICIENT FFT/IFFT COMPILER FOR

FDIFFERENTAPPLICATIONS

Department of Electrical Engineering

National Cheng Kung UniversityTainan, Taiwan, R.O.C.

Thesis for Master of Science

July, 2006

7/27/2019 etd_0728106_120055

2/113

@n&%x%%&

IE.:'

%%%z&$%MMMi

W%i=~%%:%

1M%i$@$J%MN@&%%

%%%%%%

7/27/2019 etd_0728106_120055

3/113

An Efficient FFT/IFFT Compiler for

Different Applications

by

Sheng-Hs:ienHuang

A thesis submitted to the graduate division in jpaztial.

fulfillment ef the requirement fer the degreeoflMaster of Science

at

Naticnal Cheng Kung University

Tainan, Taiwan, Republic ef Cltina

July 2006

Approved by 1

7/27/2019 etd_0728106_120055

4/113

i FEzf |3J'H%5$/).'H%5$

4iL=J@$%

4;:-aaeaal $33439

E iti I2511%?'#EJ'Hf5viFFr

iii

J*/3.F]E3~H3f4.J5lE'5*uF]E8a/dfkfi.EL.7'ci>J"

%|"n iik ifii LMi7la7&?

7/27/2019 etd_0728106_120055

5/113

An Efficient FFT/IFFT Compiler for

Different Applications

Sheng-HsienHuangl lVIing-DerShiehz

Department of Electrical Engineering

National Cheng Kung University

Tainan, Taiwan, Republic of China

ABSTRACT

With the emergence of internet services and pervasion of communication

applications, internet and wireless communication have become a part of our life.

Wireless network can reach where the traditional network cannot do. This makes the

applications of wireless network more widespread. To date, different wireless LAN

standards have emerged from different applications and each standard has its own

uniqueness and application ranges.

Of the existing digital communication systems, the Orthogonal Frequency

Division Multiplexing (OFDM) technique has been widely used in performing signal

modulation. Compared with the traditional single-carrier frequency modulation

technique, OFDM adopts multi-carrier modulation which has been adopted in many

practical wireless systems such as 802.11x, DAB and DVB, etc. Regarding the

hardware implementation, the OFDM can be fullled by employing the (inverse) fast

Fourier transform (FFT/IFFT). In fact, different standards or operation modes imply

1 The Author

2 The Advisor

7/27/2019 etd_0728106_120055

6/113

different requirements or specifications for the associated FFT/IFFT; therefore,

different design methodology should be applied. It would be a challenge if a unied

FFT/IFFT architecture is to be designed. In this thesis, we investigate how to develop

an efficient FFT/IFFT compiler for different applications. Based on our development,

not only a dedicated FFT/[FFT module can be easily prototyped for fast system

verication, but also the resulting compiler can be used as a basis for more advanced

research in the future.

vi

7/27/2019 etd_0728106_120055

7/113

". isbiiitbfi %t#*f%1BE?%'*935Ela73:4z4...sua:c:J%*i.11%33:BiJ{i1!5~

%=v?3%'*bX1'|7F'J7E.3L 4'5-.iEril$JHf5:#_>1%1935$-_%.l:.:*I"%3c75@J%

% %%3c'5?'3?'Jzna5IbXEEi?'.?=JE13*zi:-1%-)L59r5vZ=L;Li.saaHJ..~.?-Ei:..~%Fvl2E9

7F=9354J%

%t'5?'3*,JFf] %1%1%3cJl']355%?/i 2EL/3.3?-23'}f:Y

7/27/2019 etd_0728106_120055

8/113

TABLE OF CONTENTS

TABLE OF CONTENTS ......................................................................................... .. viii

LIST OF TABLES ....................................................................................................... ..x

LIST OF FIGURES ................................................................................................... ..xii

Chapter 1 Introduction ...................................................................................................1

1.1 FFT Overview .............................................................................................. ..1

1.2 Motivation .................................................................................................... .. 3

1.3 Organization of this Thesis ............................................................................5

Chapter 2 FFT Algorithm ...............................................................................................6

2.1 General Algorithm .........................................................................................6

2.1.1 Decimation-In-Time (DIT) FFT Algorithms ........................................7

2.1.2 Decimation-In-Frequency (DIF) FFT Algorithms .............................. 12

2.2 High-Radix Algorithm ................................................................................. 18

2.2.1 Radix-4 DIF FFT Algorithm ............................................................... 18

2.2.2 Radix-8 DIF FFT Algorithm ...............................................................21

2.3 Split-Radix DIF FFT Algorithms .................................................................25

2.3.1 Raidx-2/4 DIF FFT Algorithm ............................................................26

2.3.2 Radix-2/8 DIF FFT Algorithm ............................................................28

2.4 Complexity Analysis .................................................................................... 32

Chapter 3 FFT/IFFT Architecture ................................................................................ 38

3.1 Pipelined FFT Architecture .......................................................................... 39

3.1.1 Radix-2 Single-path Delay Feedback (R2SDF) ..................................42

3.1.2 Radix-2 Multi-path Delay Commutator (R2MDC) ............................47

3.2 Memory-Based FFT Architecture ................................................................ 53

viii

7/27/2019 etd_0728106_120055

9/113

3.2.1 Ping-Pong Mode of the Memory Management Strategy ....................55

3.2.2 In-Place Mode of the Memory Management Strategy........................56

3.3 A Unified FFT/[FFT Architecture .............................................................. .. 58

Chapter 4 FFT/[FFT Compiler Design ........................................................................ 60

4.1 Implementation Strategy ..............................................................................61

4.1.1 Parametrizable Architecture .............................................................. ..61

4.1.2 Parametrizable Memory Access..........................................................65

4.2 Building Block .............................................................................................68

4.3 FFT/[FFT Compiler Flow ............................................................................72

4.4 Specification ................................................................................................74

4.4.1Specicationof the128-PointR23SDF..............................................74

4.4.2Specicationof the128-PointR23MDC............................................77

4.4.3 Specication of the 128-Point Memory-Based FFT Architecture ...... 80

4.4.4 Synthesis Result ..................................................................................82

4.4.5 Analysis of Suitable Applications .......................................................85

Chapter 5 Verication and Performance......................................................................88

5.1 Cost Function and Derivation .................................................................... .. 88

5.2 C Simulation Model ................................................................................... .. 89

5.3 Verification Plan ......................................................................................... ..90

5.4 Performance Evaluation ............................................................................. ..91

Chapter 6 Conclusions and Future Work ..................................................................... 93

6.1 Conclusions ................................................................................................ ..93

6.2 Future Work ............................................................................................... ..93

References .................................................................................................................. .. 96

ix

7/27/2019 etd_0728106_120055

10/113

LIST OF TABLES

Table 1.1: FFT/IFFT size for OFDM-based communication system.............................4

Table 2.1: Complexity analysis of twiddle factor for radix-4 DIF FFT algorithm ...... 34

Table 2.2: Complexity analysis of twiddle factor for radix-8 DIF FFT algorithm ...... 35

Table 2.3: Complex multiplications required for radix-2, radix-4 and radix-2/4 FFT

algorithms .................................................................................................... 36

Table 2.4: Complex multiplications required for radix-8 and radix-2/8 FFT. .............37

Table 3.1: Comparison of single buttery and fully spread architecture ..................... 39

Table 3.2: Analysis of different pipelined architectures using different radix ........... ..40

Table 3.3: Comparison of hardware utilization ............................................................41

Table 3.4: The pairs of buttery inputs at each time stage. .........................................46

Table 3.5: The relation of data accessesat different time stage of 8-point radix-2 FFT

.................................................................................................................... ..57

Table 4.1: Analyze the number of complex multipliers in SDF with different radix. .63

Table 4.2: Analyze the number of complex multipliers in MDC with different radix. 65

Table 4.3: ROM content in each stage for an 8-point FFT. .........................................66

Table 4.4: Comparison between constant multiplier and complex multiplier of T I and

DW ............................................................................................................. ..71

Table 4.5: Parametersinformation of the FFT/IFFT Compiler ................................... 73

Table4.6:I/O portsof R23SDF....................................................................................77

Table4.7;I/O portsof R23MDC..................................................................................80

Table 4.8: I/O ports of memory-based FFT architecture. ............................................ 81

Table4.9:Gatecountof theR23SDFat 128-pointFFT/IFFT.....................................83

Table4.10:Powerconsumptionof theR23SDFat 128-pointFFT(mw)....................83

Table4.11:Gatecountof theR23MDCat 128-pointFFT/[FFT..................................84

7/27/2019 etd_0728106_120055

11/113

Table4.12:Powerconsumptionof theR23MDCat 128-pointFFT(mw)..................84

Table4.13:Powerconsumptionof theR23SDFat 128-pointFFT(mw)....................85

Table 4.14: FFT/[FFT size for OFDM-based communication system......................... 87

Table 4.15: FFT/IFFT size and throughput rate for OFDM-based communication

system ..........................................................................................................87

Table 5.1: ASIC synthesis result at clock frequency of 132MHz. ...............................92

xi

7/27/2019 etd_0728106_120055

12/113

LIST OF FIGURES

Fig.1.1ThetwiddlefactorW"kof FFTin theunitcircle...............................................2Fig. 1.2 OFDM transceiver block diagram ....................................................................4

Fig. 2.1 Signal ow graph of the decimation-in-time decomposition of an N-point

DFT (N =8) computation into two (N / 2)-point DFT computations. ...........9

Fig. 2.2 Signal ow graph of the decimation-in-time decomposition of two

(N/2)-point DFT (N = 8) computation into four (N / 4)-point DFT

computations. ................................................................................................. 9

Fig. 2.3 Signal ow graph of a 2-point DFT ................................................................ 10

Fig. 2.4 Signal ow graph of an 8-point DIT FFT ...................................................... 10

Fig. 2.5 Signal ow graph of radix-2 buttery computation in Fig. 2.4 ..................... 10

Fig. 2.6 Signal ow graph of simplified buttery computation with only one complex

multiplication. .............................................................................................. 11

Fig. 2.7 Signal ow graph of 8-point DFT using the buttery computation of Fig. 2.6

.................................................................................................................... .. 12

Fig. 2.8 Signal ow graph of DIF decomposition of an N-point DFT computation into

two (N/ 2)-point DFT computations (N =8) ............................................... 15

Fig. 2.9 Signal ow graph of decimation in-frequency decomposition of an 8-point

DFT computation into four 2-point DFT computations............................... 15

Fig. 2.10 Signal ow graph of a typical 2-point DFT at the last stage decomposition.

.................................................................................................................... .. 16

Fig. 2.11 Signal ow graph of complete DIF decomposition of an 8-point DFT

computation .................................................................................................. 16


computation .................................................................................................. 17

Fig. 2.13 Radix-4 buttery ...........................................................................................20

Fig. 2.14 16-point radix-4 DIF FFT .............................................................................21

Fig. 2.15 Radix-8 buttery ...........................................................................................24

xii

7/27/2019 etd_0728106_120055

13/113

. 2.16 Signal ow graph of 16-point radix-8 DIF FFT ...........................................25

. 2.17 Radix-2/4 buttery .......................................................................................27

. 2.18A sketch map of 16-point radix-2/4 DIF FFT. ..............................................27

. 2.19 Signal ow graph of 16-point radix-2/4 DIF FFT ........................................28

. 2.20 Radix-2/8 buttery ....................................................................................... 30

. 2.21A sketch map of 16-point radix-2/8 DIF FFT ............................................... 30

. 2.22 Signal ow graph of 16-point radix-2/8 DIF FFT ........................................31

. 2.23 Signal ow graph of 32-point radix-4 DIF FFT based on radix-2

decomposition in the rst stage and two radix-4 stagesin the next stage33

. 2.24 Signal ow graph of 32-point radix-4 DIF FFT based on two radix-4 stages

and one radix-2 stage...................................................................................33

. 3.1 Two extreme methods of implementing the FFT algorithm. ..........................38

. 3.2 Basic framework of pipelined FFT. ................................................................39

. 3.3 R2SDF N=16 (Radix-2 Single-path Delay Feedback) ...................................42

. 3.4 Unfolded delay elements of R2SDF ...............................................................43

. 3.5 Two modes of the radix-2 buttery module ...................................................43

. 3.6 R2SDF (N=8) ................................................................................................ ..44

. 3.7 R2SDF (N=8) data stream ow .................................................................... ..45

. 3.8 (a) Relation between delay elements and buttery operation modes in each

stage, (b) control of twiddle factor in each stage.........................................46

. 3.9 R2MDC N=16 (Radix-2 Multi-path Delay Commutator) ..............................47

. 3.10 Unfolded delay elements of R2MDC ...........................................................48

. 3.11 Radix-2 buttery module ..............................................................................48

. 3.12 Two modes of the radix-2 commutator module .......................................... ..48

. 3.13 R2MDC (N=8) ............................................................................................ ..49

. 3.14 R2MDC (N=8) data stream ow ................................................................ ..51

xiii

7/27/2019 etd_0728106_120055

14/113

(b) control of twiddle factor in each stage...................................................52

Fig. 3.16 Total DE numbers of R2MDC ......................................................................52

Fig. 3.17 Operation ow of single PE .........................................................................53

Fig. 3.18 Single PE FFT processor diagram ................................................................53

Fig. 3.19 Ping-pong mode architecture ........................................................................55

Fig. 3.20 Partial data processing ow of 8-point FFT .................................................55

Fig. 3.21 The conict graph and memory partition; (a) the colored conict graph

based on the radix-2 buttery unit, (b) the 2-bank memory arrangement...57

Fig. 4.1 Relationship between the PE number and different radix algorithm .............. 62

Fig.4.2R23SDFPEin radix-8buttery......................................................................62

Fig. 4.3 Different radix SDF architectures for performing 16-point DFT. ..................63

Fig. 4.4 R2MDC N=128 ..............................................................................................64

Fig.4.5R22MDCN=128.............................................................................................64

Fig.4.6R23MDCN=128.............................................................................................65Fig. 4.7 Signal data ow of 8-point DIF FFT ..............................................................66

Fig. 4.8 Relation between buttery counts and ROM addressesin each time stage...67

Fig. 4.9 Architecture of the addressgenerator .............................................................67

Fig. 4.10 Radix-2 buttery module with mode selection. ...........................................68

Fig.4.11Asimpliedcomplexmultiplicationwith W1/8...........................................69

Fig. 4.12 Real multiplication without multipliers ........................................................69

Fig. 4.13 Complex multiplier architecture using unsigned multipliers. ......................70

Fig. 4.14 Complex multiplier architecture using signed multipliers. ..........................71

Fig. 4.15 Provided design model. ................................................................................73

Fig.4.16SDFof 128-pointandR23SDFblockdiagram.............................................74

Fig.4.17128-pointR23SDFtimingdiagram...............................................................75

. 3.15 (a) Relation between delay elements and commutator modes in each stage,

xiv

7/27/2019 etd_0728106_120055

15/113

. 4.18128-pointR23SDFblockdiagramwithpipelinedregisterandcontrolcircuit.................................................................................................................... ..75

. 4.19128-pointR23SDFtimingdiagramwithpipeline.........................................76

. 4.20128-pointR23SDFtimingdiagramwith1/0information............................76

. 4.21Blockdiagramof 128-pointR23MDC..........................................................77

. 4.22Timingdiagramof 128-pointR23MDC.......................................................78

. 4.23Blockdiagramof 128-pointR23MDCwithpipelinedandcontrolcircuit...78

. 4.24Timingdiagramof 128-pointR23MDCwithpipelinedregister...................79

. 4.25128-pointR23MDCtimingdiagramwith1/0information...........................79

. 4.26 Block diagram of 128-point memory-based architecture. ............................ 80

. 4.27 Timing diagram of a 128-point memory-based architecture. ....................... 81

. 4.28 Execution ow of the provided pipelined FFT architecture.........................86

. 4.29 Execution ow of the provided memory-based FFT architecture ................ 86

. 5.1 Choosing an architecture based on the specified throughput rate. .................89

. 5.2 Radix-8 buttery .............................................................................................90

. 5.3 Verification plan .............................................................................................91

. 5.4 SNR curve in the 128-point FFT ....................................................................92

XV

7/27/2019 etd_0728106_120055

16/113

Capter1

Introduction

The Discrete Fourier Transform (DFT) plays a key role in digital signal

processing in areas such as radar processing, spectral analysis, frequency-domain

filtering, and polyphase transformations. The DFT is an important component in many

practical applications of discrete-time systems. The possibility of greatly reduced

computation was generally overlooked until about 1965, when Cooley and Tukey

(1965) published an algorithm [1] for the computation of the DFT that is applicable

when N is a composite number. The publication of their paper touched off a urry of

activity in the application of the DFT to signal processing and resulted in the

discovery of a number of highly efcient computational algorithms. Collectively, the

entire set of such algorithms has come to be known as the Fast Fourier Transform, or

the FF T.

1.1 FFT Overview

The DFT of a finite-length sequenceof length N is

N1

X[k]=Zx[n]W,;", k=0,1, ,N1, (1.1)n=0

whereW,(,"=e'j(2N)"". TheinversediscreteFouriertransformis givenby

1 N1

x[n]=ZX[k]WN"", n=0, 1, ,N1, (1.2)N k=0

7/27/2019 etd_0728106_120055

17/113

To computeallNvaluesof theN-pointDFTthereforerequiresa totalof N2complex

multiplications and N(N -1) complex additions.

Most approaches to improving the efficiency of the computation of the DFT

employ the symmetry, periodicity, compressibility and expansibility properties of

W,(,"asbelow.

1. W,;""=(W"")*=WA,"N")(symmetry)

2. WA,"=W"'+N)"=W"+N)" (periodicity)

3. W,(,"=WA',,,lWN: Wn,V(compressibilityandexpansibility)

Wecanconvenientlyobservevalueof thetwiddlefactor W,\,'fromFig 1.1.

TwiddleFatorWA',"=ej(2N)""of FFT

Fig.1.1ThetwiddlefactorW"'of FFIin theunitcircle.

By using Fig 1.1, we can nd the symmetry of the twiddle factor as

nk nkN/2W =W + ,

k N/4 nk 3N/4 - nkW71+ : _W + : ,

W81:W;2%-(1j)and

W83:_W87:_'(1+J')-

7/27/2019 etd_0728106_120055

18/113

7/27/2019 etd_0728106_120055

19/113

Serial

DataInput

JEQAM.Signal?1

2Generator

Serial

Dataoutput

Vf.G?,u3?d1. ~imje'r,va1=J:i:7re1?1'1O\./a_,l_'

Receive_

'_FiIter

Fig. 1.2 OFDM transceiver block diagram

Different wireless communication standards mean different specifications for

target applications. Moreover, even in a digital communication system, it may have

different operation modes. Table 1.1 lists the FFT/IFFT sizes for several existing

communication system. When viewing this table, we know that it would be a

challenge if a unified FFI/IFFT architecture is to be designed. In this thesis, we

investigate how to develop an efficient FFT/IFFT compiler for different applications.

Based on our development, not only a dedicated FFT/IFFT module can be easily

prototyped for fast system verification, but also the resulting compiler can be used as

a basis for more advanced research in the future.

Table 1.1: FFT/IFFT size for OFDM-based communication system

8192 DVB-T ~ VDSL

4096 DVB-H ~ VDSL

2048 DVB-T ~ DAB ~ VDSL

1024 DAB ~ VDSL

512

7/27/2019 etd_0728106_120055

20/113

1.3 Organization of this Thesis

Organization of this thesis is:

0 Chapter 1 introduces the FFT and motivation.

0 Chapter 2 reviews the general, high-radix, and split-radix FFT algorithms,

and discussestheir different objectives.

0 Chapter 3 discussesdifferent implementations of FFT algorithm.

0 Chapter 4 shows the methodology of the FFT/IFFI compiler design.

0 Chapter 5 describes the cost function, verication plan and performance

evaluation.

0 Chapter 6 presents a concluding remark.

7/27/2019 etd_0728106_120055

21/113

Capter2

FFT Algorithm

The Cooley-Tukey FFT algorithm is very popular because it can reduce the

computationalcomplexityfrom O(N2)to O(Nlog2N),and the regularityof the

algorithm makes it suitable for VLSI implementation. To further reduce the

computational complexity, high radix and split-radix versions have been proposed. In

general,all of thesealgorithmsdecomposea length-N (= 2) FFT into odd half and

even half recursively and effectively reduce the number of complex multiplications by

using symmetric properties of the FFT kernel. The high radix FFT algorithms such as

radix-4 and radix-8 [2] substantially reduce the number of arithmetic operations and

data transfers as compared to the general FFT algorithm [3]. The split-radix FFT

algorithms such as radix-2/4 [4] ~ radix-2/8 [5] are the best in terms of the

multiplicative complexity for N-point FFT when the multiplications with i 1, i j

are skipped, but it is inherently irregular becauseradix-2 stagesare used for even half

components, and radix-4 or radix-8 stages are used for odd half components, which

results in an L-shaped buttery unit. Due to the irregularity of the buttery unit, it is

hard to design regular and modular pipelined hardware for the split-radix algorithm.

2.1 General Algorithm

The DFT of a finite-length sequenceof length N is

7/27/2019 etd_0728106_120055

22/113

x[k]=x[n]vV,;",k=0,1, ,N1, (2.1)n=0

whereW,(,'=e'j(2N)k. TheinversediscreteFouriertransformis givenby

N1

x[n]=iZx[k]W,;"",n=0,1, ,N1. (2.2)N k=0

In equations (2.1) and (2.2), both X[k] and x[n] may be complex. The expressions

on the right-hand sides of those equations differ only in the sign of the exponent of

W,(,"andin the scalefactor1/N.

In computing the DFT, dramatic efciency results from decomposing the

computation into successively small DFT computations. We employ both the

symmetry ~periodicity ~compressibility and expansibility of the complex exponential

W,(,": e'j(2N)". Algorithmin whichthedecompositionis basedondecomposingthe

input sequence x[n] into successively small subsequences are called decimation-

in-time FF T algorithms. We can consider dividing the output sequence X[k] into

smaller and smaller subsequencesin the same manner. FFT algorithms based on this

procedure are commonly called decimation-in-frequency FFTalgorithms.

2.1.1 Decimation-In-Time (DIT) FFT Algorithms

The principle of the DIT FFT algorithm is most conveniently illustrated by

consideringthe specialcaseof Nan integerpower of 2, suchas 2. Since Nis an even

integer, we can consider computing X[k] by separating x[n] into two (N / 2)-point

sequences consisting of the even-numbered points in x[n] and the odd-numbered

points in x[n]. With X[k] given by equation (2.1) and separating x[n] into its even and

odd numbered points, we obtain

7/27/2019 etd_0728106_120055

23/113

X[k]= Zx[n]W,;"+Zx[n]W,$", (2.3)It even 71 odd

or, with the substitution of variables n =2r for even part and n =2r + 1 for odd part,

(N/2)1 (N/2)1

X[k]= Zx[n]W,"+ Zx[n]W,"r=0 r=0

(N/2)1 (N/2)1

= Zx[2r](W,)"+W,5Zx[2n+1](W,)"-r=0 r=0

andW;2WM,employscompressibilityproperty,since

W15: e2j(27t/N): ej27E/(N/2)ZWN/2I

Consequently, equation (2.4) can be rewritten as

(N/2)1 (N/2)1

X[k]= Zx[2r]W,;';,+W/VZx[2n +1]W,;;,,k: 0, N -1. (2.6)r=0 r=0

Each of the sums in equation (2.6) is recognized as an (N/ 2)-point DFT, the first sum

being the (N / 2)-point DFT of the even-numbered points of the original sequenceand

the second being the (N / 2)-point DFT of the odd-numbered points of the original

sequence,only the odd-numberedpointsof the originalsequenceextractsWA;. Fiq.

2.1 depicts this computation for N=8.

Therefore, the (N / 2)-point DFT can be decomposed even and odd part into two

(N/ 4)-pointDFTs, only oddpartof (N/ 4)-pointDFT multiplies WA',,22WA?, using

the fact that WN,22W13.Thus,insertingthe abovemannerinto the signalow graph

of Fig. 2.1, we obtain the complete signal ow graph of Fig. 2.2.

7/27/2019 etd_0728106_120055

24/113

x[0] -

x[2] -

x[4] -

x[6] -

x[l] -

x[3] -

x[5] -

x[7] -

Fig. 2.1 Signal ow graph of the decimation-in-time decomposition of an N-point

DFT (N =8) computation into two (NI 2)-point DFT computations.

Fig. 2.2 Signal ow graph of the decimation-in-time decomposition of two

(N/2)-point DFT (N =8) computation into four (Nl 4)-point DFT computations.

For the 8-point DFT that we have been using as an illustration, the computation

has been reduced to a computation of 2-point DFTs. The 2-point DFT of the sequence

consisting of x[0] and x[4] is depicted in Fig. 2.3. With the computation of Fig. 2.3

inserted in the signal ow graph of Fig. 2.2, we obtain the complete ow graph for

computation of the 8-point DFT, as shown in Fig. 2.4.

7/27/2019 etd_0728106_120055

25/113

w,3=1

W;=W, =W,,,"/2=-1

Fig. 2.3 Signal ow graph of a 2-point DFT.

For the more g

decomposing the (N

were left with only 2-point transforms. This requires v = log2N stagesof computation.

If N = 2, this can be done at most v = log2N times, so that after carrying out this

decomposition as many times as possible, the number of complex multiplications and

additions is equal to Nv =Nlog2N. This is the substantial computational savings that

we have previously indicated was possible.

A A=A+BW;PN

B B=A+BWyWmWA(]p+N/2)

Fig. 2.5 Signal ow graph of radix-2 buttery computation in Fig. 2.4

10

7/27/2019 etd_0728106_120055

26/113

Computation in the signal ow graph of Fig 2.4 can be reduced further by using

thepropertyof the coefficientsW: . Wefirst notethat,in proceedingfrom onestage

to the next in Fig. 2.4, the basic computation is in the form of Fig. 2.5., this

elementary computation is called a radix-2 butterfly. Since

WAIIV/2:ej(27t'/N)N/2:ej7t':_1,

thefactor WA',+N2canbe writtenas

W,5+22WWW; =W,. (2.8)

With this observation, the buttery computation of Fig. 2.5 can be simplified to the

form shown in Fig. 2.6, which requires only one complex multiplication instead of

two. Using the basic signal ow graph of Fig. 2.6 as a replacement for butteries of

the form of Fig. 2.5, we obtain the signal ow graph of Fig. 2.7 from Fig. 2.5. In

particular, the number of complex multiplications has been reduced by a factor of 2

over the number in Fig. 2.5.

Fig. 2.6 Signal ow graph of simplified buttery computation with only one complex

multiplication.

11

7/27/2019 etd_0728106_120055

27/113

Fig. 2.7 Signal ow graph of 8-point DFT using the buttery computation of Fig. 2.6

2.1.2 Decimation-In-Frequency (DIF) FFT Algorithms

We can consider partitioning the output sequenceX[k] of frequency domain into

smaller and smaller subsequencesin the same manner. FFT algorithms based on this

process are commonly called decimation-in-frequency (DIF) FFT algorithms.

To develop these FFT algorithms, let us again restrict the discussion to Na power

of 2 and consider computing separately the even-numbered frequency samples and the

odd-numbered frequency samples. Since

Nl

X[k]=Z:x[n]WA'}",k=0, l, ,N l, (2.9)n=0

the even-numbered frequency samples are

2

X[2r]= x[n]W,;,r=0, 1,...,(N/2)1, (2.10)71IIC

which can be described as

(N/2)1 2 Nl 2X[2r]= Zx[n]WN"'+Zx[n]WN"'. (2.11)

n=0 n=N/2

12

7/27/2019 etd_0728106_120055

28/113

With a substitution of variables in the second summation in equation (2.11), we obtain

(N/2)1 (N/2)1

X[2r]= Zx[n]W,"'+ Z x[n+(N/2)]W,'["+1. (2.12)n=0 n=0

Eventually,becauseof theperiodicityof WI?,

W13r[n+(N/2)]: W15rnWA;N: Wlgrn,

Since W; =WN,2, equation(2.13)canbeexpressedas

(N/2)1X[2r]= Zx[n+(N/2)]W,7;, r=0, (N/2) 1. (2.14)n=0

Equation (2.14) is the (N / 2)-point DFT of the (N / 2)-point sequence obtained by

adding the rst half and the last half of the input sequence.Adding the two halves of

the input sequencerepresents time aliasing, consistent with the fact that in computing

only the even-numbered frequency samples, we are under-sampling the Fourier

transform of x[n].

We con now consider obtaining the odd-numbered frequency points, given by

N1

X[2r+1]=Zx[n]W,;"2'+1>,r: 0, 1, , (N/2) 1. (2.15)n=0

As before, we can describe as

(N/2)1 N_1

X[Zr+1]= Z x[n]W,;2+Z x[n]W,;. (2.16)n=0 n=N/2

An alternative form for the second summation in equation (2.16) is

13

7/27/2019 etd_0728106_120055

29/113

N1 (N/2)1

Z x[n:|WAr,1(2r+l): Z x[n+ I2):|W1E,n+(N/2)](2r+1)n=N/2 n=0

(N/2)(2r+1)(N/2)1 n(2r+l)=WN Z x[n+(N/2)]WN (2.17)

n=0

(N/2)1

: Z x[n+(N/2)]WA','(2'+1),n=0

wherewe haveemployedthe fact that W,f,m':1 and W,f/W2): -1. Substituting

equation (2.17) into equation (2.16) and combining the two summations, we obtain

(N/2)1

X[2r+1]= Z (x[n] x[n+(N/2)])W,;', (2.18)n=0

- 2or, since WN : WW2,

(N/2)1

X[2r+1]= Z(x[n]x[n+(N/2)]) A',';2WA',',r=0,1,...,(N/2)-1. (2.19)n=0

Equation (2.19) is the (N / 2)-point DFT of the sequenceobtained by subtracting the

second half of the input sequence from the first half and multiplying the resulting

sequenceby WA;. On the basis of equations (2.14) and (2.19), with g[n] =

x[n]+x[n+N/2] and h[n] =x[n]-x[n+N/2], the DFT can be computed by first forming

the sequencesg[n] and h[n], then computingh[n]W,(,, And nally computingthe (N /

2)-point DFTs of these two sequencesto obtain the even-numbered output points and

the odd-numbered output points, respectively. The procedure suggested by equation

(2.14) and (2. 19) is illustrated for the case of an 8-point DFT in Fig. 2.8.

14

7/27/2019 etd_0728106_120055

30/113

Fig. 2.8 Signal ow graph of DIF decomposition of an N-point DFT computation into

two (N/ 2)-point DFT computations (N =8).

Consequently, the (N / 2)-point DFTs can be computed by computing the even-

numbered and odd numbered output points for those DFTs separately. As in the case

of the procedure leading to equation (2.14) and (2.19), this is accomplished by

combining the first half and the last half of the input points for each of the (N /

2)-point DFTs and then computing (N / 4)-point DFTs. The signal ow graph resulting

from taking this step for the 8-point example is shown in Fig. 2.9.

Fig. 2.9 Signal ow graph of decimation in-frequency decomposition of an 8-point

DFT computation into four 2-point DFT computations.

15

7/27/2019 etd_0728106_120055

31/113

For the 8-point example, the computation has now been reduced to the

computation of 2-point DFTs, which are implemented by adding and subtracting the

input points, as discussed previously. Thus, the 2-point DFTs in Fig 2.9 can be

replaced by the computation shown in Fig. 2.10, so the computation of the 8-point

DFT and 16-point DFT can be accomplished by the algorithm depicted in Fig 2.11

Xv-I:W0

Xv-1[q] Xv[q]-1

and Fig 2.12.

Fig. 2.10 Signal ow graph of a typical 2-point DFT at the last stage decomposition.


computation

By countingthe arithmetic operationsin Fig. 2.11 and generalizingtoN =2, we

observe that the computation of Fig. 2.11 requires (N/2)log2N complex multiplications

and Nlog2N complex additions. Thus, the total number of computations is the same

with the decimation-in-time algorithms.

16

7/27/2019 etd_0728106_120055

32/113

.\tIA\Vlz.I.>Xo1'. .,\\vIInVxoxozyAuw-,,

,\\\vIII,n,xxoxoz, 4 _ ,,,\\\xoxII:w:xxmv;ox.,,,\\xoxoxoxwlA\.,.1921 F

.:xoxoxoxoxoxmC1.l.\V.I.>x-...[.

.I:xoxoxoxm\vI.,I,>x1'..,II:xox\\nVxoxozyA

7/27/2019 etd_0728106_120055

33/113

the output DFT in bit-reversed order. The signal ow graph previously shown in Fig.

2.7 begins with the input sequencein bit-reversed order and provides the output DFT

in normal order.

2.2 High-Radix Algorithm

To further reduce the computational complexity, the high radix FFT algorithms

such as radix-4 and radix-8 not only reduce the number of arithmetic operations and

data transfers compared to the general FFT algorithm such as radix-2 FFT algorithm,

but also reserve regular property for convenient implementation of pipelined hardware.

Here, we consider input sequence in normal order based on the decimation-

in-frequency (DIF) FFT algorithm.

2.2.1 Radix-4 DIF FFT Algorithm

To develop Radix-4 DIF FFT algorithm, let us again restrict the discussion to Na

power of 4, i.e., N = 4 and consider computing separatelythe even-numbered

frequency samples and the odd-numbered frequency samples from radix-2 DIF FFT

algorithm. Since

N1

X[k]=Zx[n]W,;", k=0,1, ,N 1, (2.21)n=0

the even-numbered frequency samples are

(N/2)1

X[2r]= Zx[n+(N/2)]W,7;, r=0, (N/2) 1, (2.22)n=0

and the odd-numbered frequency samples are

(N/2)1

X[2r +1]: Z(x[n] x[n+(N/2)]) A,j2WA,',r=0, l,...,(N/ 2) 1. (2.23)n=0

18

7/27/2019 etd_0728106_120055

34/113

Equation (2.22) and (2.23) are separatedcontinuously into even-numbered frequency

samples and the odd-numbered frequency samples such as raidx-4 decomposition. Let

r = 2s separate even part and r = 2s + 1 separate odd part. First, let r = 2s substitute

into equation (2.22), then

(N/2)1

X[4s]: Z(x[n]+x[n+(N/2)])W,;',s=0, (N/4) 1, (2.24)n=0

usingthefactthat W553): W13;4 substitutesinto equation(2.24).Weobtain

(N/4)1 (N/2)1

X[4s]: Z x[n (N/2)])WA,;4+ Z (x[n]+x[n+(N/2)])WA,'j4n=0 n=N/4(N/4)1

= Z (x[n] x[n (N/2)])W,(;~;4 (2.25)n=0

(N/4)1+ Z (x[n+(N/4)]+x[n+(3N/4)])W,;';t.4.

n=0

Eventually,becauseof the periodicity of WI?,

( N/4) _ (N/4) _WN7: X_W1\,IL;4WN/4X_W1\,IL;4

equation (2.25) can be expressedas

_(N4)1 +x[n+(N/2)])+ MX[4s]_Z [(x[n+(N/4)]+x[n+(3N/4)]))W"""(227)n=0

Equation (2.27) is the (N / 4)-point DFT of the N-point sequenceobtained by adding

the four parts of the input sequence. Adding the four parts of the input sequence

represents time aliasing, consistent with the fact that in computing only the even part

of even-numbered frequency samples, we are under-sampling the Fourier transform of

x[n].

7/27/2019 etd_0728106_120055

35/113

2. Substituting r =2s into (2.23) can obtain even part of the equation (2.23).

3. Substituting r =2s + 1 into (2.23) can obtain odd part of the equation (2.23).

Then, we can obtain the other three equations as below Z

_N4)1(x[n]+x[n+(N/2)]) M 2X[4s+2]_ [(x[n+(N/4)]+x[n+(3N/4)])jWN4WN'(228)N4)"1[(x[n]x[n+(N/2)]) JM,,X[4s+1]:2 j(x[n+(N/4)]x[n+(3N/4)])WWW"' (229)

n=0

WNMWN. (2.30)

X[4s+3]=2n=0N4)1[(x[n]x[n+(N/2)])JM3j(x[n+(N/4)] x[n --(3N/4)])

Fig. 2.13 Radix-4 buttery

Wecanobservefrom Fig. 2.13thattwiddlefactor WA,", suchas W15, WA;and

W13, is extractedonly in stage2, andstage1 is only multipliedby j term.The

E6'99J complex computation in Fig. 2.13 only interchanges with real part and imaginary

part of complex multiplicand, and inverses real part of complex multiplicand.

Radix-4 decomposition for direct implementation of DFT reduces the number of

multiplicationsfromN2to N(log4N-1),whichis alsolessthanradix-2decomposition.

20

7/27/2019 etd_0728106_120055

36/113

:3

\w1n\VImxxIx[2]

x[3].\\\VIII.CXXX}'..W: *.\\\xoxII1nInxx:uiAV1:Q- 10]

1:\VxoxxoxIA'.IUIA\iu;bXo}.'.\xoxoxoxoxoxv:1.1.IC>x:iI.1 _

mxoxoxoxoxoz'xoxoxoxoxox':n

3:1II;xoxoxox\.'.\VIIIbzo1'II e

II;xox\\'.II.x\xoxo3.wA:Vvvv 0

AAAA, WM'IIIA\\\.:1:IIIA\\'.I.vAxxo.5lLaV'.1QZx:i'.IJ\'.I.VIA\mxo}:. . . .

Fig. 2.14 16-point radix-4 DIF FFT

2.2.2 Radix-8 DIF FFT Algorithm

algorithm. Since

N1

X[k]=Zx[n]W,$",k=0, 1, ,N 1,n=0

0 0

0

NN

NN:\\vII.jAj\xxoz:arA:IN

N

S.

S.

S.

S.

S.

S.

S.

S.

S.

iiiiii>

Radix-2/4

11]

Fig. 2.18 A sketch map of 16-point radix-2/4 DIF FFT.

27

7/27/2019 etd_0728106_120055

43/113

Timestage0 TimeStage1 TimeStage2 TimeStage3

.5 4.1

.\\vI:nxox, we,\\\v/I/nxoxoxox. . at,\\\xoxg/zgnxnuuwax...\\xoxoxo.I..II.rIA\..xx.

5

iii .\xoxoxoxoxo:!.HQ.il.7A.'a.x:,,1:;X8.';;;;;o;*.. 4 . woX9.IAAIA, _ , XM

, ;xoxoxox.'n\Vl;nx1',. .

. II:'o't\\'QV3::;,1 AA . . AA . . . W~x[.3]X. . . AA _W5 . .4-..Xm

.I\.I.7lA\.xo}:. woilvguri 0131.

Fig. 2.19 Signal ow graph of 16-point radix-2/4 DIF FFT

The -j term extracted from radix-2/4 decomposition is more than radix-4

decomposition,but we can observe from Fig. 2.19 the mix -j and WN term of

complex multiplication at the time stage 2. From the pipelined hardware design, the

mix -j and WN term of complex multiplication will employ a complex

multiplication to complete complex multiplication computation, hence radix-2/4

decomposition cannot gain the advantage of reduction on the general pipelined

hardware design.

2.3.2 Radix-2/8 DIF FFT Algorithm

To develop Radix-2/8 DIF FFT algorithm, let us again restrict the discussion to N

a power of 2, i.e., N = 2v and consider computing separatelythe odd-numbered

frequency samples from radix-2/4 DIF FFT algorithm. Since

28

7/27/2019 etd_0728106_120055

44/113

N1

X[k]=Zx[n]vV,;", k=0,1, ,N 1, (2.47)n=0

the even-numbered frequency samples of radix-2/4 decomposition are

(N/2)1

X[2r]= Zx[n+(N/2)]W,37;, r=0, (N/2) 1, (2.48)n=0

and the odd-numbered frequency samples of radix-2/4 decomposition are

N4)"1[(x[n]x[n+(N/2)]) JM,1X[4s+1]: Z j(x[n+(N/4)]x[n+(3N/4)])WWW"' (Z49)n=0

M 371

j(x[n+(N/4)]x[n--(3N/4)])jW"W'(Z50)X[4s+3]= 2n=0(N4)1[(x[n]x[n+(N/2)])--Next, we continuously separateequation (2.49) and (2.50) into even and odd part. Let

s = 2k and s = 2k + 1 substitute into (2.49) and (2.50), we can obtain

x[n+(N/2)])j+X[gk+1]: avflj(x[n+(N/4)]x[n+(3N/4)])

,,=o1[(x[n+(N/8)]x[n+(5N/8)])JW8 .](x[n+(3N/8)] x[n+(7N/8)])W,;,W,;. (2.51)

(x[n]x[n+(N/2)D _

X[8k+5]:

7/27/2019 etd_0728106_120055

45/113

Equation (2.48), (2.51), (2.52), (2.53) and (2.54) can depict signal ow graph of

8-point DFT as Fig. 2.20. The Fig 2.20 is called Radix-2/8 Butterfly. Substituting Fig.

2.20 into 16-point DFT can sketch Fig. 2.21 to depict Fig.2.22.

x[1] X[8]

xm Radix4 Xm

x[3] x[12]

x[4] '~'__-"-* X[2]

x[5] x[1o]

x[6] X[6]

xm Radix-2/8 Xm]

x[8] x[1]

x[9] Radix-2X9]

x[10] X[5]

ml] Radix~2Xm]

x[12] X[3]

xm] Radix~2 Xm]

x[14] X[7]

[15] Radix~2/8 Rama XHS]

Fig. 2.21 A sketch map of 16-point radix-2/8 DIF FFT

30

7/27/2019 etd_0728106_120055

46/113

TimeStage0 TimeStage1 TimeStage2 TimeStage3

T T

::1:r.u:x':1v;-:>x:::::i

H--:'...;x..w:3kw;-..*rX::':::...1if85] 99'. . AA . . . Wix10]x[6] x6]:11:o:o:o:o:o:o:o::: :1er :58,]IAMIAMA,, , 8,]X.0,,IIIIII\\L\VlAtDXO}.',w ,w:8,]

X.1,,IIlXX\\\;,XXV.a7A.,'baX-,w:X,8lllA\\\ ':::"i X WM . .AAAA. . 'W4 WOX3]:11:/A\::IA%ov: we:2X.,1 8.8

Fig. 2.22 Signal ow graph of 16-point radix-2/8 DIF FFT

The constantmultiplicationW81and W83 were extractedfrom radix-2/8

decomposition more than radix-8 decomposition, but we can observe from 2.22 that

mix -j, W81,W83and WN term of complexmultiplicationis at the sametime

stage 2 of Fig. 2.22. Considering the general pipelined hardware design, the mix -j,

W81,W83and W8, term of complexmultiplicationat the sametime stagewill

employ a complex multiplication to complete all term of complex multiplication

computation, so radix-2/8 decomposition cannot gain the advantage for reduction in

the pipelined hardware design.

31

7/27/2019 etd_0728106_120055

47/113

2.4 Complexity Analysis

So far we have analyzed statistics of reduction of complex multiplication by

different algorithms with general algorithm, high radix algorithm, and split radix

algorithm. We can conclude that higher radix decomposition will reduce more

complex multiplier. Considering N-point DFT which N is not a power of 4, so radix-4

algorithm decomposemust use extra radix-2 decomposition to complete N-point DFT.

The extra radix-2 decomposition employed in the first time stage or the last time stage

of signal ow graph will result in different computed reduction accordingly. For

example, 32-point DFT is not a power of 4, and FFT needs log232=5 time stages to

perform 32-point DFT. If we use radix-4 algorithm to decompose 32-point DFT, there

will remain a time stage. Adding extra radix-2 decomposition into rst or last time

stage of signal ow graph redeems the remainder decomposition. In Fig. 2.23, the

radix-2 decomposition applied to first stage decomposes 32-point DFT into two parts

of 16-point DFT. Next, using radix-4 decomposition completes two parts of 16-point

DFT. We can observe the twiddle factor

n o 1 2 3 4 5 6 7WN =>W32aW32aW32,Vl2,W32,W32,W32,W32a

vV;;,W;;,W;,,W3;1,W3;2,W3;3,W,;4,W,;5,n=o,1,...15. (255)

W,;": W;;,W;;,W;,,W;,2,n=0,1,2,3. (2.56)

W5": W;;,W3,W;;,W3,n=0,1,2,3. (2.57)

W,;": W;;,vV3,W;,2,W;,,n=0,1,2,3. (2.58)

Where W3:: W81,W3122=W83,W322: W85, and W87=W3228are constant

multiplications, W3;: W,Wf2: W41,W3126: W21,and W3242W43are non-complex

32

7/27/2019 etd_0728106_120055

48/113

multiplication, and others are complex multiplication. The total numbers of complex

multiplications are 20 and total numbers of constant multiplications are 10.

[91IE1} %EE3l

x[31]

n~.-o,1,.,.N;21

Fig. 2.23 Signal flow graph of 32-point radix-4 DIF FFT based on radix-2

decomposition in the rst stage and two radix-4 stagesin the next stage

X931

3*!1

I-([31]

3

i..o,1,.,.zw4-1 21: s:>,1,...m15-1

Fig. 2.24 Signal ow graph of 32-point radix-4 DIF FFT based on two radix-4 stages

and one radix-2 stage.

33

7/27/2019 etd_0728106_120055

49/113

Table 2.1: Complexity analysis of twiddle factor for radix-4 DIF FFT algorithm

radix-2 at first stage (if it needs ) radix2 at last satge (if it needs )1

const rnu|#* comp mu|#* const mum comp mu|#DFT#

i*::onstmul# : numberof constantmultiplication

g*c-campmul# : numberof complexmultiplication

If radix-4 decomposition is used to decompose 32-point DFT first, one time stage

will be remained at the last stage. Therefore, radix-2 decomposition decomposesthe

last time stage of 32-point DFT as shown in Fig. 2.24. We can observe the twiddle

factor

W5"2W3;,W3,W;;,W3,n=0,1,...7.(2.59W; 2 W;;,W;,,W3,v:g:,W;;,W3,W3,v:g;,n=0,1,...7. (2.60)

W5"2 W33,W;,W3,W;;,W3g2,W3g5,W3,W31,n=0,1,...7. (2.61)

W;"2W;;,W;;, n=0,1. (2.62)

W,;"2 W;;,W;;, n=0,1. (2.63)

W,(1,2"2 W3,,W;,2,n=0,1. (2.64)

Where W3:: W81,W3122=W83,W322: W85, and W87=W3228are constant

7/27/2019 etd_0728106_120055

50/113

multiplications, and others are complex multiplications. The total numbers of complex

multiplications are 16 and total numbers of constant multiplications are 12. Therefore,

using radix-4 decomposition rst and radix-2 decomposition at the last stage will

reduce more complex multiplication when N-point DFT is not a power of 4. Table 2.1

shows the different reduction of complex multiplication between two manners of

radix-2 insertion. We can conclude that using high-radix decomposition first will

obtain the best performance.

Table 2.2: Complexity analysis of twiddle factor for radix-8 DIF FFT algorithm

radix-2or 4 at 1ststage(if 11.needs) radix-2or 4 atlastsatge(1f11.needs)om * *

(a) (b)

When N-point DFT is not a power of 8, there are two situations. One is that

remainder of N-point DFT divided by 8 is 4, and the other is that remainder of N-point

DFT divided by 8 is 4. When remainder of N-point DFT divided by 8 is 4, the radix-4

decomposition is applied to compensate for the defect of computation at last stage.

Similarly, when remainder of N-point DFT divided by 8 is 2, the radix-2

decomposition is employed to overcome the defect of computation at last stage. We

can summarize the different reduction of complex multiplier in different size of DFT

as shown in Table 2.2(a). Let us think about the implementation of the pipelined

hardware. If the constant and the complex multiplications are at the same time stage

35

7/27/2019 etd_0728106_120055

51/113

of SFG, using the complex multiplier can compute both the constant and the complex

multiplication, so we analysis the number of the constant multiplication is complex

multiplication when the constant and the complex multiplications are at the same time

stage. Through above implementation issue, we can summarize the different reduction

of complex multiplier in different size of DFT as shown in Table 2.2(b).

Table 2.3 shows the number of the complex multiplication under the different

radix decomposition. Radix-4 FFT algorithm extracts -j terms to reduce partial

complex multiplier computation of radix-2 FFT algorithm. Radix-2/4 FFT algorithm

extracts more -j term to reduce partial computation of the complex multiplication of

radix-2 FFT algorithm than radix-4 FFT algorithm, but its irregular property makes it

difcult to implement pipelined hardware design.

Table 2.3: Complex multiplications required for radix-2, radix-4 and radix-2/4 FFT

algorithms

Complex Multiplication #

Table 2.4 shows the number of the complex multiplication with the radix-8 and

radix-2/8decomposition.Radix-8FFT algorithmextracts-j, W81,and W83terms

36

7/27/2019 etd_0728106_120055

52/113

to reduce complex multiplications of radix-4 FFT algorithm. Radix-2/8 FFT algorithm

further extractsmore constantmultiplication, W81and W83, than radix-8 FFT

algorithm, but its irregular property makes it to implement pipelined hardware

difficultly.

Table 2.4: Complex multiplications required for radix-8 and radix-2/8 FFT.

a DFW Radix-8D1?FFTAigamhmRiadi)-t2.Ii8DIFFFTAlgorithmccnstrrtu|#*compmu|#*ccristmulii |ccmpmulir

const mul # : number of constant multiplication

*comp mu- # : number of complex multiplication

37

7/27/2019 etd_0728106_120055

53/113

Capter3

FFT/IFFT Architecture

In this Chapter, we will discuss two methods of implementation of FFT

algorithm: reusing single buttery and fully spread, as shown in Fig. 3.1. Table 3.1

shows the different properties with speed, area and control complexity.

Implementation of the reusing single buttery employs single process element, PE

for short. Using Single PE to implement FFT algorithm is called Memory Based FF T

Architecture. The input, intermediate and output data are stored in memory, so the

bottleneck is memory access time. Implementation of the fully spread is generally

called pipelined FF T architecture. It has real time, non-stopping operation and least

memory requirement properties. The needed PEs are direct ratio to log, N , where r

is radix of buttery and N is the number of DFT.

51$!.V! .,

Wk

(a) Single ProcessElement (b) Fully Spread

Fig. 3.1 Two extreme methods of implementing the FFT algorithm.

38

7/27/2019 etd_0728106_120055

54/113

Table 3.1: Comparison of single buttery and fully spread architecture

3.1 Pipelined FFT Architecture

The architecture design for pipelined FFT processor had been the subject of

intensive research as early as in 70s when real-time processing was demanded in such

application as radar signal processing [8], well before the VLSI technology had

advanced to the level of system integration. It is characterized with real-time,

non-stopping processing as the data sequence passing the processor. In addition,

pipelined structure is highly regular, which can be easily scaled and parameterized

when Hardware Description Language (HDL) is used in the design. Basic framework

of the pipelined FFT architecture is shown in Fig. 3.2.

,.__......_. .._. ..... ....... .._. ...... __.. W... .._._ ,...__......._ .........1... _.._. .....,,..._.. ..............._.1..., ..

IE .5A : i i i, um; ; , . _ Vyeiamxmtiiaieznex,j l i { sshzzterzt:_elmmmt. _

.._........_..._..._..._..._......_....t....5< ._.:....__.....x......._......i._....a.

Fig. 3.2 Basic framework of pipelined FFT.

The implementation of the delay element has single path or multiple paths. The

implementation of the buttery element has radix-2 or radix-4. Requirement of

optimal memory is N 1, where N-point DFT is a power of 2. Furthermore, different

assumptions of input and output sequenceorder will construct different pipelined FFT

architecture. Single or multiple input and output sequences also construct different

pipelined FFT architecture, too. Several architectures have been proposed over the last

39

7/27/2019 etd_0728106_120055

55/113

3 decades. Here different approaches will be put into functional blocks with unied

terminology, where the additive butteries have been separated from multipliers to

show the hardware requirement distinctively.

f different '

Weassumethatth only theinputseqlslencetobeinnormalorder,andif TM T T reversed(radix-2or

radix-4) order, which is permissible in such applications such as DFT based

communication system. Single path pipelined architecture which uses radix-2 DIF

FFT algorithm is called Radix-2 Single-path Delay Feedback (RZSDF) [9]. Multiple

path pipelined architecture which uses radix-2 DIF FFT algorithm is called Radix-2

Multi-path Delay Commutator (RZMDC) [8]. Above two pipelined architectures are

the common pipelined architecture. More proposed pipelined architectures use

different radix FFT algorithms to extend above two basic pipelined architectures, as

givenin Table3.2.TheR22SDFhasthe samemultiplicativecomplexityasradix-4

algorithm,but theyretainthebutterystructureof radix-2algorithm.TheR23SDF

40

7/27/2019 etd_0728106_120055

56/113

has the same multiplicative complexity as radix-8 algorithm, but they retain the

buttery structure of radix-2 algorithm, too. We can nd that implementation of SDF

with higher radix algorithms could reduce more complex multipliers, but

implementationof theMDC is not. If R22MDCor R23MDCis usedto implement

pipelined FFT architecture, can the multiplicative complexity be reduced? We will

discuss at next Chapter. Table 3.3 shows the comparison of hardware utilization ration

with the different radix algorithms and architectures. It determines what architectures

perform well.

Table 3.3: Comparison of hardware utilization

In the Table 3.3, we can nd that architectures with higher radix algorithm have

higher utilization ratio of multiplier. The SDF architectures have highest 100%

utilization ratio of FIFOs than MDC architectures. The R23SDF or R23MDC of the

41

7/27/2019 etd_0728106_120055

57/113

radix-2 buttery base structure has the highest hardware utilization ratio than the

other architectures.

Other approachesof the Multiple Input Multiple Output (MIMO) proposed in [11,

16-17] are different assumptions of application system. Mixed SDF and MDC

architecture proposed in [18] is a very unusual approach. However, we introduce

many kinds of pipelined architectures in different assumptions, but they all extend

both two SDF and MDC pipelined architectures. Next, we will discuss two basic

pipelined architectures with radix-2 buttery based single-path delay feedback and

radix-2 buttery based multi-path delay commutator.

3.1.1 Radix-2 Single-path Delay Feedback (R2SDF)

Fig. 3.3 R2SDF N=16 (Radix-2 Single-path Delay Feedback)

The following notations are used;N denotes the size of the FFT and n = log2N

denotes the number of stages of FFT processing and PE of the pipelined architecture.

When Nare 16, the R2SDF needs 4 PEs, as shown in Fig. 3.3. R2SDF consists of the

radix-2 buttery modules(BF2), the delay elements (DE) and

thecomplexmultipliers1 if. Thedelayelementscompriseshiftregistersasfirstin

and first out (FIFO), and its block number means delay times or shift times. The

number of delay element of the each stage is the key point for controlling buttery

input and next stage output. The input ordering of the data and the sequenceof delay

42

7/27/2019 etd_0728106_120055

58/113

element operations guarantee proper pairing of all samples at each stage, a valid FFT

can be performed by rearranging the twiddle factors. Unfolded delay elements of

R2SDF are shown in Fig. 3.4. The radix-2 buttery module has two modes: one is

operation mode and the other is commutator mode, as shown in Fig. 3.5. The

operation mode computes radix-2 buttery operation and commutates pairs of

buttery results. Commutator mode only commutates pairs of the inputs to pairs of

outputs.

Fig. 3.4 Unfolded delay elements of R2SDF

Qperation( O )

Qommutator( C )

Fig. 3.5 Two modes of the radix-2 buttery module

When performing a FFT of size N, the first stage of processing combines pairs of

samples whose indices are N / 2 apart ( samples are indeed from 0 to N 1). The

second stage combines pairs whose indices are N / 4 apart, and so on. The number of

buttery operation is N / 2 at every time stage. We can find that the regularity between

buttery started operation and delay element in every time stage performs a valid FFT.

Using 8-point DIF FFT would explain conveniently the R2SDF. The R2SDF needs 3

43

7/27/2019 etd_0728106_120055

59/113

PEs to implement 8-point DFT, as shown in Fig.3.6. In radix-2 N-point FFT, the

twiddle factor of penultimatestagewill alwaysbe a constantW; : W82: W41j

multiplication.Thebuttery outputsof the last stagemultiply W13:1, so lastPE do

not employ complex multiplier.

u=:e*'ar2;*2.*%;.iM WJ; K.xgviy ~

. %

l

221

TE,

Fig. 3.6 R2SDF (N=8)

The following notations are used: The first input symbol has T0~T7 input sequence,

andnextinputsymboldenoteT01~T71.Thepowerof I of 01~71notesthe1st time

stageoutput.Thepowerof 2 of02~72denotesthe2ndstageoutput.Thedelayelement

of first PE will queue from T0 to T3 input sequencesthen T0 and T4 denoting indices

of inputpairsfor thefirst butteryprocessingof first stagewill compute01and41

denotingthe outputpairsof first buttery processingof first stage.01 passesto

7/27/2019 etd_0728106_120055

60/113

T5T4T3

T6T5T4

T71T61T51

4.v.a+.%

.7 R2SDF (N=8) data stream owFig. 3

.8 to observe the control mode of buttery module. The squaresWe can use Fig. 3

.8(a) denote that the buttery module enters commutator mode. Thein Fig. 3shown

in each stage were shown in Table 3.4.inputpairs of samples for buttery

45

7/27/2019 etd_0728106_120055

61/113

0]>s%d=}

azzrgm "3*3complex multiplier

Fig.4.5R22MDCN=128

7/27/2019 etd_0728106_120055

80/113

4"i*E$

v complex multiplier

r constantmultiplier

Fig.4.6R23MDCN=128

Table 4.2: Analyze the number of complex multipliers in MDC with different radix.

8192

4 5 6 7" 8 9 10 11

2957832 328648 36l512.8

328648197188 8 262918 4 26291824 328648 1 39437? 6

256 512 1024 2048 4096 8192

2 4 4 4 6 6 6 8

''106551.21'?"2280.81T"2280.8 192691.6 2584212 2584212 1 2T8832 344561.6

4.1.2 Parametrizable Memory Access

Because the memory-based FFT architecture uses single PE to perform operation

of FFT, the address width of the memory and ROM will change according to the size

of FFT, but regularity of the memory and ROM address access is invariable, so we

focus the property to realize parametrizable design for operating the variable size of

DPT.

65

7/27/2019 etd_0728106_120055

81/113

Performing a N-point FFT, the ROM size must storeN / 2 words, so the address

width of ROM is log2(N / 2) bits. Taking 8-point DFT for example, the squaresof Fig.

4.7 show the value of the twiddle factor in each stage. The requirement of twiddle

factor, W,3,WA1,,W,, andW3, in everytime stagehavebeenalreadystoredin the

ROM. The ROM address width is log2(N / 2) bit, and we suppose that the ROM

address must double when increasing one time stage, which is suitable for hardware

implementation because we can use left shift instead of multiplication. The regularity

of the ROM addressaccessof 8-pont FFT is shown in Table 4.3. If we extend the size

of DFT to 64, the relation between buttery count and ROM address in each time

stage is shown in Fig. 4.8.

Table 4.3: ROM content in each stage for an 8-point FFI.

:\Vl:XA Xm

Fig. 4.7 Signal data ow of 8-point DIF FFT

66

7/27/2019 etd_0728106_120055

82/113

Butterycounter[4:0]

timestage0

timestage1

timestage2

timestage3

timestage4

WW5 llll

Fig. 4.8 Relationbetweenbuttery countsandROM addressesin eachtimestage

In parametrizablememoryaccess,we useIn-placemodeto performvariablesize

of FFT, whichhavebeendiscussedin section3.2.2. Fig. 4.9 depictsthe architecture

of the conict-free addressgenerator for the radix- r FFT buttery processor

assumingthatthememoryhasbeenpartitionedinto r banks.In thefigure,ther barrel

shiftersassociatedwith the stagecounterare to emulatetheright rotationalproperty

of the buttery unit at different stages.The buttery counter is designedfor

completing all buttery task assignmentsat current stage. Finally, the address

switchingis usedto implementequation(4.1) suchthattheoutputof theeachbarrel

shiftercanbemappedto thecorrectmemorybank.

Data_count=[dn_1,dn_2,......,d2,d1,d0]r

n=l1og.l (4.1)Bank_index=(d +d", +......+d,+d,+d0)modr

Data_index=[d,,,1,d,,,2,......,d2,d1]r

MB0_addr.

B1_addr.

Fig. 4.9 Architectureof theaddressgenerator

67

7/27/2019 etd_0728106_120055

83/113

4.2 Building Block

Weapplythreekindsof IP cores,R23SDF,R23MDCandmemory-based,where

common building blocks are radix-2 buttery modules, complex multipliers. Our

building blocks follow the RTL coding guidelines of SIP.

The Radix-2 buttery module (BF2)

The BF2 includes one complex adder at top, one complex subtractor at bottom

and the mode for selection of FFT/IFFT as shown in Fig. 4.10. When mode is IFFT,

the path of divided by 2 will pass to output ports. Besides, when mode is FFT, the

other path will pass. Assuming the input pair of BF2 are (a +jb) at top and (c +jd) at

bottom. After the computation of the radix-2 buttery, the output pair are

top output :(a jb) : (c jd) : (a +c) j(b +d),

bottom output: (a jb) (c jd) : (a c) j(b d ).(4.2)

mode

0:FFT

1:IFFTFig. 4.10 Radix-2 buttery module with mode selection.

Implementation of the complex multiplier contains a constant multiplier and a

complex multiplier.

Constant multiplier

7/27/2019 etd_0728106_120055

84/113

(a+jb)xW;: (a +jb)xW;

x/EA/Ex/5

""7'J7"7(a+jb)>

7/27/2019 etd_0728106_120055

85/113

Complex multiplier

Assuming input pair of the complex multiplier are (X1+jY1)and (X2+jY2).X1, Y,-

X2, and Y2use 2s complement representation. Computing complex multiplication

arrange in real part and imaginary part as below

real iX1X2 Y1Y2,_ _ (4.4)mag - X1Y2+X2Y1.

There are four real multiplications and two real additions in equation (4.4). Because

the Verilog hardware description language (HDL) cant support signed multiplication,

X1Y2+XQY1

Y2[width-l]

UnsignedMultiplier _ _X2[w1dth-l] A Y1[w1dth-l]

Fig. 4.13 Complex multiplier architecture using unsigned multipliers.

Using DW02_Mult provided by Synopsys can apply signed multipliers to

implement complex multiplications and the correction by 2s complement operation

omit from Fig. 4.13 to anew depict in Fig. 4.14.

70

7/27/2019 etd_0728106_120055

86/113

Y1 Y1

X2+_lY2X2 X;

X1

Y2 Y2m_out_imag

Fig. 4.14 Complex multiplier architecture using signed multipliers.

If we use UMC.18 process, the word length of multiplier 20 bits which consists

of 10 real part 10bits and imaginary part 10 bits, and clock rate 60 MHz, the statistics

is shown in Table 4.4. Using unsigned multipliers to construct complex multiplier is

called technology independent (TI) else using signed multipliers is called Design

Ware (DW). The cost of the constant multiplier is apparently less than the complex

multiplier, so the simplifying complex multiplier by constant multiplier benefits

indeed. The constant multiplier really simplies complex multiplier. When user can

utilize DW, it will benefit less gate count and higher speedin provided design.

Table 4.4: Comparison between constant multiplier and complex multiplier of TI and

DW

UMC. l 8, word length 20bit and

clock rate 260MHz

TotalCellarea(umz) 32864.829465.2

TotalDynamicPower(mW) 4.30343.6627 1.3652

71

7/27/2019 etd_0728106_120055

87/113

4.3 FFT/IFFT Compiler Flow

In the FFI/IFFT compiler ow, the user got the circuit they wanted by choosing

parameters through user interface. Table 4.4 lists and describes all parameter in our

FFT/IFFT compiler. After choosing parameters, FFT/IFFT compiler will generate the

design model automatically as shown in Fig. 4.15.

Untimed functional model

Provide C simulation model which verify operation result of the FFT/IFFT

verilog RTL code. C simulation model can generate golden pattern to the test bench

by common input pattern.

Verilog RT L model

Providing synthesizable verilog code of FFT/IFFT benet user to integrate

design itself.

Test model

Providing the test bench and the test pattern les can simulate and test circuit,

and further use golden patterns to verify design via the automatic comparison.

Script model

Generate synthesis script file and testing script file of the providing user to

synthesize circuit and test insertion.

Bus functional model

Provide the AMBA AHB interface testing compatibility of application system.

72

7/27/2019 etd_0728106_120055

88/113

Table 4.5: Parametersinformation of the FFT/IFFT Compiler

Size of FFT/IFFT 1range - 64,128,. ,8192

Choosing vender-specific directives (Design Ware) or technology

Vender-specic independent

directives 0: technology independent

1: vender-specific directives

ClockRate FFT/IFFTsystemclockrate it

HHDataWidth Datawidthofeachinpntandimaginaryofcomplexdata

sub-pipe

Vender-sdirective

architectclockratethroughputrate...EoD-

7/27/2019 etd_0728106_120055

89/113

4.4 Specification

We use the 128-point FFT as an instance to separately show the block diagram,

I/O denition,timingdiagram,andsynthesizedresultsin theR23SDF,R23MDCand

Memory Based FFT architecture, and further analyzing their suitable applications,

respectively.

4.4.1Specificationof the128-PointR23SDF

The providedR23SDFarchitectureis shownin Fig. 4.16.In the gure, thetwiddle factors in the smallest rectangular forms at the penultimate stage of SFG are

W,3,W52",W,;",W,;",W5",W,3",W54",W56",n =0,1,...,N/64-1, (4.5)

based on the relation between input sequenceand buttery module we have discussed

before in each stage, as depicted in Fig. 4.17.

peiiultinmte

:"::tagex[D]

km

)([1271

Fig.4.16SDFof 128-pointandR23SDFblockdiagram

7/27/2019 etd_0728106_120055

90/113

C1->cX[n]/ 0 127

FFTin ut

stage]inM1_wor

M3' 112 239

$67111 ' 1 ~ - - ~ as2I22:mzimzmnznsieiisszumlillilillliM7r TotalCyclesi 1I I 127 254/Smg:117

7/27/2019 etd_0728106_120055

91/113

x[nj'"Em"

M1_v\or 64clocks/stage1out

-65- -142 -DH2M2_v\or/stage2 out

DE3M3_v\or/stage3 out

Pipeline11holdc e32clocks 97 224

DE4M4_v\or/stage4 out_ _ D-55-

M5_\Vnr/stage5 out_ _ D-55-

M6m 2clocksI I i .. 258/stage6 out Si?r%RW.@!IIIBEiIII%II@IIEI3I3~- 1-31- 259

DE7 1 IM77or 13TI]we|meFFholdC_\ole 1clocks- _ _ , p_/stage7out l-IEEIHIEJIIBEEIBL[HIEIEEHmu FHEIIEEIII{ l'FI[[E|IJBIFEIIEJISISJIIJ

Xnq TotalCycles1 134 _N+(N 1)+ 7 stagePipelineRegister: 262(0~261) imc11|\c|ilho1dC3c1c

Fig.4.19128-pointR23SDFtimingdiagramwithpipeline.

In the 128-pointFFT, the R23SDFstartsexportingoutputsequenceafter

(128-1+1og2N)cycles.

cIk_I l_|I_|I_|I|//l|_|!_|l_||_I/ll! \:\:|II/LI|IIII!II1

M | // // //rm.->" E EII ' 1

start tn:+5. I +2 '

1mns_mputj( u I X 2X 3 X 4 myX I27

ready )[

mvns_oumut H )( 0 X 64 X 32 06Im127delayeiements H M-*"i

,< 127cgcies >.< 7 cgcies I2Xqc1es

Fig.4.20128-pointR23SDFtimingdiagramwith1/0information

76

7/27/2019 etd_0728106_120055

92/113

Table4.6:I/Oportsof R23SDF.

R23SDFPipelineFFT

tart Circuitis receivinginputsymbolsfromuppersystemin highlevels

meaning.

Inputportreceivinginputsymbolsfromuppersystem.Executing FFT operation in low level meaning.de

ExecutinIFFTo erationinhi hlevelmeanin

clk Clock signal

ready Output port already prepares valid data in high level meaning.

trans_output Output port of computation result

4.4.2Specificationof the128-PointR23MDC

Blockdiagramof R23MDCimplementationis shownin Fig. 4.21andtiming

diagram is shown in Fig. 4.22. The Figures shows the regularity of control circuit in

everytime stage.Equally,the outputlatencyof the R23MDCwill extendwhen

inserting pipelined registers. The pipelined registers in the each stage and the control

circuit in the pipelined architecture depicts Fig. 4.23 and timing diagram Fig. 4.24. We

depict timing diagram of input/output (I / 0) ports as shown in Fig. 4.25 and I/O

signals describe in Table 4.7.

Fig.4.21Blockdiagramof 128-pointR23MDC

7/27/2019 etd_0728106_120055

93/113

6'-'1clocks 63,1164 64clocks 1274112364clocks 191419264clocks 25511125532clocks237233 319Clock

InputITTLialll

StagelL72conlml

SmgclBT2contl(VI3tag:I mmcunlml

SmgezC2Izomrol

sragczIII:mmml

SlagclmIIIcumml

SIage3C2comrol

Stage?BTZcnnImI

Slugs}mmConllU]

Smgc4czcmmnl

Stage-'1BTZcnnImI

stage:mnlcomrol

I I Imya:I I |r'e1;:L;"

7/27/2019 etd_0728106_120055

94/113

F0 64c1ocks 63+64 64c1cks 127+128 64c1c-cks1914192640106165 sdocks287288 3191111111111111

Suur5C24 $34I FWTTII I I I I I I I I 1~\~IIs--I:21I I I IWFI~~4clocks

SW731:2 1clocks133 197 261 324

134 198 325Nady TotalCycle3N-1+Nll+Log;N=198(0-197)

Fig.4.24Timingdiagramof 128-pointR23MDCwithpipelinedregister

In the 128-pointDFT,theR23MDCstartsexportingoutputsequenceafter134,

(N 1+log2N),cycles.Becauseoutputportof theR23MDCaremulti-path,theresult

data need 64, (N / 2), cycles exporting completely.

ll I/L_I!I L/A_I!I L// llmi // I // I //M '71 // 1 // //

"-I-"51". _t..,..ri i...,.I_.~_..Ip... I X 2 x 3 x 4 x I

mud)

l'mns_oIIIputl

E (,4cycles >196XsoX112 111X 95X127l'mns_oIIIpuI2

Fig.4.25128-pointR23MDCtimingdiagramwith1/0information

79

7/27/2019 etd_0728106_120055

95/113

Table4.7:I/Oportsof R23MDC

R23MDCPipelineFFT

tart Circuitisreceivinginputsymbolsfromuppersysteminhighlevels

meaning.

Inputportreceivinginputsymbolsfromuppersystem.

t_.d ExecutingFFToperationinlowlevelmeaning.

mo e

ExecutinIFFTo erationin hi h levelmeanin

Asynchronous resetsignalandpositiveedgetrigger.Clock si nal

ready Output.cport.a1readypreparevalidiidatalevel meaning

trans_output1result.trans_output2OutputportZiofcompuitationresulrti:

4.4.3Specification FFT.iArcliitecture

\/ Blockdiagram

R/\M_DAT/\2

Process

Element

CLL 'l'rans_output> >rst_p?, .mode MEM1n ROM11- MEM0ut

7/27/2019 etd_0728106_120055

96/113

\/ Timing diagram

The rest signal rst_p must be set high to trigger the memory-based architecture

first. Then, the two dual-port memories will begin receiving the input data if the

primary input start is pushed high. This signal will be pulled down until all 128 sets

of data have been inputted. When all 448, (128/2 x log2128), butteries complete their

operations, the output signal trans_output will start outputting computed results;

simultaneously, the other output signal ready must also be set high to tell outer

circuits that current output data are valid. Finally, the ready signal will be pulled

down when all 128 sets of data have been outputted.

1k__l|_||_l|_||_I|/L||_||_!_I|_|L/|_!|_!|_||_l|_lL/L||_l|_ll_|_||_

64*7=896c)c1

Fig. 4.27 Timing diagram of a 128-point memory-based architecture.

\/ I/O Definition

Table 4.8: I/O ports of memory-based FFT architecture.

Memory-based FFT architecture

tart Circuitisreceivinginputsymbolsfromuppersystemin highlevels

meaning.

7/27/2019 etd_0728106_120055

97/113

rst_p Asynchronous reset signal is positive edge trigger.

clk Clock signal

High level means that output port already prepare valid data

trans_output Output port of computation result

WEN1

WEN2

OEN

4.4.4 Synthesis Result

Technology le: Artisan umc.18 1P6M Cell library

Word length: 20 bits (10 bit for real part and 10 bit for imaginary part)

/ R23SDF

The Table 4.9 lists the gate count from different approach. We can nd that the

original design without any supposed option, such as adding extra pipelined and using

complex multiplier of Design Ware, is the least gate count when clock rate is small

than 60MHz. While the clock rate is large than 70MHz, the option 3 of sub-pipelined

insertion is the least gate count and also workable until 130MHz. In addition to option

3 of sub-pipelined insertion with increase clock rate, the area is small than original

7/27/2019 etd_0728106_120055

98/113

Table 4.9: Gate count of the R23SDF

Table4.10:Powerconsumptionof theR23SDFat 128-pointFFT(mw).

16.3537 16.5821 17.4383 17.6724 17.6523

23.3062 23.3919 24.5416 24.7647 24.7128

30.580630.4169828.180131.935431.8581

39.4032 39.3024 39.0054

46.2579

J R23MDC

Throughaboveanalyze,wetry partialcaseto quickcomparewhetherR23MDC

hasthesamepropertyof R23SDF.Dueto thetable4.11,thecharacteristicis thesame

to theR23SDwhenchoosingoption3 of sub-pipelinedinsertionfor theR23MDC.

83

7/27/2019 etd_0728106_120055

99/113

However,wecannd thatthegatecountnotonlyis largethanR23SDFbutthe

powerconsumptionasshownin Table4.12is alsomorethanR23SDF.Nevertheless,

the R23MDCisnt reallyno advantagewhensomekind of applicationneedsthe

unused half N cycle to do something like that bit reverse order of output sequence

transfer normal order sequence.

Table4.11:Gatecountof theR23MDCat 128-pointFFT/IFFT.

Table4.12:Powerconsumtion of theR23MDCat 128- oint FFT (mw).

33.5921 36.5175 36.2135

58.5101

46.8579

84

7/27/2019 etd_0728106_120055

100/113

/ Memory Basedarchitecture

We try to nd the fastest clock rate in this one. The clock rate, 80MHz, is the

fastest when using complex multiplier of technology independent. The clock rate

increases to 100 MHz when using complex multiplier of Design Ware, and further

decreasing gate count.

T ble4.13'Powerco sumptionof theR23SDFat 128pointFFT(mw).

using technology

independent ( @8OMhz)

using Design Ware

(@ lOOMhz)

4.4.5 Analysis of Suitable Applications

The number of the butteries is equal to N / r in the N-point FFT implemented

with the radix-r PE, where r is a power of 2 and the number of stages will be log2N.

Under such a circumstance, we describe the execution ow of the provided pipelined

architecture and memory-based architecture in Figs. 4.28 and 4.29, respectively. The

clock rate and the throughput rate will be the same for our provided pipelined

architecture because that possessesthe properties, real time and non-stopping. On the

contrary, the throughput rate will be different from the clock rate for memory-based

one since it has some specic characteristics. In this situation, the relation between the

throughput rate and the clock rate can be representedas follows.

Throughputrate=+x ClockRate: . (4.7)2N+log,N 2+ grr r

85

7/27/2019 etd_0728106_120055

101/113

OFDM Symbol 1

Fig. 4.29 Execution ow of the provided memory-based FFT architecture

According to the synthesis results given before, the maximum operating frequency of

our memory-based architecture is 100 MHz. Assuming that the size of FFT is 64-point

and the operating frequency is set 100 MHz, the throughput rate will be equal to

Clock Rate _ 100MHz

1og,N 2+log264r 2

=20Mbps. (4.8)

2+

In the same way, if the size of FFT is 8192-point, the throughput rate will become

Clock Rate_ 100MHz

I 1og,N 2 I 1og28192I r I 2

=11.76Mbps . (4.9)

2

As seen from the specifications of associated OFDM-based communication

systems given in Tables 4.14 and 4.15, most applications except UWB could be

realized using our developed architectures. However, the required word length will

increase for higher precision while implementing FFT, whose size of points is too

86

7/27/2019 etd_0728106_120055

102/113

larger. In this condition, the proposed architectures maybe cannot operate at the

highest frequency, 100 MHz. The detailed information about the maximum operating

frequency for different size needs more experiments to acquire; here, we have not

done more completely yet.

Table 4.14: FFT/IFFT size for OFDM-based communication system

DVB-T ~ DAB ~ VDSL

system

64 . 20

2 x 256 2.22

2.22*22x256x2,n=0,...,4 23 1

256x2,n=0,...,3 8.26

8192/2048 896/224 9.14/9.14

128 0.24242 528

87

7/27/2019 etd_0728106_120055

103/113

Capter5

Verication and Performance

In this Chapter, we discuss the possibility of finding cost function. By the cost

function, the capability of FFT/IFFT compiler will be raise, and construct an approach

of C simulation model for verifying proposed design. Finally, a verification plan and

comparison with other works are given.

5.1 Cost Function and Derivation

A good cost function is the statistics of the power consumption and area of all

proposed architectures which is calculated with the given parameters,that contains the

size of FFT, the clock rate and the throughput rate, etc. After analyzing the statistics,

FFT/IFFT compiler would indicate which architectures is the rst choice under the

parameters .

According to the analysis results of our research, the FFT/IFFT compiler

automatically chooses the lowest gate counts under the parameters of throughput rate

and clock rate among proposed architectures. From previous discussion the pipelined

architecture and memory-based architecture have different consider under the

requirement of throughput rate. Then, we can rearrange the FFT/IFFT compiler

automatically chooses our provided architecture which is the lowest gate counts via

the throughput rate. However, when changing the size of the FFT and word length, the

critical path of proposed architecture and range of Fig. 5.1 will different.

88

7/27/2019 etd_0728106_120055

104/113

Sub-pipe option 1R23SDF

Sub-pipe option 3R23SDFNonsubpipe

R23SDFMemory Based

20 60 70 140 Mbps

Using designware ( ThroughputRate)

Fig. 5.1 Choosing an architecture based on the specified throughput rate.

5.2 C Simulation Model

Constructing C simulation model can obtain some middle of simulation, these

values are useful at the duration of debugging when chip is implementing. Another

purpose of the C simulation model is to generate golden patterns to verify proposed

design. Next, we discuss how to construct C simulation model from FFT algorithm.

Based on the discussions of algorithm in the previous Chapters, the regularity of

the FFT algorithm is already known in which the twiddle factor is variation in each

time stage when using different radix FFT algorithm. Then, we use radix-8 algorithm

to explain how to construct a C simulation model. All operation of construction

process must be considered in fixed point arithmetic for matching the simulation

result with hardware result. We illustrate the constructing C simulation ow using an

example which shown in Fig 5.2.

Step 1: operate all buttery for one stage rst, then saving the result using in place

mode.

Step 2: process the operation of multiplication to the output of buttery output and

export a file if needed to trace every stagesoutput.

89

7/27/2019 etd_0728106_120055

105/113

Step 3: return to step 1 until all time stage is operated completely and export result le

which is golden pattern.

Step I Step 2

ffif 5.2..Radi5

7/27/2019 etd_0728106_120055

106/113

from bit-reverse order to normal order before injecting them to FFT. Based on

above-mentioned verification plan, we can ensure that our design is correct.

( GoldenPattern)

FFT( RTL Code)

GoldenPattern)

Fig. 5.3 Verication plan

5.4 Performance Evaluation

With the parametrizable control, different simulation results can be acquired by

changing the input data width. As mentioned in the last section, the verification plan is

complete by 1) injecting the test patterns to the IFFT rst, where the test patterns are

from the pattern generator and 2) then applying the computed results to the FFT. So

the information about signal-to-quantization-noise-ratio (SNR) can be obtained by

analyzing the input sequenceof the IFFT and the resulted output sequenceof the FFT.

In our design, we assume that the input, output and the twiddle factor have the same

91

7/27/2019 etd_0728106_120055

107/113

data width. For simplicity of explanation, data width is subsequently to denote the

data width of above-mentioned signals. And, Fig. 5.4 depicts the relation between

SNR and the data width of 128-point FFT. It is obvious that SNR will be higher than

30 db when the data width is larger than 11x2 bits and higher than 40 db while larger

than 15x2 bits.

In 11 I1 13 I1 I5 I15 rr l l I9

nauwianghaa

Fig. 5.4 SNR curve in the 128-point FFT

Table5.1liststhesynthesisresultsaboutourproposedR23SDFarchitectureand

another work [25]. Assuming that the data width is 20 bits, it is observed that both the

area and power consumption of our proposed architecture are less than those in [25].

Table 5.1: ASIC synthesis result at clock frequency of 132MHz.

proposed

92

7/27/2019 etd_0728106_120055

108/113

Capter 6

Conclusions and Future Work

6.1 Conclusions

We present an efcient FFT/IFFT compiler which consists of three IP cores,

R23SDF,R23MDCandmemory-basedFFTarchitectures.Theinputsto ourdeveloped

generator are a set of user-defined parameters. According to the provided input

constraints from the outside world, our generator can take in account the trade-off

between hardware overhead and speed requirement and output a suitable RTL code

for users reference. Based on our development, not only a dedicated FFT/IFFT

module can be easily prototyped for fast system verication, but also the resulting

compiler can be used as a basis for more advanced research in the

etd_0728106_120055

Documents