etd_0728106_120055

Upload: balashyamu

Post on 02-Apr-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/27/2019 etd_0728106_120055

    1/113

    E 1 :%;k5%

    alav

    raaxE4-2? FEzi$ J'%i$/)i'H%53:

    iiiidAN EFFICIENT FFT/IFFT COMPILER FOR

    FDIFFERENTAPPLICATIONS

    Department of Electrical Engineering

    National Cheng Kung UniversityTainan, Taiwan, R.O.C.

    Thesis for Master of Science

    July, 2006

  • 7/27/2019 etd_0728106_120055

    2/113

    @n&%x%%&

    IE.:'

    %%%z&$%MMMi

    W%i=~%%:%

    1M%i$@$J%MN@&%%

    %%%%%%

  • 7/27/2019 etd_0728106_120055

    3/113

    An Efficient FFT/IFFT Compiler for

    Different Applications

    by

    Sheng-Hs:ienHuang

    A thesis submitted to the graduate division in jpaztial.

    fulfillment ef the requirement fer the degreeoflMaster of Science

    at

    Naticnal Cheng Kung University

    Tainan, Taiwan, Republic ef Cltina

    July 2006

    Approved by 1

  • 7/27/2019 etd_0728106_120055

    4/113

    i FEzf |3J'H%5$/).'H%5$

    4iL=J@$%

    4;:-aaeaal $33439

    E iti I2511%?'#EJ'Hf5viFFr

    iii

    J*/3.F]E3~H3f4.J5lE'5*uF]E8a/dfkfi.EL.7'ci>J"

    %|"n iik ifii LMi7la7&?

  • 7/27/2019 etd_0728106_120055

    5/113

    An Efficient FFT/IFFT Compiler for

    Different Applications

    Sheng-HsienHuangl lVIing-DerShiehz

    Department of Electrical Engineering

    National Cheng Kung University

    Tainan, Taiwan, Republic of China

    ABSTRACT

    With the emergence of internet services and pervasion of communication

    applications, internet and wireless communication have become a part of our life.

    Wireless network can reach where the traditional network cannot do. This makes the

    applications of wireless network more widespread. To date, different wireless LAN

    standards have emerged from different applications and each standard has its own

    uniqueness and application ranges.

    Of the existing digital communication systems, the Orthogonal Frequency

    Division Multiplexing (OFDM) technique has been widely used in performing signal

    modulation. Compared with the traditional single-carrier frequency modulation

    technique, OFDM adopts multi-carrier modulation which has been adopted in many

    practical wireless systems such as 802.11x, DAB and DVB, etc. Regarding the

    hardware implementation, the OFDM can be fullled by employing the (inverse) fast

    Fourier transform (FFT/IFFT). In fact, different standards or operation modes imply

    1 The Author

    2 The Advisor

  • 7/27/2019 etd_0728106_120055

    6/113

    different requirements or specifications for the associated FFT/IFFT; therefore,

    different design methodology should be applied. It would be a challenge if a unied

    FFT/IFFT architecture is to be designed. In this thesis, we investigate how to develop

    an efficient FFT/IFFT compiler for different applications. Based on our development,

    not only a dedicated FFT/[FFT module can be easily prototyped for fast system

    verication, but also the resulting compiler can be used as a basis for more advanced

    research in the future.

    vi

  • 7/27/2019 etd_0728106_120055

    7/113

    ". isbiiitbfi %t#*f%1BE?%'*935Ela73:4z4...sua:c:J%*i.11%33:BiJ{i1!5~

    %=v?3%'*bX1'|7F'J7E.3L 4'5-.iEril$JHf5:#_>1%1935$-_%.l:.:*I"%3c75@J%

    % %%3c'5?'3?'Jzna5IbXEEi?'.?=JE13*zi:-1%-)L59r5vZ=L;Li.saaHJ..~.?-Ei:..~%Fvl2E9

    7F=9354J%

    %t'5?'3*,JFf] %1%1%3cJl']355%?/i 2EL/3.3?-23'}f:Y

  • 7/27/2019 etd_0728106_120055

    8/113

    TABLE OF CONTENTS

    TABLE OF CONTENTS ......................................................................................... .. viii

    LIST OF TABLES ....................................................................................................... ..x

    LIST OF FIGURES ................................................................................................... ..xii

    Chapter 1 Introduction ...................................................................................................1

    1.1 FFT Overview .............................................................................................. ..1

    1.2 Motivation .................................................................................................... .. 3

    1.3 Organization of this Thesis ............................................................................5

    Chapter 2 FFT Algorithm ...............................................................................................6

    2.1 General Algorithm .........................................................................................6

    2.1.1 Decimation-In-Time (DIT) FFT Algorithms ........................................7

    2.1.2 Decimation-In-Frequency (DIF) FFT Algorithms .............................. 12

    2.2 High-Radix Algorithm ................................................................................. 18

    2.2.1 Radix-4 DIF FFT Algorithm ............................................................... 18

    2.2.2 Radix-8 DIF FFT Algorithm ...............................................................21

    2.3 Split-Radix DIF FFT Algorithms .................................................................25

    2.3.1 Raidx-2/4 DIF FFT Algorithm ............................................................26

    2.3.2 Radix-2/8 DIF FFT Algorithm ............................................................28

    2.4 Complexity Analysis .................................................................................... 32

    Chapter 3 FFT/IFFT Architecture ................................................................................ 38

    3.1 Pipelined FFT Architecture .......................................................................... 39

    3.1.1 Radix-2 Single-path Delay Feedback (R2SDF) ..................................42

    3.1.2 Radix-2 Multi-path Delay Commutator (R2MDC) ............................47

    3.2 Memory-Based FFT Architecture ................................................................ 53

    viii

  • 7/27/2019 etd_0728106_120055

    9/113

    3.2.1 Ping-Pong Mode of the Memory Management Strategy ....................55

    3.2.2 In-Place Mode of the Memory Management Strategy........................56

    3.3 A Unified FFT/[FFT Architecture .............................................................. .. 58

    Chapter 4 FFT/[FFT Compiler Design ........................................................................ 60

    4.1 Implementation Strategy ..............................................................................61

    4.1.1 Parametrizable Architecture .............................................................. ..61

    4.1.2 Parametrizable Memory Access..........................................................65

    4.2 Building Block .............................................................................................68

    4.3 FFT/[FFT Compiler Flow ............................................................................72

    4.4 Specification ................................................................................................74

    4.4.1Specicationof the128-PointR23SDF..............................................74

    4.4.2Specicationof the128-PointR23MDC............................................77

    4.4.3 Specication of the 128-Point Memory-Based FFT Architecture ...... 80

    4.4.4 Synthesis Result ..................................................................................82

    4.4.5 Analysis of Suitable Applications .......................................................85

    Chapter 5 Verication and Performance......................................................................88

    5.1 Cost Function and Derivation .................................................................... .. 88

    5.2 C Simulation Model ................................................................................... .. 89

    5.3 Verification Plan ......................................................................................... ..90

    5.4 Performance Evaluation ............................................................................. ..91

    Chapter 6 Conclusions and Future Work ..................................................................... 93

    6.1 Conclusions ................................................................................................ ..93

    6.2 Future Work ............................................................................................... ..93

    References .................................................................................................................. .. 96

    ix

  • 7/27/2019 etd_0728106_120055

    10/113

    LIST OF TABLES

    Table 1.1: FFT/IFFT size for OFDM-based communication system.............................4

    Table 2.1: Complexity analysis of twiddle factor for radix-4 DIF FFT algorithm ...... 34

    Table 2.2: Complexity analysis of twiddle factor for radix-8 DIF FFT algorithm ...... 35

    Table 2.3: Complex multiplications required for radix-2, radix-4 and radix-2/4 FFT

    algorithms .................................................................................................... 36

    Table 2.4: Complex multiplications required for radix-8 and radix-2/8 FFT. .............37

    Table 3.1: Comparison of single buttery and fully spread architecture ..................... 39

    Table 3.2: Analysis of different pipelined architectures using different radix ........... ..40

    Table 3.3: Comparison of hardware utilization ............................................................41

    Table 3.4: The pairs of buttery inputs at each time stage. .........................................46

    Table 3.5: The relation of data accessesat different time stage of 8-point radix-2 FFT

    .................................................................................................................... ..57

    Table 4.1: Analyze the number of complex multipliers in SDF with different radix. .63

    Table 4.2: Analyze the number of complex multipliers in MDC with different radix. 65

    Table 4.3: ROM content in each stage for an 8-point FFT. .........................................66

    Table 4.4: Comparison between constant multiplier and complex multiplier of T I and

    DW ............................................................................................................. ..71

    Table 4.5: Parametersinformation of the FFT/IFFT Compiler ................................... 73

    Table4.6:I/O portsof R23SDF....................................................................................77

    Table4.7;I/O portsof R23MDC..................................................................................80

    Table 4.8: I/O ports of memory-based FFT architecture. ............................................ 81

    Table4.9:Gatecountof theR23SDFat 128-pointFFT/IFFT.....................................83

    Table4.10:Powerconsumptionof theR23SDFat 128-pointFFT(mw)....................83

    Table4.11:Gatecountof theR23MDCat 128-pointFFT/[FFT..................................84

  • 7/27/2019 etd_0728106_120055

    11/113

    Table4.12:Powerconsumptionof theR23MDCat 128-pointFFT(mw)..................84

    Table4.13:Powerconsumptionof theR23SDFat 128-pointFFT(mw)....................85

    Table 4.14: FFT/[FFT size for OFDM-based communication system......................... 87

    Table 4.15: FFT/IFFT size and throughput rate for OFDM-based communication

    system ..........................................................................................................87

    Table 5.1: ASIC synthesis result at clock frequency of 132MHz. ...............................92

    xi

  • 7/27/2019 etd_0728106_120055

    12/113

    LIST OF FIGURES

    Fig.1.1ThetwiddlefactorW"kof FFTin theunitcircle...............................................2Fig. 1.2 OFDM transceiver block diagram ....................................................................4

    Fig. 2.1 Signal ow graph of the decimation-in-time decomposition of an N-point

    DFT (N =8) computation into two (N / 2)-point DFT computations. ...........9

    Fig. 2.2 Signal ow graph of the decimation-in-time decomposition of two

    (N/2)-point DFT (N = 8) computation into four (N / 4)-point DFT

    computations. ................................................................................................. 9

    Fig. 2.3 Signal ow graph of a 2-point DFT ................................................................ 10

    Fig. 2.4 Signal ow graph of an 8-point DIT FFT ...................................................... 10

    Fig. 2.5 Signal ow graph of radix-2 buttery computation in Fig. 2.4 ..................... 10

    Fig. 2.6 Signal ow graph of simplified buttery computation with only one complex

    multiplication. .............................................................................................. 11

    Fig. 2.7 Signal ow graph of 8-point DFT using the buttery computation of Fig. 2.6

    .................................................................................................................... .. 12

    Fig. 2.8 Signal ow graph of DIF decomposition of an N-point DFT computation into

    two (N/ 2)-point DFT computations (N =8) ............................................... 15

    Fig. 2.9 Signal ow graph of decimation in-frequency decomposition of an 8-point

    DFT computation into four 2-point DFT computations............................... 15

    Fig. 2.10 Signal ow graph of a typical 2-point DFT at the last stage decomposition.

    .................................................................................................................... .. 16

    Fig. 2.11 Signal ow graph of complete DIF decomposition of an 8-point DFT

    computation .................................................................................................. 16

    Fig. 2.12 Signal ow graph of complete DIF decomposition of an 16-point DFT

    computation .................................................................................................. 17

    Fig. 2.13 Radix-4 buttery ...........................................................................................20

    Fig. 2.14 16-point radix-4 DIF FFT .............................................................................21

    Fig. 2.15 Radix-8 buttery ...........................................................................................24

    xii

  • 7/27/2019 etd_0728106_120055

    13/113

    . 2.16 Signal ow graph of 16-point radix-8 DIF FFT ...........................................25

    . 2.17 Radix-2/4 buttery .......................................................................................27

    . 2.18A sketch map of 16-point radix-2/4 DIF FFT. ..............................................27

    . 2.19 Signal ow graph of 16-point radix-2/4 DIF FFT ........................................28

    . 2.20 Radix-2/8 buttery ....................................................................................... 30

    . 2.21A sketch map of 16-point radix-2/8 DIF FFT ............................................... 30

    . 2.22 Signal ow graph of 16-point radix-2/8 DIF FFT ........................................31

    . 2.23 Signal ow graph of 32-point radix-4 DIF FFT based on radix-2

    decomposition in the rst stage and two radix-4 stagesin the next stage33

    . 2.24 Signal ow graph of 32-point radix-4 DIF FFT based on two radix-4 stages

    and one radix-2 stage...................................................................................33

    . 3.1 Two extreme methods of implementing the FFT algorithm. ..........................38

    . 3.2 Basic framework of pipelined FFT. ................................................................39

    . 3.3 R2SDF N=16 (Radix-2 Single-path Delay Feedback) ...................................42

    . 3.4 Unfolded delay elements of R2SDF ...............................................................43

    . 3.5 Two modes of the radix-2 buttery module ...................................................43

    . 3.6 R2SDF (N=8) ................................................................................................ ..44

    . 3.7 R2SDF (N=8) data stream ow .................................................................... ..45

    . 3.8 (a) Relation between delay elements and buttery operation modes in each

    stage, (b) control of twiddle factor in each stage.........................................46

    . 3.9 R2MDC N=16 (Radix-2 Multi-path Delay Commutator) ..............................47

    . 3.10 Unfolded delay elements of R2MDC ...........................................................48

    . 3.11 Radix-2 buttery module ..............................................................................48

    . 3.12 Two modes of the radix-2 commutator module .......................................... ..48

    . 3.13 R2MDC (N=8) ............................................................................................ ..49

    . 3.14 R2MDC (N=8) data stream ow ................................................................ ..51

    xiii

  • 7/27/2019 etd_0728106_120055

    14/113

    (b) control of twiddle factor in each stage...................................................52

    Fig. 3.16 Total DE numbers of R2MDC ......................................................................52

    Fig. 3.17 Operation ow of single PE .........................................................................53

    Fig. 3.18 Single PE FFT processor diagram ................................................................53

    Fig. 3.19 Ping-pong mode architecture ........................................................................55

    Fig. 3.20 Partial data processing ow of 8-point FFT .................................................55

    Fig. 3.21 The conict graph and memory partition; (a) the colored conict graph

    based on the radix-2 buttery unit, (b) the 2-bank memory arrangement...57

    Fig. 4.1 Relationship between the PE number and different radix algorithm .............. 62

    Fig.4.2R23SDFPEin radix-8buttery......................................................................62

    Fig. 4.3 Different radix SDF architectures for performing 16-point DFT. ..................63

    Fig. 4.4 R2MDC N=128 ..............................................................................................64

    Fig.4.5R22MDCN=128.............................................................................................64

    Fig.4.6R23MDCN=128.............................................................................................65Fig. 4.7 Signal data ow of 8-point DIF FFT ..............................................................66

    Fig. 4.8 Relation between buttery counts and ROM addressesin each time stage...67

    Fig. 4.9 Architecture of the addressgenerator .............................................................67

    Fig. 4.10 Radix-2 buttery module with mode selection. ...........................................68

    Fig.4.11Asimpliedcomplexmultiplicationwith W1/8...........................................69

    Fig. 4.12 Real multiplication without multipliers ........................................................69

    Fig. 4.13 Complex multiplier architecture using unsigned multipliers. ......................70

    Fig. 4.14 Complex multiplier architecture using signed multipliers. ..........................71

    Fig. 4.15 Provided design model. ................................................................................73

    Fig.4.16SDFof 128-pointandR23SDFblockdiagram.............................................74

    Fig.4.17128-pointR23SDFtimingdiagram...............................................................75

    . 3.15 (a) Relation between delay elements and commutator modes in each stage,

    xiv

  • 7/27/2019 etd_0728106_120055

    15/113

    . 4.18128-pointR23SDFblockdiagramwithpipelinedregisterandcontrolcircuit.................................................................................................................... ..75

    . 4.19128-pointR23SDFtimingdiagramwithpipeline.........................................76

    . 4.20128-pointR23SDFtimingdiagramwith1/0information............................76

    . 4.21Blockdiagramof 128-pointR23MDC..........................................................77

    . 4.22Timingdiagramof 128-pointR23MDC.......................................................78

    . 4.23Blockdiagramof 128-pointR23MDCwithpipelinedandcontrolcircuit...78

    . 4.24Timingdiagramof 128-pointR23MDCwithpipelinedregister...................79

    . 4.25128-pointR23MDCtimingdiagramwith1/0information...........................79

    . 4.26 Block diagram of 128-point memory-based architecture. ............................ 80

    . 4.27 Timing diagram of a 128-point memory-based architecture. ....................... 81

    . 4.28 Execution ow of the provided pipelined FFT architecture.........................86

    . 4.29 Execution ow of the provided memory-based FFT architecture ................ 86

    . 5.1 Choosing an architecture based on the specified throughput rate. .................89

    . 5.2 Radix-8 buttery .............................................................................................90

    . 5.3 Verification plan .............................................................................................91

    . 5.4 SNR curve in the 128-point FFT ....................................................................92

    XV

  • 7/27/2019 etd_0728106_120055

    16/113

    Capter1

    Introduction

    The Discrete Fourier Transform (DFT) plays a key role in digital signal

    processing in areas such as radar processing, spectral analysis, frequency-domain

    filtering, and polyphase transformations. The DFT is an important component in many

    practical applications of discrete-time systems. The possibility of greatly reduced

    computation was generally overlooked until about 1965, when Cooley and Tukey

    (1965) published an algorithm [1] for the computation of the DFT that is applicable

    when N is a composite number. The publication of their paper touched off a urry of

    activity in the application of the DFT to signal processing and resulted in the

    discovery of a number of highly efcient computational algorithms. Collectively, the

    entire set of such algorithms has come to be known as the Fast Fourier Transform, or

    the FF T.

    1.1 FFT Overview

    The DFT of a finite-length sequenceof length N is

    N1

    X[k]=Zx[n]W,;", k=0,1, ,N1, (1.1)n=0

    whereW,(,"=e'j(2N)"". TheinversediscreteFouriertransformis givenby

    1 N1

    x[n]=ZX[k]WN"", n=0, 1, ,N1, (1.2)N k=0

  • 7/27/2019 etd_0728106_120055

    17/113

    To computeallNvaluesof theN-pointDFTthereforerequiresa totalof N2complex

    multiplications and N(N -1) complex additions.

    Most approaches to improving the efficiency of the computation of the DFT

    employ the symmetry, periodicity, compressibility and expansibility properties of

    W,(,"asbelow.

    1. W,;""=(W"")*=WA,"N")(symmetry)

    2. WA,"=W"'+N)"=W"+N)" (periodicity)

    3. W,(,"=WA',,,lWN: Wn,V(compressibilityandexpansibility)

    Wecanconvenientlyobservevalueof thetwiddlefactor W,\,'fromFig 1.1.

    TwiddleFatorWA',"=ej(2N)""of FFT

    Fig.1.1ThetwiddlefactorW"'of FFIin theunitcircle.

    By using Fig 1.1, we can nd the symmetry of the twiddle factor as

    nk nkN/2W =W + ,

    k N/4 nk 3N/4 - nkW71+ : _W + : ,

    W81:W;2%-(1j)and

    W83:_W87:_'(1+J')-

  • 7/27/2019 etd_0728106_120055

    18/113

  • 7/27/2019 etd_0728106_120055

    19/113

    Serial

    DataInput

    JEQAM.Signal?1

    2Generator

    Serial

    Dataoutput

    Vf.G?,u3?d1. ~imje'r,va1=J:i:7re1?1'1O\./a_,l_'

    Receive_

    '_FiIter

    Fig. 1.2 OFDM transceiver block diagram

    Different wireless communication standards mean different specifications for

    target applications. Moreover, even in a digital communication system, it may have

    different operation modes. Table 1.1 lists the FFT/IFFT sizes for several existing

    communication system. When viewing this table, we know that it would be a

    challenge if a unified FFI/IFFT architecture is to be designed. In this thesis, we

    investigate how to develop an efficient FFT/IFFT compiler for different applications.

    Based on our development, not only a dedicated FFT/IFFT module can be easily

    prototyped for fast system verification, but also the resulting compiler can be used as

    a basis for more advanced research in the future.

    Table 1.1: FFT/IFFT size for OFDM-based communication system

    8192 DVB-T ~ VDSL

    4096 DVB-H ~ VDSL

    2048 DVB-T ~ DAB ~ VDSL

    1024 DAB ~ VDSL

    512

  • 7/27/2019 etd_0728106_120055

    20/113

    1.3 Organization of this Thesis

    Organization of this thesis is:

    0 Chapter 1 introduces the FFT and motivation.

    0 Chapter 2 reviews the general, high-radix, and split-radix FFT algorithms,

    and discussestheir different objectives.

    0 Chapter 3 discussesdifferent implementations of FFT algorithm.

    0 Chapter 4 shows the methodology of the FFT/IFFI compiler design.

    0 Chapter 5 describes the cost function, verication plan and performance

    evaluation.

    0 Chapter 6 presents a concluding remark.

  • 7/27/2019 etd_0728106_120055

    21/113

    Capter2

    FFT Algorithm

    The Cooley-Tukey FFT algorithm is very popular because it can reduce the

    computationalcomplexityfrom O(N2)to O(Nlog2N),and the regularityof the

    algorithm makes it suitable for VLSI implementation. To further reduce the

    computational complexity, high radix and split-radix versions have been proposed. In

    general,all of thesealgorithmsdecomposea length-N (= 2) FFT into odd half and

    even half recursively and effectively reduce the number of complex multiplications by

    using symmetric properties of the FFT kernel. The high radix FFT algorithms such as

    radix-4 and radix-8 [2] substantially reduce the number of arithmetic operations and

    data transfers as compared to the general FFT algorithm [3]. The split-radix FFT

    algorithms such as radix-2/4 [4] ~ radix-2/8 [5] are the best in terms of the

    multiplicative complexity for N-point FFT when the multiplications with i 1, i j

    are skipped, but it is inherently irregular becauseradix-2 stagesare used for even half

    components, and radix-4 or radix-8 stages are used for odd half components, which

    results in an L-shaped buttery unit. Due to the irregularity of the buttery unit, it is

    hard to design regular and modular pipelined hardware for the split-radix algorithm.

    2.1 General Algorithm

    The DFT of a finite-length sequenceof length N is

  • 7/27/2019 etd_0728106_120055

    22/113

    x[k]=x[n]vV,;",k=0,1, ,N1, (2.1)n=0

    whereW,(,'=e'j(2N)k. TheinversediscreteFouriertransformis givenby

    N1

    x[n]=iZx[k]W,;"",n=0,1, ,N1. (2.2)N k=0

    In equations (2.1) and (2.2), both X[k] and x[n] may be complex. The expressions

    on the right-hand sides of those equations differ only in the sign of the exponent of

    W,(,"andin the scalefactor1/N.

    In computing the DFT, dramatic efciency results from decomposing the

    computation into successively small DFT computations. We employ both the

    symmetry ~periodicity ~compressibility and expansibility of the complex exponential

    W,(,": e'j(2N)". Algorithmin whichthedecompositionis basedondecomposingthe

    input sequence x[n] into successively small subsequences are called decimation-

    in-time FF T algorithms. We can consider dividing the output sequence X[k] into

    smaller and smaller subsequencesin the same manner. FFT algorithms based on this

    procedure are commonly called decimation-in-frequency FFTalgorithms.

    2.1.1 Decimation-In-Time (DIT) FFT Algorithms

    The principle of the DIT FFT algorithm is most conveniently illustrated by

    consideringthe specialcaseof Nan integerpower of 2, suchas 2. Since Nis an even

    integer, we can consider computing X[k] by separating x[n] into two (N / 2)-point

    sequences consisting of the even-numbered points in x[n] and the odd-numbered

    points in x[n]. With X[k] given by equation (2.1) and separating x[n] into its even and

    odd numbered points, we obtain

  • 7/27/2019 etd_0728106_120055

    23/113

    X[k]= Zx[n]W,;"+Zx[n]W,$", (2.3)It even 71 odd

    or, with the substitution of variables n =2r for even part and n =2r + 1 for odd part,

    (N/2)1 (N/2)1

    X[k]= Zx[n]W,"+ Zx[n]W,"r=0 r=0

    (N/2)1 (N/2)1

    = Zx[2r](W,)"+W,5Zx[2n+1](W,)"-r=0 r=0

    andW;2WM,employscompressibilityproperty,since

    W15: e2j(27t/N): ej27E/(N/2)ZWN/2I

    Consequently, equation (2.4) can be rewritten as

    (N/2)1 (N/2)1

    X[k]= Zx[2r]W,;';,+W/VZx[2n +1]W,;;,,k: 0, N -1. (2.6)r=0 r=0

    Each of the sums in equation (2.6) is recognized as an (N/ 2)-point DFT, the first sum

    being the (N / 2)-point DFT of the even-numbered points of the original sequenceand

    the second being the (N / 2)-point DFT of the odd-numbered points of the original

    sequence,only the odd-numberedpointsof the originalsequenceextractsWA;. Fiq.

    2.1 depicts this computation for N=8.

    Therefore, the (N / 2)-point DFT can be decomposed even and odd part into two

    (N/ 4)-pointDFTs, only oddpartof (N/ 4)-pointDFT multiplies WA',,22WA?, using

    the fact that WN,22W13.Thus,insertingthe abovemannerinto the signalow graph

    of Fig. 2.1, we obtain the complete signal ow graph of Fig. 2.2.

  • 7/27/2019 etd_0728106_120055

    24/113

    x[0] -

    x[2] -

    x[4] -

    x[6] -

    x[l] -

    x[3] -

    x[5] -

    x[7] -

    Fig. 2.1 Signal ow graph of the decimation-in-time decomposition of an N-point

    DFT (N =8) computation into two (NI 2)-point DFT computations.

    Fig. 2.2 Signal ow graph of the decimation-in-time decomposition of two

    (N/2)-point DFT (N =8) computation into four (Nl 4)-point DFT computations.

    For the 8-point DFT that we have been using as an illustration, the computation

    has been reduced to a computation of 2-point DFTs. The 2-point DFT of the sequence

    consisting of x[0] and x[4] is depicted in Fig. 2.3. With the computation of Fig. 2.3

    inserted in the signal ow graph of Fig. 2.2, we obtain the complete ow graph for

    computation of the 8-point DFT, as shown in Fig. 2.4.

  • 7/27/2019 etd_0728106_120055

    25/113

    w,3=1

    W;=W, =W,,,"/2=-1

    Fig. 2.3 Signal ow graph of a 2-point DFT.

    For the more g

    decomposing the (N

    were left with only 2-point transforms. This requires v = log2N stagesof computation.

    If N = 2, this can be done at most v = log2N times, so that after carrying out this

    decomposition as many times as possible, the number of complex multiplications and

    additions is equal to Nv =Nlog2N. This is the substantial computational savings that

    we have previously indicated was possible.

    A A=A+BW;PN

    B B=A+BWyWmWA(]p+N/2)

    Fig. 2.5 Signal ow graph of radix-2 buttery computation in Fig. 2.4

    10

  • 7/27/2019 etd_0728106_120055

    26/113

    Computation in the signal ow graph of Fig 2.4 can be reduced further by using

    thepropertyof the coefficientsW: . Wefirst notethat,in proceedingfrom onestage

    to the next in Fig. 2.4, the basic computation is in the form of Fig. 2.5., this

    elementary computation is called a radix-2 butterfly. Since

    WAIIV/2:ej(27t'/N)N/2:ej7t':_1,

    thefactor WA',+N2canbe writtenas

    W,5+22WWW; =W,. (2.8)

    With this observation, the buttery computation of Fig. 2.5 can be simplified to the

    form shown in Fig. 2.6, which requires only one complex multiplication instead of

    two. Using the basic signal ow graph of Fig. 2.6 as a replacement for butteries of

    the form of Fig. 2.5, we obtain the signal ow graph of Fig. 2.7 from Fig. 2.5. In

    particular, the number of complex multiplications has been reduced by a factor of 2

    over the number in Fig. 2.5.

    Fig. 2.6 Signal ow graph of simplified buttery computation with only one complex

    multiplication.

    11

  • 7/27/2019 etd_0728106_120055

    27/113

    Fig. 2.7 Signal ow graph of 8-point DFT using the buttery computation of Fig. 2.6

    2.1.2 Decimation-In-Frequency (DIF) FFT Algorithms

    We can consider partitioning the output sequenceX[k] of frequency domain into

    smaller and smaller subsequencesin the same manner. FFT algorithms based on this

    process are commonly called decimation-in-frequency (DIF) FFT algorithms.

    To develop these FFT algorithms, let us again restrict the discussion to Na power

    of 2 and consider computing separately the even-numbered frequency samples and the

    odd-numbered frequency samples. Since

    Nl

    X[k]=Z:x[n]WA'}",k=0, l, ,N l, (2.9)n=0

    the even-numbered frequency samples are

    2

    X[2r]= x[n]W,;,r=0, 1,...,(N/2)1, (2.10)71IIC

    which can be described as

    (N/2)1 2 Nl 2X[2r]= Zx[n]WN"'+Zx[n]WN"'. (2.11)

    n=0 n=N/2

    12

  • 7/27/2019 etd_0728106_120055

    28/113

    With a substitution of variables in the second summation in equation (2.11), we obtain

    (N/2)1 (N/2)1

    X[2r]= Zx[n]W,"'+ Z x[n+(N/2)]W,'["+1. (2.12)n=0 n=0

    Eventually,becauseof theperiodicityof WI?,

    W13r[n+(N/2)]: W15rnWA;N: Wlgrn,

    Since W; =WN,2, equation(2.13)canbeexpressedas

    (N/2)1X[2r]= Zx[n+(N/2)]W,7;, r=0, (N/2) 1. (2.14)n=0

    Equation (2.14) is the (N / 2)-point DFT of the (N / 2)-point sequence obtained by

    adding the rst half and the last half of the input sequence.Adding the two halves of

    the input sequencerepresents time aliasing, consistent with the fact that in computing

    only the even-numbered frequency samples, we are under-sampling the Fourier

    transform of x[n].

    We con now consider obtaining the odd-numbered frequency points, given by

    N1

    X[2r+1]=Zx[n]W,;"2'+1>,r: 0, 1, , (N/2) 1. (2.15)n=0

    As before, we can describe as

    (N/2)1 N_1

    X[Zr+1]= Z x[n]W,;2+Z x[n]W,;. (2.16)n=0 n=N/2

    An alternative form for the second summation in equation (2.16) is

    13

  • 7/27/2019 etd_0728106_120055

    29/113

    N1 (N/2)1

    Z x[n:|WAr,1(2r+l): Z x[n+ I2):|W1E,n+(N/2)](2r+1)n=N/2 n=0

    (N/2)(2r+1)(N/2)1 n(2r+l)=WN Z x[n+(N/2)]WN (2.17)

    n=0

    (N/2)1

    : Z x[n+(N/2)]WA','(2'+1),n=0

    wherewe haveemployedthe fact that W,f,m':1 and W,f/W2): -1. Substituting

    equation (2.17) into equation (2.16) and combining the two summations, we obtain

    (N/2)1

    X[2r+1]= Z (x[n] x[n+(N/2)])W,;', (2.18)n=0

    - 2or, since WN : WW2,

    (N/2)1

    X[2r+1]= Z(x[n]x[n+(N/2)]) A',';2WA',',r=0,1,...,(N/2)-1. (2.19)n=0

    Equation (2.19) is the (N / 2)-point DFT of the sequenceobtained by subtracting the

    second half of the input sequence from the first half and multiplying the resulting

    sequenceby WA;. On the basis of equations (2.14) and (2.19), with g[n] =

    x[n]+x[n+N/2] and h[n] =x[n]-x[n+N/2], the DFT can be computed by first forming

    the sequencesg[n] and h[n], then computingh[n]W,(,, And nally computingthe (N /

    2)-point DFTs of these two sequencesto obtain the even-numbered output points and

    the odd-numbered output points, respectively. The procedure suggested by equation

    (2.14) and (2. 19) is illustrated for the case of an 8-point DFT in Fig. 2.8.

    14

  • 7/27/2019 etd_0728106_120055

    30/113

    Fig. 2.8 Signal ow graph of DIF decomposition of an N-point DFT computation into

    two (N/ 2)-point DFT computations (N =8).

    Consequently, the (N / 2)-point DFTs can be computed by computing the even-

    numbered and odd numbered output points for those DFTs separately. As in the case

    of the procedure leading to equation (2.14) and (2.19), this is accomplished by

    combining the first half and the last half of the input points for each of the (N /

    2)-point DFTs and then computing (N / 4)-point DFTs. The signal ow graph resulting

    from taking this step for the 8-point example is shown in Fig. 2.9.

    Fig. 2.9 Signal ow graph of decimation in-frequency decomposition of an 8-point

    DFT computation into four 2-point DFT computations.

    15

  • 7/27/2019 etd_0728106_120055

    31/113

    For the 8-point example, the computation has now been reduced to the

    computation of 2-point DFTs, which are implemented by adding and subtracting the

    input points, as discussed previously. Thus, the 2-point DFTs in Fig 2.9 can be

    replaced by the computation shown in Fig. 2.10, so the computation of the 8-point

    DFT and 16-point DFT can be accomplished by the algorithm depicted in Fig 2.11

    Xv-I:W0

    Xv-1[q] Xv[q]-1

    and Fig 2.12.

    Fig. 2.10 Signal ow graph of a typical 2-point DFT at the last stage decomposition.

    Fig. 2.11 Signal ow graph of complete DIF decomposition of an 8-point DFT

    computation

    By countingthe arithmetic operationsin Fig. 2.11 and generalizingtoN =2, we

    observe that the computation of Fig. 2.11 requires (N/2)log2N complex multiplications

    and Nlog2N complex additions. Thus, the total number of computations is the same

    with the decimation-in-time algorithms.

    16

  • 7/27/2019 etd_0728106_120055

    32/113

    .\tIA\Vlz.I.>Xo1'. .,\\vIInVxoxozyAuw-,,

    ,\\\vIII,n,xxoxoz, 4 _ ,,,\\\xoxII:w:xxmv;ox.,,,\\xoxoxoxwlA\.,.1921 F

    .:xoxoxoxoxoxmC1.l.\V.I.>x-...[.

    .I:xoxoxoxm\vI.,I,>x1'..,II:xox\\nVxoxozyA

  • 7/27/2019 etd_0728106_120055

    33/113

    the output DFT in bit-reversed order. The signal ow graph previously shown in Fig.

    2.7 begins with the input sequencein bit-reversed order and provides the output DFT

    in normal order.

    2.2 High-Radix Algorithm

    To further reduce the computational complexity, the high radix FFT algorithms

    such as radix-4 and radix-8 not only reduce the number of arithmetic operations and

    data transfers compared to the general FFT algorithm such as radix-2 FFT algorithm,

    but also reserve regular property for convenient implementation of pipelined hardware.

    Here, we consider input sequence in normal order based on the decimation-

    in-frequency (DIF) FFT algorithm.

    2.2.1 Radix-4 DIF FFT Algorithm

    To develop Radix-4 DIF FFT algorithm, let us again restrict the discussion to Na

    power of 4, i.e., N = 4 and consider computing separatelythe even-numbered

    frequency samples and the odd-numbered frequency samples from radix-2 DIF FFT

    algorithm. Since

    N1

    X[k]=Zx[n]W,;", k=0,1, ,N 1, (2.21)n=0

    the even-numbered frequency samples are

    (N/2)1

    X[2r]= Zx[n+(N/2)]W,7;, r=0, (N/2) 1, (2.22)n=0

    and the odd-numbered frequency samples are

    (N/2)1

    X[2r +1]: Z(x[n] x[n+(N/2)]) A,j2WA,',r=0, l,...,(N/ 2) 1. (2.23)n=0

    18

  • 7/27/2019 etd_0728106_120055

    34/113

    Equation (2.22) and (2.23) are separatedcontinuously into even-numbered frequency

    samples and the odd-numbered frequency samples such as raidx-4 decomposition. Let

    r = 2s separate even part and r = 2s + 1 separate odd part. First, let r = 2s substitute

    into equation (2.22), then

    (N/2)1

    X[4s]: Z(x[n]+x[n+(N/2)])W,;',s=0, (N/4) 1, (2.24)n=0

    usingthefactthat W553): W13;4 substitutesinto equation(2.24).Weobtain

    (N/4)1 (N/2)1

    X[4s]: Z x[n (N/2)])WA,;4+ Z (x[n]+x[n+(N/2)])WA,'j4n=0 n=N/4(N/4)1

    = Z (x[n] x[n (N/2)])W,(;~;4 (2.25)n=0

    (N/4)1+ Z (x[n+(N/4)]+x[n+(3N/4)])W,;';t.4.

    n=0

    Eventually,becauseof the periodicity of WI?,

    ( N/4) _ (N/4) _WN7: X_W1\,IL;4WN/4X_W1\,IL;4

    equation (2.25) can be expressedas

    _(N4)1 +x[n+(N/2)])+ MX[4s]_Z [(x[n+(N/4)]+x[n+(3N/4)]))W"""(227)n=0

    Equation (2.27) is the (N / 4)-point DFT of the N-point sequenceobtained by adding

    the four parts of the input sequence. Adding the four parts of the input sequence

    represents time aliasing, consistent with the fact that in computing only the even part

    of even-numbered frequency samples, we are under-sampling the Fourier transform of

    x[n].

  • 7/27/2019 etd_0728106_120055

    35/113

    2. Substituting r =2s into (2.23) can obtain even part of the equation (2.23).

    3. Substituting r =2s + 1 into (2.23) can obtain odd part of the equation (2.23).

    Then, we can obtain the other three equations as below Z

    _N4)1(x[n]+x[n+(N/2)]) M 2X[4s+2]_ [(x[n+(N/4)]+x[n+(3N/4)])jWN4WN'(228)N4)"1[(x[n]x[n+(N/2)]) JM,,X[4s+1]:2 j(x[n+(N/4)]x[n+(3N/4)])WWW"' (229)

    n=0

    WNMWN. (2.30)

    X[4s+3]=2n=0N4)1[(x[n]x[n+(N/2)])JM3j(x[n+(N/4)] x[n --(3N/4)])

    Fig. 2.13 Radix-4 buttery

    Wecanobservefrom Fig. 2.13thattwiddlefactor WA,", suchas W15, WA;and

    W13, is extractedonly in stage2, andstage1 is only multipliedby j term.The

    E6'99J complex computation in Fig. 2.13 only interchanges with real part and imaginary

    part of complex multiplicand, and inverses real part of complex multiplicand.

    Radix-4 decomposition for direct implementation of DFT reduces the number of

    multiplicationsfromN2to N(log4N-1),whichis alsolessthanradix-2decomposition.

    20

  • 7/27/2019 etd_0728106_120055

    36/113

    :3

    \w1n\VImxxIx[2]

    x[3].\\\VIII.CXXX}'..W: *.\\\xoxII1nInxx:uiAV1:Q- 10]

    1:\VxoxxoxIA'.IUIA\iu;bXo}.'.\xoxoxoxoxoxv:1.1.IC>x:iI.1 _

    mxoxoxoxoxoz'xoxoxoxoxox':n

    3:1II;xoxoxox\.'.\VIIIbzo1'II e

    II;xox\\'.II.x\xoxo3.wA:Vvvv 0

    AAAA, WM'IIIA\\\.:1:IIIA\\'.I.vAxxo.5lLaV'.1QZx:i'.IJ\'.I.VIA\mxo}:. . . .

    Fig. 2.14 16-point radix-4 DIF FFT

    2.2.2 Radix-8 DIF FFT Algorithm

    algorithm. Since

    N1

    X[k]=Zx[n]W,$",k=0, 1, ,N 1,n=0

    0 0

    0

    NN

    NN:\\vII.jAj\xxoz:arA:IN

    N

    S.

    S.

    S.

    S.

    S.

    S.

    S.

    S.

    S.

    iiiiii>

    Radix-2/4

    11]

    Fig. 2.18 A sketch map of 16-point radix-2/4 DIF FFT.

    27

  • 7/27/2019 etd_0728106_120055

    43/113

    Timestage0 TimeStage1 TimeStage2 TimeStage3

    .5 4.1

    .\\vI:nxox, we,\\\v/I/nxoxoxox. . at,\\\xoxg/zgnxnuuwax...\\xoxoxo.I..II.rIA\..xx.

    5

    iii .\xoxoxoxoxo:!.HQ.il.7A.'a.x:,,1:;X8.';;;;;o;*.. 4 . woX9.IAAIA, _ , XM

    , ;xoxoxox.'n\Vl;nx1',. .

    . II:'o't\\'QV3::;,1 AA . . AA . . . W~x[.3]X. . . AA _W5 . .4-..Xm

    .I\.I.7lA\.xo}:. woilvguri 0131.

    Fig. 2.19 Signal ow graph of 16-point radix-2/4 DIF FFT

    The -j term extracted from radix-2/4 decomposition is more than radix-4

    decomposition,but we can observe from Fig. 2.19 the mix -j and WN term of

    complex multiplication at the time stage 2. From the pipelined hardware design, the

    mix -j and WN term of complex multiplication will employ a complex

    multiplication to complete complex multiplication computation, hence radix-2/4

    decomposition cannot gain the advantage of reduction on the general pipelined

    hardware design.

    2.3.2 Radix-2/8 DIF FFT Algorithm

    To develop Radix-2/8 DIF FFT algorithm, let us again restrict the discussion to N

    a power of 2, i.e., N = 2v and consider computing separatelythe odd-numbered

    frequency samples from radix-2/4 DIF FFT algorithm. Since

    28

  • 7/27/2019 etd_0728106_120055

    44/113

    N1

    X[k]=Zx[n]vV,;", k=0,1, ,N 1, (2.47)n=0

    the even-numbered frequency samples of radix-2/4 decomposition are

    (N/2)1

    X[2r]= Zx[n+(N/2)]W,37;, r=0, (N/2) 1, (2.48)n=0

    and the odd-numbered frequency samples of radix-2/4 decomposition are

    N4)"1[(x[n]x[n+(N/2)]) JM,1X[4s+1]: Z j(x[n+(N/4)]x[n+(3N/4)])WWW"' (Z49)n=0

    M 371

    j(x[n+(N/4)]x[n--(3N/4)])jW"W'(Z50)X[4s+3]= 2n=0(N4)1[(x[n]x[n+(N/2)])--Next, we continuously separateequation (2.49) and (2.50) into even and odd part. Let

    s = 2k and s = 2k + 1 substitute into (2.49) and (2.50), we can obtain

    x[n+(N/2)])j+X[gk+1]: avflj(x[n+(N/4)]x[n+(3N/4)])

    ,,=o1[(x[n+(N/8)]x[n+(5N/8)])JW8 .](x[n+(3N/8)] x[n+(7N/8)])W,;,W,;. (2.51)

    (x[n]x[n+(N/2)D _

    X[8k+5]:

  • 7/27/2019 etd_0728106_120055

    45/113

    Equation (2.48), (2.51), (2.52), (2.53) and (2.54) can depict signal ow graph of

    8-point DFT as Fig. 2.20. The Fig 2.20 is called Radix-2/8 Butterfly. Substituting Fig.

    2.20 into 16-point DFT can sketch Fig. 2.21 to depict Fig.2.22.

    x[1] X[8]

    xm Radix4 Xm

    x[3] x[12]

    x[4] '~'__-"-* X[2]

    x[5] x[1o]

    x[6] X[6]

    xm Radix-2/8 Xm]

    x[8] x[1]

    x[9] Radix-2X9]

    x[10] X[5]

    ml] Radix~2Xm]

    x[12] X[3]

    xm] Radix~2 Xm]

    x[14] X[7]

    [15] Radix~2/8 Rama XHS]

    Fig. 2.21 A sketch map of 16-point radix-2/8 DIF FFT

    30

  • 7/27/2019 etd_0728106_120055

    46/113

    TimeStage0 TimeStage1 TimeStage2 TimeStage3

    T T

    ::1:r.u:x':1v;-:>x:::::i

    H--:'...;x..w:3kw;-..*rX::':::...1if85] 99'. . AA . . . Wix10]x[6] x6]:11:o:o:o:o:o:o:o::: :1er :58,]IAMIAMA,, , 8,]X.0,,IIIIII\\L\VlAtDXO}.',w ,w:8,]

    X.1,,IIlXX\\\;,XXV.a7A.,'baX-,w:X,8lllA\\\ ':::"i X WM . .AAAA. . 'W4 WOX3]:11:/A\::IA%ov: we:2X.,1 8.8

    Fig. 2.22 Signal ow graph of 16-point radix-2/8 DIF FFT

    The constantmultiplicationW81and W83 were extractedfrom radix-2/8

    decomposition more than radix-8 decomposition, but we can observe from 2.22 that

    mix -j, W81,W83and WN term of complexmultiplicationis at the sametime

    stage 2 of Fig. 2.22. Considering the general pipelined hardware design, the mix -j,

    W81,W83and W8, term of complexmultiplicationat the sametime stagewill

    employ a complex multiplication to complete all term of complex multiplication

    computation, so radix-2/8 decomposition cannot gain the advantage for reduction in

    the pipelined hardware design.

    31

  • 7/27/2019 etd_0728106_120055

    47/113

    2.4 Complexity Analysis

    So far we have analyzed statistics of reduction of complex multiplication by

    different algorithms with general algorithm, high radix algorithm, and split radix

    algorithm. We can conclude that higher radix decomposition will reduce more

    complex multiplier. Considering N-point DFT which N is not a power of 4, so radix-4

    algorithm decomposemust use extra radix-2 decomposition to complete N-point DFT.

    The extra radix-2 decomposition employed in the first time stage or the last time stage

    of signal ow graph will result in different computed reduction accordingly. For

    example, 32-point DFT is not a power of 4, and FFT needs log232=5 time stages to

    perform 32-point DFT. If we use radix-4 algorithm to decompose 32-point DFT, there

    will remain a time stage. Adding extra radix-2 decomposition into rst or last time

    stage of signal ow graph redeems the remainder decomposition. In Fig. 2.23, the

    radix-2 decomposition applied to first stage decomposes 32-point DFT into two parts

    of 16-point DFT. Next, using radix-4 decomposition completes two parts of 16-point

    DFT. We can observe the twiddle factor

    n o 1 2 3 4 5 6 7WN =>W32aW32aW32,Vl2,W32,W32,W32,W32a

    vV;;,W;;,W;,,W3;1,W3;2,W3;3,W,;4,W,;5,n=o,1,...15. (255)

    W,;": W;;,W;;,W;,,W;,2,n=0,1,2,3. (2.56)

    W5": W;;,W3,W;;,W3,n=0,1,2,3. (2.57)

    W,;": W;;,vV3,W;,2,W;,,n=0,1,2,3. (2.58)

    Where W3:: W81,W3122=W83,W322: W85, and W87=W3228are constant

    multiplications, W3;: W,Wf2: W41,W3126: W21,and W3242W43are non-complex

    32

  • 7/27/2019 etd_0728106_120055

    48/113

    multiplication, and others are complex multiplication. The total numbers of complex

    multiplications are 20 and total numbers of constant multiplications are 10.

    [91IE1} %EE3l

    x[31]

    n~.-o,1,.,.N;21

    Fig. 2.23 Signal flow graph of 32-point radix-4 DIF FFT based on radix-2

    decomposition in the rst stage and two radix-4 stagesin the next stage

    X931

    3*!1

    I-([31]

    3

    i..o,1,.,.zw4-1 21: s:>,1,...m15-1

    Fig. 2.24 Signal ow graph of 32-point radix-4 DIF FFT based on two radix-4 stages

    and one radix-2 stage.

    33

  • 7/27/2019 etd_0728106_120055

    49/113

    Table 2.1: Complexity analysis of twiddle factor for radix-4 DIF FFT algorithm

    radix-2 at first stage (if it needs ) radix2 at last satge (if it needs )1

    const rnu|#* comp mu|#* const mum comp mu|#DFT#

    i*::onstmul# : numberof constantmultiplication

    g*c-campmul# : numberof complexmultiplication

    If radix-4 decomposition is used to decompose 32-point DFT first, one time stage

    will be remained at the last stage. Therefore, radix-2 decomposition decomposesthe

    last time stage of 32-point DFT as shown in Fig. 2.24. We can observe the twiddle

    factor

    W5"2W3;,W3,W;;,W3,n=0,1,...7.(2.59W; 2 W;;,W;,,W3,v:g:,W;;,W3,W3,v:g;,n=0,1,...7. (2.60)

    W5"2 W33,W;,W3,W;;,W3g2,W3g5,W3,W31,n=0,1,...7. (2.61)

    W;"2W;;,W;;, n=0,1. (2.62)

    W,;"2 W;;,W;;, n=0,1. (2.63)

    W,(1,2"2 W3,,W;,2,n=0,1. (2.64)

    Where W3:: W81,W3122=W83,W322: W85, and W87=W3228are constant

  • 7/27/2019 etd_0728106_120055

    50/113

    multiplications, and others are complex multiplications. The total numbers of complex

    multiplications are 16 and total numbers of constant multiplications are 12. Therefore,

    using radix-4 decomposition rst and radix-2 decomposition at the last stage will

    reduce more complex multiplication when N-point DFT is not a power of 4. Table 2.1

    shows the different reduction of complex multiplication between two manners of

    radix-2 insertion. We can conclude that using high-radix decomposition first will

    obtain the best performance.

    Table 2.2: Complexity analysis of twiddle factor for radix-8 DIF FFT algorithm

    radix-2or 4 at 1ststage(if 11.needs) radix-2or 4 atlastsatge(1f11.needs)om * *

    (a) (b)

    When N-point DFT is not a power of 8, there are two situations. One is that

    remainder of N-point DFT divided by 8 is 4, and the other is that remainder of N-point

    DFT divided by 8 is 4. When remainder of N-point DFT divided by 8 is 4, the radix-4

    decomposition is applied to compensate for the defect of computation at last stage.

    Similarly, when remainder of N-point DFT divided by 8 is 2, the radix-2

    decomposition is employed to overcome the defect of computation at last stage. We

    can summarize the different reduction of complex multiplier in different size of DFT

    as shown in Table 2.2(a). Let us think about the implementation of the pipelined

    hardware. If the constant and the complex multiplications are at the same time stage

    35

  • 7/27/2019 etd_0728106_120055

    51/113

    of SFG, using the complex multiplier can compute both the constant and the complex

    multiplication, so we analysis the number of the constant multiplication is complex

    multiplication when the constant and the complex multiplications are at the same time

    stage. Through above implementation issue, we can summarize the different reduction

    of complex multiplier in different size of DFT as shown in Table 2.2(b).

    Table 2.3 shows the number of the complex multiplication under the different

    radix decomposition. Radix-4 FFT algorithm extracts -j terms to reduce partial

    complex multiplier computation of radix-2 FFT algorithm. Radix-2/4 FFT algorithm

    extracts more -j term to reduce partial computation of the complex multiplication of

    radix-2 FFT algorithm than radix-4 FFT algorithm, but its irregular property makes it

    difcult to implement pipelined hardware design.

    Table 2.3: Complex multiplications required for radix-2, radix-4 and radix-2/4 FFT

    algorithms

    Complex Multiplication #

    Table 2.4 shows the number of the complex multiplication with the radix-8 and

    radix-2/8decomposition.Radix-8FFT algorithmextracts-j, W81,and W83terms

    36

  • 7/27/2019 etd_0728106_120055

    52/113

    to reduce complex multiplications of radix-4 FFT algorithm. Radix-2/8 FFT algorithm

    further extractsmore constantmultiplication, W81and W83, than radix-8 FFT

    algorithm, but its irregular property makes it to implement pipelined hardware

    difficultly.

    Table 2.4: Complex multiplications required for radix-8 and radix-2/8 FFT.

    a DFW Radix-8D1?FFTAigamhmRiadi)-t2.Ii8DIFFFTAlgorithmccnstrrtu|#*compmu|#*ccristmulii |ccmpmulir

    const mul # : number of constant multiplication

    *comp mu- # : number of complex multiplication

    37

  • 7/27/2019 etd_0728106_120055

    53/113

    Capter3

    FFT/IFFT Architecture

    In this Chapter, we will discuss two methods of implementation of FFT

    algorithm: reusing single buttery and fully spread, as shown in Fig. 3.1. Table 3.1

    shows the different properties with speed, area and control complexity.

    Implementation of the reusing single buttery employs single process element, PE

    for short. Using Single PE to implement FFT algorithm is called Memory Based FF T

    Architecture. The input, intermediate and output data are stored in memory, so the

    bottleneck is memory access time. Implementation of the fully spread is generally

    called pipelined FF T architecture. It has real time, non-stopping operation and least

    memory requirement properties. The needed PEs are direct ratio to log, N , where r

    is radix of buttery and N is the number of DFT.

    51$!.V! .,

    Wk

    (a) Single ProcessElement (b) Fully Spread

    Fig. 3.1 Two extreme methods of implementing the FFT algorithm.

    38

  • 7/27/2019 etd_0728106_120055

    54/113

    Table 3.1: Comparison of single buttery and fully spread architecture

    3.1 Pipelined FFT Architecture

    The architecture design for pipelined FFT processor had been the subject of

    intensive research as early as in 70s when real-time processing was demanded in such

    application as radar signal processing [8], well before the VLSI technology had

    advanced to the level of system integration. It is characterized with real-time,

    non-stopping processing as the data sequence passing the processor. In addition,

    pipelined structure is highly regular, which can be easily scaled and parameterized

    when Hardware Description Language (HDL) is used in the design. Basic framework

    of the pipelined FFT architecture is shown in Fig. 3.2.

    ,.__......_. .._. ..... ....... .._. ...... __.. W... .._._ ,...__......._ .........1... _.._. .....,,..._.. ..............._.1..., ..

    IE .5A : i i i, um; ; , . _ Vyeiamxmtiiaieznex,j l i { sshzzterzt:_elmmmt. _

    .._........_..._..._..._..._......_....t....5< ._.:....__.....x......._......i._....a.

    Fig. 3.2 Basic framework of pipelined FFT.

    The implementation of the delay element has single path or multiple paths. The

    implementation of the buttery element has radix-2 or radix-4. Requirement of

    optimal memory is N 1, where N-point DFT is a power of 2. Furthermore, different

    assumptions of input and output sequenceorder will construct different pipelined FFT

    architecture. Single or multiple input and output sequences also construct different

    pipelined FFT architecture, too. Several architectures have been proposed over the last

    39

  • 7/27/2019 etd_0728106_120055

    55/113

    3 decades. Here different approaches will be put into functional blocks with unied

    terminology, where the additive butteries have been separated from multipliers to

    show the hardware requirement distinctively.

    f different '

    Weassumethatth only theinputseqlslencetobeinnormalorder,andif TM T T reversed(radix-2or

    radix-4) order, which is permissible in such applications such as DFT based

    communication system. Single path pipelined architecture which uses radix-2 DIF

    FFT algorithm is called Radix-2 Single-path Delay Feedback (RZSDF) [9]. Multiple

    path pipelined architecture which uses radix-2 DIF FFT algorithm is called Radix-2

    Multi-path Delay Commutator (RZMDC) [8]. Above two pipelined architectures are

    the common pipelined architecture. More proposed pipelined architectures use

    different radix FFT algorithms to extend above two basic pipelined architectures, as

    givenin Table3.2.TheR22SDFhasthe samemultiplicativecomplexityasradix-4

    algorithm,but theyretainthebutterystructureof radix-2algorithm.TheR23SDF

    40

  • 7/27/2019 etd_0728106_120055

    56/113

    has the same multiplicative complexity as radix-8 algorithm, but they retain the

    buttery structure of radix-2 algorithm, too. We can nd that implementation of SDF

    with higher radix algorithms could reduce more complex multipliers, but

    implementationof theMDC is not. If R22MDCor R23MDCis usedto implement

    pipelined FFT architecture, can the multiplicative complexity be reduced? We will

    discuss at next Chapter. Table 3.3 shows the comparison of hardware utilization ration

    with the different radix algorithms and architectures. It determines what architectures

    perform well.

    Table 3.3: Comparison of hardware utilization

    In the Table 3.3, we can nd that architectures with higher radix algorithm have

    higher utilization ratio of multiplier. The SDF architectures have highest 100%

    utilization ratio of FIFOs than MDC architectures. The R23SDF or R23MDC of the

    41

  • 7/27/2019 etd_0728106_120055

    57/113

    radix-2 buttery base structure has the highest hardware utilization ratio than the

    other architectures.

    Other approachesof the Multiple Input Multiple Output (MIMO) proposed in [11,

    16-17] are different assumptions of application system. Mixed SDF and MDC

    architecture proposed in [18] is a very unusual approach. However, we introduce

    many kinds of pipelined architectures in different assumptions, but they all extend

    both two SDF and MDC pipelined architectures. Next, we will discuss two basic

    pipelined architectures with radix-2 buttery based single-path delay feedback and

    radix-2 buttery based multi-path delay commutator.

    3.1.1 Radix-2 Single-path Delay Feedback (R2SDF)

    Fig. 3.3 R2SDF N=16 (Radix-2 Single-path Delay Feedback)

    The following notations are used;N denotes the size of the FFT and n = log2N

    denotes the number of stages of FFT processing and PE of the pipelined architecture.

    When Nare 16, the R2SDF needs 4 PEs, as shown in Fig. 3.3. R2SDF consists of the

    radix-2 buttery modules(BF2), the delay elements (DE) and

    thecomplexmultipliers1 if. Thedelayelementscompriseshiftregistersasfirstin

    and first out (FIFO), and its block number means delay times or shift times. The

    number of delay element of the each stage is the key point for controlling buttery

    input and next stage output. The input ordering of the data and the sequenceof delay

    42

  • 7/27/2019 etd_0728106_120055

    58/113

    element operations guarantee proper pairing of all samples at each stage, a valid FFT

    can be performed by rearranging the twiddle factors. Unfolded delay elements of

    R2SDF are shown in Fig. 3.4. The radix-2 buttery module has two modes: one is

    operation mode and the other is commutator mode, as shown in Fig. 3.5. The

    operation mode computes radix-2 buttery operation and commutates pairs of

    buttery results. Commutator mode only commutates pairs of the inputs to pairs of

    outputs.

    Fig. 3.4 Unfolded delay elements of R2SDF

    Qperation( O )

    Qommutator( C )

    Fig. 3.5 Two modes of the radix-2 buttery module

    When performing a FFT of size N, the first stage of processing combines pairs of

    samples whose indices are N / 2 apart ( samples are indeed from 0 to N 1). The

    second stage combines pairs whose indices are N / 4 apart, and so on. The number of

    buttery operation is N / 2 at every time stage. We can find that the regularity between

    buttery started operation and delay element in every time stage performs a valid FFT.

    Using 8-point DIF FFT would explain conveniently the R2SDF. The R2SDF needs 3

    43

  • 7/27/2019 etd_0728106_120055

    59/113

    PEs to implement 8-point DFT, as shown in Fig.3.6. In radix-2 N-point FFT, the

    twiddle factor of penultimatestagewill alwaysbe a constantW; : W82: W41j

    multiplication.Thebuttery outputsof the last stagemultiply W13:1, so lastPE do

    not employ complex multiplier.

    u=:e*'ar2;*2.*%;.iM WJ; K.xgviy ~

    . %

    l

    221

    TE,

    Fig. 3.6 R2SDF (N=8)

    The following notations are used: The first input symbol has T0~T7 input sequence,

    andnextinputsymboldenoteT01~T71.Thepowerof I of 01~71notesthe1st time

    stageoutput.Thepowerof 2 of02~72denotesthe2ndstageoutput.Thedelayelement

    of first PE will queue from T0 to T3 input sequencesthen T0 and T4 denoting indices

    of inputpairsfor thefirst butteryprocessingof first stagewill compute01and41

    denotingthe outputpairsof first buttery processingof first stage.01 passesto

  • 7/27/2019 etd_0728106_120055

    60/113

    T5T4T3

    T6T5T4

    T71T61T51

    4.v.a+.%

    .7 R2SDF (N=8) data stream owFig. 3

    .8 to observe the control mode of buttery module. The squaresWe can use Fig. 3

    .8(a) denote that the buttery module enters commutator mode. Thein Fig. 3shown

    in each stage were shown in Table 3.4.inputpairs of samples for buttery

    45

  • 7/27/2019 etd_0728106_120055

    61/113

    0]>s%d=}

    azzrgm "3*3complex multiplier

    Fig.4.5R22MDCN=128

  • 7/27/2019 etd_0728106_120055

    80/113

    4"i*E$

    v complex multiplier

    r constantmultiplier

    Fig.4.6R23MDCN=128

    Table 4.2: Analyze the number of complex multipliers in MDC with different radix.

    8192

    4 5 6 7" 8 9 10 11

    2957832 328648 36l512.8

    328648197188 8 262918 4 26291824 328648 1 39437? 6

    256 512 1024 2048 4096 8192

    2 4 4 4 6 6 6 8

    ''106551.21'?"2280.81T"2280.8 192691.6 2584212 2584212 1 2T8832 344561.6

    4.1.2 Parametrizable Memory Access

    Because the memory-based FFT architecture uses single PE to perform operation

    of FFT, the address width of the memory and ROM will change according to the size

    of FFT, but regularity of the memory and ROM address access is invariable, so we

    focus the property to realize parametrizable design for operating the variable size of

    DPT.

    65

  • 7/27/2019 etd_0728106_120055

    81/113

    Performing a N-point FFT, the ROM size must storeN / 2 words, so the address

    width of ROM is log2(N / 2) bits. Taking 8-point DFT for example, the squaresof Fig.

    4.7 show the value of the twiddle factor in each stage. The requirement of twiddle

    factor, W,3,WA1,,W,, andW3, in everytime stagehavebeenalreadystoredin the

    ROM. The ROM address width is log2(N / 2) bit, and we suppose that the ROM

    address must double when increasing one time stage, which is suitable for hardware

    implementation because we can use left shift instead of multiplication. The regularity

    of the ROM addressaccessof 8-pont FFT is shown in Table 4.3. If we extend the size

    of DFT to 64, the relation between buttery count and ROM address in each time

    stage is shown in Fig. 4.8.

    Table 4.3: ROM content in each stage for an 8-point FFI.

    :\Vl:XA Xm

    Fig. 4.7 Signal data ow of 8-point DIF FFT

    66

  • 7/27/2019 etd_0728106_120055

    82/113

    Butterycounter[4:0]

    timestage0

    timestage1

    timestage2

    timestage3

    timestage4

    WW5 llll

    Fig. 4.8 Relationbetweenbuttery countsandROM addressesin eachtimestage

    In parametrizablememoryaccess,we useIn-placemodeto performvariablesize

    of FFT, whichhavebeendiscussedin section3.2.2. Fig. 4.9 depictsthe architecture

    of the conict-free addressgenerator for the radix- r FFT buttery processor

    assumingthatthememoryhasbeenpartitionedinto r banks.In thefigure,ther barrel

    shiftersassociatedwith the stagecounterare to emulatetheright rotationalproperty

    of the buttery unit at different stages.The buttery counter is designedfor

    completing all buttery task assignmentsat current stage. Finally, the address

    switchingis usedto implementequation(4.1) suchthattheoutputof theeachbarrel

    shiftercanbemappedto thecorrectmemorybank.

    Data_count=[dn_1,dn_2,......,d2,d1,d0]r

    n=l1og.l (4.1)Bank_index=(d +d", +......+d,+d,+d0)modr

    Data_index=[d,,,1,d,,,2,......,d2,d1]r

    MB0_addr.

    B1_addr.

    Fig. 4.9 Architectureof theaddressgenerator

    67

  • 7/27/2019 etd_0728106_120055

    83/113

    4.2 Building Block

    Weapplythreekindsof IP cores,R23SDF,R23MDCandmemory-based,where

    common building blocks are radix-2 buttery modules, complex multipliers. Our

    building blocks follow the RTL coding guidelines of SIP.

    The Radix-2 buttery module (BF2)

    The BF2 includes one complex adder at top, one complex subtractor at bottom

    and the mode for selection of FFT/IFFT as shown in Fig. 4.10. When mode is IFFT,

    the path of divided by 2 will pass to output ports. Besides, when mode is FFT, the

    other path will pass. Assuming the input pair of BF2 are (a +jb) at top and (c +jd) at

    bottom. After the computation of the radix-2 buttery, the output pair are

    top output :(a jb) : (c jd) : (a +c) j(b +d),

    bottom output: (a jb) (c jd) : (a c) j(b d ).(4.2)

    mode

    0:FFT

    1:IFFTFig. 4.10 Radix-2 buttery module with mode selection.

    Implementation of the complex multiplier contains a constant multiplier and a

    complex multiplier.

    Constant multiplier

  • 7/27/2019 etd_0728106_120055

    84/113

    (a+jb)xW;: (a +jb)xW;

    x/EA/Ex/5

    ""7'J7"7(a+jb)>

  • 7/27/2019 etd_0728106_120055

    85/113

    Complex multiplier

    Assuming input pair of the complex multiplier are (X1+jY1)and (X2+jY2).X1, Y,-

    X2, and Y2use 2s complement representation. Computing complex multiplication

    arrange in real part and imaginary part as below

    real iX1X2 Y1Y2,_ _ (4.4)mag - X1Y2+X2Y1.

    There are four real multiplications and two real additions in equation (4.4). Because

    the Verilog hardware description language (HDL) cant support signed multiplication,

    X1Y2+XQY1

    Y2[width-l]

    UnsignedMultiplier _ _X2[w1dth-l] A Y1[w1dth-l]

    Fig. 4.13 Complex multiplier architecture using unsigned multipliers.

    Using DW02_Mult provided by Synopsys can apply signed multipliers to

    implement complex multiplications and the correction by 2s complement operation

    omit from Fig. 4.13 to anew depict in Fig. 4.14.

    70

  • 7/27/2019 etd_0728106_120055

    86/113

    Y1 Y1

    X2+_lY2X2 X;

    X1

    Y2 Y2m_out_imag

    Fig. 4.14 Complex multiplier architecture using signed multipliers.

    If we use UMC.18 process, the word length of multiplier 20 bits which consists

    of 10 real part 10bits and imaginary part 10 bits, and clock rate 60 MHz, the statistics

    is shown in Table 4.4. Using unsigned multipliers to construct complex multiplier is

    called technology independent (TI) else using signed multipliers is called Design

    Ware (DW). The cost of the constant multiplier is apparently less than the complex

    multiplier, so the simplifying complex multiplier by constant multiplier benefits

    indeed. The constant multiplier really simplies complex multiplier. When user can

    utilize DW, it will benefit less gate count and higher speedin provided design.

    Table 4.4: Comparison between constant multiplier and complex multiplier of TI and

    DW

    UMC. l 8, word length 20bit and

    clock rate 260MHz

    TotalCellarea(umz) 32864.829465.2

    TotalDynamicPower(mW) 4.30343.6627 1.3652

    71

  • 7/27/2019 etd_0728106_120055

    87/113

    4.3 FFT/IFFT Compiler Flow

    In the FFI/IFFT compiler ow, the user got the circuit they wanted by choosing

    parameters through user interface. Table 4.4 lists and describes all parameter in our

    FFT/IFFT compiler. After choosing parameters, FFT/IFFT compiler will generate the

    design model automatically as shown in Fig. 4.15.

    Untimed functional model

    Provide C simulation model which verify operation result of the FFT/IFFT

    verilog RTL code. C simulation model can generate golden pattern to the test bench

    by common input pattern.

    Verilog RT L model

    Providing synthesizable verilog code of FFT/IFFT benet user to integrate

    design itself.

    Test model

    Providing the test bench and the test pattern les can simulate and test circuit,

    and further use golden patterns to verify design via the automatic comparison.

    Script model

    Generate synthesis script file and testing script file of the providing user to

    synthesize circuit and test insertion.

    Bus functional model

    Provide the AMBA AHB interface testing compatibility of application system.

    72

  • 7/27/2019 etd_0728106_120055

    88/113

    Table 4.5: Parametersinformation of the FFT/IFFT Compiler

    Size of FFT/IFFT 1range - 64,128,. ,8192

    Choosing vender-specific directives (Design Ware) or technology

    Vender-specic independent

    directives 0: technology independent

    1: vender-specific directives

    ClockRate FFT/IFFTsystemclockrate it

    HHDataWidth Datawidthofeachinpntandimaginaryofcomplexdata

    sub-pipe

    Vender-sdirective

    architectclockratethroughputrate...EoD-

  • 7/27/2019 etd_0728106_120055

    89/113

    4.4 Specification

    We use the 128-point FFT as an instance to separately show the block diagram,

    I/O denition,timingdiagram,andsynthesizedresultsin theR23SDF,R23MDCand

    Memory Based FFT architecture, and further analyzing their suitable applications,

    respectively.

    4.4.1Specificationof the128-PointR23SDF

    The providedR23SDFarchitectureis shownin Fig. 4.16.In the gure, thetwiddle factors in the smallest rectangular forms at the penultimate stage of SFG are

    W,3,W52",W,;",W,;",W5",W,3",W54",W56",n =0,1,...,N/64-1, (4.5)

    based on the relation between input sequenceand buttery module we have discussed

    before in each stage, as depicted in Fig. 4.17.

    peiiultinmte

    :"::tagex[D]

    km

    )([1271

    Fig.4.16SDFof 128-pointandR23SDFblockdiagram

  • 7/27/2019 etd_0728106_120055

    90/113

    C1->cX[n]/ 0 127

    FFTin ut

    stage]inM1_wor

    M3' 112 239

    $67111 ' 1 ~ - - ~ as2I22:mzimzmnznsieiisszumlillilillliM7r TotalCyclesi 1I I 127 254/Smg:117

  • 7/27/2019 etd_0728106_120055

    91/113

    x[nj'"Em"

    M1_v\or 64clocks/stage1out

    -65- -142 -DH2M2_v\or/stage2 out

    DE3M3_v\or/stage3 out

    Pipeline11holdc e32clocks 97 224

    DE4M4_v\or/stage4 out_ _ D-55-

    M5_\Vnr/stage5 out_ _ D-55-

    M6m 2clocksI I i .. 258/stage6 out Si?r%RW.@!IIIBEiIII%II@IIEI3I3~- 1-31- 259

    DE7 1 IM77or 13TI]we|meFFholdC_\ole 1clocks- _ _ , p_/stage7out l-IEEIHIEJIIBEEIBL[HIEIEEHmu FHEIIEEIII{ l'FI[[E|IJBIFEIIEJISISJIIJ

    Xnq TotalCycles1 134 _N+(N 1)+ 7 stagePipelineRegister: 262(0~261) imc11|\c|ilho1dC3c1c

    Fig.4.19128-pointR23SDFtimingdiagramwithpipeline.

    In the 128-pointFFT, the R23SDFstartsexportingoutputsequenceafter

    (128-1+1og2N)cycles.

    cIk_I l_|I_|I_|I|//l|_|!_|l_||_I/ll! \:\:|II/LI|IIII!II1

    M | // // //rm.->" E EII ' 1

    start tn:+5. I +2 '

    1mns_mputj( u I X 2X 3 X 4 myX I27

    ready )[

    mvns_oumut H )( 0 X 64 X 32 06Im127delayeiements H M-*"i

    ,< 127cgcies >.< 7 cgcies I2Xqc1es

    Fig.4.20128-pointR23SDFtimingdiagramwith1/0information

    76

  • 7/27/2019 etd_0728106_120055

    92/113

    Table4.6:I/Oportsof R23SDF.

    R23SDFPipelineFFT

    tart Circuitis receivinginputsymbolsfromuppersystemin highlevels

    meaning.

    Inputportreceivinginputsymbolsfromuppersystem.Executing FFT operation in low level meaning.de

    ExecutinIFFTo erationinhi hlevelmeanin

    clk Clock signal

    ready Output port already prepares valid data in high level meaning.

    trans_output Output port of computation result

    4.4.2Specificationof the128-PointR23MDC

    Blockdiagramof R23MDCimplementationis shownin Fig. 4.21andtiming

    diagram is shown in Fig. 4.22. The Figures shows the regularity of control circuit in

    everytime stage.Equally,the outputlatencyof the R23MDCwill extendwhen

    inserting pipelined registers. The pipelined registers in the each stage and the control

    circuit in the pipelined architecture depicts Fig. 4.23 and timing diagram Fig. 4.24. We

    depict timing diagram of input/output (I / 0) ports as shown in Fig. 4.25 and I/O

    signals describe in Table 4.7.

    Fig.4.21Blockdiagramof 128-pointR23MDC

  • 7/27/2019 etd_0728106_120055

    93/113

    6'-'1clocks 63,1164 64clocks 1274112364clocks 191419264clocks 25511125532clocks237233 319Clock

    InputITTLialll

    StagelL72conlml

    SmgclBT2contl(VI3tag:I mmcunlml

    SmgezC2Izomrol

    sragczIII:mmml

    SlagclmIIIcumml

    SIage3C2comrol

    Stage?BTZcnnImI

    Slugs}mmConllU]

    Smgc4czcmmnl

    Stage-'1BTZcnnImI

    stage:mnlcomrol

    I I Imya:I I |r'e1;:L;"

  • 7/27/2019 etd_0728106_120055

    94/113

    F0 64c1ocks 63+64 64c1cks 127+128 64c1c-cks1914192640106165 sdocks287288 3191111111111111

    Suur5C24 $34I FWTTII I I I I I I I I 1~\~IIs--I:21I I I IWFI~~4clocks

    SW731:2 1clocks133 197 261 324

    134 198 325Nady TotalCycle3N-1+Nll+Log;N=198(0-197)

    Fig.4.24Timingdiagramof 128-pointR23MDCwithpipelinedregister

    In the 128-pointDFT,theR23MDCstartsexportingoutputsequenceafter134,

    (N 1+log2N),cycles.Becauseoutputportof theR23MDCaremulti-path,theresult

    data need 64, (N / 2), cycles exporting completely.

    ll I/L_I!I L/A_I!I L// llmi // I // I //M '71 // 1 // //

    "-I-"51". _t..,..ri i...,.I_.~_..Ip... I X 2 x 3 x 4 x I

    mud)

    l'mns_oIIIputl

    E (,4cycles >196XsoX112 111X 95X127l'mns_oIIIpuI2

    Fig.4.25128-pointR23MDCtimingdiagramwith1/0information

    79

  • 7/27/2019 etd_0728106_120055

    95/113

    Table4.7:I/Oportsof R23MDC

    R23MDCPipelineFFT

    tart Circuitisreceivinginputsymbolsfromuppersysteminhighlevels

    meaning.

    Inputportreceivinginputsymbolsfromuppersystem.

    t_.d ExecutingFFToperationinlowlevelmeaning.

    mo e

    ExecutinIFFTo erationin hi h levelmeanin

    Asynchronous resetsignalandpositiveedgetrigger.Clock si nal

    ready Output.cport.a1readypreparevalidiidatalevel meaning

    trans_output1result.trans_output2OutputportZiofcompuitationresulrti:

    4.4.3Specification FFT.iArcliitecture

    \/ Blockdiagram

    R/\M_DAT/\2

    Process

    Element

    CLL 'l'rans_output> >rst_p?, .mode MEM1n ROM11- MEM0ut

  • 7/27/2019 etd_0728106_120055

    96/113

    \/ Timing diagram

    The rest signal rst_p must be set high to trigger the memory-based architecture

    first. Then, the two dual-port memories will begin receiving the input data if the

    primary input start is pushed high. This signal will be pulled down until all 128 sets

    of data have been inputted. When all 448, (128/2 x log2128), butteries complete their

    operations, the output signal trans_output will start outputting computed results;

    simultaneously, the other output signal ready must also be set high to tell outer

    circuits that current output data are valid. Finally, the ready signal will be pulled

    down when all 128 sets of data have been outputted.

    1k__l|_||_l|_||_I|/L||_||_!_I|_|L/|_!|_!|_||_l|_lL/L||_l|_ll_|_||_

    64*7=896c)c1

    Fig. 4.27 Timing diagram of a 128-point memory-based architecture.

    \/ I/O Definition

    Table 4.8: I/O ports of memory-based FFT architecture.

    Memory-based FFT architecture

    tart Circuitisreceivinginputsymbolsfromuppersystemin highlevels

    meaning.

  • 7/27/2019 etd_0728106_120055

    97/113

    rst_p Asynchronous reset signal is positive edge trigger.

    clk Clock signal

    High level means that output port already prepare valid data

    trans_output Output port of computation result

    WEN1

    WEN2

    OEN

    4.4.4 Synthesis Result

    Technology le: Artisan umc.18 1P6M Cell library

    Word length: 20 bits (10 bit for real part and 10 bit for imaginary part)

    / R23SDF

    The Table 4.9 lists the gate count from different approach. We can nd that the

    original design without any supposed option, such as adding extra pipelined and using

    complex multiplier of Design Ware, is the least gate count when clock rate is small

    than 60MHz. While the clock rate is large than 70MHz, the option 3 of sub-pipelined

    insertion is the least gate count and also workable until 130MHz. In addition to option

    3 of sub-pipelined insertion with increase clock rate, the area is small than original

  • 7/27/2019 etd_0728106_120055

    98/113

    Table 4.9: Gate count of the R23SDF

    Table4.10:Powerconsumptionof theR23SDFat 128-pointFFT(mw).

    16.3537 16.5821 17.4383 17.6724 17.6523

    23.3062 23.3919 24.5416 24.7647 24.7128

    30.580630.4169828.180131.935431.8581

    39.4032 39.3024 39.0054

    46.2579

    J R23MDC

    Throughaboveanalyze,wetry partialcaseto quickcomparewhetherR23MDC

    hasthesamepropertyof R23SDF.Dueto thetable4.11,thecharacteristicis thesame

    to theR23SDwhenchoosingoption3 of sub-pipelinedinsertionfor theR23MDC.

    83

  • 7/27/2019 etd_0728106_120055

    99/113

    However,wecannd thatthegatecountnotonlyis largethanR23SDFbutthe

    powerconsumptionasshownin Table4.12is alsomorethanR23SDF.Nevertheless,

    the R23MDCisnt reallyno advantagewhensomekind of applicationneedsthe

    unused half N cycle to do something like that bit reverse order of output sequence

    transfer normal order sequence.

    Table4.11:Gatecountof theR23MDCat 128-pointFFT/IFFT.

    Table4.12:Powerconsumtion of theR23MDCat 128- oint FFT (mw).

    33.5921 36.5175 36.2135

    58.5101

    46.8579

    84

  • 7/27/2019 etd_0728106_120055

    100/113

    / Memory Basedarchitecture

    We try to nd the fastest clock rate in this one. The clock rate, 80MHz, is the

    fastest when using complex multiplier of technology independent. The clock rate

    increases to 100 MHz when using complex multiplier of Design Ware, and further

    decreasing gate count.

    T ble4.13'Powerco sumptionof theR23SDFat 128pointFFT(mw).

    using technology

    independent ( @8OMhz)

    using Design Ware

    (@ lOOMhz)

    4.4.5 Analysis of Suitable Applications

    The number of the butteries is equal to N / r in the N-point FFT implemented

    with the radix-r PE, where r is a power of 2 and the number of stages will be log2N.

    Under such a circumstance, we describe the execution ow of the provided pipelined

    architecture and memory-based architecture in Figs. 4.28 and 4.29, respectively. The

    clock rate and the throughput rate will be the same for our provided pipelined

    architecture because that possessesthe properties, real time and non-stopping. On the

    contrary, the throughput rate will be different from the clock rate for memory-based

    one since it has some specic characteristics. In this situation, the relation between the

    throughput rate and the clock rate can be representedas follows.

    Throughputrate=+x ClockRate: . (4.7)2N+log,N 2+ grr r

    85

  • 7/27/2019 etd_0728106_120055

    101/113

    OFDM Symbol 1

    Fig. 4.29 Execution ow of the provided memory-based FFT architecture

    According to the synthesis results given before, the maximum operating frequency of

    our memory-based architecture is 100 MHz. Assuming that the size of FFT is 64-point

    and the operating frequency is set 100 MHz, the throughput rate will be equal to

    Clock Rate _ 100MHz

    1og,N 2+log264r 2

    =20Mbps. (4.8)

    2+

    In the same way, if the size of FFT is 8192-point, the throughput rate will become

    Clock Rate_ 100MHz

    I 1og,N 2 I 1og28192I r I 2

    =11.76Mbps . (4.9)

    2

    As seen from the specifications of associated OFDM-based communication

    systems given in Tables 4.14 and 4.15, most applications except UWB could be

    realized using our developed architectures. However, the required word length will

    increase for higher precision while implementing FFT, whose size of points is too

    86

  • 7/27/2019 etd_0728106_120055

    102/113

    larger. In this condition, the proposed architectures maybe cannot operate at the

    highest frequency, 100 MHz. The detailed information about the maximum operating

    frequency for different size needs more experiments to acquire; here, we have not

    done more completely yet.

    Table 4.14: FFT/IFFT size for OFDM-based communication system

    DVB-T ~ DAB ~ VDSL

    system

    64 . 20

    2 x 256 2.22

    2.22*22x256x2,n=0,...,4 23 1

    256x2,n=0,...,3 8.26

    8192/2048 896/224 9.14/9.14

    128 0.24242 528

    87

  • 7/27/2019 etd_0728106_120055

    103/113

    Capter5

    Verication and Performance

    In this Chapter, we discuss the possibility of finding cost function. By the cost

    function, the capability of FFT/IFFT compiler will be raise, and construct an approach

    of C simulation model for verifying proposed design. Finally, a verification plan and

    comparison with other works are given.

    5.1 Cost Function and Derivation

    A good cost function is the statistics of the power consumption and area of all

    proposed architectures which is calculated with the given parameters,that contains the

    size of FFT, the clock rate and the throughput rate, etc. After analyzing the statistics,

    FFT/IFFT compiler would indicate which architectures is the rst choice under the

    parameters .

    According to the analysis results of our research, the FFT/IFFT compiler

    automatically chooses the lowest gate counts under the parameters of throughput rate

    and clock rate among proposed architectures. From previous discussion the pipelined

    architecture and memory-based architecture have different consider under the

    requirement of throughput rate. Then, we can rearrange the FFT/IFFT compiler

    automatically chooses our provided architecture which is the lowest gate counts via

    the throughput rate. However, when changing the size of the FFT and word length, the

    critical path of proposed architecture and range of Fig. 5.1 will different.

    88

  • 7/27/2019 etd_0728106_120055

    104/113

    Sub-pipe option 1R23SDF

    Sub-pipe option 3R23SDFNonsubpipe

    R23SDFMemory Based

    20 60 70 140 Mbps

    Using designware ( ThroughputRate)

    Fig. 5.1 Choosing an architecture based on the specified throughput rate.

    5.2 C Simulation Model

    Constructing C simulation model can obtain some middle of simulation, these

    values are useful at the duration of debugging when chip is implementing. Another

    purpose of the C simulation model is to generate golden patterns to verify proposed

    design. Next, we discuss how to construct C simulation model from FFT algorithm.

    Based on the discussions of algorithm in the previous Chapters, the regularity of

    the FFT algorithm is already known in which the twiddle factor is variation in each

    time stage when using different radix FFT algorithm. Then, we use radix-8 algorithm

    to explain how to construct a C simulation model. All operation of construction

    process must be considered in fixed point arithmetic for matching the simulation

    result with hardware result. We illustrate the constructing C simulation ow using an

    example which shown in Fig 5.2.

    Step 1: operate all buttery for one stage rst, then saving the result using in place

    mode.

    Step 2: process the operation of multiplication to the output of buttery output and

    export a file if needed to trace every stagesoutput.

    89

  • 7/27/2019 etd_0728106_120055

    105/113

    Step 3: return to step 1 until all time stage is operated completely and export result le

    which is golden pattern.

    Step I Step 2

    ffif 5.2..Radi5

  • 7/27/2019 etd_0728106_120055

    106/113

    from bit-reverse order to normal order before injecting them to FFT. Based on

    above-mentioned verification plan, we can ensure that our design is correct.

    ( GoldenPattern)

    FFT( RTL Code)

    GoldenPattern)

    Fig. 5.3 Verication plan

    5.4 Performance Evaluation

    With the parametrizable control, different simulation results can be acquired by

    changing the input data width. As mentioned in the last section, the verification plan is

    complete by 1) injecting the test patterns to the IFFT rst, where the test patterns are

    from the pattern generator and 2) then applying the computed results to the FFT. So

    the information about signal-to-quantization-noise-ratio (SNR) can be obtained by

    analyzing the input sequenceof the IFFT and the resulted output sequenceof the FFT.

    In our design, we assume that the input, output and the twiddle factor have the same

    91

  • 7/27/2019 etd_0728106_120055

    107/113

    data width. For simplicity of explanation, data width is subsequently to denote the

    data width of above-mentioned signals. And, Fig. 5.4 depicts the relation between

    SNR and the data width of 128-point FFT. It is obvious that SNR will be higher than

    30 db when the data width is larger than 11x2 bits and higher than 40 db while larger

    than 15x2 bits.

    In 11 I1 13 I1 I5 I15 rr l l I9

    nauwianghaa

    Fig. 5.4 SNR curve in the 128-point FFT

    Table5.1liststhesynthesisresultsaboutourproposedR23SDFarchitectureand

    another work [25]. Assuming that the data width is 20 bits, it is observed that both the

    area and power consumption of our proposed architecture are less than those in [25].

    Table 5.1: ASIC synthesis result at clock frequency of 132MHz.

    proposed

    92

  • 7/27/2019 etd_0728106_120055

    108/113

    Capter 6

    Conclusions and Future Work

    6.1 Conclusions

    We present an efcient FFT/IFFT compiler which consists of three IP cores,

    R23SDF,R23MDCandmemory-basedFFTarchitectures.Theinputsto ourdeveloped

    generator are a set of user-defined parameters. According to the provided input

    constraints from the outside world, our generator can take in account the trade-off

    between hardware overhead and speed requirement and output a suitable RTL code

    for users reference. Based on our development, not only a dedicated FFT/IFFT

    module can be easily prototyped for fast system verication, but also the resulting

    compiler can be used as a basis for more advanced research in the