etd_0728106_120055
TRANSCRIPT
-
7/27/2019 etd_0728106_120055
1/113
E 1 :%;k5%
alav
raaxE4-2? FEzi$ J'%i$/)i'H%53:
iiiidAN EFFICIENT FFT/IFFT COMPILER FOR
FDIFFERENTAPPLICATIONS
Department of Electrical Engineering
National Cheng Kung UniversityTainan, Taiwan, R.O.C.
Thesis for Master of Science
July, 2006
-
7/27/2019 etd_0728106_120055
2/113
@n&%x%%&
IE.:'
%%%z&$%MMMi
W%i=~%%:%
1M%i$@$J%MN@&%%
%%%%%%
-
7/27/2019 etd_0728106_120055
3/113
An Efficient FFT/IFFT Compiler for
Different Applications
by
Sheng-Hs:ienHuang
A thesis submitted to the graduate division in jpaztial.
fulfillment ef the requirement fer the degreeoflMaster of Science
at
Naticnal Cheng Kung University
Tainan, Taiwan, Republic ef Cltina
July 2006
Approved by 1
-
7/27/2019 etd_0728106_120055
4/113
i FEzf |3J'H%5$/).'H%5$
4iL=J@$%
4;:-aaeaal $33439
E iti I2511%?'#EJ'Hf5viFFr
iii
J*/3.F]E3~H3f4.J5lE'5*uF]E8a/dfkfi.EL.7'ci>J"
%|"n iik ifii LMi7la7&?
-
7/27/2019 etd_0728106_120055
5/113
An Efficient FFT/IFFT Compiler for
Different Applications
Sheng-HsienHuangl lVIing-DerShiehz
Department of Electrical Engineering
National Cheng Kung University
Tainan, Taiwan, Republic of China
ABSTRACT
With the emergence of internet services and pervasion of communication
applications, internet and wireless communication have become a part of our life.
Wireless network can reach where the traditional network cannot do. This makes the
applications of wireless network more widespread. To date, different wireless LAN
standards have emerged from different applications and each standard has its own
uniqueness and application ranges.
Of the existing digital communication systems, the Orthogonal Frequency
Division Multiplexing (OFDM) technique has been widely used in performing signal
modulation. Compared with the traditional single-carrier frequency modulation
technique, OFDM adopts multi-carrier modulation which has been adopted in many
practical wireless systems such as 802.11x, DAB and DVB, etc. Regarding the
hardware implementation, the OFDM can be fullled by employing the (inverse) fast
Fourier transform (FFT/IFFT). In fact, different standards or operation modes imply
1 The Author
2 The Advisor
-
7/27/2019 etd_0728106_120055
6/113
different requirements or specifications for the associated FFT/IFFT; therefore,
different design methodology should be applied. It would be a challenge if a unied
FFT/IFFT architecture is to be designed. In this thesis, we investigate how to develop
an efficient FFT/IFFT compiler for different applications. Based on our development,
not only a dedicated FFT/[FFT module can be easily prototyped for fast system
verication, but also the resulting compiler can be used as a basis for more advanced
research in the future.
vi
-
7/27/2019 etd_0728106_120055
7/113
". isbiiitbfi %t#*f%1BE?%'*935Ela73:4z4...sua:c:J%*i.11%33:BiJ{i1!5~
%=v?3%'*bX1'|7F'J7E.3L 4'5-.iEril$JHf5:#_>1%1935$-_%.l:.:*I"%3c75@J%
% %%3c'5?'3?'Jzna5IbXEEi?'.?=JE13*zi:-1%-)L59r5vZ=L;Li.saaHJ..~.?-Ei:..~%Fvl2E9
7F=9354J%
%t'5?'3*,JFf] %1%1%3cJl']355%?/i 2EL/3.3?-23'}f:Y
-
7/27/2019 etd_0728106_120055
8/113
TABLE OF CONTENTS
TABLE OF CONTENTS ......................................................................................... .. viii
LIST OF TABLES ....................................................................................................... ..x
LIST OF FIGURES ................................................................................................... ..xii
Chapter 1 Introduction ...................................................................................................1
1.1 FFT Overview .............................................................................................. ..1
1.2 Motivation .................................................................................................... .. 3
1.3 Organization of this Thesis ............................................................................5
Chapter 2 FFT Algorithm ...............................................................................................6
2.1 General Algorithm .........................................................................................6
2.1.1 Decimation-In-Time (DIT) FFT Algorithms ........................................7
2.1.2 Decimation-In-Frequency (DIF) FFT Algorithms .............................. 12
2.2 High-Radix Algorithm ................................................................................. 18
2.2.1 Radix-4 DIF FFT Algorithm ............................................................... 18
2.2.2 Radix-8 DIF FFT Algorithm ...............................................................21
2.3 Split-Radix DIF FFT Algorithms .................................................................25
2.3.1 Raidx-2/4 DIF FFT Algorithm ............................................................26
2.3.2 Radix-2/8 DIF FFT Algorithm ............................................................28
2.4 Complexity Analysis .................................................................................... 32
Chapter 3 FFT/IFFT Architecture ................................................................................ 38
3.1 Pipelined FFT Architecture .......................................................................... 39
3.1.1 Radix-2 Single-path Delay Feedback (R2SDF) ..................................42
3.1.2 Radix-2 Multi-path Delay Commutator (R2MDC) ............................47
3.2 Memory-Based FFT Architecture ................................................................ 53
viii
-
7/27/2019 etd_0728106_120055
9/113
3.2.1 Ping-Pong Mode of the Memory Management Strategy ....................55
3.2.2 In-Place Mode of the Memory Management Strategy........................56
3.3 A Unified FFT/[FFT Architecture .............................................................. .. 58
Chapter 4 FFT/[FFT Compiler Design ........................................................................ 60
4.1 Implementation Strategy ..............................................................................61
4.1.1 Parametrizable Architecture .............................................................. ..61
4.1.2 Parametrizable Memory Access..........................................................65
4.2 Building Block .............................................................................................68
4.3 FFT/[FFT Compiler Flow ............................................................................72
4.4 Specification ................................................................................................74
4.4.1Specicationof the128-PointR23SDF..............................................74
4.4.2Specicationof the128-PointR23MDC............................................77
4.4.3 Specication of the 128-Point Memory-Based FFT Architecture ...... 80
4.4.4 Synthesis Result ..................................................................................82
4.4.5 Analysis of Suitable Applications .......................................................85
Chapter 5 Verication and Performance......................................................................88
5.1 Cost Function and Derivation .................................................................... .. 88
5.2 C Simulation Model ................................................................................... .. 89
5.3 Verification Plan ......................................................................................... ..90
5.4 Performance Evaluation ............................................................................. ..91
Chapter 6 Conclusions and Future Work ..................................................................... 93
6.1 Conclusions ................................................................................................ ..93
6.2 Future Work ............................................................................................... ..93
References .................................................................................................................. .. 96
ix
-
7/27/2019 etd_0728106_120055
10/113
LIST OF TABLES
Table 1.1: FFT/IFFT size for OFDM-based communication system.............................4
Table 2.1: Complexity analysis of twiddle factor for radix-4 DIF FFT algorithm ...... 34
Table 2.2: Complexity analysis of twiddle factor for radix-8 DIF FFT algorithm ...... 35
Table 2.3: Complex multiplications required for radix-2, radix-4 and radix-2/4 FFT
algorithms .................................................................................................... 36
Table 2.4: Complex multiplications required for radix-8 and radix-2/8 FFT. .............37
Table 3.1: Comparison of single buttery and fully spread architecture ..................... 39
Table 3.2: Analysis of different pipelined architectures using different radix ........... ..40
Table 3.3: Comparison of hardware utilization ............................................................41
Table 3.4: The pairs of buttery inputs at each time stage. .........................................46
Table 3.5: The relation of data accessesat different time stage of 8-point radix-2 FFT
.................................................................................................................... ..57
Table 4.1: Analyze the number of complex multipliers in SDF with different radix. .63
Table 4.2: Analyze the number of complex multipliers in MDC with different radix. 65
Table 4.3: ROM content in each stage for an 8-point FFT. .........................................66
Table 4.4: Comparison between constant multiplier and complex multiplier of T I and
DW ............................................................................................................. ..71
Table 4.5: Parametersinformation of the FFT/IFFT Compiler ................................... 73
Table4.6:I/O portsof R23SDF....................................................................................77
Table4.7;I/O portsof R23MDC..................................................................................80
Table 4.8: I/O ports of memory-based FFT architecture. ............................................ 81
Table4.9:Gatecountof theR23SDFat 128-pointFFT/IFFT.....................................83
Table4.10:Powerconsumptionof theR23SDFat 128-pointFFT(mw)....................83
Table4.11:Gatecountof theR23MDCat 128-pointFFT/[FFT..................................84
-
7/27/2019 etd_0728106_120055
11/113
Table4.12:Powerconsumptionof theR23MDCat 128-pointFFT(mw)..................84
Table4.13:Powerconsumptionof theR23SDFat 128-pointFFT(mw)....................85
Table 4.14: FFT/[FFT size for OFDM-based communication system......................... 87
Table 4.15: FFT/IFFT size and throughput rate for OFDM-based communication
system ..........................................................................................................87
Table 5.1: ASIC synthesis result at clock frequency of 132MHz. ...............................92
xi
-
7/27/2019 etd_0728106_120055
12/113
LIST OF FIGURES
Fig.1.1ThetwiddlefactorW"kof FFTin theunitcircle...............................................2Fig. 1.2 OFDM transceiver block diagram ....................................................................4
Fig. 2.1 Signal ow graph of the decimation-in-time decomposition of an N-point
DFT (N =8) computation into two (N / 2)-point DFT computations. ...........9
Fig. 2.2 Signal ow graph of the decimation-in-time decomposition of two
(N/2)-point DFT (N = 8) computation into four (N / 4)-point DFT
computations. ................................................................................................. 9
Fig. 2.3 Signal ow graph of a 2-point DFT ................................................................ 10
Fig. 2.4 Signal ow graph of an 8-point DIT FFT ...................................................... 10
Fig. 2.5 Signal ow graph of radix-2 buttery computation in Fig. 2.4 ..................... 10
Fig. 2.6 Signal ow graph of simplified buttery computation with only one complex
multiplication. .............................................................................................. 11
Fig. 2.7 Signal ow graph of 8-point DFT using the buttery computation of Fig. 2.6
.................................................................................................................... .. 12
Fig. 2.8 Signal ow graph of DIF decomposition of an N-point DFT computation into
two (N/ 2)-point DFT computations (N =8) ............................................... 15
Fig. 2.9 Signal ow graph of decimation in-frequency decomposition of an 8-point
DFT computation into four 2-point DFT computations............................... 15
Fig. 2.10 Signal ow graph of a typical 2-point DFT at the last stage decomposition.
.................................................................................................................... .. 16
Fig. 2.11 Signal ow graph of complete DIF decomposition of an 8-point DFT
computation .................................................................................................. 16
Fig. 2.12 Signal ow graph of complete DIF decomposition of an 16-point DFT
computation .................................................................................................. 17
Fig. 2.13 Radix-4 buttery ...........................................................................................20
Fig. 2.14 16-point radix-4 DIF FFT .............................................................................21
Fig. 2.15 Radix-8 buttery ...........................................................................................24
xii
-
7/27/2019 etd_0728106_120055
13/113
. 2.16 Signal ow graph of 16-point radix-8 DIF FFT ...........................................25
. 2.17 Radix-2/4 buttery .......................................................................................27
. 2.18A sketch map of 16-point radix-2/4 DIF FFT. ..............................................27
. 2.19 Signal ow graph of 16-point radix-2/4 DIF FFT ........................................28
. 2.20 Radix-2/8 buttery ....................................................................................... 30
. 2.21A sketch map of 16-point radix-2/8 DIF FFT ............................................... 30
. 2.22 Signal ow graph of 16-point radix-2/8 DIF FFT ........................................31
. 2.23 Signal ow graph of 32-point radix-4 DIF FFT based on radix-2
decomposition in the rst stage and two radix-4 stagesin the next stage33
. 2.24 Signal ow graph of 32-point radix-4 DIF FFT based on two radix-4 stages
and one radix-2 stage...................................................................................33
. 3.1 Two extreme methods of implementing the FFT algorithm. ..........................38
. 3.2 Basic framework of pipelined FFT. ................................................................39
. 3.3 R2SDF N=16 (Radix-2 Single-path Delay Feedback) ...................................42
. 3.4 Unfolded delay elements of R2SDF ...............................................................43
. 3.5 Two modes of the radix-2 buttery module ...................................................43
. 3.6 R2SDF (N=8) ................................................................................................ ..44
. 3.7 R2SDF (N=8) data stream ow .................................................................... ..45
. 3.8 (a) Relation between delay elements and buttery operation modes in each
stage, (b) control of twiddle factor in each stage.........................................46
. 3.9 R2MDC N=16 (Radix-2 Multi-path Delay Commutator) ..............................47
. 3.10 Unfolded delay elements of R2MDC ...........................................................48
. 3.11 Radix-2 buttery module ..............................................................................48
. 3.12 Two modes of the radix-2 commutator module .......................................... ..48
. 3.13 R2MDC (N=8) ............................................................................................ ..49
. 3.14 R2MDC (N=8) data stream ow ................................................................ ..51
xiii
-
7/27/2019 etd_0728106_120055
14/113
(b) control of twiddle factor in each stage...................................................52
Fig. 3.16 Total DE numbers of R2MDC ......................................................................52
Fig. 3.17 Operation ow of single PE .........................................................................53
Fig. 3.18 Single PE FFT processor diagram ................................................................53
Fig. 3.19 Ping-pong mode architecture ........................................................................55
Fig. 3.20 Partial data processing ow of 8-point FFT .................................................55
Fig. 3.21 The conict graph and memory partition; (a) the colored conict graph
based on the radix-2 buttery unit, (b) the 2-bank memory arrangement...57
Fig. 4.1 Relationship between the PE number and different radix algorithm .............. 62
Fig.4.2R23SDFPEin radix-8buttery......................................................................62
Fig. 4.3 Different radix SDF architectures for performing 16-point DFT. ..................63
Fig. 4.4 R2MDC N=128 ..............................................................................................64
Fig.4.5R22MDCN=128.............................................................................................64
Fig.4.6R23MDCN=128.............................................................................................65Fig. 4.7 Signal data ow of 8-point DIF FFT ..............................................................66
Fig. 4.8 Relation between buttery counts and ROM addressesin each time stage...67
Fig. 4.9 Architecture of the addressgenerator .............................................................67
Fig. 4.10 Radix-2 buttery module with mode selection. ...........................................68
Fig.4.11Asimpliedcomplexmultiplicationwith W1/8...........................................69
Fig. 4.12 Real multiplication without multipliers ........................................................69
Fig. 4.13 Complex multiplier architecture using unsigned multipliers. ......................70
Fig. 4.14 Complex multiplier architecture using signed multipliers. ..........................71
Fig. 4.15 Provided design model. ................................................................................73
Fig.4.16SDFof 128-pointandR23SDFblockdiagram.............................................74
Fig.4.17128-pointR23SDFtimingdiagram...............................................................75
. 3.15 (a) Relation between delay elements and commutator modes in each stage,
xiv
-
7/27/2019 etd_0728106_120055
15/113
. 4.18128-pointR23SDFblockdiagramwithpipelinedregisterandcontrolcircuit.................................................................................................................... ..75
. 4.19128-pointR23SDFtimingdiagramwithpipeline.........................................76
. 4.20128-pointR23SDFtimingdiagramwith1/0information............................76
. 4.21Blockdiagramof 128-pointR23MDC..........................................................77
. 4.22Timingdiagramof 128-pointR23MDC.......................................................78
. 4.23Blockdiagramof 128-pointR23MDCwithpipelinedandcontrolcircuit...78
. 4.24Timingdiagramof 128-pointR23MDCwithpipelinedregister...................79
. 4.25128-pointR23MDCtimingdiagramwith1/0information...........................79
. 4.26 Block diagram of 128-point memory-based architecture. ............................ 80
. 4.27 Timing diagram of a 128-point memory-based architecture. ....................... 81
. 4.28 Execution ow of the provided pipelined FFT architecture.........................86
. 4.29 Execution ow of the provided memory-based FFT architecture ................ 86
. 5.1 Choosing an architecture based on the specified throughput rate. .................89
. 5.2 Radix-8 buttery .............................................................................................90
. 5.3 Verification plan .............................................................................................91
. 5.4 SNR curve in the 128-point FFT ....................................................................92
XV
-
7/27/2019 etd_0728106_120055
16/113
Capter1
Introduction
The Discrete Fourier Transform (DFT) plays a key role in digital signal
processing in areas such as radar processing, spectral analysis, frequency-domain
filtering, and polyphase transformations. The DFT is an important component in many
practical applications of discrete-time systems. The possibility of greatly reduced
computation was generally overlooked until about 1965, when Cooley and Tukey
(1965) published an algorithm [1] for the computation of the DFT that is applicable
when N is a composite number. The publication of their paper touched off a urry of
activity in the application of the DFT to signal processing and resulted in the
discovery of a number of highly efcient computational algorithms. Collectively, the
entire set of such algorithms has come to be known as the Fast Fourier Transform, or
the FF T.
1.1 FFT Overview
The DFT of a finite-length sequenceof length N is
N1
X[k]=Zx[n]W,;", k=0,1, ,N1, (1.1)n=0
whereW,(,"=e'j(2N)"". TheinversediscreteFouriertransformis givenby
1 N1
x[n]=ZX[k]WN"", n=0, 1, ,N1, (1.2)N k=0
-
7/27/2019 etd_0728106_120055
17/113
To computeallNvaluesof theN-pointDFTthereforerequiresa totalof N2complex
multiplications and N(N -1) complex additions.
Most approaches to improving the efficiency of the computation of the DFT
employ the symmetry, periodicity, compressibility and expansibility properties of
W,(,"asbelow.
1. W,;""=(W"")*=WA,"N")(symmetry)
2. WA,"=W"'+N)"=W"+N)" (periodicity)
3. W,(,"=WA',,,lWN: Wn,V(compressibilityandexpansibility)
Wecanconvenientlyobservevalueof thetwiddlefactor W,\,'fromFig 1.1.
TwiddleFatorWA',"=ej(2N)""of FFT
Fig.1.1ThetwiddlefactorW"'of FFIin theunitcircle.
By using Fig 1.1, we can nd the symmetry of the twiddle factor as
nk nkN/2W =W + ,
k N/4 nk 3N/4 - nkW71+ : _W + : ,
W81:W;2%-(1j)and
W83:_W87:_'(1+J')-
-
7/27/2019 etd_0728106_120055
18/113
-
7/27/2019 etd_0728106_120055
19/113
Serial
DataInput
JEQAM.Signal?1
2Generator
Serial
Dataoutput
Vf.G?,u3?d1. ~imje'r,va1=J:i:7re1?1'1O\./a_,l_'
Receive_
'_FiIter
Fig. 1.2 OFDM transceiver block diagram
Different wireless communication standards mean different specifications for
target applications. Moreover, even in a digital communication system, it may have
different operation modes. Table 1.1 lists the FFT/IFFT sizes for several existing
communication system. When viewing this table, we know that it would be a
challenge if a unified FFI/IFFT architecture is to be designed. In this thesis, we
investigate how to develop an efficient FFT/IFFT compiler for different applications.
Based on our development, not only a dedicated FFT/IFFT module can be easily
prototyped for fast system verification, but also the resulting compiler can be used as
a basis for more advanced research in the future.
Table 1.1: FFT/IFFT size for OFDM-based communication system
8192 DVB-T ~ VDSL
4096 DVB-H ~ VDSL
2048 DVB-T ~ DAB ~ VDSL
1024 DAB ~ VDSL
512
-
7/27/2019 etd_0728106_120055
20/113
1.3 Organization of this Thesis
Organization of this thesis is:
0 Chapter 1 introduces the FFT and motivation.
0 Chapter 2 reviews the general, high-radix, and split-radix FFT algorithms,
and discussestheir different objectives.
0 Chapter 3 discussesdifferent implementations of FFT algorithm.
0 Chapter 4 shows the methodology of the FFT/IFFI compiler design.
0 Chapter 5 describes the cost function, verication plan and performance
evaluation.
0 Chapter 6 presents a concluding remark.
-
7/27/2019 etd_0728106_120055
21/113
Capter2
FFT Algorithm
The Cooley-Tukey FFT algorithm is very popular because it can reduce the
computationalcomplexityfrom O(N2)to O(Nlog2N),and the regularityof the
algorithm makes it suitable for VLSI implementation. To further reduce the
computational complexity, high radix and split-radix versions have been proposed. In
general,all of thesealgorithmsdecomposea length-N (= 2) FFT into odd half and
even half recursively and effectively reduce the number of complex multiplications by
using symmetric properties of the FFT kernel. The high radix FFT algorithms such as
radix-4 and radix-8 [2] substantially reduce the number of arithmetic operations and
data transfers as compared to the general FFT algorithm [3]. The split-radix FFT
algorithms such as radix-2/4 [4] ~ radix-2/8 [5] are the best in terms of the
multiplicative complexity for N-point FFT when the multiplications with i 1, i j
are skipped, but it is inherently irregular becauseradix-2 stagesare used for even half
components, and radix-4 or radix-8 stages are used for odd half components, which
results in an L-shaped buttery unit. Due to the irregularity of the buttery unit, it is
hard to design regular and modular pipelined hardware for the split-radix algorithm.
2.1 General Algorithm
The DFT of a finite-length sequenceof length N is
-
7/27/2019 etd_0728106_120055
22/113
x[k]=x[n]vV,;",k=0,1, ,N1, (2.1)n=0
whereW,(,'=e'j(2N)k. TheinversediscreteFouriertransformis givenby
N1
x[n]=iZx[k]W,;"",n=0,1, ,N1. (2.2)N k=0
In equations (2.1) and (2.2), both X[k] and x[n] may be complex. The expressions
on the right-hand sides of those equations differ only in the sign of the exponent of
W,(,"andin the scalefactor1/N.
In computing the DFT, dramatic efciency results from decomposing the
computation into successively small DFT computations. We employ both the
symmetry ~periodicity ~compressibility and expansibility of the complex exponential
W,(,": e'j(2N)". Algorithmin whichthedecompositionis basedondecomposingthe
input sequence x[n] into successively small subsequences are called decimation-
in-time FF T algorithms. We can consider dividing the output sequence X[k] into
smaller and smaller subsequencesin the same manner. FFT algorithms based on this
procedure are commonly called decimation-in-frequency FFTalgorithms.
2.1.1 Decimation-In-Time (DIT) FFT Algorithms
The principle of the DIT FFT algorithm is most conveniently illustrated by
consideringthe specialcaseof Nan integerpower of 2, suchas 2. Since Nis an even
integer, we can consider computing X[k] by separating x[n] into two (N / 2)-point
sequences consisting of the even-numbered points in x[n] and the odd-numbered
points in x[n]. With X[k] given by equation (2.1) and separating x[n] into its even and
odd numbered points, we obtain
-
7/27/2019 etd_0728106_120055
23/113
X[k]= Zx[n]W,;"+Zx[n]W,$", (2.3)It even 71 odd
or, with the substitution of variables n =2r for even part and n =2r + 1 for odd part,
(N/2)1 (N/2)1
X[k]= Zx[n]W,"+ Zx[n]W,"r=0 r=0
(N/2)1 (N/2)1
= Zx[2r](W,)"+W,5Zx[2n+1](W,)"-r=0 r=0
andW;2WM,employscompressibilityproperty,since
W15: e2j(27t/N): ej27E/(N/2)ZWN/2I
Consequently, equation (2.4) can be rewritten as
(N/2)1 (N/2)1
X[k]= Zx[2r]W,;';,+W/VZx[2n +1]W,;;,,k: 0, N -1. (2.6)r=0 r=0
Each of the sums in equation (2.6) is recognized as an (N/ 2)-point DFT, the first sum
being the (N / 2)-point DFT of the even-numbered points of the original sequenceand
the second being the (N / 2)-point DFT of the odd-numbered points of the original
sequence,only the odd-numberedpointsof the originalsequenceextractsWA;. Fiq.
2.1 depicts this computation for N=8.
Therefore, the (N / 2)-point DFT can be decomposed even and odd part into two
(N/ 4)-pointDFTs, only oddpartof (N/ 4)-pointDFT multiplies WA',,22WA?, using
the fact that WN,22W13.Thus,insertingthe abovemannerinto the signalow graph
of Fig. 2.1, we obtain the complete signal ow graph of Fig. 2.2.
-
7/27/2019 etd_0728106_120055
24/113
x[0] -
x[2] -
x[4] -
x[6] -
x[l] -
x[3] -
x[5] -
x[7] -
Fig. 2.1 Signal ow graph of the decimation-in-time decomposition of an N-point
DFT (N =8) computation into two (NI 2)-point DFT computations.
Fig. 2.2 Signal ow graph of the decimation-in-time decomposition of two
(N/2)-point DFT (N =8) computation into four (Nl 4)-point DFT computations.
For the 8-point DFT that we have been using as an illustration, the computation
has been reduced to a computation of 2-point DFTs. The 2-point DFT of the sequence
consisting of x[0] and x[4] is depicted in Fig. 2.3. With the computation of Fig. 2.3
inserted in the signal ow graph of Fig. 2.2, we obtain the complete ow graph for
computation of the 8-point DFT, as shown in Fig. 2.4.
-
7/27/2019 etd_0728106_120055
25/113
w,3=1
W;=W, =W,,,"/2=-1
Fig. 2.3 Signal ow graph of a 2-point DFT.
For the more g
decomposing the (N
were left with only 2-point transforms. This requires v = log2N stagesof computation.
If N = 2, this can be done at most v = log2N times, so that after carrying out this
decomposition as many times as possible, the number of complex multiplications and
additions is equal to Nv =Nlog2N. This is the substantial computational savings that
we have previously indicated was possible.
A A=A+BW;PN
B B=A+BWyWmWA(]p+N/2)
Fig. 2.5 Signal ow graph of radix-2 buttery computation in Fig. 2.4
10
-
7/27/2019 etd_0728106_120055
26/113
Computation in the signal ow graph of Fig 2.4 can be reduced further by using
thepropertyof the coefficientsW: . Wefirst notethat,in proceedingfrom onestage
to the next in Fig. 2.4, the basic computation is in the form of Fig. 2.5., this
elementary computation is called a radix-2 butterfly. Since
WAIIV/2:ej(27t'/N)N/2:ej7t':_1,
thefactor WA',+N2canbe writtenas
W,5+22WWW; =W,. (2.8)
With this observation, the buttery computation of Fig. 2.5 can be simplified to the
form shown in Fig. 2.6, which requires only one complex multiplication instead of
two. Using the basic signal ow graph of Fig. 2.6 as a replacement for butteries of
the form of Fig. 2.5, we obtain the signal ow graph of Fig. 2.7 from Fig. 2.5. In
particular, the number of complex multiplications has been reduced by a factor of 2
over the number in Fig. 2.5.
Fig. 2.6 Signal ow graph of simplified buttery computation with only one complex
multiplication.
11
-
7/27/2019 etd_0728106_120055
27/113
Fig. 2.7 Signal ow graph of 8-point DFT using the buttery computation of Fig. 2.6
2.1.2 Decimation-In-Frequency (DIF) FFT Algorithms
We can consider partitioning the output sequenceX[k] of frequency domain into
smaller and smaller subsequencesin the same manner. FFT algorithms based on this
process are commonly called decimation-in-frequency (DIF) FFT algorithms.
To develop these FFT algorithms, let us again restrict the discussion to Na power
of 2 and consider computing separately the even-numbered frequency samples and the
odd-numbered frequency samples. Since
Nl
X[k]=Z:x[n]WA'}",k=0, l, ,N l, (2.9)n=0
the even-numbered frequency samples are
2
X[2r]= x[n]W,;,r=0, 1,...,(N/2)1, (2.10)71IIC
which can be described as
(N/2)1 2 Nl 2X[2r]= Zx[n]WN"'+Zx[n]WN"'. (2.11)
n=0 n=N/2
12
-
7/27/2019 etd_0728106_120055
28/113
With a substitution of variables in the second summation in equation (2.11), we obtain
(N/2)1 (N/2)1
X[2r]= Zx[n]W,"'+ Z x[n+(N/2)]W,'["+1. (2.12)n=0 n=0
Eventually,becauseof theperiodicityof WI?,
W13r[n+(N/2)]: W15rnWA;N: Wlgrn,
Since W; =WN,2, equation(2.13)canbeexpressedas
(N/2)1X[2r]= Zx[n+(N/2)]W,7;, r=0, (N/2) 1. (2.14)n=0
Equation (2.14) is the (N / 2)-point DFT of the (N / 2)-point sequence obtained by
adding the rst half and the last half of the input sequence.Adding the two halves of
the input sequencerepresents time aliasing, consistent with the fact that in computing
only the even-numbered frequency samples, we are under-sampling the Fourier
transform of x[n].
We con now consider obtaining the odd-numbered frequency points, given by
N1
X[2r+1]=Zx[n]W,;"2'+1>,r: 0, 1, , (N/2) 1. (2.15)n=0
As before, we can describe as
(N/2)1 N_1
X[Zr+1]= Z x[n]W,;2+Z x[n]W,;. (2.16)n=0 n=N/2
An alternative form for the second summation in equation (2.16) is
13
-
7/27/2019 etd_0728106_120055
29/113
N1 (N/2)1
Z x[n:|WAr,1(2r+l): Z x[n+ I2):|W1E,n+(N/2)](2r+1)n=N/2 n=0
(N/2)(2r+1)(N/2)1 n(2r+l)=WN Z x[n+(N/2)]WN (2.17)
n=0
(N/2)1
: Z x[n+(N/2)]WA','(2'+1),n=0
wherewe haveemployedthe fact that W,f,m':1 and W,f/W2): -1. Substituting
equation (2.17) into equation (2.16) and combining the two summations, we obtain
(N/2)1
X[2r+1]= Z (x[n] x[n+(N/2)])W,;', (2.18)n=0
- 2or, since WN : WW2,
(N/2)1
X[2r+1]= Z(x[n]x[n+(N/2)]) A',';2WA',',r=0,1,...,(N/2)-1. (2.19)n=0
Equation (2.19) is the (N / 2)-point DFT of the sequenceobtained by subtracting the
second half of the input sequence from the first half and multiplying the resulting
sequenceby WA;. On the basis of equations (2.14) and (2.19), with g[n] =
x[n]+x[n+N/2] and h[n] =x[n]-x[n+N/2], the DFT can be computed by first forming
the sequencesg[n] and h[n], then computingh[n]W,(,, And nally computingthe (N /
2)-point DFTs of these two sequencesto obtain the even-numbered output points and
the odd-numbered output points, respectively. The procedure suggested by equation
(2.14) and (2. 19) is illustrated for the case of an 8-point DFT in Fig. 2.8.
14
-
7/27/2019 etd_0728106_120055
30/113
Fig. 2.8 Signal ow graph of DIF decomposition of an N-point DFT computation into
two (N/ 2)-point DFT computations (N =8).
Consequently, the (N / 2)-point DFTs can be computed by computing the even-
numbered and odd numbered output points for those DFTs separately. As in the case
of the procedure leading to equation (2.14) and (2.19), this is accomplished by
combining the first half and the last half of the input points for each of the (N /
2)-point DFTs and then computing (N / 4)-point DFTs. The signal ow graph resulting
from taking this step for the 8-point example is shown in Fig. 2.9.
Fig. 2.9 Signal ow graph of decimation in-frequency decomposition of an 8-point
DFT computation into four 2-point DFT computations.
15
-
7/27/2019 etd_0728106_120055
31/113
For the 8-point example, the computation has now been reduced to the
computation of 2-point DFTs, which are implemented by adding and subtracting the
input points, as discussed previously. Thus, the 2-point DFTs in Fig 2.9 can be
replaced by the computation shown in Fig. 2.10, so the computation of the 8-point
DFT and 16-point DFT can be accomplished by the algorithm depicted in Fig 2.11
Xv-I:W0
Xv-1[q] Xv[q]-1
and Fig 2.12.
Fig. 2.10 Signal ow graph of a typical 2-point DFT at the last stage decomposition.
Fig. 2.11 Signal ow graph of complete DIF decomposition of an 8-point DFT
computation
By countingthe arithmetic operationsin Fig. 2.11 and generalizingtoN =2, we
observe that the computation of Fig. 2.11 requires (N/2)log2N complex multiplications
and Nlog2N complex additions. Thus, the total number of computations is the same
with the decimation-in-time algorithms.
16
-
7/27/2019 etd_0728106_120055
32/113
.\tIA\Vlz.I.>Xo1'. .,\\vIInVxoxozyAuw-,,
,\\\vIII,n,xxoxoz, 4 _ ,,,\\\xoxII:w:xxmv;ox.,,,\\xoxoxoxwlA\.,.1921 F
.:xoxoxoxoxoxmC1.l.\V.I.>x-...[.
.I:xoxoxoxm\vI.,I,>x1'..,II:xox\\nVxoxozyA
-
7/27/2019 etd_0728106_120055
33/113
the output DFT in bit-reversed order. The signal ow graph previously shown in Fig.
2.7 begins with the input sequencein bit-reversed order and provides the output DFT
in normal order.
2.2 High-Radix Algorithm
To further reduce the computational complexity, the high radix FFT algorithms
such as radix-4 and radix-8 not only reduce the number of arithmetic operations and
data transfers compared to the general FFT algorithm such as radix-2 FFT algorithm,
but also reserve regular property for convenient implementation of pipelined hardware.
Here, we consider input sequence in normal order based on the decimation-
in-frequency (DIF) FFT algorithm.
2.2.1 Radix-4 DIF FFT Algorithm
To develop Radix-4 DIF FFT algorithm, let us again restrict the discussion to Na
power of 4, i.e., N = 4 and consider computing separatelythe even-numbered
frequency samples and the odd-numbered frequency samples from radix-2 DIF FFT
algorithm. Since
N1
X[k]=Zx[n]W,;", k=0,1, ,N 1, (2.21)n=0
the even-numbered frequency samples are
(N/2)1
X[2r]= Zx[n+(N/2)]W,7;, r=0, (N/2) 1, (2.22)n=0
and the odd-numbered frequency samples are
(N/2)1
X[2r +1]: Z(x[n] x[n+(N/2)]) A,j2WA,',r=0, l,...,(N/ 2) 1. (2.23)n=0
18
-
7/27/2019 etd_0728106_120055
34/113
Equation (2.22) and (2.23) are separatedcontinuously into even-numbered frequency
samples and the odd-numbered frequency samples such as raidx-4 decomposition. Let
r = 2s separate even part and r = 2s + 1 separate odd part. First, let r = 2s substitute
into equation (2.22), then
(N/2)1
X[4s]: Z(x[n]+x[n+(N/2)])W,;',s=0, (N/4) 1, (2.24)n=0
usingthefactthat W553): W13;4 substitutesinto equation(2.24).Weobtain
(N/4)1 (N/2)1
X[4s]: Z x[n (N/2)])WA,;4+ Z (x[n]+x[n+(N/2)])WA,'j4n=0 n=N/4(N/4)1
= Z (x[n] x[n (N/2)])W,(;~;4 (2.25)n=0
(N/4)1+ Z (x[n+(N/4)]+x[n+(3N/4)])W,;';t.4.
n=0
Eventually,becauseof the periodicity of WI?,
( N/4) _ (N/4) _WN7: X_W1\,IL;4WN/4X_W1\,IL;4
equation (2.25) can be expressedas
_(N4)1 +x[n+(N/2)])+ MX[4s]_Z [(x[n+(N/4)]+x[n+(3N/4)]))W"""(227)n=0
Equation (2.27) is the (N / 4)-point DFT of the N-point sequenceobtained by adding
the four parts of the input sequence. Adding the four parts of the input sequence
represents time aliasing, consistent with the fact that in computing only the even part
of even-numbered frequency samples, we are under-sampling the Fourier transform of
x[n].
-
7/27/2019 etd_0728106_120055
35/113
2. Substituting r =2s into (2.23) can obtain even part of the equation (2.23).
3. Substituting r =2s + 1 into (2.23) can obtain odd part of the equation (2.23).
Then, we can obtain the other three equations as below Z
_N4)1(x[n]+x[n+(N/2)]) M 2X[4s+2]_ [(x[n+(N/4)]+x[n+(3N/4)])jWN4WN'(228)N4)"1[(x[n]x[n+(N/2)]) JM,,X[4s+1]:2 j(x[n+(N/4)]x[n+(3N/4)])WWW"' (229)
n=0
WNMWN. (2.30)
X[4s+3]=2n=0N4)1[(x[n]x[n+(N/2)])JM3j(x[n+(N/4)] x[n --(3N/4)])
Fig. 2.13 Radix-4 buttery
Wecanobservefrom Fig. 2.13thattwiddlefactor WA,", suchas W15, WA;and
W13, is extractedonly in stage2, andstage1 is only multipliedby j term.The
E6'99J complex computation in Fig. 2.13 only interchanges with real part and imaginary
part of complex multiplicand, and inverses real part of complex multiplicand.
Radix-4 decomposition for direct implementation of DFT reduces the number of
multiplicationsfromN2to N(log4N-1),whichis alsolessthanradix-2decomposition.
20
-
7/27/2019 etd_0728106_120055
36/113
:3
\w1n\VImxxIx[2]
x[3].\\\VIII.CXXX}'..W: *.\\\xoxII1nInxx:uiAV1:Q- 10]
1:\VxoxxoxIA'.IUIA\iu;bXo}.'.\xoxoxoxoxoxv:1.1.IC>x:iI.1 _
mxoxoxoxoxoz'xoxoxoxoxox':n
3:1II;xoxoxox\.'.\VIIIbzo1'II e
II;xox\\'.II.x\xoxo3.wA:Vvvv 0
AAAA, WM'IIIA\\\.:1:IIIA\\'.I.vAxxo.5lLaV'.1QZx:i'.IJ\'.I.VIA\mxo}:. . . .
Fig. 2.14 16-point radix-4 DIF FFT
2.2.2 Radix-8 DIF FFT Algorithm
algorithm. Since
N1
X[k]=Zx[n]W,$",k=0, 1, ,N 1,n=0
0 0
0
NN
NN:\\vII.jAj\xxoz:arA:IN
N
S.
S.
S.
S.
S.
S.
S.
S.
S.
iiiiii>
Radix-2/4
11]
Fig. 2.18 A sketch map of 16-point radix-2/4 DIF FFT.
27
-
7/27/2019 etd_0728106_120055
43/113
Timestage0 TimeStage1 TimeStage2 TimeStage3
.5 4.1
.\\vI:nxox, we,\\\v/I/nxoxoxox. . at,\\\xoxg/zgnxnuuwax...\\xoxoxo.I..II.rIA\..xx.
5
iii .\xoxoxoxoxo:!.HQ.il.7A.'a.x:,,1:;X8.';;;;;o;*.. 4 . woX9.IAAIA, _ , XM
, ;xoxoxox.'n\Vl;nx1',. .
. II:'o't\\'QV3::;,1 AA . . AA . . . W~x[.3]X. . . AA _W5 . .4-..Xm
.I\.I.7lA\.xo}:. woilvguri 0131.
Fig. 2.19 Signal ow graph of 16-point radix-2/4 DIF FFT
The -j term extracted from radix-2/4 decomposition is more than radix-4
decomposition,but we can observe from Fig. 2.19 the mix -j and WN term of
complex multiplication at the time stage 2. From the pipelined hardware design, the
mix -j and WN term of complex multiplication will employ a complex
multiplication to complete complex multiplication computation, hence radix-2/4
decomposition cannot gain the advantage of reduction on the general pipelined
hardware design.
2.3.2 Radix-2/8 DIF FFT Algorithm
To develop Radix-2/8 DIF FFT algorithm, let us again restrict the discussion to N
a power of 2, i.e., N = 2v and consider computing separatelythe odd-numbered
frequency samples from radix-2/4 DIF FFT algorithm. Since
28
-
7/27/2019 etd_0728106_120055
44/113
N1
X[k]=Zx[n]vV,;", k=0,1, ,N 1, (2.47)n=0
the even-numbered frequency samples of radix-2/4 decomposition are
(N/2)1
X[2r]= Zx[n+(N/2)]W,37;, r=0, (N/2) 1, (2.48)n=0
and the odd-numbered frequency samples of radix-2/4 decomposition are
N4)"1[(x[n]x[n+(N/2)]) JM,1X[4s+1]: Z j(x[n+(N/4)]x[n+(3N/4)])WWW"' (Z49)n=0
M 371
j(x[n+(N/4)]x[n--(3N/4)])jW"W'(Z50)X[4s+3]= 2n=0(N4)1[(x[n]x[n+(N/2)])--Next, we continuously separateequation (2.49) and (2.50) into even and odd part. Let
s = 2k and s = 2k + 1 substitute into (2.49) and (2.50), we can obtain
x[n+(N/2)])j+X[gk+1]: avflj(x[n+(N/4)]x[n+(3N/4)])
,,=o1[(x[n+(N/8)]x[n+(5N/8)])JW8 .](x[n+(3N/8)] x[n+(7N/8)])W,;,W,;. (2.51)
(x[n]x[n+(N/2)D _
X[8k+5]:
-
7/27/2019 etd_0728106_120055
45/113
Equation (2.48), (2.51), (2.52), (2.53) and (2.54) can depict signal ow graph of
8-point DFT as Fig. 2.20. The Fig 2.20 is called Radix-2/8 Butterfly. Substituting Fig.
2.20 into 16-point DFT can sketch Fig. 2.21 to depict Fig.2.22.
x[1] X[8]
xm Radix4 Xm
x[3] x[12]
x[4] '~'__-"-* X[2]
x[5] x[1o]
x[6] X[6]
xm Radix-2/8 Xm]
x[8] x[1]
x[9] Radix-2X9]
x[10] X[5]
ml] Radix~2Xm]
x[12] X[3]
xm] Radix~2 Xm]
x[14] X[7]
[15] Radix~2/8 Rama XHS]
Fig. 2.21 A sketch map of 16-point radix-2/8 DIF FFT
30
-
7/27/2019 etd_0728106_120055
46/113
TimeStage0 TimeStage1 TimeStage2 TimeStage3
T T
::1:r.u:x':1v;-:>x:::::i
H--:'...;x..w:3kw;-..*rX::':::...1if85] 99'. . AA . . . Wix10]x[6] x6]:11:o:o:o:o:o:o:o::: :1er :58,]IAMIAMA,, , 8,]X.0,,IIIIII\\L\VlAtDXO}.',w ,w:8,]
X.1,,IIlXX\\\;,XXV.a7A.,'baX-,w:X,8lllA\\\ ':::"i X WM . .AAAA. . 'W4 WOX3]:11:/A\::IA%ov: we:2X.,1 8.8
Fig. 2.22 Signal ow graph of 16-point radix-2/8 DIF FFT
The constantmultiplicationW81and W83 were extractedfrom radix-2/8
decomposition more than radix-8 decomposition, but we can observe from 2.22 that
mix -j, W81,W83and WN term of complexmultiplicationis at the sametime
stage 2 of Fig. 2.22. Considering the general pipelined hardware design, the mix -j,
W81,W83and W8, term of complexmultiplicationat the sametime stagewill
employ a complex multiplication to complete all term of complex multiplication
computation, so radix-2/8 decomposition cannot gain the advantage for reduction in
the pipelined hardware design.
31
-
7/27/2019 etd_0728106_120055
47/113
2.4 Complexity Analysis
So far we have analyzed statistics of reduction of complex multiplication by
different algorithms with general algorithm, high radix algorithm, and split radix
algorithm. We can conclude that higher radix decomposition will reduce more
complex multiplier. Considering N-point DFT which N is not a power of 4, so radix-4
algorithm decomposemust use extra radix-2 decomposition to complete N-point DFT.
The extra radix-2 decomposition employed in the first time stage or the last time stage
of signal ow graph will result in different computed reduction accordingly. For
example, 32-point DFT is not a power of 4, and FFT needs log232=5 time stages to
perform 32-point DFT. If we use radix-4 algorithm to decompose 32-point DFT, there
will remain a time stage. Adding extra radix-2 decomposition into rst or last time
stage of signal ow graph redeems the remainder decomposition. In Fig. 2.23, the
radix-2 decomposition applied to first stage decomposes 32-point DFT into two parts
of 16-point DFT. Next, using radix-4 decomposition completes two parts of 16-point
DFT. We can observe the twiddle factor
n o 1 2 3 4 5 6 7WN =>W32aW32aW32,Vl2,W32,W32,W32,W32a
vV;;,W;;,W;,,W3;1,W3;2,W3;3,W,;4,W,;5,n=o,1,...15. (255)
W,;": W;;,W;;,W;,,W;,2,n=0,1,2,3. (2.56)
W5": W;;,W3,W;;,W3,n=0,1,2,3. (2.57)
W,;": W;;,vV3,W;,2,W;,,n=0,1,2,3. (2.58)
Where W3:: W81,W3122=W83,W322: W85, and W87=W3228are constant
multiplications, W3;: W,Wf2: W41,W3126: W21,and W3242W43are non-complex
32
-
7/27/2019 etd_0728106_120055
48/113
multiplication, and others are complex multiplication. The total numbers of complex
multiplications are 20 and total numbers of constant multiplications are 10.
[91IE1} %EE3l
x[31]
n~.-o,1,.,.N;21
Fig. 2.23 Signal flow graph of 32-point radix-4 DIF FFT based on radix-2
decomposition in the rst stage and two radix-4 stagesin the next stage
X931
3*!1
I-([31]
3
i..o,1,.,.zw4-1 21: s:>,1,...m15-1
Fig. 2.24 Signal ow graph of 32-point radix-4 DIF FFT based on two radix-4 stages
and one radix-2 stage.
33
-
7/27/2019 etd_0728106_120055
49/113
Table 2.1: Complexity analysis of twiddle factor for radix-4 DIF FFT algorithm
radix-2 at first stage (if it needs ) radix2 at last satge (if it needs )1
const rnu|#* comp mu|#* const mum comp mu|#DFT#
i*::onstmul# : numberof constantmultiplication
g*c-campmul# : numberof complexmultiplication
If radix-4 decomposition is used to decompose 32-point DFT first, one time stage
will be remained at the last stage. Therefore, radix-2 decomposition decomposesthe
last time stage of 32-point DFT as shown in Fig. 2.24. We can observe the twiddle
factor
W5"2W3;,W3,W;;,W3,n=0,1,...7.(2.59W; 2 W;;,W;,,W3,v:g:,W;;,W3,W3,v:g;,n=0,1,...7. (2.60)
W5"2 W33,W;,W3,W;;,W3g2,W3g5,W3,W31,n=0,1,...7. (2.61)
W;"2W;;,W;;, n=0,1. (2.62)
W,;"2 W;;,W;;, n=0,1. (2.63)
W,(1,2"2 W3,,W;,2,n=0,1. (2.64)
Where W3:: W81,W3122=W83,W322: W85, and W87=W3228are constant
-
7/27/2019 etd_0728106_120055
50/113
multiplications, and others are complex multiplications. The total numbers of complex
multiplications are 16 and total numbers of constant multiplications are 12. Therefore,
using radix-4 decomposition rst and radix-2 decomposition at the last stage will
reduce more complex multiplication when N-point DFT is not a power of 4. Table 2.1
shows the different reduction of complex multiplication between two manners of
radix-2 insertion. We can conclude that using high-radix decomposition first will
obtain the best performance.
Table 2.2: Complexity analysis of twiddle factor for radix-8 DIF FFT algorithm
radix-2or 4 at 1ststage(if 11.needs) radix-2or 4 atlastsatge(1f11.needs)om * *
(a) (b)
When N-point DFT is not a power of 8, there are two situations. One is that
remainder of N-point DFT divided by 8 is 4, and the other is that remainder of N-point
DFT divided by 8 is 4. When remainder of N-point DFT divided by 8 is 4, the radix-4
decomposition is applied to compensate for the defect of computation at last stage.
Similarly, when remainder of N-point DFT divided by 8 is 2, the radix-2
decomposition is employed to overcome the defect of computation at last stage. We
can summarize the different reduction of complex multiplier in different size of DFT
as shown in Table 2.2(a). Let us think about the implementation of the pipelined
hardware. If the constant and the complex multiplications are at the same time stage
35
-
7/27/2019 etd_0728106_120055
51/113
of SFG, using the complex multiplier can compute both the constant and the complex
multiplication, so we analysis the number of the constant multiplication is complex
multiplication when the constant and the complex multiplications are at the same time
stage. Through above implementation issue, we can summarize the different reduction
of complex multiplier in different size of DFT as shown in Table 2.2(b).
Table 2.3 shows the number of the complex multiplication under the different
radix decomposition. Radix-4 FFT algorithm extracts -j terms to reduce partial
complex multiplier computation of radix-2 FFT algorithm. Radix-2/4 FFT algorithm
extracts more -j term to reduce partial computation of the complex multiplication of
radix-2 FFT algorithm than radix-4 FFT algorithm, but its irregular property makes it
difcult to implement pipelined hardware design.
Table 2.3: Complex multiplications required for radix-2, radix-4 and radix-2/4 FFT
algorithms
Complex Multiplication #
Table 2.4 shows the number of the complex multiplication with the radix-8 and
radix-2/8decomposition.Radix-8FFT algorithmextracts-j, W81,and W83terms
36
-
7/27/2019 etd_0728106_120055
52/113
to reduce complex multiplications of radix-4 FFT algorithm. Radix-2/8 FFT algorithm
further extractsmore constantmultiplication, W81and W83, than radix-8 FFT
algorithm, but its irregular property makes it to implement pipelined hardware
difficultly.
Table 2.4: Complex multiplications required for radix-8 and radix-2/8 FFT.
a DFW Radix-8D1?FFTAigamhmRiadi)-t2.Ii8DIFFFTAlgorithmccnstrrtu|#*compmu|#*ccristmulii |ccmpmulir
const mul # : number of constant multiplication
*comp mu- # : number of complex multiplication
37
-
7/27/2019 etd_0728106_120055
53/113
Capter3
FFT/IFFT Architecture
In this Chapter, we will discuss two methods of implementation of FFT
algorithm: reusing single buttery and fully spread, as shown in Fig. 3.1. Table 3.1
shows the different properties with speed, area and control complexity.
Implementation of the reusing single buttery employs single process element, PE
for short. Using Single PE to implement FFT algorithm is called Memory Based FF T
Architecture. The input, intermediate and output data are stored in memory, so the
bottleneck is memory access time. Implementation of the fully spread is generally
called pipelined FF T architecture. It has real time, non-stopping operation and least
memory requirement properties. The needed PEs are direct ratio to log, N , where r
is radix of buttery and N is the number of DFT.
51$!.V! .,
Wk
(a) Single ProcessElement (b) Fully Spread
Fig. 3.1 Two extreme methods of implementing the FFT algorithm.
38
-
7/27/2019 etd_0728106_120055
54/113
Table 3.1: Comparison of single buttery and fully spread architecture
3.1 Pipelined FFT Architecture
The architecture design for pipelined FFT processor had been the subject of
intensive research as early as in 70s when real-time processing was demanded in such
application as radar signal processing [8], well before the VLSI technology had
advanced to the level of system integration. It is characterized with real-time,
non-stopping processing as the data sequence passing the processor. In addition,
pipelined structure is highly regular, which can be easily scaled and parameterized
when Hardware Description Language (HDL) is used in the design. Basic framework
of the pipelined FFT architecture is shown in Fig. 3.2.
,.__......_. .._. ..... ....... .._. ...... __.. W... .._._ ,...__......._ .........1... _.._. .....,,..._.. ..............._.1..., ..
IE .5A : i i i, um; ; , . _ Vyeiamxmtiiaieznex,j l i { sshzzterzt:_elmmmt. _
.._........_..._..._..._..._......_....t....5< ._.:....__.....x......._......i._....a.
Fig. 3.2 Basic framework of pipelined FFT.
The implementation of the delay element has single path or multiple paths. The
implementation of the buttery element has radix-2 or radix-4. Requirement of
optimal memory is N 1, where N-point DFT is a power of 2. Furthermore, different
assumptions of input and output sequenceorder will construct different pipelined FFT
architecture. Single or multiple input and output sequences also construct different
pipelined FFT architecture, too. Several architectures have been proposed over the last
39
-
7/27/2019 etd_0728106_120055
55/113
3 decades. Here different approaches will be put into functional blocks with unied
terminology, where the additive butteries have been separated from multipliers to
show the hardware requirement distinctively.
f different '
Weassumethatth only theinputseqlslencetobeinnormalorder,andif TM T T reversed(radix-2or
radix-4) order, which is permissible in such applications such as DFT based
communication system. Single path pipelined architecture which uses radix-2 DIF
FFT algorithm is called Radix-2 Single-path Delay Feedback (RZSDF) [9]. Multiple
path pipelined architecture which uses radix-2 DIF FFT algorithm is called Radix-2
Multi-path Delay Commutator (RZMDC) [8]. Above two pipelined architectures are
the common pipelined architecture. More proposed pipelined architectures use
different radix FFT algorithms to extend above two basic pipelined architectures, as
givenin Table3.2.TheR22SDFhasthe samemultiplicativecomplexityasradix-4
algorithm,but theyretainthebutterystructureof radix-2algorithm.TheR23SDF
40
-
7/27/2019 etd_0728106_120055
56/113
has the same multiplicative complexity as radix-8 algorithm, but they retain the
buttery structure of radix-2 algorithm, too. We can nd that implementation of SDF
with higher radix algorithms could reduce more complex multipliers, but
implementationof theMDC is not. If R22MDCor R23MDCis usedto implement
pipelined FFT architecture, can the multiplicative complexity be reduced? We will
discuss at next Chapter. Table 3.3 shows the comparison of hardware utilization ration
with the different radix algorithms and architectures. It determines what architectures
perform well.
Table 3.3: Comparison of hardware utilization
In the Table 3.3, we can nd that architectures with higher radix algorithm have
higher utilization ratio of multiplier. The SDF architectures have highest 100%
utilization ratio of FIFOs than MDC architectures. The R23SDF or R23MDC of the
41
-
7/27/2019 etd_0728106_120055
57/113
radix-2 buttery base structure has the highest hardware utilization ratio than the
other architectures.
Other approachesof the Multiple Input Multiple Output (MIMO) proposed in [11,
16-17] are different assumptions of application system. Mixed SDF and MDC
architecture proposed in [18] is a very unusual approach. However, we introduce
many kinds of pipelined architectures in different assumptions, but they all extend
both two SDF and MDC pipelined architectures. Next, we will discuss two basic
pipelined architectures with radix-2 buttery based single-path delay feedback and
radix-2 buttery based multi-path delay commutator.
3.1.1 Radix-2 Single-path Delay Feedback (R2SDF)
Fig. 3.3 R2SDF N=16 (Radix-2 Single-path Delay Feedback)
The following notations are used;N denotes the size of the FFT and n = log2N
denotes the number of stages of FFT processing and PE of the pipelined architecture.
When Nare 16, the R2SDF needs 4 PEs, as shown in Fig. 3.3. R2SDF consists of the
radix-2 buttery modules(BF2), the delay elements (DE) and
thecomplexmultipliers1 if. Thedelayelementscompriseshiftregistersasfirstin
and first out (FIFO), and its block number means delay times or shift times. The
number of delay element of the each stage is the key point for controlling buttery
input and next stage output. The input ordering of the data and the sequenceof delay
42
-
7/27/2019 etd_0728106_120055
58/113
element operations guarantee proper pairing of all samples at each stage, a valid FFT
can be performed by rearranging the twiddle factors. Unfolded delay elements of
R2SDF are shown in Fig. 3.4. The radix-2 buttery module has two modes: one is
operation mode and the other is commutator mode, as shown in Fig. 3.5. The
operation mode computes radix-2 buttery operation and commutates pairs of
buttery results. Commutator mode only commutates pairs of the inputs to pairs of
outputs.
Fig. 3.4 Unfolded delay elements of R2SDF
Qperation( O )
Qommutator( C )
Fig. 3.5 Two modes of the radix-2 buttery module
When performing a FFT of size N, the first stage of processing combines pairs of
samples whose indices are N / 2 apart ( samples are indeed from 0 to N 1). The
second stage combines pairs whose indices are N / 4 apart, and so on. The number of
buttery operation is N / 2 at every time stage. We can find that the regularity between
buttery started operation and delay element in every time stage performs a valid FFT.
Using 8-point DIF FFT would explain conveniently the R2SDF. The R2SDF needs 3
43
-
7/27/2019 etd_0728106_120055
59/113
PEs to implement 8-point DFT, as shown in Fig.3.6. In radix-2 N-point FFT, the
twiddle factor of penultimatestagewill alwaysbe a constantW; : W82: W41j
multiplication.Thebuttery outputsof the last stagemultiply W13:1, so lastPE do
not employ complex multiplier.
u=:e*'ar2;*2.*%;.iM WJ; K.xgviy ~
. %
l
221
TE,
Fig. 3.6 R2SDF (N=8)
The following notations are used: The first input symbol has T0~T7 input sequence,
andnextinputsymboldenoteT01~T71.Thepowerof I of 01~71notesthe1st time
stageoutput.Thepowerof 2 of02~72denotesthe2ndstageoutput.Thedelayelement
of first PE will queue from T0 to T3 input sequencesthen T0 and T4 denoting indices
of inputpairsfor thefirst butteryprocessingof first stagewill compute01and41
denotingthe outputpairsof first buttery processingof first stage.01 passesto
-
7/27/2019 etd_0728106_120055
60/113
T5T4T3
T6T5T4
T71T61T51
4.v.a+.%
.7 R2SDF (N=8) data stream owFig. 3
.8 to observe the control mode of buttery module. The squaresWe can use Fig. 3
.8(a) denote that the buttery module enters commutator mode. Thein Fig. 3shown
in each stage were shown in Table 3.4.inputpairs of samples for buttery
45
-
7/27/2019 etd_0728106_120055
61/113
0]>s%d=}
azzrgm "3*3complex multiplier
Fig.4.5R22MDCN=128
-
7/27/2019 etd_0728106_120055
80/113
4"i*E$
v complex multiplier
r constantmultiplier
Fig.4.6R23MDCN=128
Table 4.2: Analyze the number of complex multipliers in MDC with different radix.
8192
4 5 6 7" 8 9 10 11
2957832 328648 36l512.8
328648197188 8 262918 4 26291824 328648 1 39437? 6
256 512 1024 2048 4096 8192
2 4 4 4 6 6 6 8
''106551.21'?"2280.81T"2280.8 192691.6 2584212 2584212 1 2T8832 344561.6
4.1.2 Parametrizable Memory Access
Because the memory-based FFT architecture uses single PE to perform operation
of FFT, the address width of the memory and ROM will change according to the size
of FFT, but regularity of the memory and ROM address access is invariable, so we
focus the property to realize parametrizable design for operating the variable size of
DPT.
65
-
7/27/2019 etd_0728106_120055
81/113
Performing a N-point FFT, the ROM size must storeN / 2 words, so the address
width of ROM is log2(N / 2) bits. Taking 8-point DFT for example, the squaresof Fig.
4.7 show the value of the twiddle factor in each stage. The requirement of twiddle
factor, W,3,WA1,,W,, andW3, in everytime stagehavebeenalreadystoredin the
ROM. The ROM address width is log2(N / 2) bit, and we suppose that the ROM
address must double when increasing one time stage, which is suitable for hardware
implementation because we can use left shift instead of multiplication. The regularity
of the ROM addressaccessof 8-pont FFT is shown in Table 4.3. If we extend the size
of DFT to 64, the relation between buttery count and ROM address in each time
stage is shown in Fig. 4.8.
Table 4.3: ROM content in each stage for an 8-point FFI.
:\Vl:XA Xm
Fig. 4.7 Signal data ow of 8-point DIF FFT
66
-
7/27/2019 etd_0728106_120055
82/113
Butterycounter[4:0]
timestage0
timestage1
timestage2
timestage3
timestage4
WW5 llll
Fig. 4.8 Relationbetweenbuttery countsandROM addressesin eachtimestage
In parametrizablememoryaccess,we useIn-placemodeto performvariablesize
of FFT, whichhavebeendiscussedin section3.2.2. Fig. 4.9 depictsthe architecture
of the conict-free addressgenerator for the radix- r FFT buttery processor
assumingthatthememoryhasbeenpartitionedinto r banks.In thefigure,ther barrel
shiftersassociatedwith the stagecounterare to emulatetheright rotationalproperty
of the buttery unit at different stages.The buttery counter is designedfor
completing all buttery task assignmentsat current stage. Finally, the address
switchingis usedto implementequation(4.1) suchthattheoutputof theeachbarrel
shiftercanbemappedto thecorrectmemorybank.
Data_count=[dn_1,dn_2,......,d2,d1,d0]r
n=l1og.l (4.1)Bank_index=(d +d", +......+d,+d,+d0)modr
Data_index=[d,,,1,d,,,2,......,d2,d1]r
MB0_addr.
B1_addr.
Fig. 4.9 Architectureof theaddressgenerator
67
-
7/27/2019 etd_0728106_120055
83/113
4.2 Building Block
Weapplythreekindsof IP cores,R23SDF,R23MDCandmemory-based,where
common building blocks are radix-2 buttery modules, complex multipliers. Our
building blocks follow the RTL coding guidelines of SIP.
The Radix-2 buttery module (BF2)
The BF2 includes one complex adder at top, one complex subtractor at bottom
and the mode for selection of FFT/IFFT as shown in Fig. 4.10. When mode is IFFT,
the path of divided by 2 will pass to output ports. Besides, when mode is FFT, the
other path will pass. Assuming the input pair of BF2 are (a +jb) at top and (c +jd) at
bottom. After the computation of the radix-2 buttery, the output pair are
top output :(a jb) : (c jd) : (a +c) j(b +d),
bottom output: (a jb) (c jd) : (a c) j(b d ).(4.2)
mode
0:FFT
1:IFFTFig. 4.10 Radix-2 buttery module with mode selection.
Implementation of the complex multiplier contains a constant multiplier and a
complex multiplier.
Constant multiplier
-
7/27/2019 etd_0728106_120055
84/113
(a+jb)xW;: (a +jb)xW;
x/EA/Ex/5
""7'J7"7(a+jb)>
-
7/27/2019 etd_0728106_120055
85/113
Complex multiplier
Assuming input pair of the complex multiplier are (X1+jY1)and (X2+jY2).X1, Y,-
X2, and Y2use 2s complement representation. Computing complex multiplication
arrange in real part and imaginary part as below
real iX1X2 Y1Y2,_ _ (4.4)mag - X1Y2+X2Y1.
There are four real multiplications and two real additions in equation (4.4). Because
the Verilog hardware description language (HDL) cant support signed multiplication,
X1Y2+XQY1
Y2[width-l]
UnsignedMultiplier _ _X2[w1dth-l] A Y1[w1dth-l]
Fig. 4.13 Complex multiplier architecture using unsigned multipliers.
Using DW02_Mult provided by Synopsys can apply signed multipliers to
implement complex multiplications and the correction by 2s complement operation
omit from Fig. 4.13 to anew depict in Fig. 4.14.
70
-
7/27/2019 etd_0728106_120055
86/113
Y1 Y1
X2+_lY2X2 X;
X1
Y2 Y2m_out_imag
Fig. 4.14 Complex multiplier architecture using signed multipliers.
If we use UMC.18 process, the word length of multiplier 20 bits which consists
of 10 real part 10bits and imaginary part 10 bits, and clock rate 60 MHz, the statistics
is shown in Table 4.4. Using unsigned multipliers to construct complex multiplier is
called technology independent (TI) else using signed multipliers is called Design
Ware (DW). The cost of the constant multiplier is apparently less than the complex
multiplier, so the simplifying complex multiplier by constant multiplier benefits
indeed. The constant multiplier really simplies complex multiplier. When user can
utilize DW, it will benefit less gate count and higher speedin provided design.
Table 4.4: Comparison between constant multiplier and complex multiplier of TI and
DW
UMC. l 8, word length 20bit and
clock rate 260MHz
TotalCellarea(umz) 32864.829465.2
TotalDynamicPower(mW) 4.30343.6627 1.3652
71
-
7/27/2019 etd_0728106_120055
87/113
4.3 FFT/IFFT Compiler Flow
In the FFI/IFFT compiler ow, the user got the circuit they wanted by choosing
parameters through user interface. Table 4.4 lists and describes all parameter in our
FFT/IFFT compiler. After choosing parameters, FFT/IFFT compiler will generate the
design model automatically as shown in Fig. 4.15.
Untimed functional model
Provide C simulation model which verify operation result of the FFT/IFFT
verilog RTL code. C simulation model can generate golden pattern to the test bench
by common input pattern.
Verilog RT L model
Providing synthesizable verilog code of FFT/IFFT benet user to integrate
design itself.
Test model
Providing the test bench and the test pattern les can simulate and test circuit,
and further use golden patterns to verify design via the automatic comparison.
Script model
Generate synthesis script file and testing script file of the providing user to
synthesize circuit and test insertion.
Bus functional model
Provide the AMBA AHB interface testing compatibility of application system.
72
-
7/27/2019 etd_0728106_120055
88/113
Table 4.5: Parametersinformation of the FFT/IFFT Compiler
Size of FFT/IFFT 1range - 64,128,. ,8192
Choosing vender-specific directives (Design Ware) or technology
Vender-specic independent
directives 0: technology independent
1: vender-specific directives
ClockRate FFT/IFFTsystemclockrate it
HHDataWidth Datawidthofeachinpntandimaginaryofcomplexdata
sub-pipe
Vender-sdirective
architectclockratethroughputrate...EoD-
-
7/27/2019 etd_0728106_120055
89/113
4.4 Specification
We use the 128-point FFT as an instance to separately show the block diagram,
I/O denition,timingdiagram,andsynthesizedresultsin theR23SDF,R23MDCand
Memory Based FFT architecture, and further analyzing their suitable applications,
respectively.
4.4.1Specificationof the128-PointR23SDF
The providedR23SDFarchitectureis shownin Fig. 4.16.In the gure, thetwiddle factors in the smallest rectangular forms at the penultimate stage of SFG are
W,3,W52",W,;",W,;",W5",W,3",W54",W56",n =0,1,...,N/64-1, (4.5)
based on the relation between input sequenceand buttery module we have discussed
before in each stage, as depicted in Fig. 4.17.
peiiultinmte
:"::tagex[D]
km
)([1271
Fig.4.16SDFof 128-pointandR23SDFblockdiagram
-
7/27/2019 etd_0728106_120055
90/113
C1->cX[n]/ 0 127
FFTin ut
stage]inM1_wor
M3' 112 239
$67111 ' 1 ~ - - ~ as2I22:mzimzmnznsieiisszumlillilillliM7r TotalCyclesi 1I I 127 254/Smg:117
-
7/27/2019 etd_0728106_120055
91/113
x[nj'"Em"
M1_v\or 64clocks/stage1out
-65- -142 -DH2M2_v\or/stage2 out
DE3M3_v\or/stage3 out
Pipeline11holdc e32clocks 97 224
DE4M4_v\or/stage4 out_ _ D-55-
M5_\Vnr/stage5 out_ _ D-55-
M6m 2clocksI I i .. 258/stage6 out Si?r%RW.@!IIIBEiIII%II@IIEI3I3~- 1-31- 259
DE7 1 IM77or 13TI]we|meFFholdC_\ole 1clocks- _ _ , p_/stage7out l-IEEIHIEJIIBEEIBL[HIEIEEHmu FHEIIEEIII{ l'FI[[E|IJBIFEIIEJISISJIIJ
Xnq TotalCycles1 134 _N+(N 1)+ 7 stagePipelineRegister: 262(0~261) imc11|\c|ilho1dC3c1c
Fig.4.19128-pointR23SDFtimingdiagramwithpipeline.
In the 128-pointFFT, the R23SDFstartsexportingoutputsequenceafter
(128-1+1og2N)cycles.
cIk_I l_|I_|I_|I|//l|_|!_|l_||_I/ll! \:\:|II/LI|IIII!II1
M | // // //rm.->" E EII ' 1
start tn:+5. I +2 '
1mns_mputj( u I X 2X 3 X 4 myX I27
ready )[
mvns_oumut H )( 0 X 64 X 32 06Im127delayeiements H M-*"i
,< 127cgcies >.< 7 cgcies I2Xqc1es
Fig.4.20128-pointR23SDFtimingdiagramwith1/0information
76
-
7/27/2019 etd_0728106_120055
92/113
Table4.6:I/Oportsof R23SDF.
R23SDFPipelineFFT
tart Circuitis receivinginputsymbolsfromuppersystemin highlevels
meaning.
Inputportreceivinginputsymbolsfromuppersystem.Executing FFT operation in low level meaning.de
ExecutinIFFTo erationinhi hlevelmeanin
clk Clock signal
ready Output port already prepares valid data in high level meaning.
trans_output Output port of computation result
4.4.2Specificationof the128-PointR23MDC
Blockdiagramof R23MDCimplementationis shownin Fig. 4.21andtiming
diagram is shown in Fig. 4.22. The Figures shows the regularity of control circuit in
everytime stage.Equally,the outputlatencyof the R23MDCwill extendwhen
inserting pipelined registers. The pipelined registers in the each stage and the control
circuit in the pipelined architecture depicts Fig. 4.23 and timing diagram Fig. 4.24. We
depict timing diagram of input/output (I / 0) ports as shown in Fig. 4.25 and I/O
signals describe in Table 4.7.
Fig.4.21Blockdiagramof 128-pointR23MDC
-
7/27/2019 etd_0728106_120055
93/113
6'-'1clocks 63,1164 64clocks 1274112364clocks 191419264clocks 25511125532clocks237233 319Clock
InputITTLialll
StagelL72conlml
SmgclBT2contl(VI3tag:I mmcunlml
SmgezC2Izomrol
sragczIII:mmml
SlagclmIIIcumml
SIage3C2comrol
Stage?BTZcnnImI
Slugs}mmConllU]
Smgc4czcmmnl
Stage-'1BTZcnnImI
stage:mnlcomrol
I I Imya:I I |r'e1;:L;"
-
7/27/2019 etd_0728106_120055
94/113
F0 64c1ocks 63+64 64c1cks 127+128 64c1c-cks1914192640106165 sdocks287288 3191111111111111
Suur5C24 $34I FWTTII I I I I I I I I 1~\~IIs--I:21I I I IWFI~~4clocks
SW731:2 1clocks133 197 261 324
134 198 325Nady TotalCycle3N-1+Nll+Log;N=198(0-197)
Fig.4.24Timingdiagramof 128-pointR23MDCwithpipelinedregister
In the 128-pointDFT,theR23MDCstartsexportingoutputsequenceafter134,
(N 1+log2N),cycles.Becauseoutputportof theR23MDCaremulti-path,theresult
data need 64, (N / 2), cycles exporting completely.
ll I/L_I!I L/A_I!I L// llmi // I // I //M '71 // 1 // //
"-I-"51". _t..,..ri i...,.I_.~_..Ip... I X 2 x 3 x 4 x I
mud)
l'mns_oIIIputl
E (,4cycles >196XsoX112 111X 95X127l'mns_oIIIpuI2
Fig.4.25128-pointR23MDCtimingdiagramwith1/0information
79
-
7/27/2019 etd_0728106_120055
95/113
Table4.7:I/Oportsof R23MDC
R23MDCPipelineFFT
tart Circuitisreceivinginputsymbolsfromuppersysteminhighlevels
meaning.
Inputportreceivinginputsymbolsfromuppersystem.
t_.d ExecutingFFToperationinlowlevelmeaning.
mo e
ExecutinIFFTo erationin hi h levelmeanin
Asynchronous resetsignalandpositiveedgetrigger.Clock si nal
ready Output.cport.a1readypreparevalidiidatalevel meaning
trans_output1result.trans_output2OutputportZiofcompuitationresulrti:
4.4.3Specification FFT.iArcliitecture
\/ Blockdiagram
R/\M_DAT/\2
Process
Element
CLL 'l'rans_output> >rst_p?, .mode MEM1n ROM11- MEM0ut
-
7/27/2019 etd_0728106_120055
96/113
\/ Timing diagram
The rest signal rst_p must be set high to trigger the memory-based architecture
first. Then, the two dual-port memories will begin receiving the input data if the
primary input start is pushed high. This signal will be pulled down until all 128 sets
of data have been inputted. When all 448, (128/2 x log2128), butteries complete their
operations, the output signal trans_output will start outputting computed results;
simultaneously, the other output signal ready must also be set high to tell outer
circuits that current output data are valid. Finally, the ready signal will be pulled
down when all 128 sets of data have been outputted.
1k__l|_||_l|_||_I|/L||_||_!_I|_|L/|_!|_!|_||_l|_lL/L||_l|_ll_|_||_
64*7=896c)c1
Fig. 4.27 Timing diagram of a 128-point memory-based architecture.
\/ I/O Definition
Table 4.8: I/O ports of memory-based FFT architecture.
Memory-based FFT architecture
tart Circuitisreceivinginputsymbolsfromuppersystemin highlevels
meaning.
-
7/27/2019 etd_0728106_120055
97/113
rst_p Asynchronous reset signal is positive edge trigger.
clk Clock signal
High level means that output port already prepare valid data
trans_output Output port of computation result
WEN1
WEN2
OEN
4.4.4 Synthesis Result
Technology le: Artisan umc.18 1P6M Cell library
Word length: 20 bits (10 bit for real part and 10 bit for imaginary part)
/ R23SDF
The Table 4.9 lists the gate count from different approach. We can nd that the
original design without any supposed option, such as adding extra pipelined and using
complex multiplier of Design Ware, is the least gate count when clock rate is small
than 60MHz. While the clock rate is large than 70MHz, the option 3 of sub-pipelined
insertion is the least gate count and also workable until 130MHz. In addition to option
3 of sub-pipelined insertion with increase clock rate, the area is small than original
-
7/27/2019 etd_0728106_120055
98/113
Table 4.9: Gate count of the R23SDF
Table4.10:Powerconsumptionof theR23SDFat 128-pointFFT(mw).
16.3537 16.5821 17.4383 17.6724 17.6523
23.3062 23.3919 24.5416 24.7647 24.7128
30.580630.4169828.180131.935431.8581
39.4032 39.3024 39.0054
46.2579
J R23MDC
Throughaboveanalyze,wetry partialcaseto quickcomparewhetherR23MDC
hasthesamepropertyof R23SDF.Dueto thetable4.11,thecharacteristicis thesame
to theR23SDwhenchoosingoption3 of sub-pipelinedinsertionfor theR23MDC.
83
-
7/27/2019 etd_0728106_120055
99/113
However,wecannd thatthegatecountnotonlyis largethanR23SDFbutthe
powerconsumptionasshownin Table4.12is alsomorethanR23SDF.Nevertheless,
the R23MDCisnt reallyno advantagewhensomekind of applicationneedsthe
unused half N cycle to do something like that bit reverse order of output sequence
transfer normal order sequence.
Table4.11:Gatecountof theR23MDCat 128-pointFFT/IFFT.
Table4.12:Powerconsumtion of theR23MDCat 128- oint FFT (mw).
33.5921 36.5175 36.2135
58.5101
46.8579
84
-
7/27/2019 etd_0728106_120055
100/113
/ Memory Basedarchitecture
We try to nd the fastest clock rate in this one. The clock rate, 80MHz, is the
fastest when using complex multiplier of technology independent. The clock rate
increases to 100 MHz when using complex multiplier of Design Ware, and further
decreasing gate count.
T ble4.13'Powerco sumptionof theR23SDFat 128pointFFT(mw).
using technology
independent ( @8OMhz)
using Design Ware
(@ lOOMhz)
4.4.5 Analysis of Suitable Applications
The number of the butteries is equal to N / r in the N-point FFT implemented
with the radix-r PE, where r is a power of 2 and the number of stages will be log2N.
Under such a circumstance, we describe the execution ow of the provided pipelined
architecture and memory-based architecture in Figs. 4.28 and 4.29, respectively. The
clock rate and the throughput rate will be the same for our provided pipelined
architecture because that possessesthe properties, real time and non-stopping. On the
contrary, the throughput rate will be different from the clock rate for memory-based
one since it has some specic characteristics. In this situation, the relation between the
throughput rate and the clock rate can be representedas follows.
Throughputrate=+x ClockRate: . (4.7)2N+log,N 2+ grr r
85
-
7/27/2019 etd_0728106_120055
101/113
OFDM Symbol 1
Fig. 4.29 Execution ow of the provided memory-based FFT architecture
According to the synthesis results given before, the maximum operating frequency of
our memory-based architecture is 100 MHz. Assuming that the size of FFT is 64-point
and the operating frequency is set 100 MHz, the throughput rate will be equal to
Clock Rate _ 100MHz
1og,N 2+log264r 2
=20Mbps. (4.8)
2+
In the same way, if the size of FFT is 8192-point, the throughput rate will become
Clock Rate_ 100MHz
I 1og,N 2 I 1og28192I r I 2
=11.76Mbps . (4.9)
2
As seen from the specifications of associated OFDM-based communication
systems given in Tables 4.14 and 4.15, most applications except UWB could be
realized using our developed architectures. However, the required word length will
increase for higher precision while implementing FFT, whose size of points is too
86
-
7/27/2019 etd_0728106_120055
102/113
larger. In this condition, the proposed architectures maybe cannot operate at the
highest frequency, 100 MHz. The detailed information about the maximum operating
frequency for different size needs more experiments to acquire; here, we have not
done more completely yet.
Table 4.14: FFT/IFFT size for OFDM-based communication system
DVB-T ~ DAB ~ VDSL
system
64 . 20
2 x 256 2.22
2.22*22x256x2,n=0,...,4 23 1
256x2,n=0,...,3 8.26
8192/2048 896/224 9.14/9.14
128 0.24242 528
87
-
7/27/2019 etd_0728106_120055
103/113
Capter5
Verication and Performance
In this Chapter, we discuss the possibility of finding cost function. By the cost
function, the capability of FFT/IFFT compiler will be raise, and construct an approach
of C simulation model for verifying proposed design. Finally, a verification plan and
comparison with other works are given.
5.1 Cost Function and Derivation
A good cost function is the statistics of the power consumption and area of all
proposed architectures which is calculated with the given parameters,that contains the
size of FFT, the clock rate and the throughput rate, etc. After analyzing the statistics,
FFT/IFFT compiler would indicate which architectures is the rst choice under the
parameters .
According to the analysis results of our research, the FFT/IFFT compiler
automatically chooses the lowest gate counts under the parameters of throughput rate
and clock rate among proposed architectures. From previous discussion the pipelined
architecture and memory-based architecture have different consider under the
requirement of throughput rate. Then, we can rearrange the FFT/IFFT compiler
automatically chooses our provided architecture which is the lowest gate counts via
the throughput rate. However, when changing the size of the FFT and word length, the
critical path of proposed architecture and range of Fig. 5.1 will different.
88
-
7/27/2019 etd_0728106_120055
104/113
Sub-pipe option 1R23SDF
Sub-pipe option 3R23SDFNonsubpipe
R23SDFMemory Based
20 60 70 140 Mbps
Using designware ( ThroughputRate)
Fig. 5.1 Choosing an architecture based on the specified throughput rate.
5.2 C Simulation Model
Constructing C simulation model can obtain some middle of simulation, these
values are useful at the duration of debugging when chip is implementing. Another
purpose of the C simulation model is to generate golden patterns to verify proposed
design. Next, we discuss how to construct C simulation model from FFT algorithm.
Based on the discussions of algorithm in the previous Chapters, the regularity of
the FFT algorithm is already known in which the twiddle factor is variation in each
time stage when using different radix FFT algorithm. Then, we use radix-8 algorithm
to explain how to construct a C simulation model. All operation of construction
process must be considered in fixed point arithmetic for matching the simulation
result with hardware result. We illustrate the constructing C simulation ow using an
example which shown in Fig 5.2.
Step 1: operate all buttery for one stage rst, then saving the result using in place
mode.
Step 2: process the operation of multiplication to the output of buttery output and
export a file if needed to trace every stagesoutput.
89
-
7/27/2019 etd_0728106_120055
105/113
Step 3: return to step 1 until all time stage is operated completely and export result le
which is golden pattern.
Step I Step 2
ffif 5.2..Radi5
-
7/27/2019 etd_0728106_120055
106/113
from bit-reverse order to normal order before injecting them to FFT. Based on
above-mentioned verification plan, we can ensure that our design is correct.
( GoldenPattern)
FFT( RTL Code)
GoldenPattern)
Fig. 5.3 Verication plan
5.4 Performance Evaluation
With the parametrizable control, different simulation results can be acquired by
changing the input data width. As mentioned in the last section, the verification plan is
complete by 1) injecting the test patterns to the IFFT rst, where the test patterns are
from the pattern generator and 2) then applying the computed results to the FFT. So
the information about signal-to-quantization-noise-ratio (SNR) can be obtained by
analyzing the input sequenceof the IFFT and the resulted output sequenceof the FFT.
In our design, we assume that the input, output and the twiddle factor have the same
91
-
7/27/2019 etd_0728106_120055
107/113
data width. For simplicity of explanation, data width is subsequently to denote the
data width of above-mentioned signals. And, Fig. 5.4 depicts the relation between
SNR and the data width of 128-point FFT. It is obvious that SNR will be higher than
30 db when the data width is larger than 11x2 bits and higher than 40 db while larger
than 15x2 bits.
In 11 I1 13 I1 I5 I15 rr l l I9
nauwianghaa
Fig. 5.4 SNR curve in the 128-point FFT
Table5.1liststhesynthesisresultsaboutourproposedR23SDFarchitectureand
another work [25]. Assuming that the data width is 20 bits, it is observed that both the
area and power consumption of our proposed architecture are less than those in [25].
Table 5.1: ASIC synthesis result at clock frequency of 132MHz.
proposed
92
-
7/27/2019 etd_0728106_120055
108/113
Capter 6
Conclusions and Future Work
6.1 Conclusions
We present an efcient FFT/IFFT compiler which consists of three IP cores,
R23SDF,R23MDCandmemory-basedFFTarchitectures.Theinputsto ourdeveloped
generator are a set of user-defined parameters. According to the provided input
constraints from the outside world, our generator can take in account the trade-off
between hardware overhead and speed requirement and output a suitable RTL code
for users reference. Based on our development, not only a dedicated FFT/IFFT
module can be easily prototyped for fast system verication, but also the resulting
compiler can be used as a basis for more advanced research in the