project report1.pdf
TRANSCRIPT
-
DESIGN AND IMPLEMENTATION OF ANASYNCHRONOUS
PROCE
Masters thes
Electronics
Jonas C
Reg nr: LiTH-ISY
Linkping, Ju PIPELINED FFTSSOR
is project at Systems
laeson
-EX-3356-2003ne 13, 2003
-
DESIGN AND IMPLEMENTATION OF ANASYNCHRONOUS
PROCE
Examensarbete utfrtvid Linkpings Te
av
Jonas C
Reg nr: LiTH-ISY
Supervisor: W
Examiner: prof. L
Linkping, Ju PIPELINED FFTSSOR
i Elektroniksystemkniska Hgskola
laeson
-EX-3356-2003
eidong Liars Wanhammar
ne 12, 2003
-
SammanfattningAbstract
NyckelordKeywords
RapporttypReport: category
Licentiatavhandling
C-uppsatsD-uppsatsvrig rapport
SprkLanguage
Svenska/SwedishEngelska/English
ISBN
Serietitel och serienummerTitle of series, numbering
URL fr elektronisk version
TitelTitle
FrfattareAuthor
DatumDateAvdelning, InstitutionDivision, department
Department of Electrical Engineering
ISRNExamensarbete
ISSNX
581 83 LINKPING
http://www.ep.liu.se/exjobb/isy/2003/3356
Design och implementering av en asynkron pipelinad FFT processor
Design and Implementation of an Asynchronous Pipelined FFT Processor
FFT processors are today one of the most important blocks in communication equipment. They areused in everything from broadband to 3G and digital TV to Radio LANs. This master's thesis projectwill deal with pipelined hardware solutions for FFT processors with long FFT transforms, 1k to 8kpoints. These processors could be used for instance in OFDM communication systems.
The final implementation of the FFT processor uses a GALS (Globally Asynchronous Locally Syn-chronous) architecture, that implements the SDF (Single Delay Feedback) radix-22 algorithm.The goal of this report is to outline the knowledge gained during the master's thesis project, to de-scribe a design methodology and to document the different building blocks needed in these kinds ofsystems.
DFT, FFT, Pipelined, Parameterizable, Processor, GALS, Radix-22, SDF
X
2003-06-06
Jonas Claeson
LiTH-ISY-EX-3356-2003
-
FFT processors are today one of the mcation equipment. They are used in evdigital TV to Radio LANs. This mastpipelined hardware solutions for FFTforms, 1k to 8k points. These processOFDM communication systems.
The final implementation of the procechronous Locally Synchronous) archi(Single Delay Feedback) radix-22 alg
The goal of this report is to outline theters thesis project, to describe a desigthe different building blocks needed iAbstractost important blocks in communi-
erything from broadband to 3G anders thesis project will deal with processors with long FFT trans-ors could be used for instance in
ssor uses a GALS (Globally Asyn-tecture, that implements the SDForithm.
knowledge gained during the mas-n methodology and to document
n these kinds of systems.i
-
Abstract
ii
-
AcknowFirst of all I would like to thank my exfor giving me this interesting mastersdirections of my work. I would also liLi for his help with more detailed quehelped me a lot is Kent Palmkvist wittions, and Jonas Carlsson with questiocuits.ledgementsaminer professor Lars Wanhammar thesis project and for the generalke to thank my supervisor Weidongstions. Two other persons that
h VHDL and synthesis related ques-ns concerning asynchronous cir-iii
-
Acknowledgements
iv
-
TeTable: Terminology.
Abbreviation or term ExplanationBFP Block Floating
internally.butterfly Basic buildingCG FFT Constant GeoCOFDM Coded OrthogDFT Discrete Fouri
continuous foua time-domain
DFT Design For Teease and speed
DIF Decimation Inimplementing
DIT Decimation Inmenting a rad
FFT Fast Fourier TDFT.
GALS Globally Asyndecomposingthat communi
HW Hardware.in-place algorithm Output of a bu
came from.LS-system Locally SynchMDC Multipath Del
PEs in a pipelrminology
Point. One way of representing data
block in HW FFT processors.metry FFT.onal Frequency Division Multiplexing.er Transform. The discrete version of therier transform. Transforms a signal from to a frequency-domain.st. Extra HW is added in the design to up the testing. Frequency. One out of two ways of a radix PE. Time. One out of two ways of imple-ix PE.ransform. Quick way of computing a
chronous Locally Synchronous. A way ofa system into several synchronous blockscate with an asynchronous protocol.
tterfly is written back to where the input
ronous system.ay Commutator. Block between radixined architecture.v
-
Terminology
vi
not-in-place algorithm Output of a butterfly is not written back to where theinput came from.
OFDM Orthogonal Frequency Division Multiplexing. OFDMis a broadband multicarrier modulation method used ina lot of comm
PE Processing ElSDC Single-path D
PEs in a pipelSDF Single-path D
in a pipelined SFG Signal Flow G
ical way usingSIC Single InstrucSNR Signal to Nois
FFT context.
Table: Terminology.
Abbreviation or term Explanationunication systems.ement.elay Commutator. Block between radixined architecture.elay Feedback. Block between radix PEsarchitecture.raph. Describes an algorithm in a graph- adders, multipliers, signal wires, etc.tion Computer.e Ratio. Not a good measurement in the
-
Table: Symbols.
Symbol ExplanationN Length of the
FFT.x Input signal toX FFT transform
Table: Operators and functions.
Operator or function Explanationa|b b is divisible bN X modulo N.Notation
input and output sequence of a DFT or
an FFT processor. of the input signal x.
y a, i.e. b/a gives 0 in rest.vii
-
Notation
viii
-
Table oAbstract.........................................
Acknowledgements.......................
Terminology ..................................
Notation.........................................
Table of Contents..........................
1 Introduction..................................1.1 General 11.2 Scope of the Report 11.3 Project Requirements 21.4 Reading Instructions 3
2 Algorithms ....................................2.1 Introduction 52.2 The DFT Algorithm 52.3 FFT Algorithms 62.4 Common Factor Algorithms 62.5 Radix-2 Algorithm 72.6 Radix-r Algorithm 92.7 Split Radix Algorithm 92.8 Mixed Radix Algorithm 92.9 Prime Factor Algorithms 10f Contents.................................. i
.................................. iii
.................................. v
.................................. vii
.................................. ix
.................................. 1
.................................. 5ix
-
Table of Contents
x
2.10 Radix-r Butterflies 10
3 Architectures .................................................................. 133.1 Introduction 133.2 Array Architectures 133.3 Column Architectures 143.4 Pipelined Architectures 14
3.4.1 MDC, SDF and SDC Com3.4.2 Pipeline Architecture Com
3.5 Multipipelined Architectures 13.6 SIC FFT Architectures 173.7 Cached-FFT Architectures 18
4 Numerical Effects.........................4.1 Introduction 194.2 Safe Scaling 19
4.2.1 Radix-2 Safe Scaling 204.2.2 Radix-r Safe Scaling 21
4.3 Quantization 214.3.1 Twos Complement Quan4.3.2 Radix-2 Quantization 224.3.3 Radix-r Quantization 23
5 Implementation Choices..............5.1 Introduction 255.2 Algorithm Choice 255.3 Architecture Choice 26
6 Radix-22 FFTs ..............................6.1 Introduction 276.2 Algorithm 276.3 Architecture 296.4 Numerical Effects 30
7 FFT Design ...................................7.1 Introduction 317.2 Matlab Design 31mutators 15parisons 16
7
.................................. 19
tization 21
.................................. 25
.................................. 27
.................................. 31
-
Design and Implementation of an Asynchronous Pipelined FFT Processor
7.2.1 Problems and Solutions 327.3 Matlab Simulations 327.4 VHDL Design 33
7.4.1 Problem 1 and Solution - Abstraction 1 337.4.2 Problem 2 and Solution - Object Orientation 347.4.3 Problem 3 and Solution - Control Block 35
7.5 Design for Test 357.6 VHDL Simulations 367.7 Synchronous or Asynchronou7.8 Testing 36
7.8.1 Random Testing 367.8.2 Corner Testing 377.8.3 Block Testing 377.8.4 Golden Model Testing 377.8.5 FPGA Testing 37
7.9 Synthesis 397.10 Meetings 40
8 Asynchronous Design...................8.1 Introduction 418.2 Asynchronous Circuits 418.3 GALS 42
8.3.1 Asynchronous Wrappers 8.3.2 Enable generation 43
8.4 Design Automation 448.5 Asynchronous FFT Architectu8.6 Testing 468.7 Synthesis 468.8 Summary of GALS Design 46
9 Future Work .................................9.1 Introduction 499.2 Word Length Optimization 49
9.2.1 General 499.2.2 Gradient Search 509.2.3 Utility Function 50
9.3 VLSI Layout 50xi
s Design 36
.................................. 41
43
re 45
.................................. 49
-
Table of Contents
xii
9.4 VLSI Layout of Asynchronous Parts 519.5 Completely Asynchronous Design 519.6 Design for Test 519.7 Twiddle Factor Memory Reduction 519.8 Commutators Implemented with RAM 529.9 Unscrambler 52
10 Summary.....................................10.1 Conclusions 5310.2 Follow-up of Requirements 5
11 Bibliography ................................................................. 53
3
.................................. 55
-
1 In1.1 General
FFT processors are involved in a wideonly as a very important block in broaalso in areas like radar, medical electrfor Extraterrestrial Intelligence).
Many of these systems are real-time stems has to produce a result within a FFT computations are also high and apose processor is required, to fulfill thFor instance using application specificcessors, or ASICs could be the solutioters thesis project an ASIC FFT procchoice because of its lower power con
1.2 Scope of the Report
The report is concentrated on pipelinetectures and algorithms that are most sors.
The first part of the report gives a revand FFT algorithm and different apprHW. Some terminologies like radix balgorithms, architectures, etc., are inttroduction
range of applications today. Notdband systems, digital TV, etc., butonics and the SETI project (Search
ystems, which means that the sys-specified time. The work load forbetter approach than a general pur-
e requirements at a reasonable cost.processors, algorithm specific pro-n to these problems. In this mas-
essor will be designed. ASIC is thesumption and higher throughput.
d FFT processors, and what archi-suitable for dedicated FFT proces-
iew on the theory behind the DFToaches to implement the FFT inutterflies, pipelining, commutators,roduced in this part.1
-
1 Introduction
2
The second part of the report describes the main goal of this masters the-sis project, i.e. to design and implement a parameterized pipelined FFTprocessor for transform lengths from 1k to 8k samples per frame. Thesetransform lengths and the parameterization reduces the amount of algo-rithms, architectures, and so on, that could be taken into account whendesigning a processor according to these criteria. Some parts of the the-ory are therefor very briefly describedlimited usefulness in the considered a
What trade-offs have to be made? Whshould be used? What types of simulaing performed? These are some of thethe second part.
1.3 Project RequirementsThe requirements for this masters the1. The transform length shall be able
in powers-of-2.2. The input signal shall be a continu3. The input signal shall consist of on4. The word length of the input and o
able. The internal word lengths sha5. Safe scaling shall be used.6. Data shall be represented with two7. The implemented architecture shal
These are the prioritizations that shou Effect, power consumption and thr
latency within reasonable limits. Linput stream arrives continuously.
SNR is a poor quality measuremenment should therefor not be considThough, this does not mean that thdue to quantization noise.compared to others, because of itsrea.
at architecture and algorithmtions should be done? How is test-questions that will be discussed in
sis project are as follows:to vary between 1k and 8k samples
ous data stream.ly one continuous data stream.utput signal shall be parameteriz-ll also be parameterizable.
s complement format.l be pipelined.
ld be taken mostly into account:oughput are superior to die area andatency is hard to affect when the
t in FFT processors, this measure-ered too important in the design.e output can have too low precision
-
Design and Implementation of an Asynchronous Pipelined FFT Processor
These are the first requirements on the FFT processor that is going to bedesigned. Later in the report new restrictions and requirements will beadded to narrow down the area of investigation even further, to concen-trate the work on the type of architecture found to be most adequate.
1.4 Reading Instructions
This list gives a short description of th Chapter 1 Introduction contains th
will the project be all about? What Chapter 2 Algorithms contains a d
algorithms, not only those that wilalso a few other ones.
Chapter 3 Architectures contains aarchitectures that implement the FFalgorithms that not will be considealong with all the more adequate o
Chapter 4 Numerical Effects contaon the quantization errors in FFT a
Chapter 5 Implementation Choicesradix-22 algorithm and the SDF arbeing implemented.
Chapter 6 Radix-22 FFTs containsrithm and architectural description
Chapter 7 FFT Design contains thproject. Some problems that arose are discussed in this chapter.
Chapter 8 Asynchronous Design coasynchronous circuits, with a focusfocus in this chapter, but it also desture of the final implementation of
Chapter 9 Future Work contains suthis project could be.3
e content of each chapter.e introduction of the project. What will the result of the project be?escription of a lot of different FFTl be considered in the project, but
description of a lot of different FFTT algorithms. Also here some
red in the project will be described,nes.ins a basic theoretical introductionlgorithms and architectures. contains an explanation why the
chitecture is chosen to be to one
the derivation of the radix-22 algo-s of its components.e design methodology used in thisduring the project and the solution
ntains a very basic introduction toon GALS. The methodology is thecribes the asynchronous architec-
the asynchronous FFT processor.ggestions of what the next steps in
-
1 Introduction
4
Chapter 10 Summary contains the summary for the whole project.General thoughts and acquired knowledge are discussed.
Chapter 11 Bibliography contains the references referred to, insidesquare brackets, in the text.
-
2 A2.1 Introduction
The algorithms chapter will introducerithm and different approaches to comwill mainly be focused on FFT algoritalgorithms, will also be described bri
2.2 The DFT Algorithm
A DFT is a transform that is defined a
where
is the N-th root of unity. The inverse o
These equations show that the compleDFTs and IDFTs is O(N2), hence the masters thesis will be very costly in a
X k( ) x n( ) W Nnk
n 0=
N 1
,=
W N ej2N-
=
x n( ) 1N---- X k( ) W Nn k
k 0=
N 1
=lgorithms
the DFT definition, the FFT algo-pute FFTs in HW. The discussion
hms useful for long FFTs, but otherefly.
s
(Eq 2.1)
(Eq 2.2)
f the DFT (IDFT) is defined as
(Eq 2.3)
xity of a direct computation oflong transforms considered in thisstraight forward computation. The
k 0 N 1,[ ]
p
---
n 0 N 1,[ ],5
-
2 Algorithms
6
FFT algorithm deals with these complexity problems by exploiting regu-larities in the DFT algorithm.
2.3 FFT Algorithms
An FFT algorithm uses a divide-and-conquer approach to reduce thecomputation complexity for DFT, i.e.lot of different smaller problems thattion of the original problem.
In a communication system that uses need for an IFFT algorithm. Since theof these can be computed using basicreal and imaginary parts of the input,and imaginary data of the output. Thedata, except for the scaling factor in tthis is not a problem, and this will the
2.4 Common Factor Algorithm
Common factor algorithms are one wthe divide-and-conquer approach. Thiway of computing FFTs. N is then div
where the factors are constrained in th
This basically means that they have onFFT can be computed in i number of stational complex algorithms that can tion-in-frequency) and DIT (decimati
N Nii
=
a i a Ni( )"$ one big problem is divided into ain the end are assembled to the solu-
an FFT algorithm there is also aDFT and the IDFT are similar both
ally the same FFT HW, swap thecompute the FFT and swap the realoutput is now the IFFT of the input
he IFFT algorithm, 1/N. Usuallyrefor not be discussed henceforth.
s
ay of dividing the problem, usings method is the most widely usedided into factors according to:
(Eq 2.4)
e following way:
(Eq 2.5)
e factor in common. In this way anteps. There are two equally compu-
be derived from this, DIF (decima-on-in-time).
-
Design and Implementation of an Asynchronous Pipelined FFT Processor
2.5 Radix-2 Algorithm
The radix-2 algorithm is a special case of the common factor algorithmfor N-point DFTs, where N is power-of-2. To derive the radix-2 algo-rithm, the indices n and k in Equation 2.1 are represented by
where
When these representations are used fDFT definition can be rewritten as
The last term in the right side of Equa
Observe that
By using Equation 2.10 on the differelowing relations are found
n 2 a 1 na 1 2
a 2n
a 2+ +=
k 2 a 1 ka 1 2
a 2 ka 2+ +=
ni ki, 0 1,{ } ,
N 2 a= a
X ka 1 k a 2 k0, , ,( ) x(
na 1 0=
1
n1 0=
1
n0 0=
1
=
W N
2 b nb
b 0=
a 1
2 b kb
b 0=
a 1
W N
2 a 1 ka 1 2
b
b 0=
a 1
=
W NN
e
j2 pN
---------
N
=7
(Eq 2.6)
(Eq 2.7)
or substitution in Equation 2.1, the
(Eq 2.8)
tion 2.8 can be expressed as
(Eq 2.9)
(Eq 2.10)
nt factors of Equation 2.9 the fol-
n0+ 2b
nb
b 0=
a 1
=
k0+ 2b k
b
b 0=
a 1
=
i 0a 1=
na 1 n a 2 n0, , , ) W N
2 b nb
b 0=
a 1
2 b kb
b 0=
a 1
nb
W N
2 a 2 ka 2 2
b
nb
b 0=
a 1
W N
k0 2b
nb
b 0=
a 1
1=
-
2 Algorithms
8
(Eq 2.11)
Insert Equation 2.11 in Equation 2.8
This summation can be divided into s
Finally, to obtain the FFT an unscramoutput data in natural order. Unscram
With this algorithm the computationaO(Nlog2(N)) butterfly operations. Theinto log2(N) different steps, which is ain HW.
The SFG for this derivation of the FFfor an 8-point radix-2 DIF FFT:
G0 W N
2 a 1 ka 1 2
b
nb
b 0=
a 1
W N
2 a 1 ka 1 2
b
nb
b 0=
0
W N2 a 1 k
a 1 n0= = =
G1 W N
2 a 2 ka 2 2
b
nb
b 0=
a 1
W N
2 a 2 ka 2 2
b
nb
b 0=
1
= =
Ga 1 W N
k0 2b
nb
b 0=
a 1
=
X ka 1 k a 2 k0, , ,( )
na 1 =
1
n1 0=
1
n0 0=
1
=
x1 k0 n a 2 n a 3 n0, , , ,( ) x(n
a 1 0=
1
=
x2 k0 k1 n a 3 n0, , , ,( ) x1n
a 2 0=
1
=
xa 1 k0 k1 k a 1, , ,( ) x a
n0 0=
1
=
X ka 1 k a 2 k0, , ,( ) x a =(Eq 2.12)
equential summations
(Eq 2.13)
bling stage is added to reorder thebling is done by bit-reversing.
(Eq 2.14)
l complexity is reduced tocomputation has also been dividedn advantage considering pipelining
T algorithm looks like Figure 2.1
x na 1 n a 2 n0, , ,( ) Gi
i 0=
a 1
0
na 1 n a 2 n0, , , ) G a 1
k0 n a 2 n a 3 n0, , , ,( ) G a 2
2 k0 k a 2 n, 0, ,( ) G0
1 k0 k1 k a 1, , ,( )
-
Design and Implementation of an Asynchronous Pipelined FFT Processor
Figure 2.1: SFG for an 8-point radix-2 DIF FFT.
2.6 Radix-r Algorithm
The radix-r algorithm uses the same adecomposition using base-r instead o
The derivation of the radix-r is analogThe proof will therefore be left out hefor the radix-r case is O(Nlogr(N)) buO(logr(N)) butterfly stages.
2.7 Split Radix Algorithm
The split radix algorithm is one way oplications and additions required, [1].irregular structure compared to mixedrithms. Because of the irregular structparameterization, and will therefore n
2.8 Mixed Radix Algorithm
Mixed radix algorithms is a combinatThat is, different stages in the FFT coFor instance, a 16-point long FFT canone stage with radix-8 PEs, followed
W0
W0W0
W0
W0
W2
W2
W2
W1
W3
x(0)x(1)x(2)x(3)x(4)x(5)x(6)x(7)
X(0)X(1)X(2)X(3)
Stage 1 Stage 2
Unscramble
N r a= r a,9
pproach as radix-2, but with thef base-2. N is factorized as
(Eq 2.15)
ous to the derivation of radix-2.re. The computational complexitytterfly operations divided into
f decreasing the number of multi- The main drawback is the more radix and constant radix algo-ure this algorithm is not suitable forot be studied more thoroughly.
ion of different radix-r algorithms.mputation have different radices. be computed in two stages using
by a stage of radix-2 PEs. This adds
W0
W0X(4)X(5)X(6)X(7)
Stage 3
-
2 Algorithms
10
a bit of complexity to the algorithm compared to radix-r, but in return itgives more options in choosing the transform length.
2.9 Prime Factor Algorithms
Prime factor algorithms decompose N into factors that are relative prime,which means that the greatest commoThere are two reasons why prime factered later in the report. Firstly, it restrthat N cannot be power-of-2, which isscale very good, because for large Nsnumbers will also be large, hence wilmentation of the PEs.
2.10 Radix-r Butterflies
The radix-r butterflies are the blocks tin the radix-r algorithm. The followinSFG structure is derived (for the radixstage will be shown, the other proofs obtained is a radix-2 DIF (decimationand Equation 2.13
The last factor in the summation is noable, hence this factor can be lifted ou
x1 k0 n a 2 n0, , ,( ) x n a 1 n a,(n
a 1 0=
1
=
x na 1 n a,(
na 1 0=
1
=n divisor of the factors is equal to 1.or algorithms will not be consid-icts the transform length in a waya requirement. Secondly, it doesnt
the decomposing relative primel result in a very complex imple-
hat perform the basic computationsg reasoning will explain how the-2 case). Only the proof of the firstare analogous. The butterfly-in-frequency). From Equation 2.11
(Eq 2.16)
t depending on the summation vari-t from the summation
2 n0, , ) W Nk0 2
b
nb
b 0=
a 1
2 n0, , ) W Nk0 2
a 1n
a 1 W N
k0 2b
nb
b 0=
a 2
-
Design and Implementation of an Asynchronous Pipelined FFT Processor
(Eq 2.17)
According to the above computations,made with a structure called radix eleradixes can be derived in a similar waand r outputs for a radix-r element. Tfor the radix-2 case.
Figure 2.2: Structure of a radix-2 DIF b
W N
k0 2b
nb
b 0=
a 2
x na 1 n a 2 n0, , ,( ) W N
k0 2a 1
na 1
na 1 0=
1
=
W Nk0 2
a 1n
a 1 e
j2 p2 a
------------
2 a
2----- k0 n a 1 1( )k0 n a 1= =
=
W NP
x 0 na 2 n0, , ,( ) 1( )
k0+(=
Wp
x1
x0 X0
X111
the basic FFT computations can bement. Radix elements for highery. These elements will have r inputshe figure below shows the structure
utterfly.
x 1 na 2 n0, , ,( ))
+
+ Wp-
x1
x0 X0
X1
-
2 Algorithms
12
-
3 Ar3.1 Introduction
This chapter discusses different architAs in chapter 2 Algorithms, mostly thwill be taken into account. Their advacussed.
3.2 Array Architectures
The array architecture can only be usthe extensive use of chip-area. This coelement (PE) for each butterfly in theFFTs longer than 16 points are not imhence it will not be discussed in detai
Figure 3.1: SFG for an 8-point array F
W0
W0
W0
W2
W2
W2
W1
W3
x(0)x(1)x(2)x(3)x(4)x(5)x(6)x(7)
Stage 1 Stage 2chitectures
ectures used for FFT computations.e architectures useful for long FFTsntages and drawbacks will be dis-
ed for very short FFTs, because ofmes from the use of one processingsignal flow graph (SFG). Normallyplemented with this architecture,ls.FT architecture.
W0
W0
W0
W0X(0)X(1)X(2)X(3)X(4)X(5)X(6)X(7)
Stage 3
Unscramble13
-
3 Architectures
14
3.3 Column Architectures
The column architecture uses an approach that requires less area on thechip than the array architecture. It is done by collapsing all the columnsin an array architecture into one column, hence a new frame cannot beprocessed until the processing of the current frame is finished. Hence,this architecture is not suitable for pipobviously smaller, only N/r radix-r elture. The architecture is still not smalfor long FFTs.
An architectural structure of a 4-poinbelow. To get a simple feedback netwstant geometry FFT (CG FFT) is oftemeans that the connection network insame between all stages.
Figure 3.2: Structure of a 4-point radix
3.4 Pipelined Architectures
Pipelined architectures are useful for throughput. The basic principle with pthe rows, instead of the stages like in ture is built up from radix butterfly elbetween. An unscrambling stage is soput side of the processor, if the outpuorder.
The advantage with these architecturethroughput, relatively small area and
x(0)x(2)
x(1)x(4)elining. The area requirement isements, than for the array architec-l enough to be taken into account
t radix-2 DIF FFT can be seenork a type of structure called con-n used as a starting point. This an array architecture would be the
-2 column architecture.
FFTs that require high dataipelined architectures is to collapsecolumn architectures. The architec-ements with commutators inmetimes added on the input or out-
t data is needed to be in natural
s, are for instance, high dataa relatively simple control unit.
X(0)X(2)
X(1)X(4)W
p
Wp
-
Design and Implementation of an Asynchronous Pipelined FFT Processor
These advantages make this solution suitable for the long FFTs consid-ered in this masters thesis project.The basic structure of the pipelined architecture is shown below. Betweeneach stage of radix-r PEs there is a commutator (denoted C in the pic-ture). The last stage is the unscrambling stage (denoted U in the picture).The commutator reorders the output dfeeds to the following stage. The unscral sorted order.
Figure 3.3: General structure of a pipe
3.4.1 MDC, SDF and SDC Commu
There are basically three kinds of commutator (MDC), Single-path Delay FDelay Commutator (SDC). They all gerties, especially when it comes to tot
A commutator is a switch for data betthe pipeline. It stores parts of the FFTto perform the switching properly. Thdifferent, because it also feeds data ba
The figures below show the structure oa denotes the stage number in the pigives the size of that FIFO buffer in cBF4 is short for radix-4 butterfly elem
...x
radix-rPE C15
ata from the previous stage andrambler rearranges the data in natu-
lined FFT architecture.
tators
mutators, Multipath Delay Com-eedback (SDF) and Single-pathive the architecture different prop-al memory requirement.
ween the radix butterfly stages incomputations temporarily in order
e SDF commutator is somewhatckwards, Figure 3.5.
f the commutators. In these figurespeline. The numbers in the boxesomplex samples. C2 is a switch andent.
Xradix-r
PE U
-
3 Architectures
16
Figure 3.4: Multipath Delay Commutator structure.
Figure 3.5: Single-path Delay Feedback
Figure 3.6: Single-path Delay Commut
3.4.2 Pipeline Architecture Compa
There are many different pipelined armemory requirements, different compsummary of the most common pipelinTable 3.1, [2]. The abbreviations of th
2a
2a
C2
3x4a
BF4
6x4a Commutator structure.
ator structure.
risons
chitectures. They have differentlexities, different utilization, etc. Aed architectures are show ine architecture names are composed
-
Design and Implementation of an Asynchronous Pipelined FFT Processor
in the following way, e.g. R2MDC is short for radix-2 multipath delaycommutator FFT architecture.
The R22SDF architecture is interestinties in the table this architecture is eqtectures, with one exception, the numR4SDC architecture. The R22SDF arcgood candidate to investigate, when cimplemented.
3.5 Multipipelined Architectu
Multipipelined architectures are builtlined architectures, but with the distinline can use two or more radix butterfl
These architectures achieve a higher parchitectures, [5]. The improvement inof pipes introduced.
3.6 SIC FFT Architectures
SIC FFT Architectures can be a goodments are not high compared with the[1]. In this architecture all the butterflA radix PE reads data from the memocomputation it writes the data back to
Table 3.1: Pipeline architecture comparison.
Architecture Multiplier # Adder # Memory size ControlR2MDC 2(log4 (N-1)) 4log4N 3N/2 - 2 SimpleR2SDF 2(log4 (N-1)) 4log4NR4SDF log4 (N-1) 8log4NR4MDC 3(log4 (N-1)) 8log4NR4SDC log4 (N-1) 3log4NR22SDF log4 (N-1) 4log4N17
g. When it comes to these proper-ual to or better than the other archi-ber of adders is 25% lower in thehitecture will therefore be a really
hoosing the architecture that will be
res
up in a similar way as normal pipe-ction that some stages in the pipe-y elements.
arallelism than regular pipelinedparallelism is equal to the number
choice when throughput require-throughput of available butterflies,
y elements share the same memory.ry and when it is finished with thethe memory. This results in a lot of
N - 1 Simple
N - 1 Medium
5N/2 - 4 Simple
2N - 2 Complex
N - 1 Simple
-
3 Architectures
18
memory accesses, which could be both hard to implement and be costlyin power consumption.
The architecture can be adapted to the requirements specification moreprecisely by adapting the number of radix PEs. For some specificationsthis architecture reduces radix PEs, which reduces both the die area andthe power consumption.
Figure 3.7: Structure of the SIC FFT a
3.7 Cached-FFT Architecture
Cached-FFT architectures are mainlysumption, [4]. The idea is to use a cacPEs and the main memory to decreasaccesses, which is very energy consum
.
.
.
Memor
radix-PE
radix-PErchitecture.
s
used for reducing the power con-he-memory between the radix-r
e the number of main memorying.
y
r
r
-
4 Numer4.1 Introduction
DSP systems almost always suffer frothe limited internal data word length. two operands always gives a result thoperands. The result have to be truncanal word length, hence quantization oSection 4.3 on page 21.
There is another thing in FFT systemsoverflow. If an overflow occurs in an Foutput. How this is solved will be dis
4.2 Safe Scaling
To prevent operations to overflow andscaling is used. It means that the outpto the first FFT stage are scaled in suctation in the next radix PE will not ovsion by a power-of-2 number, becausearithmetic right shift.
To simplify the discussion about safeused as an example. The safe scaling given without detailed discussion.ical Effects
m quantization effects, because ofFor instance, a multiplication byat is longer in bits than each of theted or rounded to avoid long inter-ccurs. Quantization is explained in
that have to be considered, and it isFT system it will generate a faulty
cussed in Section 4.2 on page 19.
cause errors a method called safeut from each radix PE and the inputh a way, that for certain the compu-erflow. The scaling is often a divi- it easily can be implemented by
scaling, the radix-2 case will befor the radix-r FFT algorithm is19
-
4 Numerical Effects
20
4.2.1 Radix-2 Safe Scaling
In a radix-2 butterfly element, overflow can occur for signals in wire Aand B, see Figure 4.1, after the summations. Overflow can, in reality, notoccur after the twiddle factor multiplication, because the absolute valueof the twiddle factor is always very close to one. Hence, the absolutevalue does not change, only the argum
Fractional twos complement is goingtherefore signals that can be represenabsolute value after each summation cas the input. Hence, to prevent the ouvalue of the input signals have to be sthat the input signals are in this rangeprevious butterfly stage by a factor ofoverflow and it only requires a little exin the FFT implementation.
Figure 4.1: Problem areas in a radix-2
No overflow will now occur in the radexcept for the first butterfly element. Tbe scaled. The real and imaginary inpthe range [-1, 1[, the absolute value coprinciple the input should be scaled bvalue of the first radix PE in the rangecheap to implement in HW and a scalbecause it can be implemented using
The absolute value of the output of thdue to the last safe scaling. To use theues a last stage called final scaling is multiplication by 2, increasing the abrange [-1, 1[.
-
x1
x0A
B
+
+ent.
to be used in this FFT processor,ted is in the range of [-1, 1[. Thean in the worst case be twice as big
tput from overflow the absolutemaller than 0.5. One way to ensureis to divide the output signals in the2. This method will always preventtra HW, hence, this method is used
element.
ix-2 butterflies using this method,he input to this butterfly also has tout to the FFT processor will be inuld therefore be as large as 20.5. In
y a factor of 1/21.5, to get the input [-0.5, 0.5[. This division is noting factor of 1/4 is a better choice,arithmetic right shift.
e FFT processor is smaller than 0.5, whole range of representable val-often added. This stage performs asolute value of the output to the
X0
X1Wp
-
Design and Implementation of an Asynchronous Pipelined FFT Processor
4.2.2 Radix-r Safe Scaling
Radix-r safe scaling is similar to radix-2 safe scaling. The only differenceis that the scaling factor in the radix elements are 1/r instead of 1/2, theprescaling stage is 1/2r instead of 1/4 and the final scaling is a multiplica-tion of r instead of 2.
4.3 QuantizationQuantization occurs after each multipintroduced by quantization are modelmodelling. In this technique stochastiSFG where quantization occurs. Fromtions can be made to estimate the amoput by quantization.
4.3.1 Twos Complement QuantizaThe quantization can be done either bapproaches give different statistical pThe representation of a fractional two
This definition shows that the valuesuted in the [-1, 1[ interval, hence the qon the magnitude of the value. The diues is equal to the truncation error. Hnon-zero expectation value.
Rounding is a better quantization metof the rounding error is zero.
x x0 xi 2i
i 1=
W d 1
+=
0 D t1
2W d
-----------
1
2W d
--------- D r
2--
21
lication in the radix PEs. The errorsled with a technique called noisec noise sources are added to the the new SFG statistical calcula-unt of noise introduced in the out-
tiony rounding or by truncation. Theseroperties on the quantization error.s complement number is given by
(Eq 4.1)
it can represent is uniformly distrib-uantization error does not depend
fference between two adjacent val-ence, the truncation error has a
(Eq 4.2)
hod, because the expectation value
(Eq 4.3)
xi 0 1,{ }
1-----
1W d-------
-
4 Numerical Effects
22
Assuming that the data is uniformly distributed in the range of [-1,1[, theerrors are evenly distributed in the above given intervals. The variance ofa stochastic variable like this is
(Eq 4.4)
4.3.2 Radix-2 QuantizationThe quantization discussion in this sebutterfly elements with rounding and properties, the first is that no quantizasecond that quantization occurs at botcause overflow with safe scaling and tions when using safe scaling.
Figure 4.2: Quantization in a radix-2 D
The quantization, denoted Q, in Figuradding a complex stochastic variablepart and the imaginary part can be seevariables.
The expectation value and variance o
s
2 112------
1
2W d
---------
2=
-
x1
x0
+
+
Wp
n nre j n+=
E nre{ } E nim{ }=
V nre{ } V nim{ }= =ction is only considering DIF radixsafe scaling. The scaling gives twotion occurs in the adders and theh output nodes. The adders neverboth output nodes have multiplica-
IF PE.
e 4.2 can be modelled as an addern to the original signal. The realn as two independent stochastic
(Eq 4.5)
f the complex noise are
X0
X1
Q
Q
1/2
1/2
im
0=
112------
1
2W d
---------
2
-
Design and Implementation of an Asynchronous Pipelined FFT Processor
(Eq 4.6)
The analysis are for a single radix-2 Dthe error analysis in the FFT. Considethe SFG in Figure 2.1. The error in thall stages in the binary tree that is formas root. Since safe scaling is used eacdivided by 2 before it is added to the nfrom the input to the output through ting, the noise variance for an N-point
4.3.3 Radix-r QuantizationThe noise analysis of radix-r DIF quaDIF quantization analysis, [1]. For thein the following way
Equation 4.8 shows that the noise is ltions, even though it has fewer butterfl
E n{ } E nre j n im+{ } 0= =V n{ } E n n{ } E nre
2nim
2+{ } E nre
2{ } E nim2{ }+ V nre{ } V nim{ }+= = = =
V n{ } 16---1
2W d
---------
2 s BF
2= =
s FFT2
s BF2 1 22 2 12 4 12---
2+ + +
=
s FFT2
s BF2 22 1
2i----
i 0=
log2 N( ) 1
s BF2 2= =
s FFT2
s BF2
r2 1 1
r---
1
rlogr N(
-----------------+ + +
=23
IF PE. This result can be used forr the error in only on output node inat node is then the summation overed with that particular output node
h error from a previous node isext stage. By propagating the error
he radix-2 PEs and their safe scal- FFT [1] can be written as
(Eq 4.7)
ntization is similar to the radix-2radix-r FFT the variance would be
(Eq 4.8)
arger for higher radix implementa-y stages.
2log2 N( ) 1 1
2log2 N( )
-------------------
2+
3 1 2log2 N( )
( ) 8 s BF2 1 1
N----
=
) 1--------
s BF2 r
3
r 1----------- 11N----
=
-
4 Numerical Effects
24
-
5 Impl
5.1 Introduction
The first part of the masters thesis reapproaches to the FFT problem. Thischoice of an adequate algorithm, archwhich the FFT processor is going to bchoices will be explained. The chosenone going to be implemented later in
5.2 Algorithm Choice
There are no restrictions on the algoririthm should be able to compute the Fselecting algorithm, the goal of an arcdesign have to be considered.
The radix-2 FFT algorithm has manylow quantization noise level, and it isdifferent FFT lengths.
Radix-r does not seem to be as good ait has higher quantization noise, and table to the different FFT lengths. To brithm to the general power-of-2 FFT lto be used in the pipeline, resulting inementationChoices
port is only a summary of different chapter is going to explain theitecture, and so on, for the area ine used. A lot of trade-offs and architecture in this section is thethe project.
thm choice, except that the algo-FT lengths from 1k to 8k. Whenhitectural and easy understandable
good features. For example, it has also easily parameterizable to the
choice as the radix-2 one, becausehat it is not as easily parameteriz-e able to parameterize this algo-
engths, different radix-r stages have a mixed radix implementation.25
-
5 Implementation Choices
26
Since minimizing the number of multipliers is important, a good choiceof algorithm would be the split radix one. It has a lower number of multi-pliers than all the above ones, but this algorithm results in a complexdesign, which will be harder to parameterize. The control of this type ofprocessor would also be more complex.
The radix-22 algorithm is the most attof as a radix-4 algorithm with radix-2of multipliers, simple control structur
Prime factor algorithms can not be uscan not be calculated using this algor
These are the reasons that the radix-2in the FFT implementation.
5.3 Architecture Choice
The choice of the architecture is easieture. What is left to decide is what kinarchitecture, MDC, SDF or SDC. In tcommutator will be used, because it hthan the other commutators.
The radix-22 architecture behaves a bcalls for two different architectures ladifferent FFT lengths needed. The firslengths, i.e. 1k and 4k FFTs, and the sFFTs. The latter lengths can easily beadding a radix-2 stage at either the inractive algorithm. It can be thoughtbuilding blocks. It has low numbere and architecture.
ed, because the right FFT lengthsithm.
2 FFT algorithm is going to be used
r, it has to be a pipelined architec-d of commutators to be used in the
he implementation the SDF type ofas a smaller memory requirement
it like the radix-4 architecture, thister, to be able to implement all thet architecture is for power-of-4 FFTecond architecture is for 2k and 8k created from the former ones byput or the output of the FFT.
-
6 Rad6.1 Introduction
This chapter will describe the radix-2background to the algorithm and the a
6.2 Algorithm
The derivation of the radix-22 FFT alwith a 3-dimensional index map, [2]. can be expressed as
When the above substitutions are applcan be rewritten as
Where
nN2----n1
N4----n2+=
k k1 2k2 4+ +=
X k1 2k2 4k3+ +( ) xN2----n1 +
n1 0=
1
n2 0=
1
n3 0=
N4---- 1
=
BN2----
k1 N4----n2 n+
n2 0=
1
n3 0=
N4---- 1
=ix-22 FFTs
2 FFTs in detail. The mathematicalrchitecture, will be discussed.
gorithm starts with a substitutionThe index n and k in Equation 2.1
(Eq 6.1)
ied to DFT definition, the definition
(Eq 6.2)
n3+ N
k3 N
N4----n2 n3+
W N
N2----n1
N4----n2 n3+ +
k1 2k2 4k3+ +( )
3
W N
N4----n2 n3+
k1
W N
N4----n2 n3+
2k2 4k3+( )27
-
6 Radix-22 FFTs
28
(Eq 6.3)
is a general radix-2 butterfly.
Now, the two twiddle factors in Equation 6.2 can be rewritten as
Observe that the last twiddle factor inrewritten.
Insert Equation 6.5 and Equation 6.4 summation over n2. The result is a DFFFT length.
The result is that the butterflies have tbutterfly takes the input from two BF
These calculations are for the first radthe BF2I and BF2II butterflies. The Bby the formulas in brackets in Equatioouter computation in the same equatiois derived by applying this procedure
BN2----
k1 N4----n2 n3+
xN4----n2 n3+
1( )k1 x N4----n2 n3N2----+ +
+=
W N
N4----n2 n3+
k1 2k2 4k3+ +( )W N
N n2k3W N
N4----
=
j( )n2 k1 2+(=
W N4n3k3
e
j2 pN
------------ 4n3k3e
4----= =
X k1 2k2 4k3+ +( ) H k1 k,([n3 0=
N4---- 1
=
H k1 k2 n3, ,( ) x n3( ) 1( )k1
x n3N2----+
+ (+=(Eq 6.4)
the above Equation 6.4 can be
(Eq 6.5)
in Equation 6.2, and expand theT definition with four times shorter
(Eq 6.6)
he following structure. The BF2II2I butterflies.
(Eq 6.7)
ix-22 butterfly, or its componentsF2I butterfly is the one representedn 6.7 and the BF2II butterfly is then. The complete radix-22 algorithmrecursively.
n2 k1 2k2+( )W N
n3 k1 2k2+( )W N4n3k3
k2)W Nn3 k1 2k2+( )W N
4n3k3
j2 pN
-------- n3k3W N
4----
n3k3=
2 n3, )W Nn3 k1 2k2+( )]W N
4----
n3k3
j ) k1 2k2+( ) x n3N4----+
1( )k1x n33N4-------+
+
-
Design and Implementation of an Asynchronous Pipelined FFT Processor
6.3 Architecture
The first butterfly, the BF2I, in the radix-22 butterfly has the followingarchitecture.
Figure 6.1: BF2I DIF butterfly architecture.
The second butterfly, the BF2II, has tbelow. The BF2I butterfly is a radix-2fly basically is a radix-2 butterfly but cations.
Figure 6.2: BF2II DIF butterfly archite
-
-
xr(n)
xi(n)
xr(n+N/2)
xi(n+N/2)
+
+
+
+
xr(n)
xi(n)
xr(n+N/2)
xi(n+N/2)29
he architecture seen in the figurebutterfly, whereas the BF2II butter-with trivial twiddle factor multipli-
cture.
1
1
0
0
0
0
1
1Zr(n+N/2)
Zi(n+N/2)
Zr(n)
Zi(n)
s
-
-
1
1
0
0
0
0
1
1Zr(n+N/2)
Zi(n+N/2)
Zr(n)
Zi(n)
s
+
+
+
+
t
+
+-
-
6 Radix-22 FFTs
30
A radix-22 SDF FFT architecture with these radix butterfly elements plusmultipliers is shown in Figure 6.3. This architecture uses the sameamount of non-trivial complex multipliers as the radix-4 architecture, butretains the simple radix-2 architecture. Another advantage is that the con-trol structure is simple for this implementation, only a binary counter.The block in the feedback loop is a FIFO buffer, the number indicates thenumber of complex samples it can sto
Figure 6.3: Architecture of a 64-point r
6.4 Numerical Effects
Numerical effects in the radix-22 algoradix-2 algorithm, because it has the sfewer multipliers. For a description oradix-22 algorithm, see the radix-2 invEffects.
clk 5 4 3
x(n)
W1(n)
X
BF2I BF2II
1632
s st
BF2I
8
sre.
adix-22 SDF FFT.
rithm is exactly the same as in theame butterfly structure, but with
f the numerical effect for theestigations in chapter 4 Numerical
02 1
X(k)
W2(n)
X
BF2I BF2II
12
BF2II
4
ssst t
-
7 F7.1 Introduction
This chapter discusses the design of ation levels, block division, trade-offs,discussed in more detail.
In general, the design in this project sodology in small refining steps. The fiand the final model will be FPGA synprocess will go from the former to thei.e. only small changes will be introduThese smaller design steps will hopefdesign process and smaller amount ofsome point in the design process the MVHDL description, but it is hard to knfor this conversion will be.
The models should be implemented wThis leads to a more reusable and easmentation.
7.2 Matlab Design
The design of the FFT processor begitional model in Matlab. The advantagMatlab is that Matlab offers a high levgood interface for testing. This meansent models can be tested.FT Design
n FFT processor. Different abstrac- simulations, testing, etc. are things
hould be done with top-down meth-rst model will be built in Matlabthesizable VHDL code. The designlatter in several small design steps,ces in the model in each step.
ully lead to a more predictableerrors introduced when refining. Atatlab model has to be converted toow in advance when the best time
ith a good hierarchical architecture.ier understanding of the final imple-
ns with the design of a simple func-e of starting the design process inel programming language and a that in a short time a lot of differ-31
-
7 FFT Design
32
The first Matlab model was designed with a bottom-up methodology at ahigh level of abstraction. Actually a top-down methodology should havebeen used, but it seemed like a better solution to do the first model in thisway, because a good description of the algorithm and architecture wasavailable [2]. Some things in the algorithm were left out, which latercaused problems that delayed the project for around two weeks. The bot-tom-up methodology was only for thefrom that point and onwards a top-do
7.2.1 Problems and Solutions
All the different blocks were easily imtor generating block. The other blockwere described in detail in the paper [almost completely left out. Only the dhint about the functionality of the blomodels to get the FFT to work, the 16i.e. two radix-22 stages. To get it to wproblem, because this was the part whFinally it was solved through more stunderstand them better.
7.3 Matlab Simulations
Matlab simulations are an important plation shows if the models developed the final FFT model were tested throuferent blocks in the processor were te
Most simulations were done to get anthe output. The error depends on the dnumber of steps (depends on the FFTthese simulations were carried out to tion techniques. Different optimizatioSection 9.2 on page 49.
Simulations were also done to compavalidate that they are functionally equ creation of the first model andwn methodology was used.
plemented, except the twiddle-fac-s where easy to understand and2], but the twiddle-factor block waseduction of the algorithm gave ack. After testing a lot of different-point FFT finally worked correct,ork for longer FFTs was a real hardere some descriptions were left out.udying of the FFT formulas, to
art in the design process. The simu-are functionally correct. Not onlygh simulations, but also all the dif-sted separately.
estimation of the size of the error inata-widths in the processor and the length) in the pipeline. A lot oftest different data-width optimiza-n techniques are discussed in
re two models against each other, toivalent.
-
Design and Implementation of an Asynchronous Pipelined FFT Processor
7.4 VHDL Design
The VHDL design started when the model of the parameterizable FFTprocessor were decided to be correct after a lot of simulations. The stepfrom Matlab to VHDL should be as small as possible. The models shouldhave the same blocks and their implementation should be the same. InVHDL it is possible to write functionVHDL model should not differ so mu
The design environment that was useding, Vcom for VHDL compiling and is a tool for graphical representation ois easier to understand with a total texproject with a lot of parameterization
7.4.1 Problem 1 and Solution - Abs
In the Matlab model signals are descrIEEE library have support for complerepresentation and not for the signed sentation. Hence, the first refinement separate the complex signals into twoand the other holding the imaginary vlems, for one reason the abstraction leanother reason some of these models lab.
One problem that the division of the cincreased the number of signals in theincreasing the block complexity. Incrdecreases the ease of understanding it
The first approach to solve this probleelements of a std_logic_vector, to crenals. However, it was impossible twocode wouldnt compile because the arbefore compile-time, and the word lenterized.33
al models so the Matlab and thech.
was Emacs for VHDL text-edit-Vsim for VHDL simulation. Thereff block structures, but the structuret representation, at least in this
.
traction 1
ibed by complex variables. Thex signals, but only for floating pointfractional twos complement repre-between Matlab and VHDL was to signals, one holding the real valuealue. This didnt cause much prob-vel was almost the same, and forwere already implemented in Mat-
omplex signals caused was that it blocks almost by a factor of two,
easing the block complexity.
m was to create an array with twoate an abstraction of complex sig- make a construct in this way. Theray elements have to be constrainedgth couldnt therefore be parame-
-
7 FFT Design
34
The second approach also used arrays. The approach used an array withword length number of std_logic_vector(1 downto 0). This construct iscompilable. The problem with it is that slices couldnt be used, a specialfunction would have to be written to extract the information. The codewould be easy to read, and it might even be synthesizable, but the testingwill be more complicated. The std_logic_vector in this case only storestwo bits, one for the real value and onebench it will therefore not be easy to
The third approach was to create an ahave a size of 2 x the word length. Thdrawbacks. Both dimensions in the deunconstrained, the best way would beand the other as a parameter. The drawsignal two dimensions have to be givefor declaring the real and imaginary dtion of signals in this way only makes
The final approach, the one that was udoned the abstraction of signals. The good solution. An extra layer in the bing the number of signals in each blocable complexity and length.
7.4.2 Problem 2 and Solution - ObjThe abstraction problem described abobject oriented variant of VHDL. Thefunctionality with an extra layer on topreprocessing stage, i.e. digital structferent from VHDL. To synthesis this,into VHDL-code, the rest of the stepssteps.
This solution wasnt used because VHrequirements, and because most peopextra layer. The lack of understandingthe future.for the imaginary value. In the testread the signal values.
rray of std_logic. The array wouldis is a good way but it has someclaration of the array have to beto be able to set one dimension to 2backs is that in the declaration of an, one for the word length, and oneimension (always 2). The declara- the code a bit less readable.
sed in the implementation, aban-reason was that there was anotherlock structure was added, decreas-k. This gave VHDL-files of reason-
ect Orientationove could have been solved with anre are some attempts the create thisp of VHDL. One solution had aures were written in a language dif- the code first had to be compiled are the usual VHDL-synthesis
DL had to be used according to thele dont understand the code of the would limit the use of the code in
-
Design and Implementation of an Asynchronous Pipelined FFT Processor
When writing normal non-parameterized VHDL-descriptions the benefitof object orientation might not be as large as for highly parameterizedsystems like the FFTs considered in this project.
7.4.3 Problem 3 and Solution - Control Block
The control problem arose in the syncdata signals, i.e. that the right data shocontrol signals.
To get the FFT processor to work, shibetween each radix-22 stage and betwinside the radix-22 stage. The result wthat the latency between input and ouis not a big problem, but the extra HWpower consumption.
This problem could be solved in two unit or creating a system consisting omunicating asynchronously (GALS).completely synchronous would increaunit, resulting in a system that is hardewould only have a slightly different cthe original one. The blocks would alother functionally, which could be a gdesign later in the future. These pros aof the FFT processor as a GALS-syst
7.5 Design for Test
Design for test is a way to speed up thThis design method is used to find fabin the printed chip causing logical errment often used in this context is faulcoverage of stuck-at faults. A fault coof the die area of the chip was not corcess.35
hronization of control signals anduld be available in the right state of
mming delays had to be addedeen the two butterfly elementsas that more HW was needed and
tput frames increased. The latency will increase the die size and the
ways, either changing the controlf locally synchronous blocks com-The first choice of keeping the FFTse the complexity of the controlr to understand. The second choice
ontrol structure, but very similar toso be more separated from eachood property when improving thend cons lead to the implementation
em.
e testing of manufactured chips.rication faults in the chips, e.g. dustors, not design errors. A measure-t coverage, which often means theverage of e.g. 95% means that 95%rupted during the fabrication pro-
-
7 FFT Design
36
Design for test have not yet been considered, due to the early stage of theproject.
7.6 VHDL Simulations
Test bench code-skeletons were produced for combinatoric, synchronousand asynchronous parts. These test bespecific test bench for a block. Input awere read and written to test files. TheMatlab where the VHDL simulationssimulations.
To ease the interfacing between MatlaMatlab functions were developed. Thtion of test data and the reading and w
7.7 Synchronous or Asynchro
Both synchronous design and asynchdisadvantages. The reason for choosinthis project can be found in Section 7chronous circuits and the design of thpage 41.
7.8 Testing
Testing is an important part of the proThis section will outline the testing sttions have been describing simulationbut this section looks into this area m
7.8.1 Random Testing
Random testing is the most frequentlyare generated by Matlab, written to thtest benches. This type of testing is quare generated automatically, and it is because input signals often seem to bnches could easily be altered to and output from the test benchesse files could then later be read intocould be compared with the Matlab
b and VHDL simulation a set ofese functions handled the genera-riting of binary data to the test files.
nous Design
ronous design have advantages andg asynchronous design (GALS) in
.4.3 on page 35. More about asyn-ese can be found in Section 8 on
ject and testing is done on all levels.rategy of this project. Previous sec-s, which also is a form for testing,ore thoroughly.
used method. Random sequencese test files and read by the VHDLick to use because test sequences
also adequate in the area of FFTse randomly distributed.
-
Design and Implementation of an Asynchronous Pipelined FFT Processor
7.8.2 Corner Testing
Random testing is good, but some things is hard to detect, e.g. cornercases. Corner cases are input sequences that the designer or tester thinkcan cause errors, e.g. overflows in adders and multipliers. In the FFT areaa corner case could be an input sequence of only maximum and mini-mum input values.
In this project corner testing is mostlyworks as in should, resulting in no ovethe output.
7.8.3 Block Testing
Block testing is the lowest level of tesother sub-block, the sub-blocks are rueach sub-block is verified. When it isrectly it can be assumed that if the blonection of sub-blocks. This method liwill save a lot of debugging time.
7.8.4 Golden Model Testing
Golden model testing is the best way tit should. The Golden model is a modrectly, which new implementations ofcompared against.
In the case of this project the golden mMatlab. The Matlab function can of ccomplete FFT, and not really the sub-Matlab model is thoroughly tested ansub-blocks can be used as golden mo
7.8.5 FPGA Testing
FPGA tests is the final step in the testcomputer software is time consuminglarge test vectors. VHDL code synthetest a lot, probably several orders of m37
done to check that the safe scalingrflows which would cause errors in
ting. Before a block is built out ofn through a block test to ensure thatknown that all sub-blocks work cor-ck errors it is due to the intercon-
mits the area where the error is and
o ensure that a model is working asel that is known to be working cor- the same functionality could be
odel is the built-in FFT function inourse only be used when testing theblocks of the system. Since thed it is assumed to be correct, thedels for the testing of later designs.
ing process. Doing simulations in, and it is therefore difficult to runsized to an FPGA will speed up theagnitude.
-
7 FFT Design
38
The Virtex-II V2MB1000 Development Board was used for the FPGAtests. The choice to use this board is that it can handle large designs. Thedevelopment board has a lot of sockets for interfacing with other compo-nents, i.e. RS232, parallel input, ethernet and on board switches.
To do tests in real-time, test vectors have to be sent and received from theFPGA in full speed. This requires a hFPGA board interfaces, since all test FPGA board. These interfaces wouldfit the time-plan of this thesis project,therefore the real-time requirement is
The full speed test can be done by loaon-board, run the simulation in full-spmemory, and finally reading out the Fthe block schematic of the test bench
Figure 7.1: Vertex-II asynchronous FFT
To test the design with this block strufor this there is not enough time left iinterface have to be written for the meas the code for the wrapper and test cnicate with the test board. To design atime far beyond the limit.
TestController
FFT
Wrapper
Data
Addr
RS232
FPGA chipVertex-II boardigh bandwidth through one of thevectors cannot be stored on the take to much time to implement tohence a simpler test is required and dropped.
ding all test vectors into a memoryeed while writing the output toFT output from the memory. Seein the figure below. test bench.
cture would be possible, but evenn the project to finish the test. Anmory and the RS232 port, as well
ontroller, and a program to commu-ll this would prolong the project
bus
ess bus Memory
-
Design and Implementation of an Asynchronous Pipelined FFT Processor
The RS232 port seems to be the easiest way to interface with the Virtex-II board, so the other solutions using other ports would also extend theproject beyond limits. Hence, another way of testing is required.The final test bench was completely embedded on the FPGA chip. Sincethe FPGA chip does not have a large memory (ROMs and RAMs) capac-ity, the test vectors have to be small. Ia smaller FFT was tested. A 16-pointsmallest FFT processor in this projectthe two different butterflies, prescalintor multipliers.
Figure 7.2: The implemented FPGA tes
The Input generator was implementedand an asynchronous wrapper. The Osimilar way, but with an extra block ththe expected data. A difference betwedata would trigger the test bench intodiode on the FPGA board.
The result of the test showed that it wprocessor to an FPGA.
7.9 Synthesis
The VHDL-code written in this projesynthesis of the code two programs wXilinx Design Manager. The synchrobut the asynchronous ones caused a lopage 46.
FPGA chip
Inputgenerator Req
Ack
Data39
nstead of testing a 1024-point FFT FFT was selected because it is the that includes all components, i.e.g, final scaling and the twiddle fac-
t.
with a ROM memory, a counterutput tester was implemented in aat compared the received data withen the received and the expectedan error mode, which would light a
as possible to synthesize the FFT
ct should be synthesizable. For theere used, LeonardoSpectrum andnous parts were easily synthesized,t of problems, Section 8.7 on
FFTOutputtesterReq
Ack
Data
-
7 FFT Design
40
7.10 Meetings
Meetings were held regularly. Every meeting had a written protocol and aminutes were written directly after the meeting. The minutes were thensent by e-mail to the examiner and the supervisor.
In the beginning the meetings were hedirections of the project, and later theprogress in the work.
The meetings helped a lot in the beginwhat was going to be done. If something minutes could be read to find the aprotocol for the next meeting.ld to define the limitations and meetings mostly described the
ning, because it defined clearlying was forgotten, the correspond-nswer, if not, it was included in the
-
8 Asy
8.1 Introduction
An asynchronous design methodologproject to solve the problem with thetion. Asynchronous design has many son for using it was to test if it could
This chapter will introduce asynchronthe asynchronous design process used
8.2 Asynchronous Circuits
Asynchronous circuits are a big area owill be used in this project, the theorylined. For the interested reader more mcan be found in [3].
Asynchronous circuits have many advlike Performance of an asynchronous s
latency, not the worst case latency Global clock timing problems are The power consumption can be low
despite the fact that they require mnchronousDesign
y was used in this masters thesiscontrol signal and data synchroniza-interesting properties, but the rea-solve the control problem.
ous circuits and it will also describe in this project.
f research and only small parts of it will therefore only be briefly out-aterial on asynchronous circuits
antages over synchronous ones,
ystem depends on the average caseas in synchronous circuits.avoided.
er in asynchronous circuits,ore HW.41
-
8 Asynchronous Design
42
Asynchronous systems are more robust against temperature and sup-ply voltage variations.
Lower electrical interference with other components, due to the lackof clock harmonics in the emission spectra.
No global clock is used in asynchronous circuits, instead some form ofhandshaking is used in the communicples of handshaking protocols are 2-pthis project 4-phase handshaking willblocks already have been implemente
8.3 GALS
GALS, abbreviation of Globally Asynare a small subset of asynchronous sypletely asynchronous, they consist of nication asynchronously. In Figure 8.synchronous system, Req is short for acknowledge. Req and Ack performs push communication channel will be ducer of data initiates the handshakin
Figure 8.1: GALS asynchronous comm
A 4-phase handshaking cycle is perfomeans Req goes high): Req+, Ack+, Rbe valid between Req+ and Ack-, but
GALS combine some of the benefits benefits from asynchronous ones. Sinthey can be designed the same way asGlobally the system is asynchronous,global signals, which is an increasinglarger and faster.
LS-systemReq
Ack
Dataation between systems. Two exam-hase and 4-phase handshaking. In be used, because these interfacingd in VHDL.
chronous Locally Synchronous,stems. These systems are not com-synchronous sub-systems commu-1 the LS-system is a locallyrequest and Ack is short forthe handshaking. In this project aused, which means that the pro-g.unication.
rmed in the following way (Req+eq- and finally Ack-. Data should
is often sampled on the Ack+ edge.
from synchronous circuits with thece the local parts are synchronousbefore using available design tools.which removes timing problems for problem when designs are getting
LS-system
-
Design and Implementation of an Asynchronous Pipelined FFT Processor
8.3.1 Asynchronous Wrappers
An asynchronous wrapper is used for the handshaking between two mod-ules. The wrapper consists of three components, a stretchable clock gen-erator, a demand-port (D-port) and a poll-port (P-port). Connectedaccording to the figure below they implement a 4-phase push-channel,[3].
Figure 8.2: Asynchronous wrapper com
The En signal is a control signal frominput and output of the system. En ontem is ready to receive data, and En oLS-system is ready to send data. The Sdata is available or the receiving blockfor the LS-system.
8.3.2 Enable generation
The enable signal can be generated inthis problem an easy solution is possinents continuously and the same amousample. The system both needs data aThough, the output is delayed compaenable generation has to be delayed ction. The result is that the enable contmostly from a counter.
LS-sy
P-in
Stretcha
Req
Ack
Str
En
Data43
ponents.
the LS-system, that controls thethe input controls when the LS-sys-n the output controls when thetr signal stops the clock if no inputis busy. The lclk signal is the clock
a lot of different ways. Though, forble. Data flows through the compo-nt of processing is needed for eachnd can send data each clock cycle.red to the input, i.e. the outputompared to the input enable genera-rol can be a separate block built up
stem
D-out
ble clock
Req
Ack
Str
Data
En
lclk
-
8 Asynchronous Design
44
8.4 Design Automation
To be able to quickly convert the available synchronous implementationto a GALS one, the design was automated. The wrapper was imple-mented as one component, a VHDL code skeleton was created for oneLS-system with an asynchronous wrapper and data registers, and testbenches for these systems were createchronous component in this project ubelow.
Figure 8.3: General architecture of asy
The skeleton according to this structuLS-system of the considered type.
The only difference between two diffcontrol block, which of course depenwrapped. In the synchronous implemwhereas when the system is divided inblock have to have its own local contrwere similarly implemented as the glo
The skeleton above only works for a sinput is only received from one blockondly, the structure implements a pusThough, it is easy to change it to a pul
Req
Ack
En
Data
Wr
Conbl
Reg LS-sd. The general structure of an asyn-ses the architecture in the figure
nchronous components.
re is easy to modify to wrap any
erent asynchronous blocks are theds on what LS-system is beingentation the control block is global,to several asynchronous block eachol block. The local control blocksbal one.
pecial type of architectures. Firstly, and sent to another block. Sec-h-channel type of communication.l-channel type by reversing the Req
Req
Ack
En
apper
trolock
lclk
Data
ystem
Controlsignals
Controlsignals
-
Design and Implementation of an Asynchronous Pipelined FFT Processor
and Ack signals, and swap the D-port for a P-port and the P-port for aD-port in the wrapper.
8.5 Asynchronous FFT Architecture
The asynchronous FFT is built up by connecting the wrapped LS-sys-tems, which is illustrated in the figure
Figure 8.4: The architecture of the asyn
The number of stages are determinedFigure 8.5 shows the FFT structure wIn the case a 2 times power-of-4 lengtand twiddle factor multiplier stage hathe processor. The reason for choosinside, which also is possible, is that a lradix-22 (1024-point FFT) stages andbetween. This layout would not changstages are added on the input side.
Figure 8.5: The architecture of the asyn
Decomposing the radix-22 blocks intoremoved the timing problem of the in
Figure 8.6: The architecture of the radi
The asynchronous butterfly stages, thand the scaling stages were designed described in Section 8.4 on page 44.
Asyncprescaling
AsynFFT
Data
ReqAck
AsyncTwiddleFactor
Multiplier
Async2
Radix-2AsyRadi
Data
ReqAck
Asyncbf2istage
Abs
Data
ReqAck45
below.chronous FFT, including scaling.
by the FFT transform length.ith a power-of-4 transform length.h is wanted, an extra butterfly stageve to be added on the input side ofg the input side and not the outputayout could be done of the first five the twiddle factor stages ine in the case when extra butterfly
chronous FFT.
two different asynchronous unitsternal control signals in the stage.
x-22 block.
e twiddle factor multiplying stagesaccording to the methodology
Asyncfinal
scaling
c Data
ReqAck
nc2x-2
Async2
Radix-2
Data
ReqAck
syncf2iitage
Data
ReqAck
-
8 Asynchronous Design
46
8.6 Testing
The testing of the asynchronous circuits are similar to the testing of thesynchronous ones. A skeleton test bench were implemented that allowedthe use of the same input and output files as for the testing of the synchro-nous design.
8.7 Synthesis
The synthesis process of the asynchrothe VHDL simulations the asynchronare not synthesizable, but synthesizabavailable. These blocks, since they arlogic to be functionally correct and resynthesis tools to optimize the designthis was the problem when the FPGAFPGA does not offer any means of seinside the chip.
The synthesis tools have a possibilityand the problem with the removed logsuming problem was to find out how removing logic in certain componentsattribute (keep) in the VHDL code an(PRESERVE_SIGNAL) in the tcl-scrdoSpectrum.
8.8 Summary of GALS Design
To design a GALS-system instead of proved to be a quick and efficient waying of data and control signals. All uncould be removed in one week of worasynchronous circuits, which took on
The design principle is easy to use, nothe asynchronous parts, only the use opackage including the wrappers etc., GALS could get started designing thenous parts required a lot of time. Inous blocks were functional, thesele versions of these blocks weree asynchronous, need redundantdundant logic is removed by the. It took a lot of time to find out that simulations didnt work, since theeing what actually is happening
to see the generated componentsic was found. The next time con-
to prevent the synthesis tools from. The solution was to use and also setting an attributeipt used in the synthesis by Leonar-
a completely synchronous system to solve the problem with the tim-necessary buffers in the processork, not including the studying ofly a few days.
t much understanding is needed off them have to be learnt. With a
a designer unexperienced withse systems in a couple of days.
-
Design and Implementation of an Asynchronous Pipelined FFT Processor
The only problem encountered with the GALS design was the synthesis,but when the process to do the synthesis of the asynchronous parts havebeen learned, this design step will not cause much problems.47
-
8 Asynchronous Design
48
-
9 Fu9.1 Introduction
In this chapter the future work of theter suggests where improvements are ing continuations of the project.
9.2 Word Length Optimizatio
Word length optimization is one of thSince the FFT processor is parameterspecific applications should be easy torithm. A few optimization test have bline them, plus describe some areas w
9.2.1 General
In general, optimizing a HW structurepower consumption, die size, and so oweighed against each other? This depWhen the application is know the neeknown. Since the precision is non-negthe precision, an optimized solution csion than this constraint. Hence, the otwo parts: getting good enough precisthe other parameters, as power consumare weighted together to a utility valuters depends on the application).ture Work
project will be discussed. The chap-possible or important, and interest-
n
e most important future works.ized, the different parameters forfind using some optimization algo-
een done, and this section will out-here more work could be done.
is a trade-off between precision,n. How should these parameters beends of course on the application.ded precision of the output isotiable there is a fixed constraint onould therefore not have worse preci-ptimization problem is divided intoion, and maximizing the utility ofption and so on. These parameters
e (the importance between parame-49
-
9 Future Work
50
9.2.2 Gradient Search
The test on gradient search that has been done uses a vector of wordlength parameters. One at a time each parameter is changed one step andthe utility and output error are calculated for each change. This results ina vector telling the importance (the gradient) of each parameter, both inutility and precision. If there are solutconstraint the one with the highest utiand if no solution have a good enoughprecision is chosen for the next iterati
It was hard to test the convergence oflack of a relevant function to calculatOne way to solve this is suggested in
9.2.3 Utility Function
One way on solving the problem to esthesize the butterflies, ROMs, RAMslengths, and measure the power consuMeasurements does not have to be malengths, because the ones not measurekind of interpolation.
The problem is still how to weigh thebut with this method the solution is onpower consumption, and so on have t
The synthesis work for the different wis left for the future.
9.3 VLSI Layout
The work in this project ends in FPGAgood in many ways. They have a shorouts does not have to be done and no tion. Though, they cannot compete wpower consumption, and when it comFPGAs can not compete with the prizions with better precision than thelity is chosen for the next iteration, precision, the solution with beston.
this algorithm, mostly due to thee the utility of a set of parameters.the next section.
timate the utility function is to syn-and multipliers with different wordmption, the die area, and so on.de for all possible different wordd could be estimated with some
se parameters against each other,e step closer, since at least die size,
o be estimated.
ord lengths have not been done. It
synthesizable code. FPGAs aret development cycle, because lay-layouts have to be sent for fabrica-ith VLSI layouts in speed andes to large series of components thee.
-
Design and Implementation of an Asynchronous Pipelined FFT Processor
Mostly due to the better performance, in speed and power consumption,that a VLSI design gives, there could be an interest in doing a VLSI lay-out of this FFT processor. It could of course also be interesting to see ifthe processor, considering the GALS design, is easy to layout into aworking chip.
9.4 VLSI Layout of Asynchro
The GALS design of this project is inthese circuits turned out to be simple.learn more about the design flow of thdown to VLSI layout. Hopefully the das the high level design. The reason thgood test example is that it is a GALSreal-world application.
9.5 Completely Asynchronous
The design of the FFT processor is sono reason that the LS-systems have totems could also be broken down into be of the GALS type, or even comple
9.6 Design for Test
DFT, or design for test, is increasing size of designs. Adding design for tesbe interesting later on in the project bdone.
9.7 Twiddle Factor Memory R
A twiddle factor memory reduction isaccomplished is reducing the ROM mtion of the ROM memories will decrewill probably increase the power consROM have to be changed in some wators.51
nous Parts
teresting. The high level design of It could therefor be interesting toese kinds of circuits all the wayesign on that level will be as easyat this FFT processor could be a design, that could be used in a
Design
far of the GALS type, but there is be kept synchronous. These sys-smaller systems, which also couldtely asynchronous.
in importance due to the increasingt methodology to the project couldefore a VLSI layout is going to be
eduction
possible. The reduction that can beemories by a factor of 8. The reduc-ase the die size of the design, butumption because the factors in the
y to match the previous twiddle fac-
-
9 Future Work
52
9.8 Commutators Implemented with RAM
The current implementation of the commutators uses a FIFO bufferimplementation. To use a RAM memory to implement the FIFO bufferswill reduce the power consumption.
9.9 Unscrambler
An unscrambler could be a useful blorequired to get the output data in natubler is needed. In the current solutionorder.ck to implement. It is not alwaysral order, but when it is an unscram- the output arrives in bit-reversed
-
1010.1 Conclusions
The goal with this masters thesis prodesign of FFT processors, and designThe processor has been designed anddesign steps have been used successfu
In the end of the project the design tuleading to more interesting problemsbeginning. The design of the GALS Fsimple design methodology for this a
Finally the next section will go througthe project, to check that they are fulfi
10.2 Follow-up of Requiremen
The Follow-up of Requirements wistated in chapter 1 Introduction.1. The transform length shall be able
in powers-of-2.This requirement is fulfilled. The rments, the implemented architectupower-of-2 number, not only the o
2. The input signal shall be a continu Summary
ject was to learn more about theing a parameterized FFT processor. a design methodology with smalllly.
rned into an asynchronous design,and solutions not expected from theFT taught a lot in this area, and a
rea has also been developed.
h the requirement stated early inlled.
ts
ll go through the requirements
to vary between 1k and 8k samples
esult is even better than the require-re handles any frame length of anes between 1k and 8k.ous data stream.53
-
10 Summary
54
This requirement is fulfilled. The implemented architecture handles acontinuous data stream in natural order, and the output is received inbit-reversed order.
3. The input signal shall consist of only one continuous data stream.This requirement is fulfilled. The implemented architecture handlesone and only one data stream.
4. The word length of the input and oable. The internal word lengths shaThis requirement is fulfilled. The ipletely parameterized. Word lengthtipliers, as well as input and outpu
5. Safe scaling shall be used.This requirement is fulfilled. The iscaling, using a scaling factor of 2
6. Data shall be represented with twoThis requirement is fulfilled. Data architecture uses fractional twos c
7. The implemented architecture shalThis requirement is fulfilled. The ilining. Each butterfly and each comline. More pipelining stages can ofutput signal shall be parameteriz-ll also be parameterizable.
mplemented architecture is com-s in butterfly stages and in all mul-
t, can be specified.
mplemented architecture uses safe.s complement format.representation in the implementedomplement representation.l be pipelined.mplemented architecture uses pipe-plex multiplier is a step in the pipe- course be inserted, if needed.
-
11 Bi[1] Torbjrn Widhe. Efficient Implem
Elements. Thesis No. 619, DeparLinkping University, Sweden. 2
[2] Sousheng He, and Mats TorkelssoFFT Processor. Department of AUniversity, SWEDEN.
[3] Jens Muttersbach, et. al. GloballyLocally-Synchronous ArchitecturOn-Chip Systems. Integrated SysInstitute of Technology, Zrich, S
[4] Bevan M. Baas. An Approach to LProcessor design. Dissertation DeStanford, USA. 1999.
[5] Shigenori Shimizu. Multi-Pipelinetransactions of the IEICE, vol. E bliographyentation of FFT Processingtment of Electrical Engineering,002.
n. A New Approach to Pipelinepplied Electronics, Lund
-Asynchronouses to Simplify the Design oftems Laboratory, Swiss Federalwitzerland. 1999.
ow-Power, High-Performance FFTpartment of Electrical Engineering,
FFT Architecture. The70, no. 6 June 1987.55
-
11 Bibliography
56
-
P svenska
Detta dokument hlls tillgngligt p Internet eller dess framtida ersttare under en lngre tid frn publiceringsdatum under frutsttning att inga extra-ordinra omstndigheter uppstr.
Tillgng till dokumentet innebr tillstnd fr var och en att lsa, ladda ner,skriva ut enstaka kopior fr enskilt bruk och att anvnda det ofrndrat fr ick-ekommersiell forskning och fr undervisning. verfring av upphovsrtten viden senare tidpunkt kan inte upphva detta tillstnd. All annan anvndning avdokumentet krver upphovsmannens medgivande. Fr att garantera ktheten,skerheten och tillgngligheten finns det lsningar av teknisk och administrativart.Upphovsmannens ideella rtt innefattar rtt att bli nmnd som upphovsman i denomfattning som god sed krver vid anvndning av dokumentet p ovan beskrivnastt samt skydd mot att dokumentet ndras eller presenteras i sdan form eller isdant sammanhang som r krnkande fr upphovsmannens litterra eller konst-nrliga anseende eller egenart.
Fr ytterligare information om Linkping University Electronic Press se fr-lagets hemsida http://www.ep.liu.se/
In English
The publishers will keep this document online on the Internet - or its possiblereplacement - for a considerable time from the date of publication barring excep-tional circumstances.
The online availability of the document implies a permanent permission foranyone to read, to download, to print out single copies for your own use and touse it unchanged for any non-commercial research and educational purpose. Sub-sequent transfers of copyright cannot revoke this permission. All other uses ofthe document are conditional on the consent of the copyright owner. The pub-lisher has taken technical and administrative measures to assure authenticity,security and accessibility.
According to intellectual property law the author has the right to be men-tioned when his/her work is accessed as described above and to be protectedagainst infringement.
For additional information about the Linkping University Electronic Pressand its procedures for publication and for assurance of document integrity, pleaserefer to its WWW home page: http://www.ep.liu.se/
Jonas Claeson
AbstractAcknowledgementsTerminologyNotationTable of Contents1 Introduction1.1 General1.2 Scope of the Report1.3 Project Requirements1.4 Reading Instructions
2 Algorithms2.1 Introduction2.2 The DFT Algorithm2.3 FFT Algorithms2.4 Common Factor Algorithms2.5 Radix-2 Algorithm2.6 Radix-r Algorithm2.7 Split Radix Algorithm2.8 Mixed Radix Algorithm2.9 Prime Factor Algorithms2.10 Radix-r Butterflies
3 Architectures3.1 Introduction3.2 Array Architectures3.3 Column Architectures3.4 Pipelined Architectures3.4.1 MDC, SDF and SDC Commutators3.4.2 Pipeline Architecture Comparisons
3.5 Multipipelined Architectures3.6 SIC FFT Architectures3.7 Cached-FFT Archite