project report1.pdf

DESIGN AND IMPLEMENTATION OF ANASYNCHRONOUS

PROCE

Masters thes

Electronics

Jonas C

Reg nr: LiTH-ISY

Linkping, Ju PIPELINED FFTSSOR

is project at Systems

laeson

-EX-3356-2003ne 13, 2003

DESIGN AND IMPLEMENTATION OF ANASYNCHRONOUS

PROCE

Examensarbete utfrtvid Linkpings Te

av

Jonas C

Reg nr: LiTH-ISY

Supervisor: W

Examiner: prof. L

Linkping, Ju PIPELINED FFTSSOR

i Elektroniksystemkniska Hgskola

laeson

-EX-3356-2003

eidong Liars Wanhammar

ne 12, 2003

SammanfattningAbstract

NyckelordKeywords

RapporttypReport: category

Licentiatavhandling

C-uppsatsD-uppsatsvrig rapport

SprkLanguage

Svenska/SwedishEngelska/English

ISBN

Serietitel och serienummerTitle of series, numbering

URL fr elektronisk version

TitelTitle

FrfattareAuthor

DatumDateAvdelning, InstitutionDivision, department

Department of Electrical Engineering

ISRNExamensarbete

ISSNX

581 83 LINKPING

http://www.ep.liu.se/exjobb/isy/2003/3356

Design och implementering av en asynkron pipelinad FFT processor

Design and Implementation of an Asynchronous Pipelined FFT Processor

FFT processors are today one of the most important blocks in communication equipment. They areused in everything from broadband to 3G and digital TV to Radio LANs. This master's thesis projectwill deal with pipelined hardware solutions for FFT processors with long FFT transforms, 1k to 8kpoints. These processors could be used for instance in OFDM communication systems.

The final implementation of the FFT processor uses a GALS (Globally Asynchronous Locally Syn-chronous) architecture, that implements the SDF (Single Delay Feedback) radix-22 algorithm.The goal of this report is to outline the knowledge gained during the master's thesis project, to de-scribe a design methodology and to document the different building blocks needed in these kinds ofsystems.

DFT, FFT, Pipelined, Parameterizable, Processor, GALS, Radix-22, SDF

X

2003-06-06

Jonas Claeson

LiTH-ISY-EX-3356-2003

FFT processors are today one of the mcation equipment. They are used in evdigital TV to Radio LANs. This mastpipelined hardware solutions for FFTforms, 1k to 8k points. These processOFDM communication systems.

The final implementation of the procechronous Locally Synchronous) archi(Single Delay Feedback) radix-22 alg

The goal of this report is to outline theters thesis project, to describe a desigthe different building blocks needed iAbstractost important blocks in communi-

erything from broadband to 3G anders thesis project will deal with processors with long FFT trans-ors could be used for instance in

ssor uses a GALS (Globally Asyn-tecture, that implements the SDForithm.

knowledge gained during the mas-n methodology and to document

n these kinds of systems.i

Abstract

ii

AcknowFirst of all I would like to thank my exfor giving me this interesting mastersdirections of my work. I would also liLi for his help with more detailed quehelped me a lot is Kent Palmkvist wittions, and Jonas Carlsson with questiocuits.ledgementsaminer professor Lars Wanhammar thesis project and for the generalke to thank my supervisor Weidongstions. Two other persons that

h VHDL and synthesis related ques-ns concerning asynchronous cir-iii

Acknowledgements

iv

TeTable: Terminology.

Abbreviation or term ExplanationBFP Block Floating

internally.butterfly Basic buildingCG FFT Constant GeoCOFDM Coded OrthogDFT Discrete Fouri

continuous foua time-domain

DFT Design For Teease and speed

DIF Decimation Inimplementing

DIT Decimation Inmenting a rad

FFT Fast Fourier TDFT.

GALS Globally Asyndecomposingthat communi

HW Hardware.in-place algorithm Output of a bu

came from.LS-system Locally SynchMDC Multipath Del

PEs in a pipelrminology

Point. One way of representing data

block in HW FFT processors.metry FFT.onal Frequency Division Multiplexing.er Transform. The discrete version of therier transform. Transforms a signal from to a frequency-domain.st. Extra HW is added in the design to up the testing. Frequency. One out of two ways of a radix PE. Time. One out of two ways of imple-ix PE.ransform. Quick way of computing a

chronous Locally Synchronous. A way ofa system into several synchronous blockscate with an asynchronous protocol.

tterfly is written back to where the input

ronous system.ay Commutator. Block between radixined architecture.v

Terminology

vi

not-in-place algorithm Output of a butterfly is not written back to where theinput came from.

OFDM Orthogonal Frequency Division Multiplexing. OFDMis a broadband multicarrier modulation method used ina lot of comm

PE Processing ElSDC Single-path D

PEs in a pipelSDF Single-path D

in a pipelined SFG Signal Flow G

ical way usingSIC Single InstrucSNR Signal to Nois

FFT context.

Table: Terminology.

Abbreviation or term Explanationunication systems.ement.elay Commutator. Block between radixined architecture.elay Feedback. Block between radix PEsarchitecture.raph. Describes an algorithm in a graph- adders, multipliers, signal wires, etc.tion Computer.e Ratio. Not a good measurement in the

Table: Symbols.

Symbol ExplanationN Length of the

FFT.x Input signal toX FFT transform

Table: Operators and functions.

Operator or function Explanationa|b b is divisible bN X modulo N.Notation

input and output sequence of a DFT or

an FFT processor. of the input signal x.

y a, i.e. b/a gives 0 in rest.vii

Notation

viii

Table oAbstract.........................................

Acknowledgements.......................

Terminology ..................................

Notation.........................................

Table of Contents..........................

1 Introduction..................................1.1 General 11.2 Scope of the Report 11.3 Project Requirements 21.4 Reading Instructions 3

2 Algorithms ....................................2.1 Introduction 52.2 The DFT Algorithm 52.3 FFT Algorithms 62.4 Common Factor Algorithms 62.5 Radix-2 Algorithm 72.6 Radix-r Algorithm 92.7 Split Radix Algorithm 92.8 Mixed Radix Algorithm 92.9 Prime Factor Algorithms 10f Contents.................................. i

.................................. iii

.................................. v

.................................. vii

.................................. ix

.................................. 1

.................................. 5ix

Table of Contents

x

2.10 Radix-r Butterflies 10

3 Architectures .................................................................. 133.1 Introduction 133.2 Array Architectures 133.3 Column Architectures 143.4 Pipelined Architectures 14

3.4.1 MDC, SDF and SDC Com3.4.2 Pipeline Architecture Com

3.5 Multipipelined Architectures 13.6 SIC FFT Architectures 173.7 Cached-FFT Architectures 18

4 Numerical Effects.........................4.1 Introduction 194.2 Safe Scaling 19

4.2.1 Radix-2 Safe Scaling 204.2.2 Radix-r Safe Scaling 21

4.3 Quantization 214.3.1 Twos Complement Quan4.3.2 Radix-2 Quantization 224.3.3 Radix-r Quantization 23

5 Implementation Choices..............5.1 Introduction 255.2 Algorithm Choice 255.3 Architecture Choice 26

6 Radix-22 FFTs ..............................6.1 Introduction 276.2 Algorithm 276.3 Architecture 296.4 Numerical Effects 30

7 FFT Design ...................................7.1 Introduction 317.2 Matlab Design 31mutators 15parisons 16

7

.................................. 19

tization 21

.................................. 25

.................................. 27

.................................. 31


7.2.1 Problems and Solutions 327.3 Matlab Simulations 327.4 VHDL Design 33

7.4.1 Problem 1 and Solution - Abstraction 1 337.4.2 Problem 2 and Solution - Object Orientation 347.4.3 Problem 3 and Solution - Control Block 35

7.5 Design for Test 357.6 VHDL Simulations 367.7 Synchronous or Asynchronou7.8 Testing 36

7.8.1 Random Testing 367.8.2 Corner Testing 377.8.3 Block Testing 377.8.4 Golden Model Testing 377.8.5 FPGA Testing 37

7.9 Synthesis 397.10 Meetings 40

8 Asynchronous Design...................8.1 Introduction 418.2 Asynchronous Circuits 418.3 GALS 42

8.3.1 Asynchronous Wrappers 8.3.2 Enable generation 43

8.4 Design Automation 448.5 Asynchronous FFT Architectu8.6 Testing 468.7 Synthesis 468.8 Summary of GALS Design 46

9 Future Work .................................9.1 Introduction 499.2 Word Length Optimization 49

9.2.1 General 499.2.2 Gradient Search 509.2.3 Utility Function 50

9.3 VLSI Layout 50xi

s Design 36

.................................. 41

43

re 45

.................................. 49

Table of Contents

xii

9.4 VLSI Layout of Asynchronous Parts 519.5 Completely Asynchronous Design 519.6 Design for Test 519.7 Twiddle Factor Memory Reduction 519.8 Commutators Implemented with RAM 529.9 Unscrambler 52

10 Summary.....................................10.1 Conclusions 5310.2 Follow-up of Requirements 5

11 Bibliography ................................................................. 53

3

.................................. 55

1 In1.1 General

FFT processors are involved in a wideonly as a very important block in broaalso in areas like radar, medical electrfor Extraterrestrial Intelligence).

Many of these systems are real-time stems has to produce a result within a FFT computations are also high and apose processor is required, to fulfill thFor instance using application specificcessors, or ASICs could be the solutioters thesis project an ASIC FFT procchoice because of its lower power con

1.2 Scope of the Report

The report is concentrated on pipelinetectures and algorithms that are most sors.

The first part of the report gives a revand FFT algorithm and different apprHW. Some terminologies like radix balgorithms, architectures, etc., are inttroduction

range of applications today. Notdband systems, digital TV, etc., butonics and the SETI project (Search

ystems, which means that the sys-specified time. The work load forbetter approach than a general pur-

e requirements at a reasonable cost.processors, algorithm specific pro-n to these problems. In this mas-

essor will be designed. ASIC is thesumption and higher throughput.

d FFT processors, and what archi-suitable for dedicated FFT proces-

iew on the theory behind the DFToaches to implement the FFT inutterflies, pipelining, commutators,roduced in this part.1

1 Introduction

2

The second part of the report describes the main goal of this masters the-sis project, i.e. to design and implement a parameterized pipelined FFTprocessor for transform lengths from 1k to 8k samples per frame. Thesetransform lengths and the parameterization reduces the amount of algo-rithms, architectures, and so on, that could be taken into account whendesigning a processor according to these criteria. Some parts of the the-ory are therefor very briefly describedlimited usefulness in the considered a

What trade-offs have to be made? Whshould be used? What types of simulaing performed? These are some of thethe second part.

1.3 Project RequirementsThe requirements for this masters the1. The transform length shall be able

in powers-of-2.2. The input signal shall be a continu3. The input signal shall consist of on4. The word length of the input and o

able. The internal word lengths sha5. Safe scaling shall be used.6. Data shall be represented with two7. The implemented architecture shal

These are the prioritizations that shou Effect, power consumption and thr

latency within reasonable limits. Linput stream arrives continuously.

SNR is a poor quality measuremenment should therefor not be considThough, this does not mean that thdue to quantization noise.compared to others, because of itsrea.

at architecture and algorithmtions should be done? How is test-questions that will be discussed in

sis project are as follows:to vary between 1k and 8k samples

ous data stream.ly one continuous data stream.utput signal shall be parameteriz-ll also be parameterizable.

s complement format.l be pipelined.

ld be taken mostly into account:oughput are superior to die area andatency is hard to affect when the

t in FFT processors, this measure-ered too important in the design.e output can have too low precision


These are the first requirements on the FFT processor that is going to bedesigned. Later in the report new restrictions and requirements will beadded to narrow down the area of investigation even further, to concen-trate the work on the type of architecture found to be most adequate.

1.4 Reading Instructions

This list gives a short description of th Chapter 1 Introduction contains th

will the project be all about? What Chapter 2 Algorithms contains a d

algorithms, not only those that wilalso a few other ones.

Chapter 3 Architectures contains aarchitectures that implement the FFalgorithms that not will be considealong with all the more adequate o

Chapter 4 Numerical Effects contaon the quantization errors in FFT a

Chapter 5 Implementation Choicesradix-22 algorithm and the SDF arbeing implemented.

Chapter 6 Radix-22 FFTs containsrithm and architectural description

Chapter 7 FFT Design contains thproject. Some problems that arose are discussed in this chapter.

Chapter 8 Asynchronous Design coasynchronous circuits, with a focusfocus in this chapter, but it also desture of the final implementation of

Chapter 9 Future Work contains suthis project could be.3

e content of each chapter.e introduction of the project. What will the result of the project be?escription of a lot of different FFTl be considered in the project, but

description of a lot of different FFTT algorithms. Also here some

red in the project will be described,nes.ins a basic theoretical introductionlgorithms and architectures. contains an explanation why the

chitecture is chosen to be to one

the derivation of the radix-22 algo-s of its components.e design methodology used in thisduring the project and the solution

ntains a very basic introduction toon GALS. The methodology is thecribes the asynchronous architec-

the asynchronous FFT processor.ggestions of what the next steps in

1 Introduction

4

Chapter 10 Summary contains the summary for the whole project.General thoughts and acquired knowledge are discussed.

Chapter 11 Bibliography contains the references referred to, insidesquare brackets, in the text.

2 A2.1 Introduction

The algorithms chapter will introducerithm and different approaches to comwill mainly be focused on FFT algoritalgorithms, will also be described bri

2.2 The DFT Algorithm

A DFT is a transform that is defined a

where

is the N-th root of unity. The inverse o

These equations show that the compleDFTs and IDFTs is O(N2), hence the masters thesis will be very costly in a

X k( ) x n( ) W Nnk

n 0=

N 1

,=

W N ej2N-

=

x n( ) 1N---- X k( ) W Nn k

k 0=

N 1

=lgorithms

the DFT definition, the FFT algo-pute FFTs in HW. The discussion

hms useful for long FFTs, but otherefly.

s

(Eq 2.1)

(Eq 2.2)

f the DFT (IDFT) is defined as

(Eq 2.3)

xity of a direct computation oflong transforms considered in thisstraight forward computation. The

k 0 N 1,[ ]

p

---

n 0 N 1,[ ],5

2 Algorithms

6

FFT algorithm deals with these complexity problems by exploiting regu-larities in the DFT algorithm.

2.3 FFT Algorithms

An FFT algorithm uses a divide-and-conquer approach to reduce thecomputation complexity for DFT, i.e.lot of different smaller problems thattion of the original problem.

In a communication system that uses need for an IFFT algorithm. Since theof these can be computed using basicreal and imaginary parts of the input,and imaginary data of the output. Thedata, except for the scaling factor in tthis is not a problem, and this will the

2.4 Common Factor Algorithm

Common factor algorithms are one wthe divide-and-conquer approach. Thiway of computing FFTs. N is then div

where the factors are constrained in th

This basically means that they have onFFT can be computed in i number of stational complex algorithms that can tion-in-frequency) and DIT (decimati

N Nii

=

a i a Ni( )"$ one big problem is divided into ain the end are assembled to the solu-

an FFT algorithm there is also aDFT and the IDFT are similar both

ally the same FFT HW, swap thecompute the FFT and swap the realoutput is now the IFFT of the input

he IFFT algorithm, 1/N. Usuallyrefor not be discussed henceforth.

s

ay of dividing the problem, usings method is the most widely usedided into factors according to:

(Eq 2.4)

e following way:

(Eq 2.5)

e factor in common. In this way anteps. There are two equally compu-

be derived from this, DIF (decima-on-in-time).


2.5 Radix-2 Algorithm

The radix-2 algorithm is a special case of the common factor algorithmfor N-point DFTs, where N is power-of-2. To derive the radix-2 algo-rithm, the indices n and k in Equation 2.1 are represented by

where

When these representations are used fDFT definition can be rewritten as

The last term in the right side of Equa

Observe that

By using Equation 2.10 on the differelowing relations are found

n 2 a 1 na 1 2

a 2n

a 2+ +=

k 2 a 1 ka 1 2

a 2 ka 2+ +=

ni ki, 0 1,{ } ,

N 2 a= a

X ka 1 k a 2 k0, , ,( ) x(

na 1 0=

1

n1 0=

1

n0 0=

1

=

W N

2 b nb

b 0=

a 1

2 b kb

b 0=

a 1

W N

2 a 1 ka 1 2

b

b 0=

a 1

=

W NN

e

j2 pN

---------

N

=7

(Eq 2.6)

(Eq 2.7)

or substitution in Equation 2.1, the

(Eq 2.8)

tion 2.8 can be expressed as

(Eq 2.9)

(Eq 2.10)

nt factors of Equation 2.9 the fol-

n0+ 2b

nb

b 0=

a 1

=

k0+ 2b k

b

b 0=

a 1

=

i 0a 1=

na 1 n a 2 n0, , , ) W N

2 b nb

b 0=

a 1

2 b kb

b 0=

a 1

nb

W N

2 a 2 ka 2 2

b

nb

b 0=

a 1

W N

k0 2b

nb

b 0=

a 1

1=

2 Algorithms

8

(Eq 2.11)

Insert Equation 2.11 in Equation 2.8

This summation can be divided into s

Finally, to obtain the FFT an unscramoutput data in natural order. Unscram

With this algorithm the computationaO(Nlog2(N)) butterfly operations. Theinto log2(N) different steps, which is ain HW.

The SFG for this derivation of the FFfor an 8-point radix-2 DIF FFT:

G0 W N

2 a 1 ka 1 2

b

nb

b 0=

a 1

W N

2 a 1 ka 1 2

b

nb

b 0=

0

W N2 a 1 k

a 1 n0= = =

G1 W N

2 a 2 ka 2 2

b

nb

b 0=

a 1

W N

2 a 2 ka 2 2

b

nb

b 0=

1

= =

Ga 1 W N

k0 2b

nb

b 0=

a 1

=

X ka 1 k a 2 k0, , ,( )

na 1 =

1

n1 0=

1

n0 0=

1

=

x1 k0 n a 2 n a 3 n0, , , ,( ) x(n

a 1 0=

1

=

x2 k0 k1 n a 3 n0, , , ,( ) x1n

a 2 0=

1

=

xa 1 k0 k1 k a 1, , ,( ) x a

n0 0=

1

=

X ka 1 k a 2 k0, , ,( ) x a =(Eq 2.12)

equential summations

(Eq 2.13)

bling stage is added to reorder thebling is done by bit-reversing.

(Eq 2.14)

l complexity is reduced tocomputation has also been dividedn advantage considering pipelining

T algorithm looks like Figure 2.1

x na 1 n a 2 n0, , ,( ) Gi

i 0=

a 1

0

na 1 n a 2 n0, , , ) G a 1

k0 n a 2 n a 3 n0, , , ,( ) G a 2

2 k0 k a 2 n, 0, ,( ) G0

1 k0 k1 k a 1, , ,( )


Figure 2.1: SFG for an 8-point radix-2 DIF FFT.

2.6 Radix-r Algorithm

The radix-r algorithm uses the same adecomposition using base-r instead o

The derivation of the radix-r is analogThe proof will therefore be left out hefor the radix-r case is O(Nlogr(N)) buO(logr(N)) butterfly stages.

2.7 Split Radix Algorithm

The split radix algorithm is one way oplications and additions required, [1].irregular structure compared to mixedrithms. Because of the irregular structparameterization, and will therefore n

2.8 Mixed Radix Algorithm

Mixed radix algorithms is a combinatThat is, different stages in the FFT coFor instance, a 16-point long FFT canone stage with radix-8 PEs, followed

W0

W0W0

W0

W0

W2

W2

W2

W1

W3

x(0)x(1)x(2)x(3)x(4)x(5)x(6)x(7)

X(0)X(1)X(2)X(3)

Stage 1 Stage 2

Unscramble

N r a= r a,9

pproach as radix-2, but with thef base-2. N is factorized as

(Eq 2.15)

ous to the derivation of radix-2.re. The computational complexitytterfly operations divided into

f decreasing the number of multi- The main drawback is the more radix and constant radix algo-ure this algorithm is not suitable forot be studied more thoroughly.

ion of different radix-r algorithms.mputation have different radices. be computed in two stages using

by a stage of radix-2 PEs. This adds

W0

W0X(4)X(5)X(6)X(7)

Stage 3

2 Algorithms

10

a bit of complexity to the algorithm compared to radix-r, but in return itgives more options in choosing the transform length.

2.9 Prime Factor Algorithms

Prime factor algorithms decompose N into factors that are relative prime,which means that the greatest commoThere are two reasons why prime factered later in the report. Firstly, it restrthat N cannot be power-of-2, which isscale very good, because for large Nsnumbers will also be large, hence wilmentation of the PEs.

2.10 Radix-r Butterflies

The radix-r butterflies are the blocks tin the radix-r algorithm. The followinSFG structure is derived (for the radixstage will be shown, the other proofs obtained is a radix-2 DIF (decimationand Equation 2.13

The last factor in the summation is noable, hence this factor can be lifted ou

x1 k0 n a 2 n0, , ,( ) x n a 1 n a,(n

a 1 0=

1

=

x na 1 n a,(

na 1 0=

1

=n divisor of the factors is equal to 1.or algorithms will not be consid-icts the transform length in a waya requirement. Secondly, it doesnt

the decomposing relative primel result in a very complex imple-

hat perform the basic computationsg reasoning will explain how the-2 case). Only the proof of the firstare analogous. The butterfly-in-frequency). From Equation 2.11

(Eq 2.16)

t depending on the summation vari-t from the summation

2 n0, , ) W Nk0 2

b

nb

b 0=

a 1

2 n0, , ) W Nk0 2

a 1n

a 1 W N

k0 2b

nb

b 0=

a 2


(Eq 2.17)

According to the above computations,made with a structure called radix eleradixes can be derived in a similar waand r outputs for a radix-r element. Tfor the radix-2 case.

Figure 2.2: Structure of a radix-2 DIF b

W N

k0 2b

nb

b 0=

a 2

x na 1 n a 2 n0, , ,( ) W N

k0 2a 1

na 1

na 1 0=

1

=

W Nk0 2

a 1n

a 1 e

j2 p2 a

------------

2 a

2----- k0 n a 1 1( )k0 n a 1= =

=

W NP

x 0 na 2 n0, , ,( ) 1( )

k0+(=

Wp

x1

x0 X0

X111

the basic FFT computations can bement. Radix elements for highery. These elements will have r inputshe figure below shows the structure

utterfly.

x 1 na 2 n0, , ,( ))

+

+ Wp-

x1

x0 X0

X1

2 Algorithms

12

3 Ar3.1 Introduction

This chapter discusses different architAs in chapter 2 Algorithms, mostly thwill be taken into account. Their advacussed.

3.2 Array Architectures

The array architecture can only be usthe extensive use of chip-area. This coelement (PE) for each butterfly in theFFTs longer than 16 points are not imhence it will not be discussed in detai

Figure 3.1: SFG for an 8-point array F

W0

W0

W0

W2

W2

W2

W1

W3

x(0)x(1)x(2)x(3)x(4)x(5)x(6)x(7)

Stage 1 Stage 2chitectures

ectures used for FFT computations.e architectures useful for long FFTsntages and drawbacks will be dis-

ed for very short FFTs, because ofmes from the use of one processingsignal flow graph (SFG). Normallyplemented with this architecture,ls.FT architecture.

W0

W0

W0

W0X(0)X(1)X(2)X(3)X(4)X(5)X(6)X(7)

Stage 3

Unscramble13

3 Architectures

14

3.3 Column Architectures

The column architecture uses an approach that requires less area on thechip than the array architecture. It is done by collapsing all the columnsin an array architecture into one column, hence a new frame cannot beprocessed until the processing of the current frame is finished. Hence,this architecture is not suitable for pipobviously smaller, only N/r radix-r elture. The architecture is still not smalfor long FFTs.

An architectural structure of a 4-poinbelow. To get a simple feedback netwstant geometry FFT (CG FFT) is oftemeans that the connection network insame between all stages.

Figure 3.2: Structure of a 4-point radix

3.4 Pipelined Architectures

Pipelined architectures are useful for throughput. The basic principle with pthe rows, instead of the stages like in ture is built up from radix butterfly elbetween. An unscrambling stage is soput side of the processor, if the outpuorder.

The advantage with these architecturethroughput, relatively small area and

x(0)x(2)

x(1)x(4)elining. The area requirement isements, than for the array architec-l enough to be taken into account

t radix-2 DIF FFT can be seenork a type of structure called con-n used as a starting point. This an array architecture would be the

-2 column architecture.

FFTs that require high dataipelined architectures is to collapsecolumn architectures. The architec-ements with commutators inmetimes added on the input or out-

t data is needed to be in natural

s, are for instance, high dataa relatively simple control unit.

X(0)X(2)

X(1)X(4)W

p

Wp


These advantages make this solution suitable for the long FFTs consid-ered in this masters thesis project.The basic structure of the pipelined architecture is shown below. Betweeneach stage of radix-r PEs there is a commutator (denoted C in the pic-ture). The last stage is the unscrambling stage (denoted U in the picture).The commutator reorders the output dfeeds to the following stage. The unscral sorted order.

Figure 3.3: General structure of a pipe

3.4.1 MDC, SDF and SDC Commu

There are basically three kinds of commutator (MDC), Single-path Delay FDelay Commutator (SDC). They all gerties, especially when it comes to tot

A commutator is a switch for data betthe pipeline. It stores parts of the FFTto perform the switching properly. Thdifferent, because it also feeds data ba

The figures below show the structure oa denotes the stage number in the pigives the size of that FIFO buffer in cBF4 is short for radix-4 butterfly elem

...x

radix-rPE C15

ata from the previous stage andrambler rearranges the data in natu-

lined FFT architecture.

tators

mutators, Multipath Delay Com-eedback (SDF) and Single-pathive the architecture different prop-al memory requirement.

ween the radix butterfly stages incomputations temporarily in order

e SDF commutator is somewhatckwards, Figure 3.5.

f the commutators. In these figurespeline. The numbers in the boxesomplex samples. C2 is a switch andent.

Xradix-r

PE U

3 Architectures

16

Figure 3.4: Multipath Delay Commutator structure.

Figure 3.5: Single-path Delay Feedback

Figure 3.6: Single-path Delay Commut

3.4.2 Pipeline Architecture Compa

There are many different pipelined armemory requirements, different compsummary of the most common pipelinTable 3.1, [2]. The abbreviations of th

2a

2a

C2

3x4a

BF4

6x4a Commutator structure.

ator structure.

risons

chitectures. They have differentlexities, different utilization, etc. Aed architectures are show ine architecture names are composed


in the following way, e.g. R2MDC is short for radix-2 multipath delaycommutator FFT architecture.

The R22SDF architecture is interestinties in the table this architecture is eqtectures, with one exception, the numR4SDC architecture. The R22SDF arcgood candidate to investigate, when cimplemented.

3.5 Multipipelined Architectu

Multipipelined architectures are builtlined architectures, but with the distinline can use two or more radix butterfl

These architectures achieve a higher parchitectures, [5]. The improvement inof pipes introduced.

3.6 SIC FFT Architectures

SIC FFT Architectures can be a goodments are not high compared with the[1]. In this architecture all the butterflA radix PE reads data from the memocomputation it writes the data back to

Table 3.1: Pipeline architecture comparison.

Architecture Multiplier # Adder # Memory size ControlR2MDC 2(log4 (N-1)) 4log4N 3N/2 - 2 SimpleR2SDF 2(log4 (N-1)) 4log4NR4SDF log4 (N-1) 8log4NR4MDC 3(log4 (N-1)) 8log4NR4SDC log4 (N-1) 3log4NR22SDF log4 (N-1) 4log4N17

g. When it comes to these proper-ual to or better than the other archi-ber of adders is 25% lower in thehitecture will therefore be a really

hoosing the architecture that will be

res

up in a similar way as normal pipe-ction that some stages in the pipe-y elements.

arallelism than regular pipelinedparallelism is equal to the number

choice when throughput require-throughput of available butterflies,

y elements share the same memory.ry and when it is finished with thethe memory. This results in a lot of

N - 1 Simple

N - 1 Medium

5N/2 - 4 Simple

2N - 2 Complex

N - 1 Simple

3 Architectures

18

memory accesses, which could be both hard to implement and be costlyin power consumption.

The architecture can be adapted to the requirements specification moreprecisely by adapting the number of radix PEs. For some specificationsthis architecture reduces radix PEs, which reduces both the die area andthe power consumption.

Figure 3.7: Structure of the SIC FFT a

3.7 Cached-FFT Architecture

Cached-FFT architectures are mainlysumption, [4]. The idea is to use a cacPEs and the main memory to decreasaccesses, which is very energy consum

.

.

.

Memor

radix-PE

radix-PErchitecture.

s

used for reducing the power con-he-memory between the radix-r

e the number of main memorying.

y

r

r

4 Numer4.1 Introduction

DSP systems almost always suffer frothe limited internal data word length. two operands always gives a result thoperands. The result have to be truncanal word length, hence quantization oSection 4.3 on page 21.

There is another thing in FFT systemsoverflow. If an overflow occurs in an Foutput. How this is solved will be dis

4.2 Safe Scaling

To prevent operations to overflow andscaling is used. It means that the outpto the first FFT stage are scaled in suctation in the next radix PE will not ovsion by a power-of-2 number, becausearithmetic right shift.

To simplify the discussion about safeused as an example. The safe scaling given without detailed discussion.ical Effects

m quantization effects, because ofFor instance, a multiplication byat is longer in bits than each of theted or rounded to avoid long inter-ccurs. Quantization is explained in

that have to be considered, and it isFT system it will generate a faulty

cussed in Section 4.2 on page 19.

cause errors a method called safeut from each radix PE and the inputh a way, that for certain the compu-erflow. The scaling is often a divi- it easily can be implemented by

scaling, the radix-2 case will befor the radix-r FFT algorithm is19

4 Numerical Effects

20

4.2.1 Radix-2 Safe Scaling

In a radix-2 butterfly element, overflow can occur for signals in wire Aand B, see Figure 4.1, after the summations. Overflow can, in reality, notoccur after the twiddle factor multiplication, because the absolute valueof the twiddle factor is always very close to one. Hence, the absolutevalue does not change, only the argum

Fractional twos complement is goingtherefore signals that can be represenabsolute value after each summation cas the input. Hence, to prevent the ouvalue of the input signals have to be sthat the input signals are in this rangeprevious butterfly stage by a factor ofoverflow and it only requires a little exin the FFT implementation.

Figure 4.1: Problem areas in a radix-2

No overflow will now occur in the radexcept for the first butterfly element. Tbe scaled. The real and imaginary inpthe range [-1, 1[, the absolute value coprinciple the input should be scaled bvalue of the first radix PE in the rangecheap to implement in HW and a scalbecause it can be implemented using

The absolute value of the output of thdue to the last safe scaling. To use theues a last stage called final scaling is multiplication by 2, increasing the abrange [-1, 1[.

-

x1

x0A

B

+

+ent.

to be used in this FFT processor,ted is in the range of [-1, 1[. Thean in the worst case be twice as big

tput from overflow the absolutemaller than 0.5. One way to ensureis to divide the output signals in the2. This method will always preventtra HW, hence, this method is used

element.

ix-2 butterflies using this method,he input to this butterfly also has tout to the FFT processor will be inuld therefore be as large as 20.5. In

y a factor of 1/21.5, to get the input [-0.5, 0.5[. This division is noting factor of 1/4 is a better choice,arithmetic right shift.

e FFT processor is smaller than 0.5, whole range of representable val-often added. This stage performs asolute value of the output to the

X0

X1Wp


4.2.2 Radix-r Safe Scaling

Radix-r safe scaling is similar to radix-2 safe scaling. The only differenceis that the scaling factor in the radix elements are 1/r instead of 1/2, theprescaling stage is 1/2r instead of 1/4 and the final scaling is a multiplica-tion of r instead of 2.

4.3 QuantizationQuantization occurs after each multipintroduced by quantization are modelmodelling. In this technique stochastiSFG where quantization occurs. Fromtions can be made to estimate the amoput by quantization.

4.3.1 Twos Complement QuantizaThe quantization can be done either bapproaches give different statistical pThe representation of a fractional two

This definition shows that the valuesuted in the [-1, 1[ interval, hence the qon the magnitude of the value. The diues is equal to the truncation error. Hnon-zero expectation value.

Rounding is a better quantization metof the rounding error is zero.

x x0 xi 2i

i 1=

W d 1

+=

0 D t1

2W d

-----------

1

2W d

--------- D r

2--

21

lication in the radix PEs. The errorsled with a technique called noisec noise sources are added to the the new SFG statistical calcula-unt of noise introduced in the out-

tiony rounding or by truncation. Theseroperties on the quantization error.s complement number is given by

(Eq 4.1)

it can represent is uniformly distrib-uantization error does not depend

fference between two adjacent val-ence, the truncation error has a

(Eq 4.2)

hod, because the expectation value

(Eq 4.3)

xi 0 1,{ }

1-----

1W d-------

4 Numerical Effects

22

Assuming that the data is uniformly distributed in the range of [-1,1[, theerrors are evenly distributed in the above given intervals. The variance ofa stochastic variable like this is

(Eq 4.4)

4.3.2 Radix-2 QuantizationThe quantization discussion in this sebutterfly elements with rounding and properties, the first is that no quantizasecond that quantization occurs at botcause overflow with safe scaling and tions when using safe scaling.

Figure 4.2: Quantization in a radix-2 D

The quantization, denoted Q, in Figuradding a complex stochastic variablepart and the imaginary part can be seevariables.

The expectation value and variance o

s

2 112------

1

2W d

---------

2=

-

x1

x0

+

+

Wp

n nre j n+=

E nre{ } E nim{ }=

V nre{ } V nim{ }= =ction is only considering DIF radixsafe scaling. The scaling gives twotion occurs in the adders and theh output nodes. The adders neverboth output nodes have multiplica-

IF PE.

e 4.2 can be modelled as an addern to the original signal. The realn as two independent stochastic

(Eq 4.5)

f the complex noise are

X0

X1

Q

Q

1/2

1/2

im

0=

112------

1

2W d

---------

2


(Eq 4.6)

The analysis are for a single radix-2 Dthe error analysis in the FFT. Considethe SFG in Figure 2.1. The error in thall stages in the binary tree that is formas root. Since safe scaling is used eacdivided by 2 before it is added to the nfrom the input to the output through ting, the noise variance for an N-point

4.3.3 Radix-r QuantizationThe noise analysis of radix-r DIF quaDIF quantization analysis, [1]. For thein the following way

Equation 4.8 shows that the noise is ltions, even though it has fewer butterfl

E n{ } E nre j n im+{ } 0= =V n{ } E n n{ } E nre

2nim

2+{ } E nre

2{ } E nim2{ }+ V nre{ } V nim{ }+= = = =

V n{ } 16---1

2W d

---------

2 s BF

2= =

s FFT2

s BF2 1 22 2 12 4 12---

2+ + +

=

s FFT2

s BF2 22 1

2i----

i 0=

log2 N( ) 1

s BF2 2= =

s FFT2

s BF2

r2 1 1

r---

1

rlogr N(

-----------------+ + +

=23

IF PE. This result can be used forr the error in only on output node inat node is then the summation overed with that particular output node

h error from a previous node isext stage. By propagating the error

he radix-2 PEs and their safe scal- FFT [1] can be written as

(Eq 4.7)

ntization is similar to the radix-2radix-r FFT the variance would be

(Eq 4.8)

arger for higher radix implementa-y stages.

2log2 N( ) 1 1

2log2 N( )

-------------------

2+

3 1 2log2 N( )

( ) 8 s BF2 1 1

N----

=

) 1--------

s BF2 r

3

r 1----------- 11N----

=

4 Numerical Effects

24

5 Impl

5.1 Introduction

The first part of the masters thesis reapproaches to the FFT problem. Thischoice of an adequate algorithm, archwhich the FFT processor is going to bchoices will be explained. The chosenone going to be implemented later in

5.2 Algorithm Choice

There are no restrictions on the algoririthm should be able to compute the Fselecting algorithm, the goal of an arcdesign have to be considered.

The radix-2 FFT algorithm has manylow quantization noise level, and it isdifferent FFT lengths.

Radix-r does not seem to be as good ait has higher quantization noise, and table to the different FFT lengths. To brithm to the general power-of-2 FFT lto be used in the pipeline, resulting inementationChoices

port is only a summary of different chapter is going to explain theitecture, and so on, for the area ine used. A lot of trade-offs and architecture in this section is thethe project.

thm choice, except that the algo-FT lengths from 1k to 8k. Whenhitectural and easy understandable

good features. For example, it has also easily parameterizable to the

choice as the radix-2 one, becausehat it is not as easily parameteriz-e able to parameterize this algo-

engths, different radix-r stages have a mixed radix implementation.25

5 Implementation Choices

26

Since minimizing the number of multipliers is important, a good choiceof algorithm would be the split radix one. It has a lower number of multi-pliers than all the above ones, but this algorithm results in a complexdesign, which will be harder to parameterize. The control of this type ofprocessor would also be more complex.

The radix-22 algorithm is the most attof as a radix-4 algorithm with radix-2of multipliers, simple control structur

Prime factor algorithms can not be uscan not be calculated using this algor

These are the reasons that the radix-2in the FFT implementation.

5.3 Architecture Choice

The choice of the architecture is easieture. What is left to decide is what kinarchitecture, MDC, SDF or SDC. In tcommutator will be used, because it hthan the other commutators.

The radix-22 architecture behaves a bcalls for two different architectures ladifferent FFT lengths needed. The firslengths, i.e. 1k and 4k FFTs, and the sFFTs. The latter lengths can easily beadding a radix-2 stage at either the inractive algorithm. It can be thoughtbuilding blocks. It has low numbere and architecture.

ed, because the right FFT lengthsithm.

2 FFT algorithm is going to be used

r, it has to be a pipelined architec-d of commutators to be used in the

he implementation the SDF type ofas a smaller memory requirement

it like the radix-4 architecture, thister, to be able to implement all thet architecture is for power-of-4 FFTecond architecture is for 2k and 8k created from the former ones byput or the output of the FFT.

6 Rad6.1 Introduction

This chapter will describe the radix-2background to the algorithm and the a

6.2 Algorithm

The derivation of the radix-22 FFT alwith a 3-dimensional index map, [2]. can be expressed as

When the above substitutions are applcan be rewritten as

Where

nN2----n1

N4----n2+=

k k1 2k2 4+ +=

X k1 2k2 4k3+ +( ) xN2----n1 +

n1 0=

1

n2 0=

1

n3 0=

N4---- 1

=

BN2----

k1 N4----n2 n+

n2 0=

1

n3 0=

N4---- 1

=ix-22 FFTs

2 FFTs in detail. The mathematicalrchitecture, will be discussed.

gorithm starts with a substitutionThe index n and k in Equation 2.1

(Eq 6.1)

ied to DFT definition, the definition

(Eq 6.2)

n3+ N

k3 N

N4----n2 n3+

W N

N2----n1

N4----n2 n3+ +

k1 2k2 4k3+ +( )

3

W N

N4----n2 n3+

k1

W N

N4----n2 n3+

2k2 4k3+( )27

6 Radix-22 FFTs

28

(Eq 6.3)

is a general radix-2 butterfly.

Now, the two twiddle factors in Equation 6.2 can be rewritten as

Observe that the last twiddle factor inrewritten.

Insert Equation 6.5 and Equation 6.4 summation over n2. The result is a DFFFT length.

The result is that the butterflies have tbutterfly takes the input from two BF

These calculations are for the first radthe BF2I and BF2II butterflies. The Bby the formulas in brackets in Equatioouter computation in the same equatiois derived by applying this procedure

BN2----

k1 N4----n2 n3+

xN4----n2 n3+

1( )k1 x N4----n2 n3N2----+ +

+=

W N

N4----n2 n3+

k1 2k2 4k3+ +( )W N

N n2k3W N

N4----

=

j( )n2 k1 2+(=

W N4n3k3

e

j2 pN

------------ 4n3k3e

4----= =

X k1 2k2 4k3+ +( ) H k1 k,([n3 0=

N4---- 1

=

H k1 k2 n3, ,( ) x n3( ) 1( )k1

x n3N2----+

+ (+=(Eq 6.4)

the above Equation 6.4 can be

(Eq 6.5)

in Equation 6.2, and expand theT definition with four times shorter

(Eq 6.6)

he following structure. The BF2II2I butterflies.

(Eq 6.7)

ix-22 butterfly, or its componentsF2I butterfly is the one representedn 6.7 and the BF2II butterfly is then. The complete radix-22 algorithmrecursively.

n2 k1 2k2+( )W N

n3 k1 2k2+( )W N4n3k3

k2)W Nn3 k1 2k2+( )W N

4n3k3

j2 pN

-------- n3k3W N

4----

n3k3=

2 n3, )W Nn3 k1 2k2+( )]W N

4----

n3k3

j ) k1 2k2+( ) x n3N4----+

1( )k1x n33N4-------+

+


6.3 Architecture

The first butterfly, the BF2I, in the radix-22 butterfly has the followingarchitecture.

Figure 6.1: BF2I DIF butterfly architecture.

The second butterfly, the BF2II, has tbelow. The BF2I butterfly is a radix-2fly basically is a radix-2 butterfly but cations.

Figure 6.2: BF2II DIF butterfly archite

-

-

xr(n)

xi(n)

xr(n+N/2)

xi(n+N/2)

+

+

+

+

xr(n)

xi(n)

xr(n+N/2)

xi(n+N/2)29

he architecture seen in the figurebutterfly, whereas the BF2II butter-with trivial twiddle factor multipli-

cture.

1

1

0

0

0

0

1

1Zr(n+N/2)

Zi(n+N/2)

Zr(n)

Zi(n)

s

-

-

1

1

0

0

0

0

1

1Zr(n+N/2)

Zi(n+N/2)

Zr(n)

Zi(n)

s

+

+

+

+

t

+

+-

6 Radix-22 FFTs

30

A radix-22 SDF FFT architecture with these radix butterfly elements plusmultipliers is shown in Figure 6.3. This architecture uses the sameamount of non-trivial complex multipliers as the radix-4 architecture, butretains the simple radix-2 architecture. Another advantage is that the con-trol structure is simple for this implementation, only a binary counter.The block in the feedback loop is a FIFO buffer, the number indicates thenumber of complex samples it can sto

Figure 6.3: Architecture of a 64-point r

6.4 Numerical Effects

Numerical effects in the radix-22 algoradix-2 algorithm, because it has the sfewer multipliers. For a description oradix-22 algorithm, see the radix-2 invEffects.

clk 5 4 3

x(n)

W1(n)

X

BF2I BF2II

1632

s st

BF2I

8

sre.

adix-22 SDF FFT.

rithm is exactly the same as in theame butterfly structure, but with

f the numerical effect for theestigations in chapter 4 Numerical

02 1

X(k)

W2(n)

X

BF2I BF2II

12

BF2II

4

ssst t

7 F7.1 Introduction

This chapter discusses the design of ation levels, block division, trade-offs,discussed in more detail.

In general, the design in this project sodology in small refining steps. The fiand the final model will be FPGA synprocess will go from the former to thei.e. only small changes will be introduThese smaller design steps will hopefdesign process and smaller amount ofsome point in the design process the MVHDL description, but it is hard to knfor this conversion will be.

The models should be implemented wThis leads to a more reusable and easmentation.

7.2 Matlab Design

The design of the FFT processor begitional model in Matlab. The advantagMatlab is that Matlab offers a high levgood interface for testing. This meansent models can be tested.FT Design

n FFT processor. Different abstrac- simulations, testing, etc. are things

hould be done with top-down meth-rst model will be built in Matlabthesizable VHDL code. The designlatter in several small design steps,ces in the model in each step.

ully lead to a more predictableerrors introduced when refining. Atatlab model has to be converted toow in advance when the best time

ith a good hierarchical architecture.ier understanding of the final imple-

ns with the design of a simple func-e of starting the design process inel programming language and a that in a short time a lot of differ-31

7 FFT Design

32

The first Matlab model was designed with a bottom-up methodology at ahigh level of abstraction. Actually a top-down methodology should havebeen used, but it seemed like a better solution to do the first model in thisway, because a good description of the algorithm and architecture wasavailable [2]. Some things in the algorithm were left out, which latercaused problems that delayed the project for around two weeks. The bot-tom-up methodology was only for thefrom that point and onwards a top-do

7.2.1 Problems and Solutions

All the different blocks were easily imtor generating block. The other blockwere described in detail in the paper [almost completely left out. Only the dhint about the functionality of the blomodels to get the FFT to work, the 16i.e. two radix-22 stages. To get it to wproblem, because this was the part whFinally it was solved through more stunderstand them better.

7.3 Matlab Simulations

Matlab simulations are an important plation shows if the models developed the final FFT model were tested throuferent blocks in the processor were te

Most simulations were done to get anthe output. The error depends on the dnumber of steps (depends on the FFTthese simulations were carried out to tion techniques. Different optimizatioSection 9.2 on page 49.

Simulations were also done to compavalidate that they are functionally equ creation of the first model andwn methodology was used.

plemented, except the twiddle-fac-s where easy to understand and2], but the twiddle-factor block waseduction of the algorithm gave ack. After testing a lot of different-point FFT finally worked correct,ork for longer FFTs was a real hardere some descriptions were left out.udying of the FFT formulas, to

art in the design process. The simu-are functionally correct. Not onlygh simulations, but also all the dif-sted separately.

estimation of the size of the error inata-widths in the processor and the length) in the pipeline. A lot oftest different data-width optimiza-n techniques are discussed in

re two models against each other, toivalent.


7.4 VHDL Design

The VHDL design started when the model of the parameterizable FFTprocessor were decided to be correct after a lot of simulations. The stepfrom Matlab to VHDL should be as small as possible. The models shouldhave the same blocks and their implementation should be the same. InVHDL it is possible to write functionVHDL model should not differ so mu

The design environment that was useding, Vcom for VHDL compiling and is a tool for graphical representation ois easier to understand with a total texproject with a lot of parameterization

7.4.1 Problem 1 and Solution - Abs

In the Matlab model signals are descrIEEE library have support for complerepresentation and not for the signed sentation. Hence, the first refinement separate the complex signals into twoand the other holding the imaginary vlems, for one reason the abstraction leanother reason some of these models lab.

One problem that the division of the cincreased the number of signals in theincreasing the block complexity. Incrdecreases the ease of understanding it

The first approach to solve this probleelements of a std_logic_vector, to crenals. However, it was impossible twocode wouldnt compile because the arbefore compile-time, and the word lenterized.33

al models so the Matlab and thech.

was Emacs for VHDL text-edit-Vsim for VHDL simulation. Thereff block structures, but the structuret representation, at least in this

.

traction 1

ibed by complex variables. Thex signals, but only for floating pointfractional twos complement repre-between Matlab and VHDL was to signals, one holding the real valuealue. This didnt cause much prob-vel was almost the same, and forwere already implemented in Mat-

omplex signals caused was that it blocks almost by a factor of two,

easing the block complexity.

m was to create an array with twoate an abstraction of complex sig- make a construct in this way. Theray elements have to be constrainedgth couldnt therefore be parame-

7 FFT Design

34

The second approach also used arrays. The approach used an array withword length number of std_logic_vector(1 downto 0). This construct iscompilable. The problem with it is that slices couldnt be used, a specialfunction would have to be written to extract the information. The codewould be easy to read, and it might even be synthesizable, but the testingwill be more complicated. The std_logic_vector in this case only storestwo bits, one for the real value and onebench it will therefore not be easy to

The third approach was to create an ahave a size of 2 x the word length. Thdrawbacks. Both dimensions in the deunconstrained, the best way would beand the other as a parameter. The drawsignal two dimensions have to be givefor declaring the real and imaginary dtion of signals in this way only makes

The final approach, the one that was udoned the abstraction of signals. The good solution. An extra layer in the bing the number of signals in each blocable complexity and length.

7.4.2 Problem 2 and Solution - ObjThe abstraction problem described abobject oriented variant of VHDL. Thefunctionality with an extra layer on topreprocessing stage, i.e. digital structferent from VHDL. To synthesis this,into VHDL-code, the rest of the stepssteps.

This solution wasnt used because VHrequirements, and because most peopextra layer. The lack of understandingthe future.for the imaginary value. In the testread the signal values.

rray of std_logic. The array wouldis is a good way but it has someclaration of the array have to beto be able to set one dimension to 2backs is that in the declaration of an, one for the word length, and oneimension (always 2). The declara- the code a bit less readable.

sed in the implementation, aban-reason was that there was anotherlock structure was added, decreas-k. This gave VHDL-files of reason-

ect Orientationove could have been solved with anre are some attempts the create thisp of VHDL. One solution had aures were written in a language dif- the code first had to be compiled are the usual VHDL-synthesis

DL had to be used according to thele dont understand the code of the would limit the use of the code in


When writing normal non-parameterized VHDL-descriptions the benefitof object orientation might not be as large as for highly parameterizedsystems like the FFTs considered in this project.

7.4.3 Problem 3 and Solution - Control Block

The control problem arose in the syncdata signals, i.e. that the right data shocontrol signals.

To get the FFT processor to work, shibetween each radix-22 stage and betwinside the radix-22 stage. The result wthat the latency between input and ouis not a big problem, but the extra HWpower consumption.

This problem could be solved in two unit or creating a system consisting omunicating asynchronously (GALS).completely synchronous would increaunit, resulting in a system that is hardewould only have a slightly different cthe original one. The blocks would alother functionally, which could be a gdesign later in the future. These pros aof the FFT processor as a GALS-syst

7.5 Design for Test

Design for test is a way to speed up thThis design method is used to find fabin the printed chip causing logical errment often used in this context is faulcoverage of stuck-at faults. A fault coof the die area of the chip was not corcess.35

hronization of control signals anduld be available in the right state of

mming delays had to be addedeen the two butterfly elementsas that more HW was needed and

tput frames increased. The latency will increase the die size and the

ways, either changing the controlf locally synchronous blocks com-The first choice of keeping the FFTse the complexity of the controlr to understand. The second choice

ontrol structure, but very similar toso be more separated from eachood property when improving thend cons lead to the implementation

em.

e testing of manufactured chips.rication faults in the chips, e.g. dustors, not design errors. A measure-t coverage, which often means theverage of e.g. 95% means that 95%rupted during the fabrication pro-

7 FFT Design

36

Design for test have not yet been considered, due to the early stage of theproject.

7.6 VHDL Simulations

Test bench code-skeletons were produced for combinatoric, synchronousand asynchronous parts. These test bespecific test bench for a block. Input awere read and written to test files. TheMatlab where the VHDL simulationssimulations.

To ease the interfacing between MatlaMatlab functions were developed. Thtion of test data and the reading and w

7.7 Synchronous or Asynchro

Both synchronous design and asynchdisadvantages. The reason for choosinthis project can be found in Section 7chronous circuits and the design of thpage 41.

7.8 Testing

Testing is an important part of the proThis section will outline the testing sttions have been describing simulationbut this section looks into this area m

7.8.1 Random Testing

Random testing is the most frequentlyare generated by Matlab, written to thtest benches. This type of testing is quare generated automatically, and it is because input signals often seem to bnches could easily be altered to and output from the test benchesse files could then later be read intocould be compared with the Matlab

b and VHDL simulation a set ofese functions handled the genera-riting of binary data to the test files.

nous Design

ronous design have advantages andg asynchronous design (GALS) in

.4.3 on page 35. More about asyn-ese can be found in Section 8 on

ject and testing is done on all levels.rategy of this project. Previous sec-s, which also is a form for testing,ore thoroughly.

used method. Random sequencese test files and read by the VHDLick to use because test sequences

also adequate in the area of FFTse randomly distributed.


7.8.2 Corner Testing

Random testing is good, but some things is hard to detect, e.g. cornercases. Corner cases are input sequences that the designer or tester thinkcan cause errors, e.g. overflows in adders and multipliers. In the FFT areaa corner case could be an input sequence of only maximum and mini-mum input values.

In this project corner testing is mostlyworks as in should, resulting in no ovethe output.

7.8.3 Block Testing

Block testing is the lowest level of tesother sub-block, the sub-blocks are rueach sub-block is verified. When it isrectly it can be assumed that if the blonection of sub-blocks. This method liwill save a lot of debugging time.

7.8.4 Golden Model Testing

Golden model testing is the best way tit should. The Golden model is a modrectly, which new implementations ofcompared against.

In the case of this project the golden mMatlab. The Matlab function can of ccomplete FFT, and not really the sub-Matlab model is thoroughly tested ansub-blocks can be used as golden mo

7.8.5 FPGA Testing

FPGA tests is the final step in the testcomputer software is time consuminglarge test vectors. VHDL code synthetest a lot, probably several orders of m37

done to check that the safe scalingrflows which would cause errors in

ting. Before a block is built out ofn through a block test to ensure thatknown that all sub-blocks work cor-ck errors it is due to the intercon-

mits the area where the error is and

o ensure that a model is working asel that is known to be working cor- the same functionality could be

odel is the built-in FFT function inourse only be used when testing theblocks of the system. Since thed it is assumed to be correct, thedels for the testing of later designs.

ing process. Doing simulations in, and it is therefore difficult to runsized to an FPGA will speed up theagnitude.

7 FFT Design

38

The Virtex-II V2MB1000 Development Board was used for the FPGAtests. The choice to use this board is that it can handle large designs. Thedevelopment board has a lot of sockets for interfacing with other compo-nents, i.e. RS232, parallel input, ethernet and on board switches.

To do tests in real-time, test vectors have to be sent and received from theFPGA in full speed. This requires a hFPGA board interfaces, since all test FPGA board. These interfaces wouldfit the time-plan of this thesis project,therefore the real-time requirement is

The full speed test can be done by loaon-board, run the simulation in full-spmemory, and finally reading out the Fthe block schematic of the test bench

Figure 7.1: Vertex-II asynchronous FFT

To test the design with this block strufor this there is not enough time left iinterface have to be written for the meas the code for the wrapper and test cnicate with the test board. To design atime far beyond the limit.

TestController

FFT

Wrapper

Data

Addr

RS232

FPGA chipVertex-II boardigh bandwidth through one of thevectors cannot be stored on the take to much time to implement tohence a simpler test is required and dropped.

ding all test vectors into a memoryeed while writing the output toFT output from the memory. Seein the figure below. test bench.

cture would be possible, but evenn the project to finish the test. Anmory and the RS232 port, as well

ontroller, and a program to commu-ll this would prolong the project

bus

ess bus Memory


The RS232 port seems to be the easiest way to interface with the Virtex-II board, so the other solutions using other ports would also extend theproject beyond limits. Hence, another way of testing is required.The final test bench was completely embedded on the FPGA chip. Sincethe FPGA chip does not have a large memory (ROMs and RAMs) capac-ity, the test vectors have to be small. Ia smaller FFT was tested. A 16-pointsmallest FFT processor in this projectthe two different butterflies, prescalintor multipliers.

Figure 7.2: The implemented FPGA tes

The Input generator was implementedand an asynchronous wrapper. The Osimilar way, but with an extra block ththe expected data. A difference betwedata would trigger the test bench intodiode on the FPGA board.

The result of the test showed that it wprocessor to an FPGA.

7.9 Synthesis

The VHDL-code written in this projesynthesis of the code two programs wXilinx Design Manager. The synchrobut the asynchronous ones caused a lopage 46.

FPGA chip

Inputgenerator Req

Ack

Data39

nstead of testing a 1024-point FFT FFT was selected because it is the that includes all components, i.e.g, final scaling and the twiddle fac-

t.

with a ROM memory, a counterutput tester was implemented in aat compared the received data withen the received and the expectedan error mode, which would light a

as possible to synthesize the FFT

ct should be synthesizable. For theere used, LeonardoSpectrum andnous parts were easily synthesized,t of problems, Section 8.7 on

FFTOutputtesterReq

Ack

Data

7 FFT Design

40

7.10 Meetings

Meetings were held regularly. Every meeting had a written protocol and aminutes were written directly after the meeting. The minutes were thensent by e-mail to the examiner and the supervisor.

In the beginning the meetings were hedirections of the project, and later theprogress in the work.

The meetings helped a lot in the beginwhat was going to be done. If something minutes could be read to find the aprotocol for the next meeting.ld to define the limitations and meetings mostly described the

ning, because it defined clearlying was forgotten, the correspond-nswer, if not, it was included in the

8 Asy

8.1 Introduction

An asynchronous design methodologproject to solve the problem with thetion. Asynchronous design has many son for using it was to test if it could

This chapter will introduce asynchronthe asynchronous design process used

8.2 Asynchronous Circuits

Asynchronous circuits are a big area owill be used in this project, the theorylined. For the interested reader more mcan be found in [3].

Asynchronous circuits have many advlike Performance of an asynchronous s

latency, not the worst case latency Global clock timing problems are The power consumption can be low

despite the fact that they require mnchronousDesign

y was used in this masters thesiscontrol signal and data synchroniza-interesting properties, but the rea-solve the control problem.

ous circuits and it will also describe in this project.

f research and only small parts of it will therefore only be briefly out-aterial on asynchronous circuits

antages over synchronous ones,

ystem depends on the average caseas in synchronous circuits.avoided.

er in asynchronous circuits,ore HW.41

8 Asynchronous Design

42

Asynchronous systems are more robust against temperature and sup-ply voltage variations.

Lower electrical interference with other components, due to the lackof clock harmonics in the emission spectra.

No global clock is used in asynchronous circuits, instead some form ofhandshaking is used in the communicples of handshaking protocols are 2-pthis project 4-phase handshaking willblocks already have been implemente

8.3 GALS

GALS, abbreviation of Globally Asynare a small subset of asynchronous sypletely asynchronous, they consist of nication asynchronously. In Figure 8.synchronous system, Req is short for acknowledge. Req and Ack performs push communication channel will be ducer of data initiates the handshakin

Figure 8.1: GALS asynchronous comm

A 4-phase handshaking cycle is perfomeans Req goes high): Req+, Ack+, Rbe valid between Req+ and Ack-, but

GALS combine some of the benefits benefits from asynchronous ones. Sinthey can be designed the same way asGlobally the system is asynchronous,global signals, which is an increasinglarger and faster.

LS-systemReq

Ack

Dataation between systems. Two exam-hase and 4-phase handshaking. In be used, because these interfacingd in VHDL.

chronous Locally Synchronous,stems. These systems are not com-synchronous sub-systems commu-1 the LS-system is a locallyrequest and Ack is short forthe handshaking. In this project aused, which means that the pro-g.unication.

rmed in the following way (Req+eq- and finally Ack-. Data should

is often sampled on the Ack+ edge.

from synchronous circuits with thece the local parts are synchronousbefore using available design tools.which removes timing problems for problem when designs are getting

LS-system


8.3.1 Asynchronous Wrappers

An asynchronous wrapper is used for the handshaking between two mod-ules. The wrapper consists of three components, a stretchable clock gen-erator, a demand-port (D-port) and a poll-port (P-port). Connectedaccording to the figure below they implement a 4-phase push-channel,[3].

Figure 8.2: Asynchronous wrapper com

The En signal is a control signal frominput and output of the system. En ontem is ready to receive data, and En oLS-system is ready to send data. The Sdata is available or the receiving blockfor the LS-system.

8.3.2 Enable generation

The enable signal can be generated inthis problem an easy solution is possinents continuously and the same amousample. The system both needs data aThough, the output is delayed compaenable generation has to be delayed ction. The result is that the enable contmostly from a counter.

LS-sy

P-in

Stretcha

Req

Ack

Str

En

Data43

ponents.

the LS-system, that controls thethe input controls when the LS-sys-n the output controls when thetr signal stops the clock if no inputis busy. The lclk signal is the clock

a lot of different ways. Though, forble. Data flows through the compo-nt of processing is needed for eachnd can send data each clock cycle.red to the input, i.e. the outputompared to the input enable genera-rol can be a separate block built up

stem

D-out

ble clock

Req

Ack

Str

Data

En

lclk


44

8.4 Design Automation

To be able to quickly convert the available synchronous implementationto a GALS one, the design was automated. The wrapper was imple-mented as one component, a VHDL code skeleton was created for oneLS-system with an asynchronous wrapper and data registers, and testbenches for these systems were createchronous component in this project ubelow.

Figure 8.3: General architecture of asy

The skeleton according to this structuLS-system of the considered type.

The only difference between two diffcontrol block, which of course depenwrapped. In the synchronous implemwhereas when the system is divided inblock have to have its own local contrwere similarly implemented as the glo

The skeleton above only works for a sinput is only received from one blockondly, the structure implements a pusThough, it is easy to change it to a pul

Req

Ack

En

Data

Wr

Conbl

Reg LS-sd. The general structure of an asyn-ses the architecture in the figure

nchronous components.

re is easy to modify to wrap any

erent asynchronous blocks are theds on what LS-system is beingentation the control block is global,to several asynchronous block eachol block. The local control blocksbal one.

pecial type of architectures. Firstly, and sent to another block. Sec-h-channel type of communication.l-channel type by reversing the Req

Req

Ack

En

apper

trolock

lclk

Data

ystem

Controlsignals

Controlsignals


and Ack signals, and swap the D-port for a P-port and the P-port for aD-port in the wrapper.

8.5 Asynchronous FFT Architecture

The asynchronous FFT is built up by connecting the wrapped LS-sys-tems, which is illustrated in the figure

Figure 8.4: The architecture of the asyn

The number of stages are determinedFigure 8.5 shows the FFT structure wIn the case a 2 times power-of-4 lengtand twiddle factor multiplier stage hathe processor. The reason for choosinside, which also is possible, is that a lradix-22 (1024-point FFT) stages andbetween. This layout would not changstages are added on the input side.

Figure 8.5: The architecture of the asyn

Decomposing the radix-22 blocks intoremoved the timing problem of the in

Figure 8.6: The architecture of the radi

The asynchronous butterfly stages, thand the scaling stages were designed described in Section 8.4 on page 44.

Asyncprescaling

AsynFFT

Data

ReqAck

AsyncTwiddleFactor

Multiplier

Async2

Radix-2AsyRadi

Data

ReqAck

Asyncbf2istage

Abs

Data

ReqAck45

below.chronous FFT, including scaling.

by the FFT transform length.ith a power-of-4 transform length.h is wanted, an extra butterfly stageve to be added on the input side ofg the input side and not the outputayout could be done of the first five the twiddle factor stages ine in the case when extra butterfly

chronous FFT.

two different asynchronous unitsternal control signals in the stage.

x-22 block.

e twiddle factor multiplying stagesaccording to the methodology

Asyncfinal

scaling

c Data

ReqAck

nc2x-2

Async2

Radix-2

Data

ReqAck

syncf2iitage

Data

ReqAck


46

8.6 Testing

The testing of the asynchronous circuits are similar to the testing of thesynchronous ones. A skeleton test bench were implemented that allowedthe use of the same input and output files as for the testing of the synchro-nous design.

8.7 Synthesis

The synthesis process of the asynchrothe VHDL simulations the asynchronare not synthesizable, but synthesizabavailable. These blocks, since they arlogic to be functionally correct and resynthesis tools to optimize the designthis was the problem when the FPGAFPGA does not offer any means of seinside the chip.

The synthesis tools have a possibilityand the problem with the removed logsuming problem was to find out how removing logic in certain componentsattribute (keep) in the VHDL code an(PRESERVE_SIGNAL) in the tcl-scrdoSpectrum.

8.8 Summary of GALS Design

To design a GALS-system instead of proved to be a quick and efficient waying of data and control signals. All uncould be removed in one week of worasynchronous circuits, which took on

The design principle is easy to use, nothe asynchronous parts, only the use opackage including the wrappers etc., GALS could get started designing thenous parts required a lot of time. Inous blocks were functional, thesele versions of these blocks weree asynchronous, need redundantdundant logic is removed by the. It took a lot of time to find out that simulations didnt work, since theeing what actually is happening

to see the generated componentsic was found. The next time con-

to prevent the synthesis tools from. The solution was to use and also setting an attributeipt used in the synthesis by Leonar-

a completely synchronous system to solve the problem with the tim-necessary buffers in the processork, not including the studying ofly a few days.

t much understanding is needed off them have to be learnt. With a

a designer unexperienced withse systems in a couple of days.


The only problem encountered with the GALS design was the synthesis,but when the process to do the synthesis of the asynchronous parts havebeen learned, this design step will not cause much problems.47


48

9 Fu9.1 Introduction

In this chapter the future work of theter suggests where improvements are ing continuations of the project.

9.2 Word Length Optimizatio

Word length optimization is one of thSince the FFT processor is parameterspecific applications should be easy torithm. A few optimization test have bline them, plus describe some areas w

9.2.1 General

In general, optimizing a HW structurepower consumption, die size, and so oweighed against each other? This depWhen the application is know the neeknown. Since the precision is non-negthe precision, an optimized solution csion than this constraint. Hence, the otwo parts: getting good enough precisthe other parameters, as power consumare weighted together to a utility valuters depends on the application).ture Work

project will be discussed. The chap-possible or important, and interest-

n

e most important future works.ized, the different parameters forfind using some optimization algo-

een done, and this section will out-here more work could be done.

is a trade-off between precision,n. How should these parameters beends of course on the application.ded precision of the output isotiable there is a fixed constraint onould therefore not have worse preci-ptimization problem is divided intoion, and maximizing the utility ofption and so on. These parameters

e (the importance between parame-49

9 Future Work

50

9.2.2 Gradient Search

The test on gradient search that has been done uses a vector of wordlength parameters. One at a time each parameter is changed one step andthe utility and output error are calculated for each change. This results ina vector telling the importance (the gradient) of each parameter, both inutility and precision. If there are solutconstraint the one with the highest utiand if no solution have a good enoughprecision is chosen for the next iterati

It was hard to test the convergence oflack of a relevant function to calculatOne way to solve this is suggested in

9.2.3 Utility Function

One way on solving the problem to esthesize the butterflies, ROMs, RAMslengths, and measure the power consuMeasurements does not have to be malengths, because the ones not measurekind of interpolation.

The problem is still how to weigh thebut with this method the solution is onpower consumption, and so on have t

The synthesis work for the different wis left for the future.

9.3 VLSI Layout

The work in this project ends in FPGAgood in many ways. They have a shorouts does not have to be done and no tion. Though, they cannot compete wpower consumption, and when it comFPGAs can not compete with the prizions with better precision than thelity is chosen for the next iteration, precision, the solution with beston.

this algorithm, mostly due to thee the utility of a set of parameters.the next section.

timate the utility function is to syn-and multipliers with different wordmption, the die area, and so on.de for all possible different wordd could be estimated with some

se parameters against each other,e step closer, since at least die size,

o be estimated.

ord lengths have not been done. It

synthesizable code. FPGAs aret development cycle, because lay-layouts have to be sent for fabrica-ith VLSI layouts in speed andes to large series of components thee.


Mostly due to the better performance, in speed and power consumption,that a VLSI design gives, there could be an interest in doing a VLSI lay-out of this FFT processor. It could of course also be interesting to see ifthe processor, considering the GALS design, is easy to layout into aworking chip.

9.4 VLSI Layout of Asynchro

The GALS design of this project is inthese circuits turned out to be simple.learn more about the design flow of thdown to VLSI layout. Hopefully the das the high level design. The reason thgood test example is that it is a GALSreal-world application.

9.5 Completely Asynchronous

The design of the FFT processor is sono reason that the LS-systems have totems could also be broken down into be of the GALS type, or even comple

9.6 Design for Test

DFT, or design for test, is increasing size of designs. Adding design for tesbe interesting later on in the project bdone.

9.7 Twiddle Factor Memory R

A twiddle factor memory reduction isaccomplished is reducing the ROM mtion of the ROM memories will decrewill probably increase the power consROM have to be changed in some wators.51

nous Parts

teresting. The high level design of It could therefor be interesting toese kinds of circuits all the wayesign on that level will be as easyat this FFT processor could be a design, that could be used in a

Design

far of the GALS type, but there is be kept synchronous. These sys-smaller systems, which also couldtely asynchronous.

in importance due to the increasingt methodology to the project couldefore a VLSI layout is going to be

eduction

possible. The reduction that can beemories by a factor of 8. The reduc-ase the die size of the design, butumption because the factors in the

y to match the previous twiddle fac-

9 Future Work

52

9.8 Commutators Implemented with RAM

The current implementation of the commutators uses a FIFO bufferimplementation. To use a RAM memory to implement the FIFO bufferswill reduce the power consumption.

9.9 Unscrambler

An unscrambler could be a useful blorequired to get the output data in natubler is needed. In the current solutionorder.ck to implement. It is not alwaysral order, but when it is an unscram- the output arrives in bit-reversed

1010.1 Conclusions

The goal with this masters thesis prodesign of FFT processors, and designThe processor has been designed anddesign steps have been used successfu

In the end of the project the design tuleading to more interesting problemsbeginning. The design of the GALS Fsimple design methodology for this a

Finally the next section will go througthe project, to check that they are fulfi

10.2 Follow-up of Requiremen

The Follow-up of Requirements wistated in chapter 1 Introduction.1. The transform length shall be able

in powers-of-2.This requirement is fulfilled. The rments, the implemented architectupower-of-2 number, not only the o

2. The input signal shall be a continu Summary

ject was to learn more about theing a parameterized FFT processor. a design methodology with smalllly.

rned into an asynchronous design,and solutions not expected from theFT taught a lot in this area, and a

rea has also been developed.

h the requirement stated early inlled.

ts

ll go through the requirements

to vary between 1k and 8k samples

esult is even better than the require-re handles any frame length of anes between 1k and 8k.ous data stream.53

10 Summary

54

This requirement is fulfilled. The implemented architecture handles acontinuous data stream in natural order, and the output is received inbit-reversed order.

3. The input signal shall consist of only one continuous data stream.This requirement is fulfilled. The implemented architecture handlesone and only one data stream.

4. The word length of the input and oable. The internal word lengths shaThis requirement is fulfilled. The ipletely parameterized. Word lengthtipliers, as well as input and outpu

5. Safe scaling shall be used.This requirement is fulfilled. The iscaling, using a scaling factor of 2

6. Data shall be represented with twoThis requirement is fulfilled. Data architecture uses fractional twos c

7. The implemented architecture shalThis requirement is fulfilled. The ilining. Each butterfly and each comline. More pipelining stages can ofutput signal shall be parameteriz-ll also be parameterizable.

mplemented architecture is com-s in butterfly stages and in all mul-

t, can be specified.

mplemented architecture uses safe.s complement format.representation in the implementedomplement representation.l be pipelined.mplemented architecture uses pipe-plex multiplier is a step in the pipe- course be inserted, if needed.

11 Bi[1] Torbjrn Widhe. Efficient Implem

Elements. Thesis No. 619, DeparLinkping University, Sweden. 2

[2] Sousheng He, and Mats TorkelssoFFT Processor. Department of AUniversity, SWEDEN.

[3] Jens Muttersbach, et. al. GloballyLocally-Synchronous ArchitecturOn-Chip Systems. Integrated SysInstitute of Technology, Zrich, S

[4] Bevan M. Baas. An Approach to LProcessor design. Dissertation DeStanford, USA. 1999.

[5] Shigenori Shimizu. Multi-Pipelinetransactions of the IEICE, vol. E bliographyentation of FFT Processingtment of Electrical Engineering,002.

n. A New Approach to Pipelinepplied Electronics, Lund

-Asynchronouses to Simplify the Design oftems Laboratory, Swiss Federalwitzerland. 1999.

ow-Power, High-Performance FFTpartment of Electrical Engineering,

FFT Architecture. The70, no. 6 June 1987.55

11 Bibliography

56

P svenska

Detta dokument hlls tillgngligt p Internet eller dess framtida ersttare under en lngre tid frn publiceringsdatum under frutsttning att inga extra-ordinra omstndigheter uppstr.

Tillgng till dokumentet innebr tillstnd fr var och en att lsa, ladda ner,skriva ut enstaka kopior fr enskilt bruk och att anvnda det ofrndrat fr ick-ekommersiell forskning och fr undervisning. verfring av upphovsrtten viden senare tidpunkt kan inte upphva detta tillstnd. All annan anvndning avdokumentet krver upphovsmannens medgivande. Fr att garantera ktheten,skerheten och tillgngligheten finns det lsningar av teknisk och administrativart.Upphovsmannens ideella rtt innefattar rtt att bli nmnd som upphovsman i denomfattning som god sed krver vid anvndning av dokumentet p ovan beskrivnastt samt skydd mot att dokumentet ndras eller presenteras i sdan form eller isdant sammanhang som r krnkande fr upphovsmannens litterra eller konst-nrliga anseende eller egenart.

Fr ytterligare information om Linkping University Electronic Press se fr-lagets hemsida http://www.ep.liu.se/

In English

The publishers will keep this document online on the Internet - or its possiblereplacement - for a considerable time from the date of publication barring excep-tional circumstances.

The online availability of the document implies a permanent permission foranyone to read, to download, to print out single copies for your own use and touse it unchanged for any non-commercial research and educational purpose. Sub-sequent transfers of copyright cannot revoke this permission. All other uses ofthe document are conditional on the consent of the copyright owner. The pub-lisher has taken technical and administrative measures to assure authenticity,security and accessibility.

According to intellectual property law the author has the right to be men-tioned when his/her work is accessed as described above and to be protectedagainst infringement.

For additional information about the Linkping University Electronic Pressand its procedures for publication and for assurance of document integrity, pleaserefer to its WWW home page: http://www.ep.liu.se/

Jonas Claeson

AbstractAcknowledgementsTerminologyNotationTable of Contents1 Introduction1.1 General1.2 Scope of the Report1.3 Project Requirements1.4 Reading Instructions

2 Algorithms2.1 Introduction2.2 The DFT Algorithm2.3 FFT Algorithms2.4 Common Factor Algorithms2.5 Radix-2 Algorithm2.6 Radix-r Algorithm2.7 Split Radix Algorithm2.8 Mixed Radix Algorithm2.9 Prime Factor Algorithms2.10 Radix-r Butterflies

3 Architectures3.1 Introduction3.2 Array Architectures3.3 Column Architectures3.4 Pipelined Architectures3.4.1 MDC, SDF and SDC Commutators3.4.2 Pipeline Architecture Comparisons

3.5 Multipipelined Architectures3.6 SIC FFT Architectures3.7 Cached-FFT Archite

project report1.pdf

Documents

theters thesis project

long fft transforms

final implementation

ju pipelined fftssoris

g anders thesis project

long fft transors

design methodology

kinds of systems