automatic generation of customized discrete fourier transform ips grace nordin, peter a. milder,...
Post on 22-Dec-2015
213 Views
Preview:
TRANSCRIPT
Automatic Generation of Automatic Generation of Customized Discrete Fourier Customized Discrete Fourier Transform IPsTransform IPs
Grace Nordin, Peter A. Milder, James C. Hoe, Markus Püschel
Carnegie Mellon University
This project is supported in part by NSF awards ITR/NGS-0325687 and SYS-0310941 and a DARPA DESA programwww.spiral.net
Slide 2
The Paradox of Reusable IPs
Boon to productivity zero effort required zero knowledge required zero chance to introduce new bugs
Why repeat what has already been done?
Bane to optimality finding the right functionality with the right interface design tradeoff -- performance, area, power, accuracy .....
Are you getting what you really wanted? Solution:Solution: parameterized automatic IP generators
zero effort, knowledge or bugs allows application specific customization facilitates design exploration
Slide 3
Our Work: Discrete Fourier Transform IPs
Discrete Fourier Transform (DFT) important building block in DSP applications numerous design “cores” available
Current IP libraries support: various sizes, number formats, data orderings only a small numbersmall number of microarchitecture choices
(Xilinx LogiCore DFT gives 3 choices)
We generate IPs with custom design tradeoffsWe generate IPs with custom design tradeoffs degree of parallelism in microarchitecture (min max) resource preference (e.g. BRAM vs. slices in FPGAs)
Extensible to other common linear DSP transformsExtensible to other common linear DSP transforms
Slide 4
Outline
Introduction Formula-Driven Design GenerationFormula-Driven Design Generation Microarchitecture Parameterization Generator User Interface Experimental Results Conclusions
Slide 5
Transforms as Formulas [www.spiral.net]
Transform computation is represented as matrix-vector multiplication
Matrix-vector multiplication is O(n2) operations
“Fast” algorithms factor the transform into a sequence of structured sparse matrices
O(n log n) operations
DFT:DFT:
FFT:FFT:
Datapath easily formed from factorized formulasDatapath easily formed from factorized formulas
Slide 6
Formula to Datapath
Given where is: apply , then is a permutation permute apply , times in
parallel is a diagonal scale
A
A
B A
×4
×2
×7
×8
Slide 7
Outline
Introduction Formula-Driven Design Generation Microarchitecture ParameterizationMicroarchitecture Parameterization Generator User Interface Experimental Results Conclusions
Slide 8
Simple regular structure embodied in formula
Example:
Pease DFT
diagonal
permutation
butterfly
parallel
k stages
stage 1
stage 2stage 3
Slide 9
Pease DFT Example: DFT8
x
x
x
x
x
x
x
x
x
x
x
x
stage 1 stage 2 stage 3
(formula is applied from right to left)
(datapath is built left to right)
Repeating column structure Repeating column structure hardware reuse hardware reuse without performance penaltywithout performance penalty
Slide 10
x
x
x
x
Horizontal folding
x
x
x
x
x
x
x
x
our baseline design degree of freedom: vertical parallelism
parameter pp
inputbypass
register
pp
Slide 11
Vertical (V-)folding according to p
latency
Fine-grained control over cost/latency tradeoffFine-grained control over cost/latency tradeoff
cost
Slide 12
Outline
Introduction Formula-Driven Design Generation Microarchitecture Parameterization Generator User InterfaceGenerator User Interface Experimental Results Conclusions
Slide 13
User Interface
http://www.spiral.net/hardware/dftgen.html
commonDFT
options
customization options
Slide 14
Outline
Introduction Formula-Driven Design Generation Microarchitecture Parameterization Generator User Interface Experimental ResultsExperimental Results Conclusions
Slide 15
We compare Xilinx’s fixed design against our variable We compare Xilinx’s fixed design against our variable generated designsgenerated designs
Evaluation We compare against Xilinx LogiCore DFT Ver. 3.1
radix-4 burst I/O interface
XilinxXilinx SPIRALSPIRAL
datapathdatapath fixed, one radix- 4 basic block
variable, p radix-2 basic blocks
cost-performance cost-performance tradeofftradeoff
fixedfixed user-controlled, user-controlled, varies with varies with pp
Comparison DFT n = {64, 1024, 2048}; width = 16; bit-reversed output Xilinx ISE ver. 6.1, Xilinx Virtex2-Pro XC2VP100-6
Slide 16
0
2
4
6
1 2 4 8 16 32
p
spee
du
p
0
1
2
3
4
5
1 2 4 8 16 32
p
rela
tive
BR
AM
s
0
2
4
6
8
10
12
1 2 4 8 16 32
p
rela
tive
slic
esDFT1024 relative to Xilinx
Xilinx
Performance and resources scale with Performance and resources scale with pp
1.0 = 1955 slices 1.0 = 7 BRAMs 1.0 = 1 / 5.6 µsec
0
1
2
3
4
5
1 2 4 8 16 32
Min Slice
Min BRAM
Balanced
Xilinx
logic storage performance
Slide 17
0
2
4
6
8
10
12
14
1 2 4 8 16 32
p
rela
tive
sli
ces
0
5
10
15
20
25
30
35
1 2 4 8 16 32
p
rela
tive
BR
AM
s
Resource usage preferences
0
1
2
3
4
5
1 2 4 8 16 32
Min Slice
Min BRAM
Balanced
Xilinx
Xilinx
1.0 = 1955 slices 1.0 = 7 BRAMs 1.0 = 1 / 5.6 µsec
logic storage performance
0
2
4
6
1 2 4 8 16 32
p
spee
du
p
Slide 18
0
2
4
6
1 2 4 8 16 32
p
spee
du
p
0
2
4
6
8
10
12
14
1 2 4 8 16 32
p
rela
tive
slic
es
0
5
10
15
20
25
30
35
1 2 4 8 16 32
p
rela
tive
BR
AM
s
Resource usage preferences
0
1
2
3
4
5
1 2 4 8 16 32
Min Slice
Min BRAM
Balanced
Xilinx
Can control tradeoff between slices and BRAMsCan control tradeoff between slices and BRAMs
Xilinx
exchange BRAM for slices very little change in performance
1.0 = 1955 slices 1.0 = 7 BRAMs 1.0 = 1 / 5.6 µsec
logic storage performance
Slide 19
0
10
20
30
40
50
1 2 4 8 16 32
p
rela
tive
BR
AM
s
0
2
4
6
8
10
1 2 4 8 16 32
psp
eed
up
0
4
8
12
16
20
1 2 4 8 16 32
p
rela
tive
slic
es
0
1
2
3
4
5
1 2 4 8 16 32
p
rela
tive
slic
es
0
1
2
3
4
5
1 2 4 8 16 32
p
rela
tive
BR
AM
s
0
1
2
3
4
5
1 2 4 8 16 32
p
spee
du
p
DFT64 and DFT2048
2048
0
1
2
3
4
5
1 2 4 8 16 32
Min Slice
Min BRAM
Balanced
Xilinx
1.0 = 2140 slices 1.0 = 7 BRAMs 1.0 = 1 transform / 24.578 µsec
Trends hold for sizes 64, 2048Trends hold for sizes 64, 2048
1.0 = 1743 slices 1.0 = 8 BRAMs 1.0 = 1 transform / 0.648 µsec
64
Xilinx
Xilinx
Slide 20
Related Work
Kumhom, Johnson, Nagvajara, ASIC/SOC 2000 universal FFT processor microarchitecture based on
processing elements interconnected by on-chip reconfigurable network
microarchitecture is scalable in the number of elements supports both Cooley Tukey and Pease
Choi, Scrofano, Prasanna, Jang, FPGA’2003 mapped radix-4 Cooley-Tukey algorithm onto log2(n)/2 DFT4
primitives scalable datapath between 1 element and 4 elements at a
time show energy and performance improvements from scaling
Slide 21
Conclusions
Parameterized DFT IP generator matrix formula-drivenformula-driven synthesis performance/cost tradeoff
fine-grained control over resources vs. latencyresources vs. latency resource usage preference
can balance tradeoff between slices and BRAMslices and BRAM
Key results efficient: efficient: the Xilinx design point can be matched customizable: customizable: design tradeoffsdesign tradeoffs directly controllable easy to use: easy to use: simple yet powerful web interfacesimple yet powerful web interface
Slide 22
Web Generator
This work is part of the SPIRALSPIRAL project, which aims to push the limits of automation in software and hardware development for DSP algorithms.For more information visit: www.spiral.netwww.spiral.net
http://www.spiral.net/hardware/dftgen.htmlhttp://www.spiral.net/hardware/dftgen.html
http://www.spiral.net/hardware/dftgen.html
top related