automatic generation of customized discrete fourier transform ips grace nordin, peter a. milder,...

22
Automatic Generation of Automatic Generation of Customized Discrete Customized Discrete Fourier Transform IPs Fourier Transform IPs Grace Nordin, Peter A. Milder, James C. Hoe, Markus Püschel Carnegie Mellon University This project is supported in part by NSF awards ITR/NGS-0325687 and SYS-0310941 and a DARPA DESA program www.spiral.net

Post on 22-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Automatic Generation of Customized Discrete Fourier Transform IPs Grace Nordin, Peter A. Milder, James C. Hoe, Markus Püschel Carnegie Mellon University

Automatic Generation of Automatic Generation of Customized Discrete Fourier Customized Discrete Fourier Transform IPsTransform IPs

Grace Nordin, Peter A. Milder, James C. Hoe, Markus Püschel

Carnegie Mellon University

This project is supported in part by NSF awards ITR/NGS-0325687 and SYS-0310941 and a DARPA DESA programwww.spiral.net

Page 2: Automatic Generation of Customized Discrete Fourier Transform IPs Grace Nordin, Peter A. Milder, James C. Hoe, Markus Püschel Carnegie Mellon University

Slide 2

The Paradox of Reusable IPs

Boon to productivity zero effort required zero knowledge required zero chance to introduce new bugs

Why repeat what has already been done?

Bane to optimality finding the right functionality with the right interface design tradeoff -- performance, area, power, accuracy .....

Are you getting what you really wanted? Solution:Solution: parameterized automatic IP generators

zero effort, knowledge or bugs allows application specific customization facilitates design exploration

Page 3: Automatic Generation of Customized Discrete Fourier Transform IPs Grace Nordin, Peter A. Milder, James C. Hoe, Markus Püschel Carnegie Mellon University

Slide 3

Our Work: Discrete Fourier Transform IPs

Discrete Fourier Transform (DFT) important building block in DSP applications numerous design “cores” available

Current IP libraries support: various sizes, number formats, data orderings only a small numbersmall number of microarchitecture choices

(Xilinx LogiCore DFT gives 3 choices)

We generate IPs with custom design tradeoffsWe generate IPs with custom design tradeoffs degree of parallelism in microarchitecture (min max) resource preference (e.g. BRAM vs. slices in FPGAs)

Extensible to other common linear DSP transformsExtensible to other common linear DSP transforms

Page 4: Automatic Generation of Customized Discrete Fourier Transform IPs Grace Nordin, Peter A. Milder, James C. Hoe, Markus Püschel Carnegie Mellon University

Slide 4

Outline

Introduction Formula-Driven Design GenerationFormula-Driven Design Generation Microarchitecture Parameterization Generator User Interface Experimental Results Conclusions

Page 5: Automatic Generation of Customized Discrete Fourier Transform IPs Grace Nordin, Peter A. Milder, James C. Hoe, Markus Püschel Carnegie Mellon University

Slide 5

Transforms as Formulas [www.spiral.net]

Transform computation is represented as matrix-vector multiplication

Matrix-vector multiplication is O(n2) operations

“Fast” algorithms factor the transform into a sequence of structured sparse matrices

O(n log n) operations

DFT:DFT:

FFT:FFT:

Datapath easily formed from factorized formulasDatapath easily formed from factorized formulas

Page 6: Automatic Generation of Customized Discrete Fourier Transform IPs Grace Nordin, Peter A. Milder, James C. Hoe, Markus Püschel Carnegie Mellon University

Slide 6

Formula to Datapath

Given where is: apply , then is a permutation permute apply , times in

parallel is a diagonal scale

A

A

B A

×4

×2

×7

×8

Page 7: Automatic Generation of Customized Discrete Fourier Transform IPs Grace Nordin, Peter A. Milder, James C. Hoe, Markus Püschel Carnegie Mellon University

Slide 7

Outline

Introduction Formula-Driven Design Generation Microarchitecture ParameterizationMicroarchitecture Parameterization Generator User Interface Experimental Results Conclusions

Page 8: Automatic Generation of Customized Discrete Fourier Transform IPs Grace Nordin, Peter A. Milder, James C. Hoe, Markus Püschel Carnegie Mellon University

Slide 8

Simple regular structure embodied in formula

Example:

Pease DFT

diagonal

permutation

butterfly

parallel

k stages

stage 1

stage 2stage 3

Page 9: Automatic Generation of Customized Discrete Fourier Transform IPs Grace Nordin, Peter A. Milder, James C. Hoe, Markus Püschel Carnegie Mellon University

Slide 9

Pease DFT Example: DFT8

x

x

x

x

x

x

x

x

x

x

x

x

stage 1 stage 2 stage 3

(formula is applied from right to left)

(datapath is built left to right)

Repeating column structure Repeating column structure hardware reuse hardware reuse without performance penaltywithout performance penalty

Page 10: Automatic Generation of Customized Discrete Fourier Transform IPs Grace Nordin, Peter A. Milder, James C. Hoe, Markus Püschel Carnegie Mellon University

Slide 10

x

x

x

x

Horizontal folding

x

x

x

x

x

x

x

x

our baseline design degree of freedom: vertical parallelism

parameter pp

inputbypass

register

pp

Page 11: Automatic Generation of Customized Discrete Fourier Transform IPs Grace Nordin, Peter A. Milder, James C. Hoe, Markus Püschel Carnegie Mellon University

Slide 11

Vertical (V-)folding according to p

latency

Fine-grained control over cost/latency tradeoffFine-grained control over cost/latency tradeoff

cost

Page 12: Automatic Generation of Customized Discrete Fourier Transform IPs Grace Nordin, Peter A. Milder, James C. Hoe, Markus Püschel Carnegie Mellon University

Slide 12

Outline

Introduction Formula-Driven Design Generation Microarchitecture Parameterization Generator User InterfaceGenerator User Interface Experimental Results Conclusions

Page 13: Automatic Generation of Customized Discrete Fourier Transform IPs Grace Nordin, Peter A. Milder, James C. Hoe, Markus Püschel Carnegie Mellon University

Slide 13

User Interface

http://www.spiral.net/hardware/dftgen.html

commonDFT

options

customization options

Page 14: Automatic Generation of Customized Discrete Fourier Transform IPs Grace Nordin, Peter A. Milder, James C. Hoe, Markus Püschel Carnegie Mellon University

Slide 14

Outline

Introduction Formula-Driven Design Generation Microarchitecture Parameterization Generator User Interface Experimental ResultsExperimental Results Conclusions

Page 15: Automatic Generation of Customized Discrete Fourier Transform IPs Grace Nordin, Peter A. Milder, James C. Hoe, Markus Püschel Carnegie Mellon University

Slide 15

We compare Xilinx’s fixed design against our variable We compare Xilinx’s fixed design against our variable generated designsgenerated designs

Evaluation We compare against Xilinx LogiCore DFT Ver. 3.1

radix-4 burst I/O interface

XilinxXilinx SPIRALSPIRAL

datapathdatapath fixed, one radix- 4 basic block

variable, p radix-2 basic blocks

cost-performance cost-performance tradeofftradeoff

fixedfixed user-controlled, user-controlled, varies with varies with pp

Comparison DFT n = {64, 1024, 2048}; width = 16; bit-reversed output Xilinx ISE ver. 6.1, Xilinx Virtex2-Pro XC2VP100-6

Page 16: Automatic Generation of Customized Discrete Fourier Transform IPs Grace Nordin, Peter A. Milder, James C. Hoe, Markus Püschel Carnegie Mellon University

Slide 16

0

2

4

6

1 2 4 8 16 32

p

spee

du

p

0

1

2

3

4

5

1 2 4 8 16 32

p

rela

tive

BR

AM

s

0

2

4

6

8

10

12

1 2 4 8 16 32

p

rela

tive

slic

esDFT1024 relative to Xilinx

Xilinx

Performance and resources scale with Performance and resources scale with pp

1.0 = 1955 slices 1.0 = 7 BRAMs 1.0 = 1 / 5.6 µsec

0

1

2

3

4

5

1 2 4 8 16 32

Min Slice

Min BRAM

Balanced

Xilinx

logic storage performance

Page 17: Automatic Generation of Customized Discrete Fourier Transform IPs Grace Nordin, Peter A. Milder, James C. Hoe, Markus Püschel Carnegie Mellon University

Slide 17

0

2

4

6

8

10

12

14

1 2 4 8 16 32

p

rela

tive

sli

ces

0

5

10

15

20

25

30

35

1 2 4 8 16 32

p

rela

tive

BR

AM

s

Resource usage preferences

0

1

2

3

4

5

1 2 4 8 16 32

Min Slice

Min BRAM

Balanced

Xilinx

Xilinx

1.0 = 1955 slices 1.0 = 7 BRAMs 1.0 = 1 / 5.6 µsec

logic storage performance

0

2

4

6

1 2 4 8 16 32

p

spee

du

p

Page 18: Automatic Generation of Customized Discrete Fourier Transform IPs Grace Nordin, Peter A. Milder, James C. Hoe, Markus Püschel Carnegie Mellon University

Slide 18

0

2

4

6

1 2 4 8 16 32

p

spee

du

p

0

2

4

6

8

10

12

14

1 2 4 8 16 32

p

rela

tive

slic

es

0

5

10

15

20

25

30

35

1 2 4 8 16 32

p

rela

tive

BR

AM

s

Resource usage preferences

0

1

2

3

4

5

1 2 4 8 16 32

Min Slice

Min BRAM

Balanced

Xilinx

Can control tradeoff between slices and BRAMsCan control tradeoff between slices and BRAMs

Xilinx

exchange BRAM for slices very little change in performance

1.0 = 1955 slices 1.0 = 7 BRAMs 1.0 = 1 / 5.6 µsec

logic storage performance

Page 19: Automatic Generation of Customized Discrete Fourier Transform IPs Grace Nordin, Peter A. Milder, James C. Hoe, Markus Püschel Carnegie Mellon University

Slide 19

0

10

20

30

40

50

1 2 4 8 16 32

p

rela

tive

BR

AM

s

0

2

4

6

8

10

1 2 4 8 16 32

psp

eed

up

0

4

8

12

16

20

1 2 4 8 16 32

p

rela

tive

slic

es

0

1

2

3

4

5

1 2 4 8 16 32

p

rela

tive

slic

es

0

1

2

3

4

5

1 2 4 8 16 32

p

rela

tive

BR

AM

s

0

1

2

3

4

5

1 2 4 8 16 32

p

spee

du

p

DFT64 and DFT2048

2048

0

1

2

3

4

5

1 2 4 8 16 32

Min Slice

Min BRAM

Balanced

Xilinx

1.0 = 2140 slices 1.0 = 7 BRAMs 1.0 = 1 transform / 24.578 µsec

Trends hold for sizes 64, 2048Trends hold for sizes 64, 2048

1.0 = 1743 slices 1.0 = 8 BRAMs 1.0 = 1 transform / 0.648 µsec

64

Xilinx

Xilinx

Page 20: Automatic Generation of Customized Discrete Fourier Transform IPs Grace Nordin, Peter A. Milder, James C. Hoe, Markus Püschel Carnegie Mellon University

Slide 20

Related Work

Kumhom, Johnson, Nagvajara, ASIC/SOC 2000 universal FFT processor microarchitecture based on

processing elements interconnected by on-chip reconfigurable network

microarchitecture is scalable in the number of elements supports both Cooley Tukey and Pease

Choi, Scrofano, Prasanna, Jang, FPGA’2003 mapped radix-4 Cooley-Tukey algorithm onto log2(n)/2 DFT4

primitives scalable datapath between 1 element and 4 elements at a

time show energy and performance improvements from scaling

Page 21: Automatic Generation of Customized Discrete Fourier Transform IPs Grace Nordin, Peter A. Milder, James C. Hoe, Markus Püschel Carnegie Mellon University

Slide 21

Conclusions

Parameterized DFT IP generator matrix formula-drivenformula-driven synthesis performance/cost tradeoff

fine-grained control over resources vs. latencyresources vs. latency resource usage preference

can balance tradeoff between slices and BRAMslices and BRAM

Key results efficient: efficient: the Xilinx design point can be matched customizable: customizable: design tradeoffsdesign tradeoffs directly controllable easy to use: easy to use: simple yet powerful web interfacesimple yet powerful web interface

Page 22: Automatic Generation of Customized Discrete Fourier Transform IPs Grace Nordin, Peter A. Milder, James C. Hoe, Markus Püschel Carnegie Mellon University

Slide 22

Web Generator

This work is part of the SPIRALSPIRAL project, which aims to push the limits of automation in software and hardware development for DSP algorithms.For more information visit: www.spiral.netwww.spiral.net

http://www.spiral.net/hardware/dftgen.htmlhttp://www.spiral.net/hardware/dftgen.html

http://www.spiral.net/hardware/dftgen.html