field-programmable technology: today’s and tomorrow’s

38
1 Field-Programmable Technology: Today’s and Tomorrow’s Wayne Luk Imperial College London TWEPP-07 Prague 4 September 2007

Upload: job

Post on 19-Jan-2016

28 views

Category:

Documents


0 download

DESCRIPTION

Field-Programmable Technology: Today’s and Tomorrow’s. Wayne Luk Imperial College London TWEPP-07 Prague 4 September 2007 Imperial College London April 2005. Outline: technology = devices + design. 1. overview: motivation and vision 2. field-programmable devices: today - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Field-Programmable Technology: Today’s and Tomorrow’s

1

Field-Programmable Technology: Today’s and Tomorrow’s

Wayne LukImperial College London

TWEPP-07Prague

4 September 2007

Imperial College London

April 2005

Page 2: Field-Programmable Technology: Today’s and Tomorrow’s

2

Outline: technology = devices + design

1. overview: motivation and vision2. field-programmable devices: today - Xilinx Virtex-4, Virtex-5; Stretch S5 3. field-programmable design : today

- enhance optimality and re-use 4. field-programmable devices: tomorrow - hybrid FPGA, die stacking 5. field-programmable design : tomorrow - guided synthesis, representation,

upgradability 6. summary Thanks to colleagues, students and collaborators from Imperial College,

University of British Columbia, Chinese University of Hong Kong, University of Massachusetts Amherst, UK Engineering and Physical Sciences Research Council, Stretch, Xilinx

Page 3: Field-Programmable Technology: Today’s and Tomorrow’s

3

1. Motivation: good - Moore’s Law

• rising– capacity– speed

• falling– power/MHz– price

Source: Xilinx

Page 4: Field-Programmable Technology: Today’s and Tomorrow’s

4

1

Lo

gic

Tra

nsi

sto

rs p

er C

hip

`(K

)

P

rod

uct

ivit

yT

ran

s./S

taff

- M

on

th

10

100

1,000

10,000

100,000

1,000,000

10,000,000

10

100

1,000

10,000

100,000

1,000,000

10,000,000

100,000,000 Logic (Moore’s Law) Transistors/ChipTransistor/Staff Month

58%/Yr. compoundComplexity growth rate

21%/Yr. compoundProductivity growth rate

1981

1983

1985

1987

1989

1991

1993

1995

1997

1999

2003

2001

2005

2007

2009

xxx

x xx

x

challenge: reduce the design productivity gap

Not so good: design productivity gap

Source: SEMATECH

Page 5: Field-Programmable Technology: Today’s and Tomorrow’s

5

Our vision: unified synthesis + analysis

C /M atlab /P rem iere

C om pile r

Machinecode

Run-tim einterface

Configurationinform ation

Fixed processor Custom processor

C ustom com pu ting sys tem

/Progolapplication description

manual/automatic partition, compile, analysis, validate

system-specific architecture

system-specific programming interface

Java/

CPUs + DSPs FPGAs + sensors

Page 6: Field-Programmable Technology: Today’s and Tomorrow’s

6

500MHz FlexibleSoft Logic Architecture

500MHz Programmable DSP Execution Units

0.6-11.1GbpsSerial Transceivers

450MHz PowerPC™ Processorswith

Auxiliary Processor Unit

1Gbps DifferentialI/O

500MHz multi-portDistributed RAM

500MHz DCM DigitalClock Management

2a. Devices today: Virtex-4 FPGA

• fine-grain fabric + special function units• challenge: use resources effectively

Source: Xilinx

Page 7: Field-Programmable Technology: Today’s and Tomorrow’s

7

2b. Virtex-5 FPGA

Source: Xilinx

• capacity– rises

• logic speed– rises

• I/O speed– rises

• all good news?

Page 8: Field-Programmable Technology: Today’s and Tomorrow’s

8

Growing gap: amount of gates vs I/O

Source: Xilinx

• I/O off-chip– serial

• I/O on-chip– scalable– flexible– easy to use

• interconnect– heterogeneous– customisable

Page 9: Field-Programmable Technology: Today’s and Tomorrow’s

9

2c. Software configurable engine: S5

Wide Register File (WRF)• 32 Wide Registers (WR)• 128-bit WideLoad/Store Unit• 128-bit Load/Store• Auto Increment/Decrement• Immediate, Indirect, Circular• Variable-byte Load/Store• Variable-bit Load/Store

ISEF• Instruction Specialization Fabric• Compute Intensive• Arbitrary Bit-width Operations• 3 Inputs and 2 Outputs• Pipelined, Bypassed, Interlocked• Random Logic Support• Internal State Registers

ALUFPU

32-BIT RF

CO

NT

RO

L

128-BIT WRF32-BIT RF

ALUFPU

S5 ENGINE

ISEFInstruction-Set

Extension Fabric

DATA RAM32KB

SRAM256KB

D-CACHE32KB

I-CACHE32KB

MMU

RISC Processor• Tensilica – Xtensa V• 32 KB I & D Cache• On-Chip Memory, MMU• 24 Channels of DMA, FPU

Source: Stretch

Page 10: Field-Programmable Technology: Today’s and Tomorrow’s

10

3. Design today: overview• structural or register-transfer level (RTL)

– e.g. VHDL, Verilog– low-level, little automation, small designs

• behavioural, system-level descriptions– e.g. SystemC (public-domain: systemc.org)– MARTES: +UML for real-time embedded systems

• general-purpose software languages– e.g. C, Java; with hardware support: Handel-C– high-level, large automation, large designs

• special-purpose descriptions– e.g. System Generator (signal processing)– high-level, domain-specific optimisations

Page 11: Field-Programmable Technology: Today’s and Tomorrow’s

11

3a. Enhance optimality and re-use• design optimality: quality

– select algorithm and devices: meet requirements– mapping: regular to systolic, rest to processor– I/O: dictates on-chip parallelism, buffering schemes– control speed/area/power: pipelining, layout plan– partitioning: coarse vs fine grain logic and memory

• design re-use: productivity– separate aspects specific to application/technology– library of customisable components with trade-offs– compose and customise to meet requirements– uniform interface to memory and I/O: hide details– pre-verified parts: ease system verification

Page 12: Field-Programmable Technology: Today’s and Tomorrow’s

12

Example: systolic summation tree

• n-input adder– tree of (n-1)

2-input adders

• each adder– has k stages– each stage

has s-bit adder

• figure shows– k = 3– s = 3

• high k, low s– less cycle

time– more area– less power

Page 13: Field-Programmable Technology: Today’s and Tomorrow’s

13

Finance application: value-at-risk

• sampling from multivariate Gaussian distribution

• DSP units: matrix multiplication

• systolic tree: accumulate result

• RC2000 board– Virtex-4 xc4vsx55– 400MHz

64DSPs

48DSPs

GaussianRNG

Delta GammaAccumulator

Control RC2000Interface

IO Clock domain

FIFO

FIFO

+ + + +

+ +

+

33 times faster than 2.2GHz quad Opteron(including all IO overheads, PC-FPGA communications, and using AMD

optimised BLAS for software)

Page 14: Field-Programmable Technology: Today’s and Tomorrow’s

14

CO

NT

RO

L

3b. Stretch program development

• profile code–identify hotspots

• special instruction– implement ‘C’ functions in

single instructions– bit-width optim.

• software compiler– generate instruction– schedule instruction

• multiple data (WR)– perform operations in parallel

• efficient data movement – intrinsic load store operations– 20+ DMA channels

APPLICATIONC/C++

COMPILEDMACHINE

CODE

Compiler

InstructionDefinition

NEW INSTRUCTIONS

INSTRUCTION GENERATION

TAILOR ISEF TO APPLICATION

AUTOMATIC

Source: Stretch

Page 15: Field-Programmable Technology: Today’s and Tomorrow’s

15

500MHz FlexibleSoft Logic Architecture

500MHz Programmable DSP Execution Units

0.6-11.1GbpsSerial Transceivers

450MHz PowerPC™ Processorswith

Auxiliary Processor Unit

1Gbps DifferentialI/O

500MHz multi-portDistributed RAM

500MHz DCM DigitalClock Management

4. Devices tomorrow: more diversed

Source: Xilinx

Replaced by other functional units, e.g. floating-point units

Page 16: Field-Programmable Technology: Today’s and Tomorrow’s

16

4a. Hybrid FPGA: architecture

• most digital circuits– datapath: regular, word-based

logic– control: irregular, bit-based logic

• hybrid FPGA– customised coarse-grained block:

domain-specific requirements– fine-grained blocks:

existing FPGA architecture– good match to computing

applications for given domain

Page 17: Field-Programmable Technology: Today’s and Tomorrow’s

17

Coarse-grained fabric library

U1:fpmul

control status

Q D

U2:fpadd

control status

U{D-1}:wb

bit 0

bit 1

bit 2

bit N-1

control status

Output Mux

Input Buses (M)

Feedback Registers (F)

FeedbackMux

Output Buses

(R)

control

Control Signal Input Status Flag Output

Floating Point

Multiplier

Floating Point Adder/Subtractor

status

bit 0

bit 1

bit 2

bit N-1

control status

U0:wb

D=9, M=4, R=3, F=3, 2 add, 2 mul: best density over benchmarks

Page 18: Field-Programmable Technology: Today’s and Tomorrow’s

18

Evaluation

• 6 benchmark circuits– digital signal processing kernels: e.g. bfly (for

FFT)– linear algebra: e.g. matrix multiplication– complete application: e.g. bgm (financial model)

• circuits: partitioned to control + datapath– control: vendor tools to fine-grained units– datapath: manually map to coarse-grained units

• comparison – directly synthesized to Xilinx Virtex-II devices

Page 19: Field-Programmable Technology: Today’s and Tomorrow’s

19

Results

Floating Point hybrid FPGA XC2V3000-6

Area(slices

)Delay(ns)

Area(slices)

Delay(ns)

Area(times

)Delay

(times)

bfly 565 9.02 13733 24.57 24.3 2.72

dscg 661 10.11 9614 22.78 14.5 2.25

fir4 371 9.06 11290 23.68 30.4 2.61

mm3 642 8.90 8889 23.4 13.8 2.63

ode 545 9.74 8238 21.93 15.1 2.25

bgm 1810 10.00 30207 24.34 16.7 2.43

Geometric Mean 18.3 2.48

Page 20: Field-Programmable Technology: Today’s and Tomorrow’s

20

4b. On-chip memory bandwidth

• storage hierarchy: registers, LUT RAM, block RAM

• processor cache: address lack of I/O bandwidth

Source: Xilinx

Page 21: Field-Programmable Technology: Today’s and Tomorrow’s

21

High density die stacking

Source: Xilinx

Page 22: Field-Programmable Technology: Today’s and Tomorrow’s

22

Programmable circuit board

Source: Xilinx

• die stacking: 3D interchip connections• customisable system-in-package: productivity

gap?

Page 23: Field-Programmable Technology: Today’s and Tomorrow’s

23

5a. Design tomorrow: guided synthesis

• guided transformation of design descriptions– automate tedious and error-prone steps– applicable to various levels of abstraction

• focus: two timing models– strict timing model: cycle-accurate - cycle-accurate - efficiencyefficiency– flexible timing model: behavioural - behavioural - productivityproductivity

• combine cycle-accurate and behavioural models– rapid development with high quality– design maintainability and portability

• based on high-level language– library developer: provide optimised building blocks– application developer: customise building blocks

Page 24: Field-Programmable Technology: Today’s and Tomorrow’s

24

Timing models: strict vs flexible

{ delta = b*b - ((a*c) << 2); if (delta > 0) num_sol = 2; else if (delta == 0) num_sol = 1; else num_sol = 0;}

{ delta = b*b - ((a*c) << 2); if (delta > 0) num_sol = 2; else if (delta == 0) num_sol = 1; else num_sol = 0;}

b a c

* *<<

-delta

> 0

num_sol

== 02

1

0

2

MU

X

MU

X

executed executed at cycleat cycle 11

executed executed at cycle 2at cycle 2

cc

cctrue

c

*

a

b b

* << 2

-

>

=

= =

==

true false

false

num_sol

num_sol num_sol

0

0

0

2

1

strict timing(Handel-C)

flexible timing

behavioural model- partial-ordering- abstract operations

cycle accurate model- total-ordering- resource-bound

scheduling unscheduling

Page 25: Field-Programmable Technology: Today’s and Tomorrow’s

25

{ delta = b*b - ((a*c) << 2); if (delta > 0) num_sol = 2; else if (delta == 0) num_sol = 1; else num_sol = 0;}

{ delta = b*b - ((a*c) << 2); if (delta > 0) num_sol = 2; else if (delta == 0) num_sol = 1; else num_sol = 0;}

cc

cctrue

c

*

a

b b

* << 2

-

>

=

= =

==

true false

false

num_sol

num_sol num_sol

0

0

0

2

1

unschedulingunscheduling(flexible (flexible timing)timing)

schedulinschedulingg

constraintsconstraints

+

par { // ================== [stage 1] pipe_mult[0].in(b,b); pipe_mult[1].in(a,c); // ==================[stage 8] tmp0 = pipe_mult[0].q; tmp1 = pipe_mult[1].q << 2; // ==================[stage 9] tmp2 = tmp0 - tmp1; // ==================[stage 10] if (tmp2 > 0) num_sol = 2; else if (tmp2 == 0) num_sol = 1; else num_sol = 0; delta = tmp2; }

synthesissynthesis(strict (strict timing)timing)

b a c

pmult[0]

delta

> 0

num_sol

==2

10

pmult[1]

<< 2

tmp1

-tmp2

tmp0

stage 1-7

stage 8

stage 9

stage 10

cyclecycle 11

MU

X

MU

X

Rapid design: automated scheduling

• support combination of manual and automatic scheduling

Page 26: Field-Programmable Technology: Today’s and Tomorrow’s

26

par { { // ================= [stage 1] pipe_mult[0].in(b,b); pipe_mult[0].in(a,c); } ....}

Maintainability: retarget design

par { // ================== [stage 1] pipe_mult[0].in(b,b); pipe_mult[1].in(a,c); // ==================[stage 8] tmp0 = pipe_mult[0].q; tmp1 = pipe_mult[1].q << 2; // ==================[stage 9] tmp2 = tmp0 - tmp1; // ==================[stage 10] if (tmp2 > 0) num_sol = 2; else if (tmp2 == 0) num_sol = 1; else num_sol = 0; delta = tmp2; }

b a c

pmult [0]

delta

> 0

num_sol

==2

10

pmult [1]

<< 2

tmp1

-tmp2

tmp0

stage 1-7

stage 8

stage 9

stage 10

cyclecycle 11

MU

X

MU

X

cc

cctrue

c

*

a

b b

* << 2

-

>

=

= =

==

true false

false

num_sol

num_sol num_sol

0

0

0

2

1

schedulinschedulingg

constraintsconstraints

+

unschedulingunscheduling(flexible timing)(flexible timing)

b a c

pmult[0]

> 0

num_sol

==2 1

0

tmp1

tmp0

-tmp2

pmult[0]

cycle 2cycle 2

cycle 1cycle 1

<< 2

stage 1-4

stage 5

stage 6

cycle 1cycle 1

cycle 2cycle 2

cycle 1cycle 1

cycle 2cycle 2

MU

X

MU

X

synthesissynthesis(strict (strict timing)timing)

Page 27: Field-Programmable Technology: Today’s and Tomorrow’s

27

Implementation

Page 28: Field-Programmable Technology: Today’s and Tomorrow’s

28

Automatically generated results

• ffd: free-form deformation; dct: discrete cosine transform

with respect to smallest

with respect to software

Page 29: Field-Programmable Technology: Today’s and Tomorrow’s

29

5b. Data representation optimisation

In1 In2 In3 In4 In5

+ *

-

+

*+

+

Out1 Out2

X

Y

• In1..In5– known width

• Out1..Out2– width determines

accuracy– defined by user

• find representation– minimise width of

nodes, e.g. X, Y

• trade-off in speed, area, power, error

Page 30: Field-Programmable Technology: Today’s and Tomorrow’s

30

Floating-point design Fixed-point design

Output Design Descriptions

Xilinx System Generator VHDL

A Stream Compiler (ASC) Code Annotated DFG HandelC

Designdatabase

Range analysis( Interval analysis )

Precision analysis( Automatic Differentiation )

Bit-width determination Bit-width determination

Design Selection

BitSize bit-width analysis system - Frontend

BitSize bit-width analysis system - Backend

User Specified designconstraints

Input Design Descriptions Xilinx System Generator C/C++ ASC Code HandelC

Design Selection

Fixed-point

- range: integer

- precision: fraction

Floating-point

- range: exponent

- precision: mantissa

Page 31: Field-Programmable Technology: Today’s and Tomorrow’s

31

0 5 10 15 20 250

500

1000

1500

2000

2500

3000

Error Percentage

Are

a -

Vir

tex2

Slic

es DFT

4-Tap FIR filter

FIR filter and DFT: area vs error

1% more error: 65% less area

Page 32: Field-Programmable Technology: Today’s and Tomorrow’s

32

0 5 10 15 20 2520

30

40

50

60

70

80

90

Error Percentage

Sp

ee

d -

MH

z

DFT4-Tap FIR filter

FIR filter and DFT: speed vs error

2% more error: 20% higher speed

Page 33: Field-Programmable Technology: Today’s and Tomorrow’s

33

5c. Upgradable design

initial release

Number of new users

upgradabledesign

non-upgradabledesign

10 20 30

upgrade 1

upgrade 2

Time (months)

Source: Xilinx

• upgradability: minimise time-to-market maximise time-in-market• add new functions, fix bugs• very rapid upgrade?

Page 34: Field-Programmable Technology: Today’s and Tomorrow’s

34

Dynamic upgrade: turbo coder

• error correction code: add redundancy• need: fast, low-power, adapt to noise level• recursive systematic convolutional (RSC)

encoder, decoder, interleaver

RSC

RSC

Interleaver

Decoder

Decoder

InterleaverDe-Interleaver

Interleaver

DecideIn Out

Turbo encoder

Turbo decoder

noisycommun.channel

Source: Liang, Tessier, Goeckel

Page 35: Field-Programmable Technology: Today’s and Tomorrow’s

35

Self-tuning: run-time reconfiguration

• adapt: less channel noise, so lower power

• larger Nmax: better correction, more area/power

• sample channel noise every 250K bits

• find Signal to Noise Ratio (SNR), select Nmax

• if Nmax current Nmax, configure new bitstream

• configuration overhead: about 30ms, 54.8mW

controller

LUTSNR =?

Nmax

current Nmax

ConfigureFPGA

Source: Liang, Tessier, Goeckel

Page 36: Field-Programmable Technology: Today’s and Tomorrow’s

36

Performance of dynamic upgrade

code(65,57

)(31,27)

(15,13)

staticspeed (Kbps) 173.4 301.2 487.9

power (mW) 447.7 205.8 134.3

dynamic

required reconfigs

8369/10000

6306/10000

6925/10000

speed (Kbps) 359.1 429.4 598.6

power (mW) 216.2 131.7 111.6

power saving

52% 36% 18%• up to 0.5x power, 2x speed over static decoder• 100 times faster than processor decoderSource: Liang, Tessier, Goeckel

Page 37: Field-Programmable Technology: Today’s and Tomorrow’s

37

• domain-specific design automation– languages + tools: for particle physics systems?

• multi-core, sensor network co-design– multiple hardware/software: FPGA + CPU + sensors

• extending processor and compiler capabilities– static and dynamic optimizations, self-tuning

• power-aware, radiation-aware design– transforms e.g. pipelining, damage monitoring

• rapid and informative design validation– simulation + FPGA prototype + formal verification

Other directions

Page 38: Field-Programmable Technology: Today’s and Tomorrow’s

38

• good: Moore’s Law, bad: productivity gap• vision: unified design synthesis and analysis• devices and design today

– growing gap: amount of I/O and amount of logic – enhance optimality and re-use: I/O driven

• devices tomorrow– hybrid FPGA: multi-granularity fabric– 3D FPGA: customisable system-in-package

• design tomorrow– guided synthesis: optimised and portable design – data representation optimisation– upgradable and self-tuned design

6. Summary