how to teach a billion transistor chip a new...

How to teach a billion transistor chip a new trickDr. ir. Bart Kienhuis

University Leiden, LIACS

Computer Systems Group

http://www.liacs.nl

ICT-Kenniscongres 10 April 2006

2

Outline� Focus is programming ICs for

Stream based Applications� FPGAs

� Stream Based Applications� Multi-media

� Imaging

� Bio-informatics

� Classical DSP

� Increasing demand for high-performance, embedded compute performance

� We know how to build billion transistor FPGAs, but cannot program efficiently

FPGABillion

Transistors

Applications

(C / Matlab / Java)

pro

gra

mm

ing?

3

Stream Based Applications

� Imaging� HDTV 1080x720 @ 100 Hz

� 77 Million samples per seconds

� Suppose 100 operations per sample

� 8 Billion Operations per second

� Classical DSP� LOFAR One antenna

� Sampling rate 200*2 = 400 Million Samples per second

� 100 Operations per second (First stage)

� 40 Billion Operations per second

� There are 10.000 Antennas expected = 40*10^13 operations per second

� Next beamforming is performed, statistics obtained, but on decimated signal

Operation

10 – 10.00 Operations per Sample

10 – 1000 Million Samples per second

Stream

Sample

Real-Time

4

Compute Requirements

Source: TI, Xilinx – 1 MAC = 8 bit Multiply-Accumulate Trend: ferocious appetite for more

embedded compute power

0

500

1000

1500

2000

2500

2000 2001 2002 2003 2004 2005 2006

Bil

lio

n M

AC

/s

HDTV

MPEG4

Video

over IP

3G

Wireless/

WCDMA

Future

Broadband

Standards

Voice

over IP General Purpose DSP/CPU

Market Requirements

Increasing

Gap

2.5G

?

5

Intrinsic Computational Efficiency (ICE)

E. Roza, System-on-chip: what are the limits?, IEE Electronics

Communication Engineering Journal, vol13, No6, Dec 2001, pp 249-255.

Computational efficiency

[MOPS/W]

i386SX

i486DXP5

68040

microsparc

Supersparc

601604

Ultrasparc P6

604e21164a

Turbosparc 604e

21364

7400

106

105

104

103

102

101

100

2 1 0.5 0.25 0.13 0.07Feature size [µm]

Microprocessors

Intrinsic Computational

Efficiency of Silicon

Pla

yin

g fie

ld

7

Xilinx Virtex II Pro

� Heterogeneous Architecture� Multi CPUs

� Distributed Memory Blocks

� Programmable Logic (IP cores)

� Commercially Available Platform� Virtex-4 is released and available

� Virtex-5 is on the drawing board

� FPGAs are becoming more important in Embedded System Design� Flexible since they can easily be

re-programmed

� More functionality at lower cost

� replacing a larger part of the ASICs market

Reconfigurable logic

Heterogeneous Architecture

PowerPCs

8

4

16 w

ord

s x

1 b

it me

mo

ry

[A,B,C,D]

F

� A 4-input lookup table

(LUT) can implement any

function of 4 inputs.

� For example, a 1-bit

adder needs 2 LUTs:

A⊕B⊕Ci

A.B.Ci

AB

Ci

Co

S

Building an FPGA: Logic First

01011

01010

10010

10001

FAddress

(ABCD)

9

4

FF

CE RST

M

16 w

ord

s x

1 b

it me

mo

ry

Carry M

M

M

Din

WECin

Cout

� Make LUT RAM

a user resource.

� Fast carry ripple

to neighbor.

Arithmetic, Distributed RAM

10

4

4

4

4

4

4

4

4

40

� Group logic cells to

reduce overhead.

� Add H, V routing

channels with

switchboxes.

� Add input, output

MUXing between

logic and routing.

Add Interconnect

11

4

4

4

4

4

4

4

4

40

4

4

4

4

4

4

4

4

40

4

4

4

4

4

4

4

4

40

4

4

4

4

4

4

4

4

40

Build an Array

12

FPGA Technology

� Virtex-4 has already over a billion transistors!

� FPGA can take most advantage of Moore’s law� More an more logic

available on the same die

� CPUs in programmable logic � Hardcore =

� Softcore =

13

1/91 1/92 1/93 1/94 1/95 1/96 1/97 1/98 1/99 1/00 1/01 1/02 1/03 1/04

1

10

100

1000

Capacity

Speed

Price Virtex-II Pro

Spartan-3

Virtex-II Pro

Memory

Bandwidth

Year

FPGA, the last 10 years

Xilinx Research, Kees Vissers

14

FPGA Programming� FPGA is flexible in Hardware

and Software…� How to teach a billion

transistor chip a new trick

� Solution Industry doesn’t work

� Programming means:� Write “C” programs for the

Micro processors

� Write “VHDL” programs for the IP cores and interconnection network

� “C” programs and “VHDL” programs need to work together� High performance

� Error proof

� Nightmare to debug

� Programming is currently done manually with lots of engineers

“C”

“C”

“C”

“C”

VHDL

Virtex-II Pro FPGA

15

Productivity

Lines of Code/Man Month

10

100

1,000

10,000

100,000

1,000,000

10,000,000

100,000,000

1981

1985

1989

1993

1997

2001

2005

2009

Logic transistors per chip

(K)

10

100

1,000

10,000

100,000

1,000,000

10,000,000

Logic Tr./Chip

58% / Yr. compoundcomplexity growth rate

Loc/MM

8-10% / Yr. compoundproductivity growth rate

Productivitygap

Software Productivity

We know how to build Billion transistor chips, but do not know how to program them efficiently

16

Research at LIACS (COMPAAN)

Kahn Process Network

Matlab/C/Java

Compaan

FPGA Laura /Espam

for j = 1:1:N,

[x(j)] = S1( ); endfor i = 1:1:K,

[y(i)] = S2( ); endfor j = 1:1:N,

for i = 1:1:K,[y(i), x(j)] = func(y(i), x(j) );

endendfor i = 1:1:K,

[Out(i)] = Sink( y( I ) ); end

F1 F2

S2 F3 F4

Sink

S1

How ICT Research at LIACS tackles the Productivity Gap(Leiden University, STW PROGRESS)

Parameterized Nested Loop Programs

17

Toolbox� Toolbox to change the

number of processes

� Unrolling (More Parallelism)

� Merging (Less Parallelism)

� Better use of Resources

� Re-timing

� C-Slowing

� Exploration

� Track Technology

changes PartitioningLessParallelism

More Parallelism

P1 P2

P3

Application

18

Results: Motion JPEG (1)

for k = 1:1:NumFrames,

for j = 1:1:VNumBlocks,

for i = 1:1:HNumBlocks,

[ Block ] = VideoInMain();[ Block ] = DCT( Block );

[ Block ] = Q( Block );

[ Packets ] = VLE( Block );

[ ] = VideoOut( Packets );end

end

end

Generated Process Network

DATE04, ‘System Design using Kahn Process Networks: The Compaan/Laura Approach’

Block( j , i )

19

Motion JPEG (2)

� From Matlab to FPGA implementation� Automatically derived a multi processor solution

� Free choice between Microprocessor (MicroBlaze) or Hardware (IPCore)

� Automatically generated� “C” program for all Microprocessors

� VHDL program for interconnections and IP Core integration

� Correct by Construction

� Seamless integration into FPGA Development environment (EDK)

ND_2DCT

400

ND_1

VideoIn

ND_2

DCT

ND_3

Q

ND_4

VLE

ND_5

VideoOut

10837 135036 68972 54210 2795

FIFO

Difference: 337

- Manual: 3 months

- Compaan: 3 weeks

20

Results: QR Case studyfor k = 1:1:21,

for j = 1:1:7, [r(j,j), rr(j,j), a, b, d(k)] =

bcell( r(j,j), rr(j,j),x(k,j), d(k));

for i = j+1:1:7,

[r(j,i), x(k,i)] = icell( r(j,i), x(k,i), a, b );

end

end

end

0

200

400

600

800

1000

1200

1400

1600

1800

2000

QR QR

Skew

QR

Skew

+2st

QR

Skew

+4st

QR

Skew

+6st

QR

Skew

+7st

QR

Skew

+10st

MF

LO

PS x

28

QR decomposition

Beam former/Radar

FPL04, ‘Increasing pipelined IP core utilization in Process Networks using Exploration’

- Manual: 1 year- Compaan: 3 days

Exploration

- 60MFlops out-of-the-box

- 2 GFlops, with Compaan/Laura

21

Multidisciplinary Work

Mathematics

ElectricalEngineering

Mathematical Modeling (Polyhedral Mathematics)

Efficient Solvers (Parametric Integer Programming)

Processor DesignFPGA Technology

Electronics Design Automation

Models of ComputationsCompiler Technology

Software Engineering

Compaan Project: 6 years, 6 PhDs, 2 Postdocs

ComputerScience

22

Conclusions� Stream Based Applications require Heterogeneous

Multiprocessor architecture� Stream of 10 – 1000 million samples per second

� 10 – 100 Giga operations per second

� Latest FPGAs are Heterogeneous Multiprocessor architectures� Over a Billion Transistor and still counting

� Problem is that we know how to build FPGAs, but programming them is very hard

� Programming/Compiler Technology is a strategic ICT component to reduce Productivity Gap

� Showed the Compaan Project at LIACS, which provides a solution for stream based applications.

23

Thanks to

� Prof. Ed Deprettere

� Dr. Todor Stefanov

� Dr. Sven Verdoolaege

� Alexandru Turjan

� Claudiu Zissulescu

� Hristo Nikolov

� Sjoerd Meijer

� Jerome Lemaitre

� Bin Jiang

� Wei Zhong

For more information http://www.liacs.nl/~kienhuis

how to teach a billion transistor chip a new...

Documents