reconfigurable computing

68
© 2006, [email protected] http://hartenstein.de Reconfigurable Computing Reiner Hartenstein Computing Meeting EU, ESU, Brussells, May 18, 2006

Upload: huela

Post on 19-Jan-2016

43 views

Category:

Documents


0 download

DESCRIPTION

Computing Meeting EU, ESU, Brussells, May 18, 2006. Reconfigurable Computing. Reiner Hartenstein. # of hits by Google. # of hits by Google. 647,000. 171,000. 194,000. 1,490,000. 398,000. 127,000. 1,620,000. 113,000. 158,000. 915,000. 162,000. 272,000. The Pervasiveness of RC. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Reconfigurable Computing

© 2006, [email protected]

http://hartenstein.de

Reconfigurable Computing

Reiner Hartenstein

Computing MeetingEU, ESU, Brussells, May 18, 2006

Page 2: Reconfigurable Computing

2© 2006,

[email protected]

http://hartenstein.de

The Pervasiveness of RC

162,000

127,000

158,000113,000

171,000194,000

# of hits by Google

1,620,000

915,000

398,000

272,000

647,000

1,490,000

# of hits by Google

“FPGA and ….”ECE-savvy scene (mainstream many years)

Math/SW-savvy scene(more recently: 2-3 years)

and many more areas

and many more areas

Page 3: Reconfigurable Computing

3© 2006,

[email protected]

http://hartenstein.de

The dominance of Configware

Most compute power is coming from Configware

More MIPS migrated to Configware than running as Software

Page 4: Reconfigurable Computing

4© 2006,

[email protected]

http://hartenstein.de

Reconfigurable Supercomputing (VHPC) going commercial

Cray XD1

silicon graphics RASC

… and other vendors

Page 5: Reconfigurable Computing

5© 2006,

[email protected]

http://hartenstein.de

>> Outline <<

•Reconfigurable Computing Paradox

•The Supercomputing Paradox

•We are using the wrong model

•Coarse-grained Reconfigurable Devices

•Super Pentium for Desktop Supercomputer

http://www.uni-kl.de

Page 6: Reconfigurable Computing

6© 2006,

[email protected]

http://hartenstein.de

The Reconfigurable Computing Paradox

area-inefficient, slow, power-hungry, expensive

tools and languages unacceptable by most users

poor FPGA technology:

RC education: extremely poor, if at all

even most hardware experts (86%**) hate their tools

**) DeHon ‘98

poor tools:

poor education:- ignored by CS

curriculaCS taught like for a 50 year old mainframe …

Page 7: Reconfigurable Computing

7© 2006,

[email protected]

http://hartenstein.de

FPGA integration density

the effective integration density of plane FPGAs is behind Moore’s law by more than 4 orders of magnitude

However, brillia

nt

results everywherewhat paradox ?

Page 8: Reconfigurable Computing

8© 2006,

[email protected]

http://hartenstein.de

X 2/yr

FPGA

speed-up factors published

1980 1990 2000 2010100

103

106

109

8080

Pentium 4

7%/yr

50%/yr

http://xputers.informatik.uni-kl.de/faq-pages/fqa.html

10 000

Los Alamos traffic simulation

Los Alamos traffic simulation

47

real-time face detectionreal-time face detection6000

video-rate stereo vision

video-rate stereo vision

900pattern

recognitionpattern

recognition730

SPIHT wavelet-based image compressionSPIHT wavelet-based image compression 457Smith-Waterman pattern matching

Smith-Waterman pattern matching

288

BLASTBLAST52protein identificationprotein identification

40

molecular dynamics simulationmolecular dynamics simulation

88

Reed-Solomon Decoding

Reed-Solomon Decoding2400

Viterbi DecodingViterbi Decoding

400

FFTFFT

100

1000MA

CMA

C

Grid-based DRC:no FPGA: DPLA on MoM by TU-KL

Grid-based DRC:no FPGA: DPLA on MoM by TU-KL

20002000

2-D FIR filter [TU-KL]2-D FIR filter [TU-KL]

39,4

Lee Routing (by TU-KL)

Lee Routing (by TU-KL)

160

Grid-based DRC („fair

comparizon“)

Grid-based DRC („fair

comparizon“)1500015000

DSP and wirelessDSP and wirelessImage processing,Pattern matching,

Multimedia

Image processing,Pattern matching,

Multimedia

BioinformaticsBioinformatics

GRAPEGRAPE20

AstrophysicsAstrophysics

DPLADPLA

MoM Xputer architecture

Microprocessor

rela

tive

perf

orm

anc

e

Memory

10 000

x1.25 / yr (Moore)

cryptocrypto

1000

pre-FPGA era

>1 OoM>1 OoM

>2 OoM>2 OoM

>3 OoM>3 OoM

<4 OoM<4 OoM

Page 9: Reconfigurable Computing

9© 2006,

[email protected]

http://hartenstein.de

500MHz FlexibleSoft Logic Architecture

200KLogic Cells

500MHz Programmable DSP Execution Units

0.6-11.1GbpsSerial Transceivers

500MHz PowerPC™ Processors(680DMIPS)

withAuxiliary Processor Unit

1Gbps DifferentialI/O

500MHz multi-portDistributed 10 Mb SRAM

500MHz DCM DigitalClock Management

platform FPGAs: better area efficiency

[courtesy Xilinx Corp.]DSP platform FPGA

DeHon‘s 1st Law (1996) was for plane FPGAs

Page 10: Reconfigurable Computing

10© 2006,

[email protected]

http://hartenstein.de

pre FPGA era: Why DPLA* was so goodpre FPGA era: Why DPLA* was so good

Large arrays of canonical boolean expressions -

close to Moore’s lawclassical PLA layout highly area-efficient:

*) fabricated 1984 by E.I.S. multi university project

2ASM: Auto-Sequencing MemoryASM

**) for a survey by IMEC & TU-KL see: [M. Herz et al.: ICECS 2003, Dubrovnik]

1

Mid’ 80ies: first only very tiny FPGAs available: 1 DPLA replaced 256 of them

a generalization of the DMA**

GAG Generic Address Generator** to avoid address computation overhead

reducing memory cycles which is the

key issue

Speed-up factor of 20 by

Reiner Hartenstein
ASM means: no instruction streams neededfor address computationGeneralization of DMAM. Herz et al.: ICECS 2003, Dubrovnik
Page 11: Reconfigurable Computing

11© 2006,

[email protected]

http://hartenstein.de

X 2/yr

FPGA

taxonomy of algorithms, better tools and better education

1980 1990 2000 2010100

103

106

109

8080

Pentium 4

7%/yr

50%/yr

10 000

Los Alamos traffic simulation

Los Alamos traffic simulation

47

real-time face detectionreal-time face detection6000

video-rate stereo vision

video-rate stereo vision

900pattern

recognitionpattern

recognition730

SPIHT wavelet-based image compressionSPIHT wavelet-based image compression 457Smith-Waterman pattern matching

Smith-Waterman pattern matching

288

BLASTBLAST52protein identificationprotein identification

40

molecular dynamics simulationmolecular dynamics simulation

88

Reed-Solomon Decoding

Reed-Solomon Decoding2400

Viterbi DecodingViterbi Decoding

400

FFTFFT

100

1000MA

CMA

C

Grid-based DRC:no FPGA: DPLA on MoM by TU-KL

Grid-based DRC:no FPGA: DPLA on MoM by TU-KL

20002000

2-D FIR filter [TU-KL]2-D FIR filter [TU-KL]

39,4

Lee Routing (by TU-KL)

Lee Routing (by TU-KL)

160

Grid-based DRC („fair

comparizon“)

Grid-based DRC („fair

comparizon“)1500015000

DSP and wirelessImage processing,Pattern matching,

Multimedia

Bioinformatics

GRAPEGRAPE20

Astrophysics

DPLADPLA

MoM Xputer architecture

Microprocessor

rela

tive

perf

orm

anc

e

Memory

10 000

x1.25 / yr (Moore)

cryptocrypto

1000

even

hig

her s

peed

-up

?

cons

olid

atio

n ?

Page 12: Reconfigurable Computing

12© 2006,

[email protected]

http://hartenstein.de

New dimensions of low power: Application migration [from supercomputer] resulting not only in massive speed-upsElectricity bills reduced by an order of magnitude and even more you may get for free…. up to millions of $ dollars per year

(also a matter of national energy policy)GoogleAmsterdam

NY

„Saves more than $10,000 in electricity bills per year (7¢ / kWh) - .... per 64-processor 19" rack“ [Herb Riley, R. Associates]

Page 13: Reconfigurable Computing

13© 2006,

[email protected]

http://hartenstein.de

>> Outline <<

•Reconfigurable Computing Paradox

•The Supercomputing Paradox

•We are using the wrong model

•Coarse-grained Reconfigurable Devices

•Super Pentium for Desktop Supercomputer

http://www.uni-kl.de

Page 14: Reconfigurable Computing

14© 2006,

[email protected]

http://hartenstein.de

   

   

   

   

The Supercomputing Paradox

Growing listed Teraflops

Increasing number of processors running in parallel

COTS processor decreasing cost

promising technology

Reiner Hartenstein
programmer productivity shrinking with growing number of processors
Page 15: Reconfigurable Computing

15© 2006,

[email protected]

http://hartenstein.de

HPC by classic supercomputing methodology

Extreme shortage of affordable capacity

Lack of scalability: progress only by innovation

More parallelism absorbs programmer productivity

Program ready: hardware obsolete The law of More

Not for high performance embedded computing

poor results

Page 16: Reconfigurable Computing

16© 2006,

[email protected]

http://hartenstein.de

>> Outline <<

•Reconfigurable Computing Paradox

•The Supercomputing Paradox

•We are using the wrong model

•Coarse-grained Reconfigurable Devices

•Super Pentium for Desktop Supercomputer

http://www.uni-kl.de

Page 17: Reconfigurable Computing

17© 2006,

[email protected]

http://hartenstein.de

   

   

   

   

Why traditional supercomputing / HPC failed

memory-cycle-hungryinstruction-stream-based:

the wrong way, how the data are moved around

because of the wrong multi-core interconnect architecture

extr

emel

y unbal

ance d

stolen from Bob Colwell

CPU

Page 18: Reconfigurable Computing

18© 2006,

[email protected]

http://hartenstein.de

Earth SimulatorCrossbar weight: 220 t, 3000 km of thick cable,

moving data around

inside the

Page 19: Reconfigurable Computing

19© 2006,

[email protected]

http://hartenstein.de

discarding the wrong road map

with a paradigm shift the same performance is feasible

on a single 19” rack

Page 20: Reconfigurable Computing

20© 2006,

[email protected]

http://hartenstein.de

Bringing together data and processor

moving the grand piano

by SoftwareMoving data to the processor:

Page 21: Reconfigurable Computing

21© 2006,

[email protected]

http://hartenstein.de

Key issues in very High Performance Computing (vHPC)

this needs a paradigm shift

reducing memory cycles is the key

issue

away from the dominance of instruction streams

Page 22: Reconfigurable Computing

22© 2006,

[email protected]

http://hartenstein.de

Here is the common model

data-stream-based

instruction-stream-

based

software code

accelerator reconfigurable

accelerator hardwired

configware code

CPU

it’s not von Neumannit’s not von Neumann the vN monopoly in our

curricula is severely harmful

the vN monopoly in our

curricula is severely harmful

Von Neumann:the tail is wagging the dog

we need dual paradigm education

we need dual paradigm education

very high performance & electricity bill issues

very high performance & electricity bill issues

legacy issueslegacy issues

symbioticsymbiotic

Page 23: Reconfigurable Computing

23© 2006,

[email protected]

http://hartenstein.de

The wrong basic mind set

we need a a dual paradigm approach

this is a severe eduational challenge

our IT expert labor force lacks the rite basic mind set

Page 24: Reconfigurable Computing

24© 2006,

[email protected]

http://hartenstein.de

For high school and undergraduate education

we need a an archtype simple common model

this is a severe eduational challenge

instead of a wide variety of sophisticated architectures

Page 25: Reconfigurable Computing

25© 2006,

[email protected]

http://hartenstein.de

>> Outline <<

•Reconfigurable Computing Paradox

•The Supercomputing Paradox

•We are using the wrong model

•Coarse-grained Reconfigurable Devices

•Super Pentium for Desktop Supercomputer

http://www.uni-kl.de

Page 26: Reconfigurable Computing

26© 2006,

[email protected]

http://hartenstein.de

integration density

the effective integration density of plane FPGAs behind Moore’s law by more than 4 orders of magnitude

the effective integration density of rDPAs* may come close to Moore’s law

*) reconfigurable DataPath Arrays (coarse-grained reconfigurability)

Page 27: Reconfigurable Computing

27© 2006,

[email protected]

http://hartenstein.de

rDPU not used used for routing only operator and routing port location markerLegend: backbus connect

array size: 10 x 16 = 160 rDPUs

Coarse grain is about computing, not logic

rout thru only

not usedbackbus connect

SNN filter on KressArray (mainly a pipe network)

[Ulrich Nageldinger]

reconfigurable Data Path Unit, e. g. 32 bits wide

no CPUrDPUrDPU

Page 28: Reconfigurable Computing

28© 2006,

[email protected]

http://hartenstein.de

SW 2coarse-grained CW migration example

rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU

rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU

rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU

rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU

rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU

rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU

rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU

rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU

rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU

rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU

rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU

rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU

rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU

rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU

rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU

rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU

S

+

Page 29: Reconfigurable Computing

29© 2006,

[email protected]

http://hartenstein.de

rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU

rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU

rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU

rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU

rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU

rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU

rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU

rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU

Compare it to software solution on CPU

S = R + (if C then A else B endif);C = 1simple conservative CPU example

memory cycles

nanoseconds

if C then read A

read instruction 1 100instruction decoding

read operand* 1 100operate & reg. transfers

if not C then read B

read instruction 1 100instruction decoding

add & store

read instruction 1 100instruction decoding

operate & reg. transfers

store result 1 100

total 5 500

S

+

Clock200S

+

S = R + (if C then A else B endif);

Page 30: Reconfigurable Computing

30© 2006,

[email protected]

http://hartenstein.de

hypothetical branching example to illustrate software-to-configware

migration

*) if no intermediate storage in register file

C = 1simple conservative CPU example

memory cycles

nanoseconds

if C then read A

read instruction 1 100instruction decoding

read operand* 1 100operate & reg. transfers

if not C then read B

read instruction 1 100instruction decoding

add & store

read instruction 1 100instruction decoding

operate & reg. transfers

store result 1 100

total 5 500

S = R + (if C then A else B endif);

S

+

ABR C

clock200 MHz(5 nanosec)

=1

no m

emor

y cy

cles

:

no m

emor

y cy

cles

:

spee

d-up

fac

tor

= 1

00

spee

d-up

fac

tor

= 1

00

Page 31: Reconfigurable Computing

31© 2006,

[email protected]

http://hartenstein.de

moving the locality of operation into the route of the data stream by P&R

Why the speed-up? What‘s the difference?

instead of moving data by instruction streams

Page 32: Reconfigurable Computing

32© 2006,

[email protected]

http://hartenstein.de

Bringing together data and processor

Move the stoolby

Configware

Place the location of execution into the data pipe

Page 33: Reconfigurable Computing

33© 2006,

[email protected]

http://hartenstein.de

Data-stream-based

instead of instruction-triggered

execution should be transport-triggered

transport should be done within compiled pipelines,

not by move engines*

*) which are instruction-stream-based !

Page 34: Reconfigurable Computing

34© 2006,

[email protected]

http://hartenstein.de

For high school and undergraduate education

we should send CTOs and professors back to school

this is a severe eduational challenge

Page 35: Reconfigurable Computing

35© 2006,

[email protected]

http://hartenstein.de

The wrong model

rDPU not used used for routing only operator and routing port location markerLegend: backbus connect

array size: 10 x 16 = 160 rDPUs

rout thru only

not usedbackbus connect

SNN filter on KressArray (mainly a pipe network)

[Ulrich Nageldinger]

reconfigurable Data Path Unit, e. g. 32 bits wide

no CPUrDPUrDPU

upon this schematics …… question by a Japanese Corporate vVIP

Page 36: Reconfigurable Computing

36© 2006,

[email protected]

http://hartenstein.de

The wrong mind set ....

not knowing this solution:symptom of the hardware / software chasm

and the configware / software chasm

„but you can‘t implement decisions!“

We need Reconfigurable Computing Education

S

+

ABR C

clock200 MHz(5 nanosec)

=1

(Question by a Japanese Corporate vVIP: [RAW’99])

Page 37: Reconfigurable Computing

37© 2006,

[email protected]

http://hartenstein.de

>> Outline <<

• Reconfigurable Computing Paradox

• The Supercomputing Paradox

• We are using the wrong model

• Coarse-grained Reconfigurable Devices

• Super Pentium for Desktop Supercomputer

http://www.uni-kl.de

Page 38: Reconfigurable Computing

38© 2006,

[email protected]

http://hartenstein.de

Universal HPC co-architecture for:some Goals

embedded vHPC (nomadic, automotive, ...)desktop vHPC (scientific computing ...)

Application co-development environment forHardware non-experts, ....Acceptability by software-type users, ...

Meet product lifetime >> embedded syst. life:FPGA emulation logistics from

development downto maintenance and repair stationsexamples: automotive, aerospace,

industrial, ..

Page 39: Reconfigurable Computing

39© 2006,

[email protected]

http://hartenstein.de

Architecture: A potential Pentium successorDiscard most caches

have 64* cores, 0.5 - 1 GHz

with clever interconnect for:

▪ concurrent processes and

▪ and for multithreading,

▪ Kung-Kress pipe network

The Desk-top Supercomputer!

*) CPU mode / DPU mode capability

and, for

CPU

mod

eDP

U m

ode

Page 40: Reconfigurable Computing

40© 2006,

[email protected]

http://hartenstein.de

“Super Pentium” configuration examplerDPUrDPU rDPUrDPU rDPUrDPU

rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU

rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU

rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU

rDPUrDPU rDPUrDPU rDPUrDPU

rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU

rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU

rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU

rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU

rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU

rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU

rDPUrDPU rDPUrDPU rDPUrDPU

rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU

rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU

rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU

rDPUrDPU rDPUrDPU rDPUrDPU

CPUCPU

CPUCPU CPUCPU

CPUCPU

twin paradigm machine

CPUCPU CPUCPU

CPUCPU CPUCPU

Page 41: Reconfigurable Computing

41© 2006,

[email protected]

http://hartenstein.de

e. g.: ~ 8 x 8 rDPA: all feasible under 500 MHz

GamesGames MusicMusicVideosVideos

SMeXPPSMeXPP

CameraCamera

Baseband-Baseband-ProcessorProcessor

Radio-Radio-InterfaceInterface

AudioAudio--InterfaceInterface

SD/MMC CardsSD/MMC Cards

LCD DISPLAY

rDPArDPA

• Variable resolutions and refresh rates• Variable scan mode characteristics• Noise Reduction and Artifact Removal• High performance requirements• Variable file encoding formats• Variable content security formats• Variable Displays• Luminance processing• Detail enhancement• Color processing• Sharpness Enhancement• Shadow Enhancement• Differentiation • Programmable de-interlacing heuristics• Frame rate detection and conversion• Motion detection & estimation & compensation• Different standards (MPEG2/4, H.264)• A single device handles all modes

World TV & game console & multi media center

http://pactcorp.com

Page 42: Reconfigurable Computing

42© 2006,

[email protected]

http://hartenstein.de

feasible under 500 MHz

means low electricity cost and allows very high inegration density

Page 43: Reconfigurable Computing

43© 2006,

[email protected]

http://hartenstein.de

pipeline

apropos compiled pipeline …

Page 44: Reconfigurable Computing

44© 2006,

[email protected]

http://hartenstein.de

Dual Paradigm Application Development Support

instruction-stream-

based

software code

accelerator reconfigurable

accelerator hardwired

configware codedata-stream-based

CPU

software/configwareco-compiler

high level languageplacement & routing

in the compiler

optimizes

interconnect

bandwidth by

preferring nearest

neighbor connect

Page 45: Reconfigurable Computing

45© 2006,

[email protected]

http://hartenstein.de

Software / Configware Co-Compilation

Juergen Becker’s CoDe-

X, 1996

CPUCPU

SWcompiler

CWcompiler

C language source

Partitioner

rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU

rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU

rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU

rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU

Placement &

Routing(Move the Locality of Operation

)Resource

Parameters

supportingdifferentplatforms

Page 46: Reconfigurable Computing

46© 2006,

[email protected]

http://hartenstein.de

Software / Configware very high level Synthesis

instruction-stream-

based

software code

accelerator reconfigurable

accelerator hardwired

configware codedata-stream-based

CPU

term-rewriting-basedvhl synthesis system

Math formula ....[Arvind, or,Mauricio Ayala]

Page 47: Reconfigurable Computing

47© 2006,

[email protected]

http://hartenstein.de

>> Conclusions <<

•Reconfigurable Computing Paradox

•The Supercomputing Paradox

•We are using the wrong model

•Coarse-grained Reconfigurable Devices

•Super Pentium for Desktop Supercomputer

•Conclusions http://www.uni-kl.de

Page 48: Reconfigurable Computing

48© 2006,

[email protected]

http://hartenstein.de

flexibility (for accelerators)

Objectives

avoiding specific silicon

rapid prototyping, field-patching, emulation

cheap, compact vHPC

for every area which needs:

Page 49: Reconfigurable Computing

49© 2006,

[email protected]

http://hartenstein.de

Reconfigurable Computing opens many spectacular new horizons:

Conclusion (1)

Cheap vHPC without needing specific silicon, no mask ....

Massive reduction of the electricity bill: locally and national

Cheap embedded vHPC Cheap desktop supercomputer (a new market)

Fast and cheap prototyping

Replacing expensive hardwired accelerators

Supporting fault tolerance, self-repair and self-organization

Flexibility for systems with unstable multiple standards by dynamic reconfigurability

Emulation logistics for very long term sparepart provision and part type count reduction (automotive, aerospace …)

Page 50: Reconfigurable Computing

50© 2006,

[email protected]

http://hartenstein.de

Universal vHPC co-architecture demonstrator

Conclusion (2)Needed:

The compilation tool problem to be solvedLanguage selection problem to be solvedEducation backlog problems to be solved

Use this to develop a very good high school and undergraduate lab course

A motivator: preparing for the top 500 contest

For widely spreading its use successfully:

select killer applications for demo

Page 51: Reconfigurable Computing

51© 2006,

[email protected]

http://hartenstein.de

thank you

Page 52: Reconfigurable Computing

52© 2006,

[email protected]

http://hartenstein.de

END

Page 53: Reconfigurable Computing

53© 2006,

[email protected]

http://hartenstein.de

backup

Page 54: Reconfigurable Computing

54© 2006,

[email protected]

http://hartenstein.de

Compilation: Software vs. Configware

source program

softwarecompiler

software code

Software Engineeri

ng

Software Engineeri

ng

configware code

mapper

configwarecompiler

scheduler

flowware code

source „program“

Configware

Engineering

Configware

Engineering

placement &

routing

data

C, FORTRANMATHLAB

Page 55: Reconfigurable Computing

55© 2006,

[email protected]

http://hartenstein.de

configware resources: variable

Nick Tredennick’s Paradigm Shifts explain the differences

2 programming sources needed

flowware algorithm: variable

Configware EngineeringConfigware Engineering

Software EngineeringSoftware Engineering

1 programming source

needed

algorithm: variable

resources: fixedsoftware

CPU

Page 56: Reconfigurable Computing

56© 2006,

[email protected]

http://hartenstein.de

Co-Compilation

softwarecompiler

software code

Software / Configware Co-Compiler

Software / Configware Co-Compiler

configware code

mapperconfigware

compiler

scheduler

flowware code

data

C, FORTRAN, MATHLAB

automatic SW / CW partitionersimulated annealing

simulated annealing

simulated annealing

simulated annealing

Page 57: Reconfigurable Computing

57© 2006,

[email protected]

http://hartenstein.de

Co-Compiler for Hardwired Kress/Kung Machine[e. g. Brodersen]

softwarecompiler

software code

Software / Flowware

Co-Compiler

Software / Flowware

Co-Compiler

flowwarecompiler

scheduler

flowware code

data

source

automatic SW / CW partitioner

Page 58: Reconfigurable Computing

58© 2006,

[email protected]

http://hartenstein.de

The first archetype machine model

mainframe

CPU

compile orassemble

proceduralpersonalization

Software IndustrySoftware Industry Software Industry’sSecret of Success

simple basic .Machine Paradigm

personalization:RAM-based

instruction-stream- based mind set

“von Neumann”

Page 59: Reconfigurable Computing

59© 2006,

[email protected]

http://hartenstein.de

The 2nd archetype machine model

compilestructural

personalization

Configware IndustryConfigware Industry

Configware Industry’sSecret of Success

personalization:RAM-based

data-stream- based mind set

“Kress-Kung”

accelerator reconfigurable

simple basic .Machine Paradigm

Page 60: Reconfigurable Computing

60© 2006,

[email protected]

http://hartenstein.de

Co-Compiler Enabling Technology

is available from academia

only a small team needed for commercial re-implementation

on the road map to the Personal Supercomputer

Page 61: Reconfigurable Computing

61© 2006,

[email protected]

http://hartenstein.de

DPA

xxx

xxx

xxx

|

||

x x

x

x

x

x

x x

x

- -

-

input data stream

xx

x

x

x

x

xx

x

--

-

-

-

-

-

-

-

-

-

-

xxx

xxx

xxx

|

|

|

|

|

|

|

|

|

|

|

|

|

|output data streams

„data

streams“ time

port #

time

time

port #time

port #

define: ... which data item at which time at which port

Data streams

(pipe network)

H. T. Kung paradigm(systolic array)

implemented by distributed

memory

datacounter

GAG RAM

ASM

ASM

ASM

ASM

ASM

ASM

AS

M

AS

M

AS

M

AS

M

AS

M

AS

MASM: Auto-

Sequencing Memory

50 & more on-chip ASM are feasible

50 & more on-chip ASM are feasible

Page 62: Reconfigurable Computing

62© 2006,

[email protected]

http://hartenstein.de

The Generalization of the Systolic Array

[R. Kress]:use optimization algorithmse. g.: simulated annealing

Achievement: also non-linear and non-uniform pipes, and even more wild pipe structures possible

reconfigurability makes sense

discard algebraic synthesis methods

remedy?

only for applications with regular data dependencies

Kress-Kung paradigmsuper systolic array

Page 63: Reconfigurable Computing

63© 2006,

[email protected]

http://hartenstein.de

(Kress-Kung machine paradigm) drastically reducing memory

cycles

Data Counter instead of Program CounterGeneralization of the DMA

ASM: Auto-Sequencing Memory

datacounter

GAG RAM

ASM

GAG & enabling technology:multiple publications 1989 … -Survey paper: [M. Herz et al.*: IEEE ICECS 2003, Dubrovnik] *) IMEC, Leuven & TU-KL

Storge Scheme optimization methodology, etc.*

Reiner Hartenstein
ASM means: no instruction streams neededfor address computationGeneralization of DMAM. Herz et al.: ICECS 2003, Dubrovnik
Page 64: Reconfigurable Computing

64© 2006,

[email protected]

http://hartenstein.de

fine-grained RC: 1st DeHon‘s 1st Law Technology:

reconfigurability overhead>

routing congestion

wiring overhead

overhead:

>> 10 000

1980 1990 2000 2010100

103

106

109

FPGAlogical

FPGArouted

(Gordon Moore curve)

transistors / microchip

(microprocessor)

immense area inefficiency

[1996: Ph. D, MIT]1012

density:density:

FPGAphysical

Page 65: Reconfigurable Computing

65© 2006,

[email protected]

http://hartenstein.de

coarse-grained RC: Hartenstein‘s amendment of DeHon‘s 1st Law

rDPA

FPGArouted

>> 10 000

(Gordon Moore curve)

rDPA physical rDPA logical

area efficiency very close to Moore‘s law

[1996: ISIS, Austin, TX]

e.g.

KressArray

family

1980 1990 2000 2010100

103

106

109

transistors / microchip

1012

Page 66: Reconfigurable Computing

66© 2006,

[email protected]

http://hartenstein.de

More compute power by Configware than Software

Conclusion: most compute power from Configware

75% of all (micro)processors are embedded 4 : 1

avarage acceleration factor >2-> rMIPS* : MIPS > 2

*) rMIPS: MIPS replaced by FPGA compute power

25% embedded µProc. accelerated by FPGA(s)

1 : 4

(a very cautious estimation**)

**) Dataquest interaction pending

-> 1 : 1-> Every 2nd µProc accelerated by FPGA(s)

(difference probably an order of magnitude)

Page 67: Reconfigurable Computing

67© 2006,

[email protected]

http://hartenstein.de

Conclusion (3)

Self-Repair and Self-Organization methodologyEmbedded r-emulation logistics methodology

Universal vHPC co-architecture demonstrator

select a killer application for demo

For widely spreading its use successfully:

Page 68: Reconfigurable Computing

68© 2006,

[email protected]

http://hartenstein.de

Dual Paradigm Application Development Support

instruction-stream-

based

software code

accelerator reconfigurable

accelerator hardwired

configware codedata-stream-based

CPU

software/configwareco-compiler

high level languageMATLAB

adapter

other example