reconfigurable computing

http://hartenstein.de

Reconfigurable Computing

Reiner Hartenstein

Computing MeetingEU, ESU, Brussells, May 18, 2006

reiner@hartenstein.de

The Pervasiveness of RC

162,000

127,000

158,000113,000

171,000194,000

# of hits by Google

1,620,000

915,000

398,000

272,000

647,000

1,490,000

# of hits by Google

“FPGA and ….”ECE-savvy scene (mainstream many years)

Math/SW-savvy scene(more recently: 2-3 years)

and many more areas

The dominance of Configware

Most compute power is coming from Configware

More MIPS migrated to Configware than running as Software

Reconfigurable Supercomputing (VHPC) going commercial

Cray XD1

silicon graphics RASC

… and other vendors

>> Outline <<

•Reconfigurable Computing Paradox

•The Supercomputing Paradox

•We are using the wrong model

•Coarse-grained Reconfigurable Devices

•Super Pentium for Desktop Supercomputer

http://www.uni-kl.de

The Reconfigurable Computing Paradox

area-inefficient, slow, power-hungry, expensive

tools and languages unacceptable by most users

poor FPGA technology:

RC education: extremely poor, if at all

even most hardware experts (86%**) hate their tools

**) DeHon ‘98

poor tools:

poor education:- ignored by CS

curriculaCS taught like for a 50 year old mainframe …

FPGA integration density

the effective integration density of plane FPGAs is behind Moore’s law by more than 4 orders of magnitude

However, brillia

results everywherewhat paradox ?

X 2/yr

speed-up factors published

1980 1990 2000 2010100

Pentium 4

50%/yr

http://xputers.informatik.uni-kl.de/faq-pages/fqa.html

10 000

Los Alamos traffic simulation

real-time face detectionreal-time face detection6000

video-rate stereo vision

900pattern

recognitionpattern

recognition730

SPIHT wavelet-based image compressionSPIHT wavelet-based image compression 457Smith-Waterman pattern matching

Smith-Waterman pattern matching

BLASTBLAST52protein identificationprotein identification

molecular dynamics simulationmolecular dynamics simulation

Reed-Solomon Decoding

Reed-Solomon Decoding2400

Viterbi DecodingViterbi Decoding

FFTFFT

1000MA

Grid-based DRC:no FPGA: DPLA on MoM by TU-KL

20002000

2-D FIR filter [TU-KL]2-D FIR filter [TU-KL]

Lee Routing (by TU-KL)

Grid-based DRC („fair

comparizon“)

comparizon“)1500015000

DSP and wirelessDSP and wirelessImage processing,Pattern matching,

Multimedia

Image processing,Pattern matching,

Multimedia

BioinformaticsBioinformatics

GRAPEGRAPE20

AstrophysicsAstrophysics

DPLADPLA

MoM Xputer architecture

Microprocessor

Memory

10 000

x1.25 / yr (Moore)

cryptocrypto

pre-FPGA era

>1 OoM>1 OoM

>2 OoM>2 OoM

>3 OoM>3 OoM

<4 OoM<4 OoM

500MHz FlexibleSoft Logic Architecture

200KLogic Cells

500MHz Programmable DSP Execution Units

0.6-11.1GbpsSerial Transceivers

500MHz PowerPC™ Processors(680DMIPS)

withAuxiliary Processor Unit

1Gbps DifferentialI/O

500MHz multi-portDistributed 10 Mb SRAM

500MHz DCM DigitalClock Management

platform FPGAs: better area efficiency

[courtesy Xilinx Corp.]DSP platform FPGA

DeHon‘s 1st Law (1996) was for plane FPGAs

pre FPGA era: Why DPLA* was so goodpre FPGA era: Why DPLA* was so good

Large arrays of canonical boolean expressions -

close to Moore’s lawclassical PLA layout highly area-efficient:

*) fabricated 1984 by E.I.S. multi university project

2ASM: Auto-Sequencing MemoryASM

**) for a survey by IMEC & TU-KL see: [M. Herz et al.: ICECS 2003, Dubrovnik]

Mid’ 80ies: first only very tiny FPGAs available: 1 DPLA replaced 256 of them

a generalization of the DMA**

GAG Generic Address Generator** to avoid address computation overhead

reducing memory cycles which is the

key issue

Speed-up factor of 20 by

X 2/yr

taxonomy of algorithms, better tools and better education

1980 1990 2000 2010100

Pentium 4

50%/yr

10 000

Los Alamos traffic simulation

real-time face detectionreal-time face detection6000

video-rate stereo vision

900pattern

recognitionpattern

recognition730

SPIHT wavelet-based image compressionSPIHT wavelet-based image compression 457Smith-Waterman pattern matching

Smith-Waterman pattern matching

BLASTBLAST52protein identificationprotein identification

molecular dynamics simulationmolecular dynamics simulation

Reed-Solomon Decoding

Reed-Solomon Decoding2400

Viterbi DecodingViterbi Decoding

FFTFFT

1000MA

Grid-based DRC:no FPGA: DPLA on MoM by TU-KL

20002000

2-D FIR filter [TU-KL]2-D FIR filter [TU-KL]

Lee Routing (by TU-KL)

comparizon“)

comparizon“)1500015000

DSP and wirelessImage processing,Pattern matching,

Multimedia

Bioinformatics

GRAPEGRAPE20

Astrophysics

DPLADPLA

MoM Xputer architecture

Microprocessor

Memory

10 000

x1.25 / yr (Moore)

cryptocrypto

New dimensions of low power: Application migration [from supercomputer] resulting not only in massive speed-upsElectricity bills reduced by an order of magnitude and even more you may get for free…. up to millions of $ dollars per year

(also a matter of national energy policy)GoogleAmsterdam

„Saves more than $10,000 in electricity bills per year (7¢ / kWh) - .... per 64-processor 19" rack“ [Herb Riley, R. Associates]

>> Outline <<

The Supercomputing Paradox

Growing listed Teraflops

Increasing number of processors running in parallel

COTS processor decreasing cost

promising technology

HPC by classic supercomputing methodology

Extreme shortage of affordable capacity

Lack of scalability: progress only by innovation

More parallelism absorbs programmer productivity

Program ready: hardware obsolete The law of More

Not for high performance embedded computing

poor results

>> Outline <<

Why traditional supercomputing / HPC failed

memory-cycle-hungryinstruction-stream-based:

the wrong way, how the data are moved around

because of the wrong multi-core interconnect architecture

y unbal

ance d

stolen from Bob Colwell

Earth SimulatorCrossbar weight: 220 t, 3000 km of thick cable,

moving data around

inside the

discarding the wrong road map

with a paradigm shift the same performance is feasible

on a single 19” rack

Bringing together data and processor

moving the grand piano

by SoftwareMoving data to the processor:

Key issues in very High Performance Computing (vHPC)

this needs a paradigm shift

reducing memory cycles is the key

away from the dominance of instruction streams

Here is the common model

data-stream-based

instruction-stream-

software code

accelerator reconfigurable

accelerator hardwired

configware code

it’s not von Neumannit’s not von Neumann the vN monopoly in our

curricula is severely harmful

the vN monopoly in our

curricula is severely harmful

Von Neumann:the tail is wagging the dog

we need dual paradigm education

very high performance & electricity bill issues

legacy issueslegacy issues

symbioticsymbiotic

The wrong basic mind set

we need a a dual paradigm approach

this is a severe eduational challenge

our IT expert labor force lacks the rite basic mind set

For high school and undergraduate education

we need a an archtype simple common model

instead of a wide variety of sophisticated architectures

>> Outline <<

integration density

the effective integration density of plane FPGAs behind Moore’s law by more than 4 orders of magnitude

the effective integration density of rDPAs* may come close to Moore’s law

*) reconfigurable DataPath Arrays (coarse-grained reconfigurability)

rDPU not used used for routing only operator and routing port location markerLegend: backbus connect

array size: 10 x 16 = 160 rDPUs

Coarse grain is about computing, not logic

rout thru only

not usedbackbus connect

SNN filter on KressArray (mainly a pipe network)

[Ulrich Nageldinger]

reconfigurable Data Path Unit, e. g. 32 bits wide

no CPUrDPUrDPU

SW 2coarse-grained CW migration example

rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU

Compare it to software solution on CPU

S = R + (if C then A else B endif);C = 1simple conservative CPU example

memory cycles

nanoseconds

if C then read A

read instruction 1 100instruction decoding

read operand* 1 100operate & reg. transfers

if not C then read B

add & store

operate & reg. transfers

store result 1 100

total 5 500

Clock200S

S = R + (if C then A else B endif);

hypothetical branching example to illustrate software-to-configware

migration

*) if no intermediate storage in register file

C = 1simple conservative CPU example

memory cycles

nanoseconds

if C then read A

read operand* 1 100operate & reg. transfers

if not C then read B

add & store

operate & reg. transfers

store result 1 100

total 5 500

S = R + (if C then A else B endif);

clock200 MHz(5 nanosec)

moving the locality of operation into the route of the data stream by P&R

Why the speed-up? What‘s the difference?

instead of moving data by instruction streams

Bringing together data and processor

Move the stoolby

Configware

Place the location of execution into the data pipe

Data-stream-based

instead of instruction-triggered

execution should be transport-triggered

transport should be done within compiled pipelines,

not by move engines*

*) which are instruction-stream-based !

For high school and undergraduate education

we should send CTOs and professors back to school

The wrong model

rDPU not used used for routing only operator and routing port location markerLegend: backbus connect

array size: 10 x 16 = 160 rDPUs

rout thru only

not usedbackbus connect

SNN filter on KressArray (mainly a pipe network)

[Ulrich Nageldinger]

reconfigurable Data Path Unit, e. g. 32 bits wide

no CPUrDPUrDPU

upon this schematics …… question by a Japanese Corporate vVIP

The wrong mind set ....

not knowing this solution:symptom of the hardware / software chasm

and the configware / software chasm

„but you can‘t implement decisions!“

We need Reconfigurable Computing Education

clock200 MHz(5 nanosec)

(Question by a Japanese Corporate vVIP: [RAW’99])

>> Outline <<

• Reconfigurable Computing Paradox

• The Supercomputing Paradox

• We are using the wrong model

• Coarse-grained Reconfigurable Devices

• Super Pentium for Desktop Supercomputer

Universal HPC co-architecture for:some Goals

embedded vHPC (nomadic, automotive, ...)desktop vHPC (scientific computing ...)

Application co-development environment forHardware non-experts, ....Acceptability by software-type users, ...

Meet product lifetime >> embedded syst. life:FPGA emulation logistics from

development downto maintenance and repair stationsexamples: automotive, aerospace,

industrial, ..

Architecture: A potential Pentium successorDiscard most caches

have 64* cores, 0.5 - 1 GHz

with clever interconnect for:

▪ concurrent processes and

▪ and for multithreading,

▪ Kung-Kress pipe network

The Desk-top Supercomputer!

*) CPU mode / DPU mode capability

and, for

“Super Pentium” configuration examplerDPUrDPU rDPUrDPU rDPUrDPU

rDPUrDPU rDPUrDPU rDPUrDPU

CPUCPU

CPUCPU CPUCPU

CPUCPU

twin paradigm machine

CPUCPU CPUCPU

e. g.: ~ 8 x 8 rDPA: all feasible under 500 MHz

GamesGames MusicMusicVideosVideos

SMeXPPSMeXPP

CameraCamera

Baseband-Baseband-ProcessorProcessor

Radio-Radio-InterfaceInterface

AudioAudio--InterfaceInterface

SD/MMC CardsSD/MMC Cards

LCD DISPLAY

rDPArDPA

• Variable resolutions and refresh rates• Variable scan mode characteristics• Noise Reduction and Artifact Removal• High performance requirements• Variable file encoding formats• Variable content security formats• Variable Displays• Luminance processing• Detail enhancement• Color processing• Sharpness Enhancement• Shadow Enhancement• Differentiation • Programmable de-interlacing heuristics• Frame rate detection and conversion• Motion detection & estimation & compensation• Different standards (MPEG2/4, H.264)• A single device handles all modes

World TV & game console & multi media center

http://pactcorp.com

feasible under 500 MHz

means low electricity cost and allows very high inegration density

pipeline

apropos compiled pipeline …

Dual Paradigm Application Development Support

instruction-stream-

software code

configware codedata-stream-based

software/configwareco-compiler

high level languageplacement & routing

in the compiler

optimizes

interconnect

bandwidth by

preferring nearest

neighbor connect

Software / Configware Co-Compilation

Juergen Becker’s CoDe-

X, 1996

CPUCPU

SWcompiler

CWcompiler

C language source

Partitioner

Placement &

Routing(Move the Locality of Operation

)Resource

Parameters

supportingdifferentplatforms

Software / Configware very high level Synthesis

instruction-stream-

software code

term-rewriting-basedvhl synthesis system

Math formula ....[Arvind, or,Mauricio Ayala]

>> Conclusions <<

•Conclusions http://www.uni-kl.de

flexibility (for accelerators)

Objectives

avoiding specific silicon

rapid prototyping, field-patching, emulation

cheap, compact vHPC

for every area which needs:

Reconfigurable Computing opens many spectacular new horizons:

Conclusion (1)

Cheap vHPC without needing specific silicon, no mask ....

Massive reduction of the electricity bill: locally and national

Cheap embedded vHPC Cheap desktop supercomputer (a new market)

Fast and cheap prototyping

Replacing expensive hardwired accelerators

Supporting fault tolerance, self-repair and self-organization

Flexibility for systems with unstable multiple standards by dynamic reconfigurability

Emulation logistics for very long term sparepart provision and part type count reduction (automotive, aerospace …)

Universal vHPC co-architecture demonstrator

Conclusion (2)Needed:

The compilation tool problem to be solvedLanguage selection problem to be solvedEducation backlog problems to be solved

Use this to develop a very good high school and undergraduate lab course

A motivator: preparing for the top 500 contest

For widely spreading its use successfully:

select killer applications for demo

thank you

backup

Compilation: Software vs. Configware

source program

softwarecompiler

software code

Software Engineeri

configware code

mapper

configwarecompiler

scheduler

flowware code

source „program“

Configware

Engineering

Configware

Engineering

placement &

routing

C, FORTRANMATHLAB

configware resources: variable

Nick Tredennick’s Paradigm Shifts explain the differences

2 programming sources needed

flowware algorithm: variable

Configware EngineeringConfigware Engineering

Software EngineeringSoftware Engineering

1 programming source

needed

algorithm: variable

resources: fixedsoftware

Co-Compilation

softwarecompiler

software code

Software / Configware Co-Compiler

configware code

mapperconfigware

compiler

scheduler

flowware code

C, FORTRAN, MATHLAB

automatic SW / CW partitionersimulated annealing

simulated annealing

Co-Compiler for Hardwired Kress/Kung Machine[e. g. Brodersen]

softwarecompiler

software code

Software / Flowware

Co-Compiler

Software / Flowware

Co-Compiler

flowwarecompiler

scheduler

flowware code

source

automatic SW / CW partitioner

The first archetype machine model

mainframe

compile orassemble

proceduralpersonalization

Software IndustrySoftware Industry Software Industry’sSecret of Success

simple basic .Machine Paradigm

personalization:RAM-based

instruction-stream- based mind set

“von Neumann”

The 2nd archetype machine model

compilestructural

personalization

Configware IndustryConfigware Industry

Configware Industry’sSecret of Success

personalization:RAM-based

data-stream- based mind set

“Kress-Kung”

simple basic .Machine Paradigm

Co-Compiler Enabling Technology

is available from academia

only a small team needed for commercial re-implementation

on the road map to the Personal Supercomputer

input data stream

|output data streams

„data

streams“ time

port #

port #time

port #

define: ... which data item at which time at which port

Data streams

(pipe network)

H. T. Kung paradigm(systolic array)

implemented by distributed

memory

datacounter

GAG RAM

MASM: Auto-

Sequencing Memory

50 & more on-chip ASM are feasible

The Generalization of the Systolic Array

[R. Kress]:use optimization algorithmse. g.: simulated annealing

Achievement: also non-linear and non-uniform pipes, and even more wild pipe structures possible

reconfigurability makes sense

discard algebraic synthesis methods

remedy?

only for applications with regular data dependencies

Kress-Kung paradigmsuper systolic array

(Kress-Kung machine paradigm) drastically reducing memory

cycles

Data Counter instead of Program CounterGeneralization of the DMA

ASM: Auto-Sequencing Memory

datacounter

GAG RAM

GAG & enabling technology:multiple publications 1989 … -Survey paper: [M. Herz et al.*: IEEE ICECS 2003, Dubrovnik] *) IMEC, Leuven & TU-KL

Storge Scheme optimization methodology, etc.*

fine-grained RC: 1st DeHon‘s 1st Law Technology:

reconfigurability overhead>

routing congestion

wiring overhead

overhead:

>> 10 000

1980 1990 2000 2010100

FPGAlogical

FPGArouted

(Gordon Moore curve)

transistors / microchip

(microprocessor)

immense area inefficiency

[1996: Ph. D, MIT]1012

density:density:

FPGAphysical

coarse-grained RC: Hartenstein‘s amendment of DeHon‘s 1st Law

FPGArouted

>> 10 000

(Gordon Moore curve)

rDPA physical rDPA logical

area efficiency very close to Moore‘s law

[1996: ISIS, Austin, TX]

KressArray

family

1980 1990 2000 2010100

transistors / microchip

More compute power by Configware than Software

Conclusion: most compute power from Configware

75% of all (micro)processors are embedded 4 : 1

avarage acceleration factor >2-> rMIPS* : MIPS > 2

*) rMIPS: MIPS replaced by FPGA compute power

25% embedded µProc. accelerated by FPGA(s)

(a very cautious estimation**)

**) Dataquest interaction pending

-> 1 : 1-> Every 2nd µProc accelerated by FPGA(s)

(difference probably an order of magnitude)

Conclusion (3)

Self-Repair and Self-Organization methodologyEmbedded r-emulation logistics methodology

Universal vHPC co-architecture demonstrator

select a killer application for demo

For widely spreading its use successfully:

Dual Paradigm Application Development Support

instruction-stream-

software code

software/configwareco-compiler

high level languageMATLAB

adapter

other example

reconfigurable computing

Documents

configurable, reconfigurable, and run-time reconfigurable...

lecture 13: reconfigurable computing applications october...

rhese reconfigurable computing (rc) task lead: … rc –...

cpre 583 reconfigurable computing

1 - cpre 583 (reconfigurable computing): reconfigurable...

ece 697f reconfigurable computing lecture 19 reconfigurable...

reconfigurable computing - pub.ro

reconfigurable computing reconfigurable architectures...

reconfigurable computing applications

jyothis( dynamically reconfigurable computing)

reconfigurable computing reconfigurable … computing...

reconfigurable computing with the partitioned global...

ece 636 reconfigurable computing lecture 11 reconfigurable...

dynamic reconfigurable computing architecture for ... ·...

configurable, reconfigurable, and run-time reconfigurable...

fpga and reconfigurable computing

lecture 16: reconfigurable computing applications november...

ece 636 reconfigurable computing lecture 15 reconfigurable...

1 - cpre 583 (reconfigurable computing): reconfigurable...

reconfigurable computing