capes / dfg project universidade do brasilia universitaet kaiserslautern universitaet karlsruhe...
Post on 20-Dec-2015
225 Views
Preview:
TRANSCRIPT
CAPES / DFG Project Universidade do Brasilia
Universitaet KaiserslauternUniversitaet Karlsruhe
Reiner Hartenstein*
University ofKaiserslautern
November 14, 2003, Brasilia, Brazil
Present and Future of Reconfigurable
Systems
*) IEEE fellow
© 2003, reiner@hartenstein.de http://hartenstein.de2
University of Kaiserslautern
Xputer LabLiterature (also downloads)
http://hartenstein.de
also click „recent talks“this page: also links to available Ph. D theses:
Becker ,Herz, Kress, Nageldinger,
© 2003, reiner@hartenstein.de http://hartenstein.de3
University of Kaiserslautern
Xputer LabReconfigurable Computing:
a second programming domain
Migration of programming to the structural domain
The opportunity to introduce the structural domain to programmers ...
The structural domain has become RAM-based
... to bridge the gap by clever abstraction mechanisms using a simple new machine paradigm
© 2003, reiner@hartenstein.de http://hartenstein.de4
University of Kaiserslautern
Xputer LabIT ages
mainframe age
computer age (PC age)
data streams ...
morphware age
1957
1967
1977
1987
1997
2007
von Neumann does not support morphware
flowware
here?
© 2003, reiner@hartenstein.de http://hartenstein.de5
University of Kaiserslautern
Xputer Lab>> outline <<
•fine grain reconfigurable•Placement and routing •coarse grain reconfigurable•Flowware•Datastream-based Computing•The Anti Machine Paradigm•Final Remarks
http://www.uni-kl.de
© 2003, reiner@hartenstein.de http://hartenstein.de6
University of Kaiserslautern
Xputer Labfine grain
• Fine Grain morphware platforms
already mainstream: reconfigurable logic
just logic design on a strange platform ?
speed-up til 3 orders of magnitude
© 2003, reiner@hartenstein.de http://hartenstein.de7
University of Kaiserslautern
Xputer Lab
cost / mio §
4
3
2
1mask set
cost [eASIC]
NRE and mask cost
[dataquest] .
12 12 16 20 26 28 30 >30no. of masks
0.8 0.6 0.35 0.25 0.18 0.15 0.13 0.1 0.07 feature size
PC: 25%
22%communication
others: 31%
6 %automotive
16% consumer
Xilinx42%
Altera37%
Lattice15%
Actel6%
Top 4 PLD Manufacturers 2000total: $3.7 Bio
• [Dataquest] > $7 billion by 2003.
• FPGAs going into every type of application – also SoC• fastest growing segment of semiconductor market
you don‘t need specific silicon !
you don‘t need specific silicon !
rGAs
© 2003, reiner@hartenstein.de http://hartenstein.de8
University of Kaiserslautern
Xputer Lab
switch
rGA with island architecture(Ausschnitt)
connect
switch
© 2003, reiner@hartenstein.de http://hartenstein.de9
University of Kaiserslautern
Xputer Lab switch box• R
eko
nfi
gu
rier
bar
switch box
switch
point
© 2003, reiner@hartenstein.de http://hartenstein.de10
University of Kaiserslautern
Xputer Lab connect box• R
eko
nfi
gu
rier
bar
connect boxconnect point
part of configuration
memory
© 2003, reiner@hartenstein.de http://hartenstein.de11
University of Kaiserslautern
Xputer Lab
Verbindungspunkt (vergrößert)
Verbindungs-Punkt• R
eko
nfi
gu
rier
bar
reconfigurable logic box
illustration
© 2003, reiner@hartenstein.de http://hartenstein.de12
University of Kaiserslautern
Xputer Lab connection activated
Die Zuleitung zur Funktionswahl des
rLB nicht gezeigt
reconfigurable logic box
illustration
© 2003, reiner@hartenstein.de http://hartenstein.de13
University of Kaiserslautern
Xputer Labconnect point activated• R
ou
tin
g
© 2003, reiner@hartenstein.de http://hartenstein.de14
University of Kaiserslautern
Xputer Lab
der 4. Schaltpunkt
der 5. Schaltpunkt
3 Schaltpunkte switch points
activated
• Ro
uti
ng
switch box
switch
point
© 2003, reiner@hartenstein.de http://hartenstein.de15
University of Kaiserslautern
Xputer Lab Routing continued
• Ro
uti
ng
© 2003, reiner@hartenstein.de http://hartenstein.de16
University of Kaiserslautern
Xputer Lab A
B
Plazierungs- und Routing Software bekannt s. 25 Jahren
Solche Netzwerk-Probleme manuell oder mit Hilfe der Graphen-Theorie behandelbar.
1979 Silva Lisco (Silicon Valley Research Corp.) bietet CALM-P an
20 Transistors + 20 Flipflops
Routing completed
for 1 net
•Routing
© 2003, reiner@hartenstein.de http://hartenstein.de17
University of Kaiserslautern
Xputer Lab
A
B
Passing through: long distance wiring from rLBs outside this region
Routing:long distance nets
A path can be used only once at a time .....
© 2003, reiner@hartenstein.de http://hartenstein.de18
University of Kaiserslautern
Xputer LabA
B
CCDD
C and D are not reachable.
A bridge can be passed only once (bridges of Königsberg)
routing congestion
C cannot be connected with D.
© 2003, reiner@hartenstein.de http://hartenstein.de19
University of Kaiserslautern
Xputer Lab>> outline <<
•fine grain reconfigurable•Placement and routing •coarse grain reconfigurable•Flowware•Datastream-based Computing•The Anti Machine Paradigm•Final Remarks
http://www.uni-kl.de
© 2003, reiner@hartenstein.de http://hartenstein.de20
University of Kaiserslautern
Xputer Lab
Leonhard Euler
Euler‘s problem of the bridges of Königsberg is such a network problem (1736):
Find a way, which passes each bridge exactly once .....
... also an optimization: none of the bridges remains unused.
1736
© 2003, reiner@hartenstein.de http://hartenstein.de21
University of Kaiserslautern
Xputer LabL. Euler: Solutio Problematis Ad geometriam Situs
Pertinentis; Commetarii Academiae Scientiarum Imperialis Petropolitanae 8 (1736), pp. 128-140
Graph
edge
node
Left Bank
Right Bank
Kneiphof Island
Other Island
© 2003, reiner@hartenstein.de http://hartenstein.de22
University of Kaiserslautern
Xputer Lab
adjacency matrix
Data structures for Graphs
ListGraph
1 2
3 4
0000
10
10
100
1
0
100
1234
1 2 3 4from
to
2 14 /2
3 /
2 /33 /4
directed graph
1 2
3 4
0
110
10
11
110
1
0
110
1234
1 2 3 4from
to
3 /2 13 1 22 1 33 /2 4
4 /
4 /
undirected graph
J. E. Hopcroft, R. E. Tarjan: Efficient algorithm
for graph manipulation; Comm. ACM, 1973
© 2003, reiner@hartenstein.de http://hartenstein.de23
University of Kaiserslautern
Xputer Lab
ENIAC, completed 1945
Partitioning over racks in the hallPartitioning over card cages in the rackPartitioning over boards (cards) in card cages Partitioning over chips etc. on the card (e. g. SBC)Partitioning over blocks on the chip (e. g. microprocessor)
Large Scale Routing
© 2003, reiner@hartenstein.de http://hartenstein.de24
University of Kaiserslautern
Xputer LabPCBs (printed circuit boards)
for 40 years
MULTEC at Böblingen produces printed circuits boards since 1963
planar „wiring“
no. of pins is limited
© 2003, reiner@hartenstein.de http://hartenstein.de25
University of Kaiserslautern
Xputer Lab
Integated Citcuit (Chip)limited number of pins
„wiring“ on a planar surface
© 2003, reiner@hartenstein.de http://hartenstein.de26
University of Kaiserslautern
Xputer Labhierarchy
card cage
rack
cardchip
macro cell
basic cell
more levels
Kaisers-lautern
1
KL2 KL3 KL4
FTI1
JWGU
FTI2
IMS1
IMS2
IMS3
IMS
IMS
IMS
IMS
IMSIMS
© 2003, reiner@hartenstein.de http://hartenstein.de27
University of Kaiserslautern
Xputer Labwiring
hierarchy
cables in the rackconnect thecard cages
card cage wiringconnectsthe cards
card wiring connects the chips
macro cell
cell
on-Chip-wiringconnectsthe cells
*) 30er: Telefon-Vermittlung (ohne Chips,Crossbar / Hebdreh-Wähler statt Karten)40er: erste Computer (ohne Chips)
© 2003, reiner@hartenstein.de http://hartenstein.de28
University of Kaiserslautern
Xputer Lab An obsolete Application Area
•fine grain reconfigurable•Placement and routing •coarse grain reconfigurable•Flowware•Datastream-based Computing•The Anti Machine Paradigm•Final Remarks
http://www.uni-kl.de
before fabrication ?
after fabrication ?
© 2003, reiner@hartenstein.de http://hartenstein.de29
University of Kaiserslautern
Xputer Lab
Celaro Pro (Mentor)
Dini Group
Dini Group
EmulatorsQuickturn
PCi bus extender
Dini group
© 2003, reiner@hartenstein.de http://hartenstein.de30
University of Kaiserslautern
Xputer LabCrossbar
324 x 4
n=8
no. of crossbar chips
n x n/2n
8 32
100 5000
cossbar chips in
a row
full crossbar
64
64
14
32
nn
8 8
100 100
no. of crossbar chips
cossbar chips in
a row
partial crossbar
© 2003, reiner@hartenstein.de http://hartenstein.de31
University of Kaiserslautern
Xputer Lab
14 Logic Chips (Lchip) with 128 pins(occasionally for rout-through)
32 Crossbar Chips (Xchip) with 72 I/O pins(for rout-through only)
each Xchip: 4 pins connected to each Lchip
8 Logic cards per card cage
Logik-Karte
Einschub
Schrank
8 card cages per rack
8 Ychip cards per card cage
Backplane: 8 Zboard cards per rack
Routing
© 2003, reiner@hartenstein.de http://hartenstein.de32
University of Kaiserslautern
Xputer Lab
1913 J. N. Reynold‘s crossbar switch
1915 patent granted
1926 first public telefon switching application in Shweden
Betulander‘s crossbar switch 1919
NASA telemetrics crossbar array 1964
Crossbar ?
© 2003, reiner@hartenstein.de http://hartenstein.de33
University of Kaiserslautern
Xputer LabRWC Real World Computing, Japan, 40 TFLOPS
Crossbar weight: 220 tons, 3000 km cable,5120 processors with 5000 pins each
© 2003, reiner@hartenstein.de http://hartenstein.de34
University of Kaiserslautern
Xputer Lab Routing Congestion
Example
direct connection impossible
rGA rGA rGA rGA
rGA rGA rGA rGA
rout-throughdetour connection
© 2003, reiner@hartenstein.de http://hartenstein.de35
University of Kaiserslautern
Xputer LabRouting-only configuration
(2 examples)
rLB
Identitityfunction
configured
• Ro
uti
ng
© 2003, reiner@hartenstein.de http://hartenstein.de36
University of Kaiserslautern
Xputer Lab
T. Uehara, W. M. van Cleemput: Optimal Layout of CMOS Functional Arrays; IEEE Trans. C-30, pp. 305-312, May 1981
Graphs, Partitioning, Algorithms
B. Kernighan, S. Lin: An Efficient Heuristic Procedure for Partitioning Graphs; BSTJ 49, 1970,
C. Alpert, A. Kahng: Recent Directions in Netlist Partitioning: A Survey; Integration, vol 19 (1-2), pp. 1-81, 1995
T. Cormen, et al.: Introduction to Algorithms; MIT Press / McGraw-Hill, 1991
© 2003, reiner@hartenstein.de http://hartenstein.de37
University of Kaiserslautern
Xputer Labwhy emulators are obsolete
10 000 000
1 000 000
100 000
10 000
1 000
1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004
planned
Virtex II
XC 40250XV
Virtex
XC 4085XL
100
System gates per rGA chip
Jahr
[Xilinx Data]
200
500
© 2003, reiner@hartenstein.de http://hartenstein.de38
University of Kaiserslautern
Xputer Lab
More and more the prototyping platform of rGA based systems will be directly delivered as the product to the customer: fully configured
ASICs lost the battle. rGAs are the winners
0.1 3
2001 2002 2003 2004
year
50,000
40,000
30,000
20,000
10,000
0c)
number of design starts
rGA-basiert
[N. Tredennick, Gilder Technology Report, 2003]
why declining ASIC business?
ASIC emulators have been a transient solution: now with declining commercial significance.
you don‘t need specific silicon !you don‘t need specific silicon !
© 2003, reiner@hartenstein.de http://hartenstein.de39
University of Kaiserslautern
Xputer Lab
• FPGA Fabric-based on Virtex-II Architecture
Source: Ivo Bolsens, Xilinx
On Chip Memory Controller
Power PCCore
EmbededRAM
RocketIO
Xilinx: full hierarchy on chip
from rack to chipfrom rack to chip• Xilinx Virtex-II Pro
FPGA Architecture
• PowerPC 405 RISC CPU (PPC405) cores
© 2003, reiner@hartenstein.de http://hartenstein.de40
University of Kaiserslautern
Xputer Lab>> outline <<
•fine grain reconfigurable•Placement and routing •coarse grain reconfigurable•Flowware•Datastream-based Computing•The Anti Machine Paradigm•Final Remarks
http://www.uni-kl.de
© 2003, reiner@hartenstein.de http://hartenstein.de41
University of Kaiserslautern
Xputer Labfocusing on coarse grain
• Fine Grain morphware platforms
• Coarse Grain platforms:
already mainstream: reconfigurable logicjust logic design on a strange platform
Reconfigurable Computing :not that new – but shocking the
fundamentals of CS curricula
an order of magnitude more MIPS/mW than fine grain
© 2003, reiner@hartenstein.de http://hartenstein.de42
University of Kaiserslautern
Xputer Labwhy coarse grain
1000
100
10
1
0.1
0.01
0.0012 1 0.5 0.25 0.13 0.1 0,07
MOPS / mW
µ feature size
FPGAs (reconfigurable logic)hardwired
instruction set processors
standard microprocessor
DSP
T. Claasen et al.: ISSCC 1999*) R. Hartenstein: ISIS 1997
rDPAs (reconfigurable computing)*
flexibility
throughput
hard-wired
vonNeumann
FPGAs
coarse grain goes far beyond bridging the gap
coarsegrain
© 2003, reiner@hartenstein.de http://hartenstein.de43
University of Kaiserslautern
Xputer Lab
Reconfigurable Interconnect Fabric
separate routing area
rDPA (Reconfigurable Datapath Array)
rDPU rDPU rDPU rDPU
rDPU rDPU rDPU rDPU
rDPU rDPU rDPU rDPU
rDPU rDPU rDPU rDPU
rDPU rDPU rDPU rDPU
rDPU rDPU rDPU rDPU
rDPU rDPU rDPU rDPU
rDPU rDPU rDPU rDPU
RIF layouted over rDPUs:rDPA wired by abutment
© 2003, reiner@hartenstein.de http://hartenstein.de44
University of Kaiserslautern
Xputer LabCMOS intercoonnect resources
Foundries offer up to 9 metal layers
and up to 3 poly layers
reconfigurable interconnect fabric layouted over the
rDU cell
© 2003, reiner@hartenstein.de http://hartenstein.de45
University of Kaiserslautern
Xputer LabCommercial rDPAs
XPU family (IP cores):PACT Corp., Munich
XPU128
© 2003, reiner@hartenstein.de http://hartenstein.de46
University of Kaiserslautern
Xputer Lab
rDPU not used used for routing only operator and routing port location markerLegend: backbus connect
array size: 10 x 16 = 160 rDPUs
mapping algorithms efficently onto rDPA
rout thru only
not usedbackbus connect
SNN filter on KressArray
by the way: example of scalability / relocatability by EDA support
„Structured
Configware
Design“ [R. H.]
© 2003, reiner@hartenstein.de http://hartenstein.de47
University of Kaiserslautern
Xputer Lab
badly scalable
Hundreds of rGAs or very large rGAs
Routing congestion growing exponentially
•Routing
© 2003, reiner@hartenstein.de http://hartenstein.de48
University of Kaiserslautern
Xputer Lab Communication Resource Requirements
... often Functional Resources are not the Throughput
BottleneckIn some Application Areas,such as e. g. Wireless Communication, Reconfigurable Computing Arraysneed extraordinarily rich and powerful Communication ResourcesThe Solution: Generators for Domain-specific RA Platforms
© 2003, reiner@hartenstein.de http://hartenstein.de49
University of Kaiserslautern
Xputer Lab
KressArray Family generic Fabrics: a few examples
Examples of 2nd Level Interconnect:layouted overrDPU cell - no separate routing areas !
+
rout-through and function
rout-throug
h only more NNports:
rich Rout Resources
Select Function
Repertory
select Nearest Neighbour (NN) Interconnect: an example
16 32 8 24
4
2 rDPU
Select mode, number, width of NNports
http://kressarray.de
© 2003, reiner@hartenstein.de http://hartenstein.de50
University of Kaiserslautern
Xputer LabSuper Pipe Networks
pipeline propertiesarray applications
shape resources
mappingscheduling
(data streamformation)
systolicarray
regular datadependencies
only
linearonly
uniformonly
linear projection oralgebraic synthesis
super-systolicRA
no restrictionssimulated
annealing orP&R algorithm
(e.g. force-directed)schedulingalgorithm
The key is mapping, rather than architecture
**) KressArray [ASP-DAC-1995]
© 2003, reiner@hartenstein.de http://hartenstein.de51
University of Kaiserslautern
Xputer Lab>> outline <<
•fine grain reconfigurable•Placement and routing •coarse grain reconfigurable•Flowware•Datastream-based Computing•The Anti Machine Paradigm•Final Remarks
http://www.uni-kl.de
© 2003, reiner@hartenstein.de http://hartenstein.de52
University of Kaiserslautern
Xputer LabMorphware machines vs. hardwired
machines
platformprogram source
running on it
hardware (not programmable)
morphware
fine grain rGA (FPGA)configwarecoarse
grainrDPU, rDPA
machine
reconfigurable data stream processor
flowware & configware
hardwired
data stream processor
flowware
instruction stream processor (v. N.)
software
A clear terminology helps a lot
© 2003, reiner@hartenstein.de http://hartenstein.de53
University of Kaiserslautern
Xputer Lab
DPA
xxx
xxx
xxx
|
||
x x
x
x
x
x
x x
x
- -
-
input data streams
xx
x
x
x
x
xx
x
--
-
-
-
-
-
-
-
-
-
-
xxx
xxx
xxx
|
|
|
|
|
|
|
|
|
|
|
|
|
|output data streams
time
port #
time
time
port #time
port #
... which data item at which time at which port
Flowware defines:
© 2003, reiner@hartenstein.de http://hartenstein.de54
University of Kaiserslautern
Xputer LabParadigm Shifts:
Nick Tredennick‘s view
algorithms variable
resources fixed
instruction-stream-based computing:
algorithms variable
resources variable
data-stream-based reconfigurable computing:
programmable
why 2 program sources ?
Configware
resources variable
Flowware
data-stream
Software
instruction-stream
© 2003, reiner@hartenstein.de http://hartenstein.de55
University of Kaiserslautern
Xputer Lab
Flowware heading toward mainstream
•Data-stream-based Computing is heading for mainstream
–1997 SCCC (LANL) Streams-C Configurabble Computing
–SCORE (UCB) Stream Computations Organized for Reconfigurable Execution
–ASPRC (UCB) Adapting Software Pipelining for Reconfigurable Computing
–2000 Bee (UCB), ...
–Most stream-based multimedia systems, etc.
–Many other areas ....
Flowware ..... mostly not yet modelled that way: most
flowware is hidden by its indirect instruction-stream-based implementationFlowware:
managing data streamsSoftware: managing instruction streams
© 2003, reiner@hartenstein.de http://hartenstein.de56
University of Kaiserslautern
Xputer Labcontrol-procedural vs. data-procedural
The structural domain is primarily data-stream-based:
Flowware provides a (data-)procedural abstraction of the (data-stream-based) structural domain
Flowware converts „procedural vs. structural“ into „control-procedural vs. data-procedural“ ...
... a Troyan horse to introduce the structural domain to the procedural mind set of programmers
© 2003, reiner@hartenstein.de http://hartenstein.de57
University of Kaiserslautern
Xputer Lab>> outline <<
•fine grain reconfigurable•Placement and routing •coarse grain reconfigurable•Flowware•Datastream-based Computing•The Anti Machine Paradigm•Final Remarks
http://www.uni-kl.de
© 2003, reiner@hartenstein.de http://hartenstein.de58
University of Kaiserslautern
Xputer Lab
asM
distributed memory
architecture
distributed memory
architecture
Configware / Flowware Compilation
r. DataPath
Array
rDPA intermediate
high level source
wrapper
flowwareflowware
scheduler
M M M M
M M M M
MM
MM
MM
MM
data streams
data sequencer
address generato
r
„instruction“ fetch before runtime
configwareconfigware
mapper
© 2003, reiner@hartenstein.de http://hartenstein.de59
University of Kaiserslautern
Xputer Lab>>> extremely high
efficiency: flowware-based computing
1. avoiding address computation memory cycle overhead
2. avoiding instruction fetch and interpretation overhead
3. high parallelism, massively multiple deep pipelines
4. much less configuration memory
5. interconnect layouted over the cell: no extra routing areas
6. methodologies readily available
© 2003, reiner@hartenstein.de http://hartenstein.de60
University of Kaiserslautern
Xputer LabProgramming Language
Paradigms
language category Software Languages Languages f. Anti Machine
both deterministic procedural sequencing: traceable, checkpointable
operation sequence driven by:
read next instruction, goto (instr. addr.),
jump (to instr. addr.), instr. loop, loop nesting
no parallel loops, escapes, instruction stream branching
read next data item, goto (data addr.),
jump (to data addr.), data loop, loop nesting, parallel loops, escapes, data stream branching
state register program counter data counter(s) address computation
massive memory cycle overhead overhead avoided
Instruction fetch memory cycle overhead overhead avoided parallel memory bank access interleaving only no restrictions
language features control flow + data manipulation
data streams only (no data manipulation)
very easy to learn
multipleGAGsmuch more
simple
much moresimple
much more
powerful
flowware languagesflowware languages
© 2003, reiner@hartenstein.de http://hartenstein.de61
University of Kaiserslautern
Xputer LabMachine Paradigms
machine category Computer (the Machine:
“v. Neumann”) The Anti Machine
driven by: Instruction streams data streams (no “dataflow”)
engine principles instruction sequencing sequencing data streams
state register single program counter (multiple) data counter(s)
Communication path set-up .
at run time at load time
resource DPU (e.g. single ALU) DPU or DPA (DPU array) etc. data path
operation sequential parallel pipe network etc.
( “instruction fetch” )
also hardwired implementations**) e g. Bee project Prof. Broderson
© 2003, reiner@hartenstein.de http://hartenstein.de62
University of Kaiserslautern
Xputer Lab>> outline <<
•fine grain reconfigurable•Placement and routing •coarse grain reconfigurable•Flowware•Datastream-based Computing•The Anti Machine Paradigm•Final Remarks
http://www.uni-kl.de
© 2003, reiner@hartenstein.de http://hartenstein.de63
University of Kaiserslautern
Xputer Labcomputing paradigms and
methodologies
1946: machine paradigm (von Neumann)
1980: data streams (Kung, Leiserson)
1989: anti machine paradigm
1990: 1st rDPU* (Rabaey)
1994: anti machine high level programming language
1995: super systolic rDPA (Kress)
1996+: SCCC (LANL), SCORE, ASPRC, Bee (UCB), ...
1997+: discipline of distributed memory architecture
1997: 1st configware / software partitioning compiler
flow
ware
*) rDPU = reconfigurable Data Path Unit
© 2003, reiner@hartenstein.de http://hartenstein.de64
University of Kaiserslautern
Xputer LabThe Secret of Success: Co-
Compilation
Analyzer/ Profiler
SW code
SWcompiler
paradigm“vN" machine
CW Code
CWcompiler
anti machineparadigm
Partitioner
Resource Parameters
supportingdifferentplatforms
supporting platform-based design
High level PL source
© 2003, reiner@hartenstein.de http://hartenstein.de65
University of Kaiserslautern
Xputer Lab
data-stream machine
M
DPU or rDPU
data addressgenerator(data sequencer)
memory
I/O
asM**
(anti machine)(anti machine)
Machine paradigms
von Neumanninstruction
stream machineM
I/O
instructionsequencer
CPU
instructionstream
I/OMM MM M
(r)DPU
DPU
Software
I/OMM MM M
(r)DPA
memorydistributed memory architecture*
data stream
Flowware
(Configware)
(reconf.)
*) the new discipline came just in time:see Herz et al.: Proc. IEEE ICECS, 2002
instruction stream+
CPU
- data stream
-DPU
+
memory
also see books by Francky Catthoor et al.
© 2003, reiner@hartenstein.de http://hartenstein.de66
University of Kaiserslautern
Xputer Lab
Synthesizable distributed memory architecture...
Memory(data memory)
memory bank
memory bank
memory bank
memory bank
memory bank
...
...
Scheduler
for a Stream-based Soft Machine
rDPA“instructions”
Compiler
Sequencers(data stream
generator)
© 2003, reiner@hartenstein.de http://hartenstein.de67
University of Kaiserslautern
Xputer LabPC replaced by PS
mainframe age
computer age (PC age)
data streams ...
morphware age
1957
1967
1977
1987
1997
2007
PC replaced by PS (personal supercomputer)
PC replaced by PS (personal supercomputer)
flowware
rDPArDPAµProcµProc
co-compilerco-compiler
anti machineanti machinevon Neumannvon Neumann
© 2003, reiner@hartenstein.de http://hartenstein.de68
University of Kaiserslautern
Xputer Lab all methodologies available
data streams ...
morphware age
1957
1967
1977
1987
1997
2007
flowware
free know-how for personal super computer
free know-how for personal super computer
rDPArDPAµProcµProc
co-compilerco-compiler
.... and all other methodologies available from
literature
.... and all other methodologies available from
literature
© 2003, reiner@hartenstein.de http://hartenstein.de69
University of Kaiserslautern
Xputer LabWe have an education problem
... we need a second machine paradigm
The typical programmer has problems to understand function evaluation without machine mechanisms....
Traditional CS: programming is (control-)procedural, instruction-stream-based – sources: software
acceleratorsacceleratorsµprocessorµprocessor
It‘s the gap between procedural and structural mind set
Crossing the Hardware / Software Chasm [Mike
Butts]
© 2003, reiner@hartenstein.de http://hartenstein.de70
University of Kaiserslautern
Xputer Lab Ubiquitous Embedded Systems
... and the main focus in system design
embedded software and configware became the main vehicle to product differentiation ...
(Performance and) Flexibility are key issues
current CS curricula do not qualify our students
© 2003, reiner@hartenstein.de http://hartenstein.de71
University of Kaiserslautern
Xputer Labmisqualified: jobless CS graduates
?
Embe
dded
sof
twar
e [D
TI*
law
]
1
2
0 10 12 18 months
factor
*) Department of Trade and Industry, London
(1.4/year)
[Moore
’s law]90% of all code
written for embedded systems The real labor market:
10 times more programmers will write embedded applications than computer software by 2010
© 2003, reiner@hartenstein.de http://hartenstein.de72
University of Kaiserslautern
Xputer Lab>> outline <<
•fine grain reconfigurable•Placement and routing •coarse grain reconfigurable•Flowware•Datastream-based Computing•The Anti Machine Paradigm•Final Remarks
http://www.uni-kl.de
© 2003, reiner@hartenstein.de http://hartenstein.de73
University of Kaiserslautern
Xputer LabEDA Industry Revolution every 7 years
1978
Transistor entry: Applicon, Calma, CV ...
1992Synthesis (HDLs): Cadence, Synopsys ...
1985
Schematics entry: Daisy, Mentor, Valid ...
[Keutzer / Newton]McKinsey Curves
EDA industry paradigmswitching every 7 years
1999
© 2003, reiner@hartenstein.de http://hartenstein.de74
University of Kaiserslautern
Xputer LabEDA the main bottleneck
[cou
rtes
y by
Ric
hard
New
ton]
math formula ?TRS ?
© 2003, reiner@hartenstein.de http://hartenstein.de75
University of Kaiserslautern
Xputer LabBiggest Mistake of EDAguess it !
© 2003, reiner@hartenstein.de http://hartenstein.de76
University of Kaiserslautern
Xputer LabThe next EDA Industry Revolution
1978
Transistor entry: Applicon, Calma, CV ...
1992Synthesis (HDLs): Cadence, Synopsys ...
1985
Schematics entry: Daisy, Mentor, Valid ...
[Keutzer / Newton]McKinsey Curves
EDA industry paradigmswitching every 7 years
1999
(Co-) Compilation:data-stream-based
DPAs
Von Neumann does not support Morphware:
System-Cmath formula: TRS*
higher abstraction level:
*) Term Rewriting Systems
© 2003, reiner@hartenstein.de http://hartenstein.de77
University of Kaiserslautern
Xputer Lab Algorithmic cleverness needed
Example - migration from signal processor to rGA: very high throughput on low power slow FPGAs obtained only by algorithmic cleverness:
We need an all-embracing taxonomy of algorithms and survey on algorithm transformations ....
loop transformations ....
optimization, partitioning, signal processing, (de-) coding algorithms (wireless communication), image processing, sorting, .... And much more areas .....
© 2003, reiner@hartenstein.de http://hartenstein.de78
University of Kaiserslautern
Xputer Labalgorithmic cleverness needed for CS graduates in embedded
systemsthe hardware / configware / software partitioning problem: current CS curricula do not qualify our students
software / configware migration: current CS curricula do not qualify our students
extending software engineering into software / flowware engineering: the anti machine paradigm and reconfigurable computing are the curricular enablers
© 2003, reiner@hartenstein.de http://hartenstein.de79
University of Kaiserslautern
Xputer Lab>>> thank you
thank you
© 2003, reiner@hartenstein.de http://hartenstein.de80
University of Kaiserslautern
Xputer Lab
- END -
© 2003, reiner@hartenstein.de http://hartenstein.de81
University of Kaiserslautern
Xputer Lab
Appendix for
discussion
© 2003, reiner@hartenstein.de http://hartenstein.de82
University of Kaiserslautern
Xputer LabProcessor Memory Performance Gap
1
10
100
1000Performance
1980 1990 2000
µProc60%/yr..
DRAM7%/yr..
Processor-MemoryPerformance Gap:(grows 50% / year)
DRAM
CPU
© 2003, reiner@hartenstein.de http://hartenstein.de83
University of Kaiserslautern
Xputer LabWhy a dichotomy of machine
paradigms?
data stream machine:
• bad message: caches do not help
• good message: no vN bottleneck
• caches not needed
stolen from Bob Colwell
CPU
caches, ...
vN bottleneckvN: unbalanced
The anti machine has novon Neumann bottleneck
© 2003, reiner@hartenstein.de http://hartenstein.de84
University of Kaiserslautern
Xputer Lab„Pollack‘s Law“
(simplified)
[intel]
growth factor
µm
0.1
performance
area efficiency
© 2003, reiner@hartenstein.de http://hartenstein.de85
University of Kaiserslautern
Xputer LabLoop Transformation
Examples
loop 1-8bodybodyendloop
loop 1-8bodyendloop
loop 9-16bodyendloop
fork
joinstrip mining
loop 1-4triggerendloop
loop 1-2triggerendloop
loop 1-8triggerendloop
reconf.array:host:loop 1-16bodyendloop
sequential processes: resource parameter drivenCo-Compilation
loop unrolling
© 2003, reiner@hartenstein.de http://hartenstein.de86
University of Kaiserslautern
Xputer Lab
desi
gn c
ost
year
product life cycle
Die Entwurfs-KriseDie langen Durchlauf-Zeiten der ASIC-Fertigung werden zunehmend unbezahlbar
Steigende Nachfrage: schnelle Patches und Upgrades – möglichst am Standort des Kunden – Förderung der Langlebigkeit des Produktes
© 2003, reiner@hartenstein.de http://hartenstein.de87
University of Kaiserslautern
Xputer LabSummary of the Anti Machine
Paradigm
• anti language primitives are almost the same (slightly extended)
• anti machine execution potential is dramatically more powerful
• provides drastically more flexibility
• not always replacing von Neumann
© 2003, reiner@hartenstein.de http://hartenstein.de88
University of Kaiserslautern
Xputer LabReconfigurable Computing:
a second programming domain
Migration of programming to the structural domain
Currently running: the next fundamental revolution after introduction of the microprocessor
The structural domain has become RAM-based
However, CS curricula ignore this impact of Reconfigurable Computing – key issue in embedded systems ...
... causing the coming disaster by unqualified CS graduates pushing up the unemployment rate ?
© 2003, reiner@hartenstein.de http://hartenstein.de89
University of Kaiserslautern
Xputer LabAll enabling technologies are
available
•anti machine and all its architectural resources
•parallel memory IP cores and generators
•anything else needed
•languages & (co-)compilation techniques
•morphware vendors like PACT ....
•literature from last 30 years
© 2003, reiner@hartenstein.de http://hartenstein.de90
University of Kaiserslautern
Xputer LabNew horizons
• A new RAM-based platform going mainstream• Configware industry• New machine paradigm• New theory needed• New architectures – without v. N. bottleneck• New compilation techniques• More effective parallelism provided• Rich material is already available in many areas• Lots of similarities with the classical v.N. world• But a few asymmetries: a challenge
© 2003, reiner@hartenstein.de http://hartenstein.de91
University of Kaiserslautern
Xputer Lab evangelist‘s material + lobby
space
Evangelist‚s material:• http://hartenstein.de – click „recent talks“Lobby space:• http://morphware.net• http://configware.org• http://data-streams.org• http://flowware.netTrailblazer group:• you are welcome to improve, rewrite, post links ...• You are welcome to join the trailblazer group
© 2003, reiner@hartenstein.de http://hartenstein.de92
University of Kaiserslautern
Xputer LabThe genious of von Neumann
• enormous impact of the von Neumann paradigm• even stronger impact by a dichotomy of
paradigms:• von Neumann of matter• von Neuman of anti matter –• Von Neumann machine vs. anti machine
• does not mean throwing over v. N.‘s monument• it multiplies the glory of von Neumann
© 2003, reiner@hartenstein.de http://hartenstein.de93
University of Kaiserslautern
Xputer Lab MPU performance stalled
Moore’s law will stall soon for MPUs
relative computation time needed doubles every 2 years
had been compensated by Moore’s law
Bill Gates’ law:
© 2003, reiner@hartenstein.de http://hartenstein.de94
University of Kaiserslautern
Xputer LabBasics of Binding Time
run time
loading time
compile time
time of “Instruction Fetch”
microprocessorparallel computer
ReconfigurableComputing
© 2003, reiner@hartenstein.de http://hartenstein.de95
University of Kaiserslautern
Xputer LabTime to Market
• Morphware brings a new dimension to digital system development and has a strong impact on SoC design.
• Flexibility supports spin-around times of minutes instead of months for real time in-system debugging, profiling, verification, tuning, field-maintenance, and field upgrades
• A New Business Model (in-field debugging and upgrading ... )
• A Fundamental Paradigm Shift in Silicon Application
Revenue/ month
Time / months
1 10 20
ASIC Product
30
Update 1
Product
Update 2
reconfigurable Product with download
[Tom Kean]
© 2003, reiner@hartenstein.de http://hartenstein.de96
University of Kaiserslautern
Xputer LabKressArray principles
• take systolic array principles
• replace classical synthesis by simulated annealing
• yields the super systolic array
• a generalization of the systolic array
• no more restricted to regular data dependencies
• now reconfigurability makes sense
© 2003, reiner@hartenstein.de http://hartenstein.de97
University of Kaiserslautern
Xputer LabSignificance of Address Generators
• Address generators have the potential to reduce computation time significantly.
• In a grid-based design rule check a speed-up of more than 2000 has been achieved, compared to a VAX-11/750
• Dedicated address generators contributed a factor of 10 - avoiding memory cycles for address computation overhead
© 2003, reiner@hartenstein.de http://hartenstein.de98
University of Kaiserslautern
Xputer LabAcceleration Mechanisms
•parallelism by multi bank memory architecture•auxiliary hardware for address calculation •address calculation before run time
•avoiding multiple accesses to the same data.•avoiding memory cycles for address computation•improve parallelism by storage scheme transformations•improve parallelism by memory architecture transformations
•alleviate interconnect overhead (delay, power and area)
© 2003, reiner@hartenstein.de http://hartenstein.de99
University of Kaiserslautern
Xputer Lab
Sources: Proc ISSCC, ICSPAT, DAC, DSPWorld
microprocessor / DSP
No
rmal
ized
pro
cess
or
spee
d
battery performance
Algorithmic Complexity(Shannon’s Law)
memory
Tra
nsi
sto
rs/c
hip
1960 1970 1980 1990 2000 2010
100 000 000
10 000 000
1000 000
100 000
10 000
1000
100
10
1
2G
3G
4GWhy coarse
grain ?
1G
wireless
100
10
1
0.1
0.01
0.001
mA/ MIP
computational efficiency
StrongARMSH7752
top related