scalar operand networks for tiled microprocessors
DESCRIPTION
Scalar Operand Networks for Tiled Microprocessors. Michael Taylor Raw Architecture Project MIT CSAIL (now at UCSD). Until 3 years ago – computer architects have been using the N-way superscalar to encapsulate the ideal for a parallel processor… - nearly “perfect” but not attainable. - PowerPoint PPT PresentationTRANSCRIPT
Scalar Operand Networksfor Tiled Microprocessors
Michael Taylor
Raw Architecture Project
MIT CSAIL
(now at UCSD)
Until 3 years ago – computer architects have beenusing the N-way superscalar to encapsulate the idealfor a parallel processor… - nearly “perfect” but not attainable
Superscalar
“PE”->”PE” communication Free
exploitation of parallelism
Implicit
Clean semantics Yes
scalable No
power efficient No
(hw scheduler or compiler)
(or VLIW)
What’s great about superscalar microprocessors? It’s the networks!
Fast low-latency tightly-coupled networks (0-1 cycles of latency, no occupancy)-For the lack of a better name let’s call them Scalar Operand Networks (SONs) - Can we incorporate the benefits of superscalar communication + multicore scalability-Can we build Scalable Scalar Operand Networks?
(I agree with Jose: “We need low-latency tightly-coupled … networkinterfaces” – Jose Duato, OCIN, Dec 6, 2006)
mul $2,$3,$4
add $6,$5,$2
The industry shift toward Multicore - attainable but hardly ideal
Superscalar Multicore
“PE”->”PE” communication Free Expensive
exploitation of parallelism
Implicit Explicit
Clean semantics Yes No
scalable No Yes
power efficient No Yes
Superscalar Multicore
“PE”->”PE” communication Free Expensive
exploitation of parallelism
Implicit Explicit
Clean semantics Yes No
scalable No Yes
power efficient No Yes
What we’d like – neither superscalar nor multicore
Superscalarshave fastnetworksand greatusability
Multicorehas greatscalabilityand efficiency
Why communication is expensive on multicore
Multiprocessor Node 1 Multiprocessor Node 2
Transport Cost
sendoverhead
receiveoverhead
sendoccupancy
sendlatency
receiveoccupancy
receivelatency
Multiprocessor SON Operand Routing
Multiprocessor Node 1
sendoccupancy
sendlatency
Destination node nameSequence numberValueLaunch sequence
Commit LatencyNetwork injection
Multiprocessor SON Operand Routing
Multiprocessor Node 2
receiveoccupancy
receivelatency
receive sequencedemultiplexingbranch mispredictions
injection cost
.. similar overheads for shared memory multiprocessors - store instr, commit latency, spin locks (+ attndt br. mispredicts)
Defining a figure of merit forscalar operand networks
5-tuple <SO, SL, NHL, RL, RO>:
Send Occupancy
Send Latency
Network Hop Latency
Receive Latency
Receive Occupancy
Tip: Ordering follows timing of message from sender to receiver
We can use this metric to quantitativelydifferentiateSONs from existing multiprocessor networks…
Impact of Occupancy (“o” = so+ro)
if (o * “surface area” > “volume”)
not worth it to offload: overhead too high
(parallelism too fine-grained)
Impact of Latency The lower the latency, the less work needed to keepmyself busy waiting for answer not worth it to offload: could have done it myself faster (not enough parallelism to hide latency)
Proc 0 Proc 1
noth
ing
to d
o
The interesting region
Power4 <2, 14, 0, 14,4>(on-chip)
Superscalar < 0, 0, 0, 0, 0>(not scalable)
Superscalar Multicore Tiled MulticorePE-PE communication Free Expensive Cheap
exploitation of parallelism
Implicit Explicit Both
scalable No Yes Yes
power efficient No Yes Yes
(w/ scalable SON)
Tiled Microprocessors (or “Tiled Multicore”)
Tiled Microprocessors (or “Tiled Multicore”)
Superscalar Multicore Tiled MulticoreAlu-Alu communication Free Expensive Cheap
exploitation of parallelism
Implicit Explicit Both
scalable No Yes Yes
power efficient No Yes Yes
Superscalar
CMP/multicore
Tiled
add scalable SON
add scalability
Transforming from multicore or superscalar to tiled
The interesting region
Power4 <2, 14, 0, 14,4>(on-chip)
Raw < 0, 0, 1, 2, 0>Tiled “Famous Brand 2” < 0, 0, 1, 0, 0>
Superscalar < 0, 0, 0, 0, 0>(not scalable)
Scalability Problems in Wide Issue Microprocessors
ControlWideFetch
(16 inst)
UnifiedLoad/Store
Queue
PC
RF
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALUBypass Net
Area and Frequency Scalability Problems
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALUBypass Net
RF
~N3 ~N2 N ALUs
Ex: Itanium 2
Without modification, freq decreases linearly or worse.
Operand Routing is Global
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALUBypass Net
RF
>>
+
Idea: Make Operand Routing Local
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALUBypass Net
RF
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
RF
Bypass Net
Idea: Exploit Locality
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
RF
Replace the crossbar with a point-to-point, pipelined, routed scalar operand network.
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
RF>>
+
Replace the crossbar with a point-to-point, pipelined, routed scalar operand network.
Un-pipelinedcrossbarbypass
Point-to-PointRouted MeshNetwork
Local BW ~ N½ ~ N
Area ~ N2 ~ N
Operand Transport Scaling – Bandwidth and Area
We can route more operands per unit time if we are ableto map communicating instructions nearby.
Scalesas 2-DVLSI
For N ALUs and N½ bisection bandwidth:as in conventional superscalar
Operand Transport Scaling - LatencyTime for operand to travel between instructions mapped todifferent ALUs.
Non-local Placement
~ N ~ N½
Locality- Driven Placement
~ N ~ 1
Un-pipelinedcrossbar
Point-to-PointRouted MeshNetwork
Latency bonus if we map communicating instructions nearby so communication is local.
Distribute the Register File
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
RF
RFRF RFRF
RFRF RFRF
RFRF RFRF
RFRF RFRF
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
RFRF RFRF
RFRF RFRF
RFRF RFRF
RFRF RFRF
ControlWideFetch
(16 inst)
UnifiedLoad/Store
Queue
PC
SCALABLE
More Scalability Problems
ControlWideFetch
(16 inst)
UnifiedLoad/Store
Queue
PC
Distribute the rest: Raw – a Fully-Tiled Microprocessor
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
RFRF RFRF
RFRF RFRF
RFRF RFRF
RFRF RFRF
Control
WideFetch
(16 inst)
UnifiedLoad/Store
Queue
PC I$
PC
D$
I$
PC
D$
I$
PC
D$
I$
PC
D$
I$
PC
D$
I$
PC
D$
I$
PC
D$
I$
PC
D$
I$
PC
D$
I$
PC
D$
I$
PC
D$
I$
PC
D$
I$PC
D$I$
PC
D$I$
PC
D$I$
PC
D$
Tiles!
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
RFRF RFRF
RFRF RFRF
RFRF RFRF
RFRF RFRF
I$
PC
D$
I$
PC
D$
I$
PC
D$
I$
PC
D$
I$
PC
D$
I$
PC
D$
I$
PC
D$
I$
PC
D$
I$
PC
D$
I$
PC
D$
I$
PC
D$
I$
PC
D$
I$PC
D$I$
PC
D$I$
PC
D$I$
PC
D$
Tiles!
Tiled Microprocessors
-fast inter-tile communication through SON
-easy to scale (same reasons as multicore)
1. Scalar Operand Network and Tiled Microprocessor intro
2. Raw Architecture + SON
3. VLSI implementation of Raw, a scalable microprocessor with a scalar operand network.
Outline
Raw Microprocessor
Tiled scalable microprocessorPoint-to-point pipelined networks
16 tiles, 16 issue
Each 4 mm x 4mm tile:
MIPS-style compute processor - single-issue 8-stage pipe
- 32b FPU- 32K D Cache, I Cache
4 on-chip networks- two for operands- one for cache misses- one for message passing
Fetch UnitInstruction
Cache
Generalized Transport Networks
Dynamic Router“GDN”
Dynamic Router“MDN”
FunctionalUnits
Execution Core
Inter-tileNetworkLinks
Compute Processor
Trusted
Core
Untrusted Core
Inter-tile SON
InstructionCache
Static Router
Switch ProcessorCross-
bar
Intra-tile SON
Data Cache
Raw Microprocessor Components
Cross-bar
RFA TL
M1 M2
F P
E
U
r26
r27
r25
r24
Raw Compute Processor Internals
Ex: fadd r24, r25, r26
Tile-Tile Communication
add $25,$1,$2
Tile-Tile Communication
add $25,$1,$2 Route $P->$E
Tile-Tile Communication
add $25,$1,$2 Route $P->$E Route $W->$P
Tile-Tile Communication
add $25,$1,$2
sub $20,$1,$25
Route $P->$E Route $W->$P
tmp3 = (seed*6+2)/3v2 = (tmp1 - tmp3)*5v1 = (tmp1 + tmp2)*3v0 = tmp0 - v1….
pval5=seed.0*6.0
pval4=pval5+2.0
tmp3.6=pval4/3.0
tmp3=tmp3.6
v3.10=tmp3.6-v2.7
v3=v3.10
v2.4=v2
pval3=seed.o*v2.4
tmp2.5=pval3+2.0
tmp2=tmp2.5
pval6=tmp1.3-tmp2.5
v2.7=pval6*5.0
v2=v2.7
seed.0=seed
pval1=seed.0*3.0
pval0=pval1+2.0
tmp0.1=pval0/2.0
tmp0=tmp0.1
v1.2=v1
pval2=seed.0*v1.2
tmp1.3=pval2+2.0
tmp1=tmp1.3
pval7=tmp1.3+tmp2.5
v1.8=pval7*3.0
v1=v1.8
v0.9=tmp0.1-v1.8
v0=v0.9
pval5=seed.0*6.0
pval4=pval5+2.0
tmp3.6=pval4/3.0
tmp3=tmp3.6
v3.10=tmp3.6-v2.7
v3=v3.10
v2.4=v2
pval3=seed.o*v2.4
tmp2.5=pval3+2.0
tmp2=tmp2.5
pval6=tmp1.3-tmp2.5
v2.7=pval6*5.0
v2=v2.7
seed.0=seed
pval1=seed.0*3.0
pval0=pval1+2.0
tmp0.1=pval0/2.0
tmp0=tmp0.1
v1.2=v1
pval2=seed.0*v1.2
tmp1.3=pval2+2.0
tmp1=tmp1.3
pval7=tmp1.3+tmp2.5
v1.8=pval7*3.0
v1=v1.8v0.9=tmp0.1-v1.8
v0=v0.9
RawCC assignsinstructions to the tiles, maximizing locality. It also generates the static routerinstructions that transferoperands between tiles.
Compilation
One cycle in the life of a tiled micro
httpd
4-way automaticallyparallelizedC program
2-thread MPI app
DirectI/OstreamintoScalarOperandNetwork
mem
mem
mem
Zzz...
An application uses only as many tiles as needed to exploit the parallelism intrinsic to that application…
Tile 0
Tile 5
Tile 9
Tile 12
Tile 8 Tile 13 Tile 14 Tile 11
Tile 4 Tile 1 Tile 2 Tile 7
Tile 3
Tile 6
Tile 10
Tile 15
One StreamingApplicationon Raw
very differenttraffic patternsthan RawCC-styleparallelization
Splitter
FIRFilterFIRFilter FIRFilter FIRFilter
FIRFilterFIRFilter FIRFilter FIRFilter
Joiner
Splitter
Detector
Magnitude
FIRFilter
Vec Mult
Detector
Magnitude
FIRFilter
Vec Mult
Detector
Magnitude
FIRFilter
Vec Mult
Detector
Magnitude
FIRFilter
Vec Mult
Joiner
Splitter
FIRFilterFIRFilter
FIRFilterFIRFilter
FIRFilterFIRFilter
FIRFilterFIRFilter
Joiner
Splitter
Joiner
Vec MultFIRFilterMagnitudeDetector
Vec MultFIRFilterMagnitudeDetector
Vec MultFIRFilterMagnitudeDetector
Vec MultFIRFilterMagnitudeDetector
Original After fusion
Auto-Parallelization Approach #2: Streamit Language + Compiler
FIRFilterFIRFilter
FIRFilterFIRFilter
FIRFilterFIRFilter
FIRFilterFIRFilter
Joiner
JoinerVec MultFIRFilterMagnitudeDetector
Vec MultFIRFilterMagnitudeDetector
End Results – auto-parallelized by MIT Streamitto 8 tiles.
AsTrO Taxonomy: Classifying SON diversity
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
>>
+Assignment (Static/Dynamic)
Transport (Static/Dynamic)
Ordering (Static/Dynamic)
+
>>
Is instruction assignment to ALUs predetermined?
Are operand routes predetermined?
Is the execution order of instructions assigned to a node predetermined?
%&/
Static Dynamic
Static
Static
Dynamic
DynamicStatic
RawDynRawScale
TRIPS
Static
Dynamic
Dynamic
ILDP WaveScalar
Assignment
Transport
Ordering
Microprocessor SON diversity using AsTrO taxonomy
1. Scalar Operand Network and Tiled Microprocessor intro
2. Raw Architecture + SON
3. VLSI implementation of Raw, a scalable microprocessor with a scalar operand network.
Outline
Raw Chips
October 02
Raw16 tiles (16 issue)180 nm ASIC (IBM SA-27E)~100 million transistors1 million gates
3-4 years of development1.5 years of testing200K lines of test code
Core Frequency: 425 MHz @ 1.8 V 500 MHz @ 2.2 V
Frequency competitivewith IBM-implementedPowerPCs in same process.
18W average power
Raw motherboard
Support Chipset implemented in FPGA
A Scalable Microprocessor in Action
[Taylor et al, ISCA ’04]
ConclusionsScalability problems in general purpose processors can be addressed by tiling resources across a scalable, low-latency, low-occupancy scalar operand network (SON). These SONs can be characterized by a 5-tuple and the AsTrO classification.
The 180 nm 16-issue Raw prototype shows the feasibility of the approach is feasible. 64+-issue is possible in today’s VLSI processes.
Multicore machines could benefit by adding inter-node SON for cheap communication.
* * * *