intel slide 1 a comparative study of arbitration algorithms for the alpha 21364 pipelined router...
DESCRIPTION
Intel Slide 3 The Alpha x7 Router CROSSBARCROSSBAR Input Ports Output Ports Distributed Arbitration Algorithm Controls the Crossbar 8 Input ports: 4 network, 2 memory, 1 cache, 1 I/O 7 Output ports: 4 network, 2 memory/cache, 1 I/O Router Pipeline Length = 13/14 cycles Virtual Cut-ThroughTRANSCRIPT
Slide 1
Inte
lA Comparative Study of Arbitration
Algorithms for the Alpha 21364 Pipelined Router
Shubu Mukherjee*, Federico Silla!, Peter Bannon$, Joel Emer*, Steve
Lang*, & Dave Webb$
(ack: Richard Kessler)
Intel*, UPV!, & HP$
Tenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), October 2002
Slide 2
Inte
lAlpha 21364 Network
21364 Chip(including Router)
RambusMemory
I/O
M
IO
M
IO
M
IO
M
IO
M
IO
M
IO
M
IO
M
IO
M
IO
M
IO
M
IO
M
IO
L2 CacheData
L2 CacheData
Router MC2 MC1
L2 Cache Tags
21264CORE
Slide 3
Inte
lThe Alpha 21364 8x7 Router
CROSSBAR
Input Ports
OutputPorts
Distributed Arbitration Algorithm Controls the Crossbar
• 8 Input ports: 4 network, 2 memory, 1 cache, 1 I/O• 7 Output ports: 4 network, 2 memory/cache, 1 I/O• Router Pipeline Length = 13/14 cycles• Virtual Cut-Through
Slide 4
Inte
lProblem: Maximize # Matches
Input Port 0 1 2
Input Port 1 1 2 3
Input Port 2 1 2 3
Input Port 3 1 2 3
Input Port 4 1 6 3
Input Port 5 0 2 3
Input Port 6 4 2 3
Input Port 7 5 2 3
• Oldest Packet First: one match• Smarter algorithm (shaded boxes): 7 matches (perfect)
numbers in table cells: destination output port
older packet at input port
3
Slide 5
Inte
lSimpler Algorithms Have Fewer Matches
0
1
2
3
4
5
6
7
0 5 10 15 20 25 30
% Occupied Input Packet Buffers in a 21364 router
# A
rbitr
atio
n M
atch
es P
er C
ycle
PerfectComplex (WFA)Complex (PIM)Complex (PIM1)Simple (SPAA)
Assumes all output ports are free
complexity
Slide 6
Inte
lComplexity may not pay off
0
1
2
3
4
5
6
7
0 0.25 0.5 0.75
Fraction of Output Ports Occupied
# A
rbitr
atio
n M
atch
es P
er
Cyc
le
PerfectComplex (WFA)Complex (PIM)Complex (PIM1)Simple (21364)
complexity
@ 30% input buffer occupancy
Slide 7
Inte
lKey Results
Arbitration Algorithms– WFA: Wave Front Arbitration Algorithm (Tamir & Chi, 1993,
SGI Spider)– PIM1: Parallel Iterative Matching with one iteration
(Anderson, et al., ASPLOS 1992)– SPAA: Simple, Pipelined Arbitration Algorithm (21364)
SPAA outperforms WFA & PIM1+ SPAA’s matching power similar to WFA & PIM1 (when many
output ports are busy)+ SPAA minimizes interactions between ports+ SPAA can be pipelined more effectively
Rotary Rule + avoids network saturation under very heavy load
Slide 8
Inte
lWave Front Arbiter (WFA)
Proposed by Tamir & Chi, 1993– used in the SGI Spider/Origin switch
Implement via “connection” matrix
E
N
S
W
Grant
Request
i,j
1 2 3 4
5
6
7
output ports
Grant = Request & N & W
S = N & NOT(Grant)
E = W & NOT(Grant)
input port 0
input port 1
input port 2
input port 3
Slide 9
Inte
lWFA Advantage & Pipeline
+ High degree of interaction among output portsreduces arbitration collisions & improves # of matches
Algorithm (implemented via a connection matrix)(1) Select packet at input port & load matrix (1.5 cycles)(2) Run through matrix and inform input ports (1.5 cycles)(3) Forward arbitration to output ports (1 cycle)
(1) (2) (3)1.5 1.5 1
Slide 10
Inte
lWFA Limitations
- Higher number of estimated cycles 4 cycles in 0.18 micron
- Harder to pipeline effectively micropipelining waves (2) is difficult because initial cell
changes every cycle restarting (1) before (2) completes is complex
large in-flight packet table due to large number of nominations (up to 54)
may require multiple copies of matrix to buffer pipeline stages (these must avoid stale nominations)
3 cycles
(1) (2) (3)1.5 1.5 1
(1) (2) (3)
Slide 11
Inte
lParallel Iterative Matching (PIM)
Steps in One Iteration (PIM1) Nominate: each input port nominates packets for every
output port (same packet nominated multiple times …) Grant: unmatched output port selects an input port packet
randomly Accept: unselected input port selects a grant randomly
input port 0
input port 1
output port 0
output port 1
input port 0
input port 1
output port 0
output port 1
input port 0
input port 1
output port 0
output port 1
Nominate Grant Accept
Output Port 0 unused in this arbitration round
Slide 12
Inte
lPIM1 Advantage & Pipeline
+ High interaction between input and output portsreduces arbitration collisions & improves # of matches
Algorithm (implemented via connection matrix)(1) Select packet at input port & load matrix (1.5 cycles)(2) Run through matrix and inform input ports (1.5 cycles)(3) Forward arbitration to output ports (1 cycle)
(1) (2) (3)1.5 1.5 1
Slide 13
Inte
lPIM1 Limitations
- Higher number of estimated cycles 4 cycles in 0.18 micron
- Harder to pipeline effectively restarting (1) before (2) completes is complex
same packet can be nominated multiple times requiring the “Accept” step (part of stage 2)
large in-flight packet table due to large number of nominations (up to 54)
may require multiple copies of matrix to buffer pipeline stages (these must avoid stale nominations)
3 cycles
(1) (2) (3)1.5 1.5 1
(1) (2) (3)
Slide 14
Inte
lSimple, Pipelined Arbitration Algorithm (SPAA)
used in the Alpha 21364 Router Algorithm
Nominate: each input port nominates packets for exactly one output port (one packet nominated only once)
Grant: each output port selects an input port packet based on the least-recently selected one
Reset: input ports reset state of all unselected packets and renominate them in subsequent cycles
input port 0
input port 1
output port 0
output port 1
input port 0
input port 1
output port 0
output port 1
Nominate Grant Accept
Reset
Slide 15
Inte
lSPAA’s Simplicity
Low degree of interaction among ports- increases arbitration collisions+ reduces complexity
Algorithm (no centralized matrix)(1) Select packet at input port & load matrix (1 cycle)(2) Forward packets to output ports (1 cycle)(3) Output ports select packets and return feedback to input ports
(1 cycle)
1
(1) (2) (3)11
Slide 16
Inte
lSPAA’s Advantages
+ Fewer cycles 3 cycles in 0.18micron
+ Speculatively read out input buffer prior to output port arbitration because only one packet is nominated to one output port
+ Easier to pipeline restart (1) for free input ports before (2) completes
only one packet nominated to one output port small number (16) of in-flight packets avoids any centralized matrix
speculative read allows data flits to follow header flits
(1) (2) (3)
1
(1) (2) (3)11
1 cycle
Slide 17
Inte
lSummary: Simpler is Better
WFA PIM1 SPAAAlpha 21364
# Matches Per Cycle High Medium Lower
# cycles (0.18 microns)
4 4 3
Restart Rate
Every 3 cycles
Every 3 cycles
Every cycle
Slide 18
Inte
lSaturation Behavior
• Reasons: Hot spots & tree saturation • 21364’s router shows cyclic pattern (link utilization with time)
• Ideally, operate at saturation bandwidth • Solution: throttle input load
64 Node Network, Random Traffic0
50
100
150
200
250
300
0 0.2 0.4 0.6 0.8
Delivered flits/router/nanoseconds
Ave
rage
Pac
ket L
aten
cy
(nan
osec
onds
)
SPAA-base
saturation point
Slide 19
Inte
lRotary Rule
21364’s in-built throttling+ maximum outstanding cache miss requests per processor = 16
Rotary Rule: more throttling+ 21364 is a “direct” network + Rotary Rule prioritizes traffic in network ports over local ports+ also, clears network congestion+ relies on anti-starvation mechanism
WFA+Rotary: change first cell SPAA+Rotary: change output port priority to the
Rotary Rule
Slide 20
Inte
lSimulation Methodology
Asim modeling infrastructure detailed timing model of 21364 network selected design points validated against RTL
Traffic Patterns 70% three coherence hops, 30% two coherence hops random destinations other traffic combinations in paper and simulated internally
Slide 21
Inte
l64 Node Network: Base Case
Random Traffic0
50
100
150
200
250
300
0 0.2 0.4 0.6 0.8
Delivered flits/router/nanoseconds
Ave
rage
Pac
ket L
aten
cy
(nan
osec
onds
)
PIM1
WFA-base
SPAA-base
• SPAA outperforms WFA & PIM124% higher throughput at knee
Knee
Slide 22
Inte
l64 Node Network: With Rotary Rule
Random Traffic0
50
100
150
200
250
300
0 0.2 0.4 0.6 0.8
Delivered flits/router/nanoseconds
Ave
rage
Pac
ket L
aten
cy
(nan
osec
onds
)
PIM1WFA-baseWFA-rotarySPAA-baseSPAA-rotary
• Rotary Rule helps both SPAA & WFA
Slide 23
Inte
lSummary & Conclusions
Arbitration Algorithms– WFA: Wave Front Arbitration Algorithm (Tamir & Chi, 1993,
SGI Spider)– PIM1: Parallel Iterative Matching with one iteration
(Anderson, et al., ASPLOS 1992)– SPAA: Simple, Pipelined Arbitration Algorithm (21364)
SPAA outperforms WFA & PIM1+ SPAA’s matching power similar to WFA & PIM1 (when
many output ports are busy)+ SPAA minimizes interactions between ports+ SPAA can be pipelined more effectively
Rotary Rule+ avoids network saturation under heavy load