intel slide 1 a comparative study of arbitration algorithms for the alpha 21364 pipelined router...

Inte

lA Comparative Study of Arbitration

Algorithms for the Alpha 21364 Pipelined Router

Shubu Mukherjee*, Federico Silla!, Peter Bannon$, Joel Emer*, Steve

Lang*, & Dave Webb$

(ack: Richard Kessler)

Intel*, UPV!, & HP$

Tenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), October 2002

Inte

lAlpha 21364 Network

21364 Chip(including Router)

RambusMemory

I/O

M

IO

M

IO

M

IO

M

IO

M

IO

M

IO

M

IO

M

IO

M

IO

M

IO

M

IO

M

IO

L2 CacheData

L2 CacheData

Router MC2 MC1

L2 Cache Tags

21264CORE

Inte

lThe Alpha 21364 8x7 Router

CROSSBAR

Input Ports

OutputPorts

Distributed Arbitration Algorithm Controls the Crossbar

• 8 Input ports: 4 network, 2 memory, 1 cache, 1 I/O• 7 Output ports: 4 network, 2 memory/cache, 1 I/O• Router Pipeline Length = 13/14 cycles• Virtual Cut-Through

Inte

lProblem: Maximize # Matches

Input Port 0 1 2

Input Port 1 1 2 3

Input Port 2 1 2 3

Input Port 3 1 2 3

Input Port 4 1 6 3

Input Port 5 0 2 3

Input Port 6 4 2 3

Input Port 7 5 2 3

• Oldest Packet First: one match• Smarter algorithm (shaded boxes): 7 matches (perfect)

numbers in table cells: destination output port

older packet at input port

3

Inte

lSimpler Algorithms Have Fewer Matches

0

1

2

3

4

5

6

7

0 5 10 15 20 25 30

% Occupied Input Packet Buffers in a 21364 router

# A

rbitr

atio

n M

atch

es P

er C

ycle

PerfectComplex (WFA)Complex (PIM)Complex (PIM1)Simple (SPAA)

Assumes all output ports are free

complexity

Inte

lComplexity may not pay off

0

1

2

3

4

5

6

7

0 0.25 0.5 0.75

Fraction of Output Ports Occupied

# A

rbitr

atio

n M

atch

es P

er

Cyc

le

PerfectComplex (WFA)Complex (PIM)Complex (PIM1)Simple (21364)

complexity

@ 30% input buffer occupancy

Inte

lKey Results

Arbitration Algorithms– WFA: Wave Front Arbitration Algorithm (Tamir & Chi, 1993,

SGI Spider)– PIM1: Parallel Iterative Matching with one iteration

(Anderson, et al., ASPLOS 1992)– SPAA: Simple, Pipelined Arbitration Algorithm (21364)

SPAA outperforms WFA & PIM1+ SPAA’s matching power similar to WFA & PIM1 (when many

output ports are busy)+ SPAA minimizes interactions between ports+ SPAA can be pipelined more effectively

Rotary Rule + avoids network saturation under very heavy load

Inte

lWave Front Arbiter (WFA)

Proposed by Tamir & Chi, 1993– used in the SGI Spider/Origin switch

Implement via “connection” matrix

E

N

S

W

Grant

Request

i,j

1 2 3 4

5

6

7

output ports

Grant = Request & N & W

S = N & NOT(Grant)

E = W & NOT(Grant)

input port 0

input port 1

input port 2

input port 3

Inte

lWFA Advantage & Pipeline

+ High degree of interaction among output portsreduces arbitration collisions & improves # of matches

Algorithm (implemented via a connection matrix)(1) Select packet at input port & load matrix (1.5 cycles)(2) Run through matrix and inform input ports (1.5 cycles)(3) Forward arbitration to output ports (1 cycle)

(1) (2) (3)1.5 1.5 1

Inte

lWFA Limitations

- Higher number of estimated cycles 4 cycles in 0.18 micron

- Harder to pipeline effectively micropipelining waves (2) is difficult because initial cell

changes every cycle restarting (1) before (2) completes is complex

large in-flight packet table due to large number of nominations (up to 54)

may require multiple copies of matrix to buffer pipeline stages (these must avoid stale nominations)

3 cycles

(1) (2) (3)1.5 1.5 1

(1) (2) (3)

Inte

lParallel Iterative Matching (PIM)

Steps in One Iteration (PIM1) Nominate: each input port nominates packets for every

output port (same packet nominated multiple times …) Grant: unmatched output port selects an input port packet

randomly Accept: unselected input port selects a grant randomly

input port 0

input port 1

output port 0

output port 1

input port 0

input port 1

output port 0

output port 1

input port 0

input port 1

output port 0

output port 1

Nominate Grant Accept

Output Port 0 unused in this arbitration round

Inte

lPIM1 Advantage & Pipeline

+ High interaction between input and output portsreduces arbitration collisions & improves # of matches

Algorithm (implemented via connection matrix)(1) Select packet at input port & load matrix (1.5 cycles)(2) Run through matrix and inform input ports (1.5 cycles)(3) Forward arbitration to output ports (1 cycle)

(1) (2) (3)1.5 1.5 1

Inte

lPIM1 Limitations

- Higher number of estimated cycles 4 cycles in 0.18 micron

- Harder to pipeline effectively restarting (1) before (2) completes is complex

same packet can be nominated multiple times requiring the “Accept” step (part of stage 2)

large in-flight packet table due to large number of nominations (up to 54)

may require multiple copies of matrix to buffer pipeline stages (these must avoid stale nominations)

3 cycles

(1) (2) (3)1.5 1.5 1

(1) (2) (3)

Inte

lSimple, Pipelined Arbitration Algorithm (SPAA)

used in the Alpha 21364 Router Algorithm

Nominate: each input port nominates packets for exactly one output port (one packet nominated only once)

Grant: each output port selects an input port packet based on the least-recently selected one

Reset: input ports reset state of all unselected packets and renominate them in subsequent cycles

input port 0

input port 1

output port 0

output port 1

input port 0

input port 1

output port 0

output port 1

Nominate Grant Accept

Reset

Inte

lSPAA’s Simplicity

Low degree of interaction among ports- increases arbitration collisions+ reduces complexity

Algorithm (no centralized matrix)(1) Select packet at input port & load matrix (1 cycle)(2) Forward packets to output ports (1 cycle)(3) Output ports select packets and return feedback to input ports

(1 cycle)

1

(1) (2) (3)11

Inte

lSPAA’s Advantages

+ Fewer cycles 3 cycles in 0.18micron

+ Speculatively read out input buffer prior to output port arbitration because only one packet is nominated to one output port

+ Easier to pipeline restart (1) for free input ports before (2) completes

only one packet nominated to one output port small number (16) of in-flight packets avoids any centralized matrix

speculative read allows data flits to follow header flits

(1) (2) (3)

1

(1) (2) (3)11

1 cycle

Inte

lSummary: Simpler is Better

WFA PIM1 SPAAAlpha 21364

# Matches Per Cycle High Medium Lower

# cycles (0.18 microns)

4 4 3

Restart Rate

Every 3 cycles

Every 3 cycles

Every cycle

Inte

lSaturation Behavior

• Reasons: Hot spots & tree saturation • 21364’s router shows cyclic pattern (link utilization with time)

• Ideally, operate at saturation bandwidth • Solution: throttle input load

64 Node Network, Random Traffic0

50

100

150

200

250

300

0 0.2 0.4 0.6 0.8

Delivered flits/router/nanoseconds

Ave

rage

Pac

ket L

aten

cy

(nan

osec

onds

)

SPAA-base

saturation point

Inte

lRotary Rule

21364’s in-built throttling+ maximum outstanding cache miss requests per processor = 16

Rotary Rule: more throttling+ 21364 is a “direct” network + Rotary Rule prioritizes traffic in network ports over local ports+ also, clears network congestion+ relies on anti-starvation mechanism

WFA+Rotary: change first cell SPAA+Rotary: change output port priority to the

Rotary Rule

Inte

lSimulation Methodology

Asim modeling infrastructure detailed timing model of 21364 network selected design points validated against RTL

Traffic Patterns 70% three coherence hops, 30% two coherence hops random destinations other traffic combinations in paper and simulated internally

Inte

l64 Node Network: Base Case

Random Traffic0

50

100

150

200

250

300

0 0.2 0.4 0.6 0.8


Ave

rage

Pac

ket L

aten

cy

(nan

osec

onds

)

PIM1

WFA-base

SPAA-base

• SPAA outperforms WFA & PIM124% higher throughput at knee

Knee

Inte

l64 Node Network: With Rotary Rule

Random Traffic0

50

100

150

200

250

300

0 0.2 0.4 0.6 0.8


Ave

rage

Pac

ket L

aten

cy

(nan

osec

onds

)

PIM1WFA-baseWFA-rotarySPAA-baseSPAA-rotary

• Rotary Rule helps both SPAA & WFA

Inte

lSummary & Conclusions

Arbitration Algorithms– WFA: Wave Front Arbitration Algorithm (Tamir & Chi, 1993,

SGI Spider)– PIM1: Parallel Iterative Matching with one iteration

(Anderson, et al., ASPLOS 1992)– SPAA: Simple, Pipelined Arbitration Algorithm (21364)

SPAA outperforms WFA & PIM1+ SPAA’s matching power similar to WFA & PIM1 (when

many output ports are busy)+ SPAA minimizes interactions between ports+ SPAA can be pipelined more effectively

Rotary Rule+ avoids network saturation under heavy load

intel slide 1 a comparative study of arbitration algorithms for the alpha 21364 pipelined router...

Documents