the nox router
DESCRIPTION
The NoX Router. Mitchell Hayenga Mikko Lipasti. Overview. New low-latency router technique Don’t arbitrate or speculate! Encode. XOR Property (A^B) ^ B = A Hides arbitration latency Eliminates dead cycles The NoX Router Single-cycle/wormhole/mesh implementation - PowerPoint PPT PresentationTRANSCRIPT
Predictive High-Performance Architecture Research Mavens (PHARM), Department of ECE
The NoX Router
Mitchell Hayenga
Mikko Lipasti
2/19The NoX Router, Micro’11
Overview• New low-latency router technique
– Don’t arbitrate or speculate! Encode.• XOR Property (A^B) ^ B = A
– Hides arbitration latency– Eliminates dead cycles
• The NoX Router– Single-cycle/wormhole/mesh implementation– Frequency competitive with pure speculative– 2.7%-34.4% better ED2 on application traces– Up to 9.9% better throughput on synthetic traffic
Control
Input Channel
SwitchFabric
3/19The NoX Router, Micro’11
Motivation• Modern On-Chip Networks
– Bandwidth Plentiful, Latency Critical– Control
• Complex, Speculative, Critical Path
– Datapath• Fast, Simple, Wire-Dominated
• NoX Tradeoff– Marginal increase in datapath complexity– Hide control latency
Intel Teraflops Router
LTBWNRC VA SA ST
LTRC VA SA STBW
LTBWNRC
VASA ST
LTVA
NRCSA
ST
Virtual Channel Router Pipeline Evolution
4/19The NoX Router, Micro’11
Switch Arbitration Techniques• Non-Speculative
– Arbitration occurs before switch traversal
• Speculative Switch Traversal [Mullins ISCA 2004]– Assume contention doesn’t happen– Wasted cycle in the event of contention
• Arbiter decides what gets sent on the next cycle
SwitchFabric
Control
B
AA
clkport 0port 1grantvalid outdata out
0 1 4cycle 2 3
A
p0
A
ABp1
???
B
AA ?
B
A
p0
B A
A
BA
No Contention Contention
B WinsA Wins
5/19The NoX Router, Micro’11
Switch Arbitration Techniques• Non-Speculative
– Arbitration occurs before switch traversal
• Speculative Switch Traversal [Mullins ISCA 2004]– Assume contention doesn’t happen– Wasted cycle in the event of contention
• Arbiter decides what gets sent on the next cycle
• Encoding– Blindly transmit, XOR within switch fabric– No contention - data sent unmodified– Contention - data sent XOR’d
• Arbiter decides what was sent
SwitchFabric
Control
B
A
B
AA A^BA
0 1 4cycle 2 3clkport 0port 1grantvalid outdata out
A
p0
A
ABp1
B^A
A
A
A
No Contention Contention
B Wins
6/19The NoX Router, Micro’11
Coded
Flit Buffer
AA
^B^C
B^C C
Receive Logic• Works upon simple XOR property.
– (A^B^C) ^ (B^C) = A
• Simple Decode– Always able to decode by XORing two sequential values– Maintains previous router’s arbitration order/fairness
A
0
0
B^C
1
A^B
^C
C B^C
B
7/19The NoX Router, Micro’11
Tradeoffs and Scaling• Arbitration
– O(log n) delay for most arbiters
• Decode logic– Constant with respect to # of ports
• Switch Fabric– XOR delay scales slightly worse than a
mux/tristate-based solution– Maybe not an issue (control latency)
Control
Input Channel
SwitchFabricSwitchFabric
8/19The NoX Router, Micro’11
The NoX Router• Network of XORs• Implementation Details
– 8x8 Mesh, 2mm long 64-bit links– Single Cycle (Router+Link)– Wormhole– Dimension ordered routing– Minimally buffered
9/19The NoX Router, Micro’11
Baseline Designs• Non-Speculative
– Serial arbitration & switch logic– Long cycle time– Efficient link utilization
• Speculative Techniques [Mullins ISCA 2004]– Hides arbitration latency– Potential for wasted link bandwidth– Spec-Fast & Spec-Accurate [Mullins ASP-DAC 2006]
10/19The NoX Router, Micro’11
Frequency Analysis• Overheads present in all designs
– 248ps SRAM delay– 98ps link latency
Architecture Clock Period %Non-Speculative 0.92 ns -Spec-Fast 0.69 ns 33.3%Spec-Accurate 0.72 ns 27.7%NoX 0.76 ns 21.1%
11/19The NoX Router, Micro’11
Synthetic Traffic - Latency
bandwidth (MB/s/node) bandwidth (MB/s/node)
12/19The NoX Router, Micro’11
Synthetic Traffic – ED2
bandwidth (MB/s/node) bandwidth (MB/s/node)
13/19The NoX Router, Micro’11
Application Traffic - Latency
14/19The NoX Router, Micro’11
Application Traffic – ED2
15/19The NoX Router, Micro’11
Power @ Fixed Bandwidth• Traffic Pattern
– Uniform Random– 2GB/s/node injection rate
• Spec-Fast saturated
• Switch/Link glitching in speculative
• Marginal additional decode power
Decodenegligible
16/19The NoX Router, Micro’11
Area Floorplanning
Standard Router NoX Router
Por
t 0 –
64x
4 S
RA
MP
ort 1
– 6
4x4
SR
AM
Por
t 2 –
64x
4 S
RA
MP
ort 3
– 6
4x4
SR
AM
Por
t 4 –
64x
4 S
RA
M
Crossbar
Dec
odin
g an
d M
aski
ng
140
µm
70 µm 101.0 µm
161.
2 µm
Por
t 0 –
64x
4 S
RA
MP
ort 1
– 6
4x4
SR
AM
Por
t 2 –
64x
4 S
RA
MP
ort 3
– 6
4x4
SR
AM
Por
t 4 –
64x
4 S
RA
M
140
µm
70 µm
XORSwitch
102.2 µm
161.
2 µm
28 µm
~17% More Area
17/19The NoX Router, Micro’11
Going Further• Input Speedup
– What if we could drive two values from an input buffer in a single cycle
– Final decode step has 2 values available• Last packet sees no additional delay
from contention at the previous router
• Multi-hop encoded forwarding– Don’t decode @ every hop, decode
when packets diverge– Allow new collisions with the “head” flit– Requires additional sideband info
SwitchFabric
Flit Buffer
A^B
B
AB
18/19The NoX Router, Micro’11
Conclusion• New encoding-based low-latency router technique
– Hides arbitration latency– Comparable frequency to speculative switch traversal techniques– Eliminates wasted interconnect bandwidth– Promising application to multiple router architectures
19/19The NoX Router, Micro’11
Thanks – Questions?
20/19The NoX Router, Micro’11
Virtual Channels• Future Work• Physical Channels vs. Virtual Channels
– VC Router Benefits Dynamic bandwidth sharing (performance)
– VC Router Negatives Increased arbitration delay (performance) Increased buffer energy (power) Large unified crossbar (area, power)
• Possible but tradeoffs need to be re-evaluated– Structuring of input buffers/decode logic– VC credit accounting
21/19The NoX Router, Micro’11
Multi-Flit Support• Current support is conservative
– Performs similarly to speculative routers if multi-flit packets collide– Not all bad though
• ~70% of packets are single-flit coherence packets• Only head-flit collisions matter• Requests all single-flit
• Alternatives– Fragment multi-flit packets– Provide sufficient buffering space