a cost effective centralized adaptive routing for networks on chip
DESCRIPTION
A Cost Effective Centralized Adaptive Routing for Networks on Chip. Ran Manevich, Israel Cidon, Avinoam Kolodny, Isask ’ har (Zigi) Walter and Shmuel Wimer. Technion – Israel Institute of Technology. QNoC. Research. Group. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: A Cost Effective Centralized Adaptive Routing for Networks on Chip](https://reader035.vdocuments.mx/reader035/viewer/2022081511/568161da550346895dd1e42e/html5/thumbnails/1.jpg)
A Cost Effective Centralized Adaptive
Routing for Networks on Chip
Ran Manevich, Israel Cidon, Avinoam Kolodny, Isask’har (Zigi) Walter and Shmuel
WimerTechnion – Israel Institute of
TechnologyModule
Modu le Module
Modu le Modu le
Modu le Modu le
Modu le
Module
Modu le
Modu le
Modu leGroup
ResearchQNoC
![Page 2: A Cost Effective Centralized Adaptive Routing for Networks on Chip](https://reader035.vdocuments.mx/reader035/viewer/2022081511/568161da550346895dd1e42e/html5/thumbnails/2.jpg)
Global traffic information is essential to make the right decision!
![Page 3: A Cost Effective Centralized Adaptive Routing for Networks on Chip](https://reader035.vdocuments.mx/reader035/viewer/2022081511/568161da550346895dd1e42e/html5/thumbnails/3.jpg)
2D Mesh NoC
Adaptive Routing in NoCs – Local vs. Global Information
Low CongestionMedium CongestionHigh Congestion
A Packet routed from upper left to bottom right corner utilizing local congestion information.The same packet routed using global information.
I CAN MAKE IT!!!Source
Destination
![Page 4: A Cost Effective Centralized Adaptive Routing for Networks on Chip](https://reader035.vdocuments.mx/reader035/viewer/2022081511/568161da550346895dd1e42e/html5/thumbnails/4.jpg)
Route Selection - ATDOR ATDOR - Adaptive Toggle Dimension Ordered
Routing Keep it simple! Centralized selection:
Routing tables in sources. One bit per destination.
The option with less congested bottleneck link is preferred.
XY or YX
![Page 5: A Cost Effective Centralized Adaptive Routing for Networks on Chip](https://reader035.vdocuments.mx/reader035/viewer/2022081511/568161da550346895dd1e42e/html5/thumbnails/5.jpg)
ATDOR Illustration 1 Five identical flows,
100 MB/s each.
Links modeled as M/M/1 queues. Delay of a single link:
LINKTraffic
DCapacity Traffic
Links capacity is 210 MB/s.
Initial routing - XY
![Page 6: A Cost Effective Centralized Adaptive Routing for Networks on Chip](https://reader035.vdocuments.mx/reader035/viewer/2022081511/568161da550346895dd1e42e/html5/thumbnails/6.jpg)
Centralized Routing – How?
Option 1 – Continuous calculation of optimal routing for the active sessions:
Achievable load balancing
Speed and computation complexity
System complexity
![Page 7: A Cost Effective Centralized Adaptive Routing for Networks on Chip](https://reader035.vdocuments.mx/reader035/viewer/2022081511/568161da550346895dd1e42e/html5/thumbnails/7.jpg)
Centralized Routing – How?
Option 2 – Iterative serial selection based on traffic load measurements between XY and YX for all source-destination pairs:
Achievable load balancing
Speed and computation complexitySystem complexity
![Page 8: A Cost Effective Centralized Adaptive Routing for Networks on Chip](https://reader035.vdocuments.mx/reader035/viewer/2022081511/568161da550346895dd1e42e/html5/thumbnails/8.jpg)
ATDOR illustration 1
Average Delay
∞
Re-Routed Flow
Step #
1->15 1
Re-Routed Flow
Step #
2->8 2
Average Delay
37 ns
Re-Routed Flow
Step #
2->15 3
Average Delay
22 ns
![Page 9: A Cost Effective Centralized Adaptive Routing for Networks on Chip](https://reader035.vdocuments.mx/reader035/viewer/2022081511/568161da550346895dd1e42e/html5/thumbnails/9.jpg)
What did we just see? For each flow we:
1. Calculated the better route.2. Updated routing table of the
source.3. Waited for the update to take effect and measured global traffic load.
Steps 2 and 3 are unified for all destinations of a single source:Achievable load balancing
Speed and computation complexityScalability
Performing steps 1-3 for each flow is slow and not scalable.
![Page 10: A Cost Effective Centralized Adaptive Routing for Networks on Chip](https://reader035.vdocuments.mx/reader035/viewer/2022081511/568161da550346895dd1e42e/html5/thumbnails/10.jpg)
Back illustration 1
Average Delay
∞
Re-Routed Flow
Step #
1->15 1
Average Delay
22 ns
Re-Routed Flow
Step #
2->822->15
Re-Routed Flow
Step #
4->15 3
Average Delay
22 ns
Re-Routed Flow
Step #
1->15 4
Average Delay
22 ns
Re-Routed Flow
Step #
2->852->15
Average Delay
∞
![Page 11: A Cost Effective Centralized Adaptive Routing for Networks on Chip](https://reader035.vdocuments.mx/reader035/viewer/2022081511/568161da550346895dd1e42e/html5/thumbnails/11.jpg)
Problem #1 Changing routing may enhance
congestion and cause fluctuations.
Solution: Change routing only if the alternative is better by the margin α, 0< α <1:
YX XY
YX XY
XY YX
XY YX
if (Current Route = XY)YX if MAX[Load ] a MAX[Load ]
NextRoute =XY if MAX[Load ] > a MAX[Load ]
elseif (Current Route = YX)XY if MAX[Load ] a MAX[Load ]
NextRoute =YX if MAX[Load ] > a MAX[Load ]
![Page 12: A Cost Effective Centralized Adaptive Routing for Networks on Chip](https://reader035.vdocuments.mx/reader035/viewer/2022081511/568161da550346895dd1e42e/html5/thumbnails/12.jpg)
ATDOR illustration 2
Average Delay
∞
Re-Routed Flow
Step #
1->14
11->15
1->16
Average Delay
∞
Re-Routed Flow
Step #
1->14
21->15
1->16
Re-Routed Flow
Step #
1->14
31->15
1->16
![Page 13: A Cost Effective Centralized Adaptive Routing for Networks on Chip](https://reader035.vdocuments.mx/reader035/viewer/2022081511/568161da550346895dd1e42e/html5/thumbnails/13.jpg)
Problem #2 Coupling among flows sharing
the same source. Solution: Re-Routing counters
CI,J count routing changes of flows from source I to destination J (FI,J). When CI,J reaches a limit LI,J, routing of FI,J is locked. A Possible definition of Limits LI,J :
, ( ) mod 3I JL I J
![Page 14: A Cost Effective Centralized Adaptive Routing for Networks on Chip](https://reader035.vdocuments.mx/reader035/viewer/2022081511/568161da550346895dd1e42e/html5/thumbnails/14.jpg)
Back to illustration 2R.
Changes Left
Flows
2 1->161 1->150 1->14
Average Delay
∞
R. Changes
Left
Flows
1 1->160 1->150 1->14
Average Delay
73 ns
R. Changes
Left
Flows
0 1->160 1->150 2->14
Average Delay
22 ns
, ( ) mod 3I JL I J
![Page 15: A Cost Effective Centralized Adaptive Routing for Networks on Chip](https://reader035.vdocuments.mx/reader035/viewer/2022081511/568161da550346895dd1e42e/html5/thumbnails/15.jpg)
Bring it all togetherR.
Changes Left
Flows
1 1>-15
1 2>-8
2 2>-15
1 4>-15
Average Delay
∞
R. Changes
Left
Flows
0 1>-15
1 2>-8
2 2>-15
1 4>-15
R. Changes
Left
Flows
0 1>-15
0 2>-8
1 2>-15
1 4>-15
Average Delay
22 ns
R. Changes
Left
Flows
0 1>-15
0 2>-8
1 2>-15
0 4>-15
Average Delay
22 nsAverage Delay
14 ns
R. Changes
Left
Flows
0 1>-15
0 2>-8
0 2>-15
0 4>-15
, ( ) mod 3I JL I J
![Page 16: A Cost Effective Centralized Adaptive Routing for Networks on Chip](https://reader035.vdocuments.mx/reader035/viewer/2022081511/568161da550346895dd1e42e/html5/thumbnails/16.jpg)
Centralized Adaptive Routing for NoCs - Architecture
Traffic load measurements aggregation into Traffic Load Maps.
Routing control.
Local traffic load measurements inside the routers.
![Page 17: A Cost Effective Centralized Adaptive Routing for Networks on Chip](https://reader035.vdocuments.mx/reader035/viewer/2022081511/568161da550346895dd1e42e/html5/thumbnails/17.jpg)
Load Measurements Aggregation
An illustration of aggregation of load values in a 4X4 2D mesh.
A congestion value is written to each traffic load map every clock cycle.
![Page 18: A Cost Effective Centralized Adaptive Routing for Networks on Chip](https://reader035.vdocuments.mx/reader035/viewer/2022081511/568161da550346895dd1e42e/html5/thumbnails/18.jpg)
ATDOR – Route Selection Circuit
Combinatorial pipelined implementation.
Result every ATDOR clock cycle.
Maximally loaded links of the two alternatives are compared. Next route:
YX XY
YX XY
XY YX
XY YX
if(Current Route = XY)YX if MAX[Load ] a MAX[Load ]
NextRoute =XY if MAX[Load ] > a MAX[Load ]
elseif(Current Route = YX)XY if MAX[Load ] a MAX[Load ]
NextRoute =YX if MAX[Load ] > a MAX[Load ]
0 < a <1
![Page 19: A Cost Effective Centralized Adaptive Routing for Networks on Chip](https://reader035.vdocuments.mx/reader035/viewer/2022081511/568161da550346895dd1e42e/html5/thumbnails/19.jpg)
Hardware Requirements The whole
mechanism was implemented on xc5vlx50t VIRTEX 5 FPGA.
Estimated area for 45nm technology node.
Per-Router hardware overheads in % for a NoC with typical size (50 KGates) virtual channel routers.
![Page 20: A Cost Effective Centralized Adaptive Routing for Networks on Chip](https://reader035.vdocuments.mx/reader035/viewer/2022081511/568161da550346895dd1e42e/html5/thumbnails/20.jpg)
Average Packet Delay – Uniform Traffic
Average delay vs. average load in links normalized to links capacity. 8X8 2D Mesh. Uniform traffic pattern.
![Page 21: A Cost Effective Centralized Adaptive Routing for Networks on Chip](https://reader035.vdocuments.mx/reader035/viewer/2022081511/568161da550346895dd1e42e/html5/thumbnails/21.jpg)
Average Packet Delay – Transpose Traffic
Average delay vs. average load in links normalized to links capacity. 8X8 2D Mesh. Transpose traffic pattern.
![Page 22: A Cost Effective Centralized Adaptive Routing for Networks on Chip](https://reader035.vdocuments.mx/reader035/viewer/2022081511/568161da550346895dd1e42e/html5/thumbnails/22.jpg)
Average Packet Delay – Hotspot Traffic
Average delay vs. average load in links normalized to links capacity. 8X8 2D Mesh. 4 Hotspots traffic pattern.
![Page 23: A Cost Effective Centralized Adaptive Routing for Networks on Chip](https://reader035.vdocuments.mx/reader035/viewer/2022081511/568161da550346895dd1e42e/html5/thumbnails/23.jpg)
Control Iteration Duration Number of re-routed flows vs. time. 8X8 2D Mesh, ATDOR clock of 100 MHz.
α = 15/16 α = 3/4
![Page 24: A Cost Effective Centralized Adaptive Routing for Networks on Chip](https://reader035.vdocuments.mx/reader035/viewer/2022081511/568161da550346895dd1e42e/html5/thumbnails/24.jpg)
CMP DNUCA - Architecture 8X8 CMP DNUCA (Dynamic Non Uniform
Cache Array) with 8 CPUs and 56 cache banks:
![Page 25: A Cost Effective Centralized Adaptive Routing for Networks on Chip](https://reader035.vdocuments.mx/reader035/viewer/2022081511/568161da550346895dd1e42e/html5/thumbnails/25.jpg)
CMP DNUCA – Saturation Throughput
Saturation throughput - Splash 2 and Parsec benchmarks on 8X8 CMP DNUCA with 8 CPUs and 56 cache banks:
![Page 26: A Cost Effective Centralized Adaptive Routing for Networks on Chip](https://reader035.vdocuments.mx/reader035/viewer/2022081511/568161da550346895dd1e42e/html5/thumbnails/26.jpg)
Conclusions Centralized adaptive routing is feasible
for NoCs. ATDOR: Centralized selection
between XY and YX for each source-destination pair. Hardware overhead: <4% of an 8X8 typical NoC. Average saturation throughput improvement:Vs. RCA Vs. O1TURN
12.1% 19.3% Synthetic Patterns12.8% 22.8% Spash 2 and
Parsec Benchmarks