![Page 1: Profile-Driven Energy Reduction in Network-on-Chips](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56814c26550346895db92a76/html5/thumbnails/1.jpg)
Profile-Driven Energy Reduction in Network-
on-Chips
8383 – 2nd Presentation
Ranya Alawadhi
![Page 2: Profile-Driven Energy Reduction in Network-on-Chips](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56814c26550346895db92a76/html5/thumbnails/2.jpg)
C
S
E
8
3
8
3
Source Li, F., Chen, G., Kandemir, M., and Kolcu, I.
2007. Profile-driven energy reduction in network-on-chips. SIGPLAN Not. 42, 6 (Jun. 2007), 394-404
![Page 3: Profile-Driven Energy Reduction in Network-on-Chips](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56814c26550346895db92a76/html5/thumbnails/3.jpg)
C
S
E
8
3
8
3
Agenda Motivation
Contribution
Introduction
The Technique
Results
Conclusion
![Page 4: Profile-Driven Energy Reduction in Network-on-Chips](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56814c26550346895db92a76/html5/thumbnails/4.jpg)
C
S
E
8
3
8
3
Motivation Increasing on-chip power consumption
demands a power-aware designs
Recent research shows that using voltage/frequency scaling on communication links & shutting down the idle links can significantly reduce NoC power consumption
They work best when communication links have long idle periods
![Page 5: Profile-Driven Energy Reduction in Network-on-Chips](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56814c26550346895db92a76/html5/thumbnails/5.jpg)
C
S
E
8
3
8
3
Contribution A profile-driven compiler optimization for
increasing the length of idle periods of communication links for a two-dimensional, on-chip, mesh network by maximizing communication link reuse
![Page 6: Profile-Driven Energy Reduction in Network-on-Chips](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56814c26550346895db92a76/html5/thumbnails/6.jpg)
C
S
E
8
3
8
3
Introduction Targeted application: array/loop-intensive
embedded programs
Targeted NoC: two-dimensional mesh used by a single application at a time.
Data tested: 12 data intensive embedded applications
Results: reduces leakage energy by more than 35% on average (as compared to a pure hardware-based link power management scheme)
![Page 7: Profile-Driven Energy Reduction in Network-on-Chips](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56814c26550346895db92a76/html5/thumbnails/7.jpg)
C
S
E
8
3
8
3
Architectural modelNetwork Abstraction
NI CPU
Memory
S S S S
S S S S
S S S S
NI CPU
Memory
NI CPU
Memory
NI CPU
Memory
NI CPU
Memory
NI CPU
Memory
NI CPU
Memory
NI CPU
Memory
NI CPU
Memory
NI CPU
Memory
NI CPU
Memory
NI CPU
Memory
SouthSouth
Interfaceto local
processingunit
Cross-Bar
Interfaceto local
processingunit
Cross-Bar
Interfaceto local
processingunit
Cross-BarWest East
North
Input Buffer
Output Buffer
Interfaceto local
processingunit
Cross-BarWest East
North
Input Buffer
Output Buffer
![Page 8: Profile-Driven Energy Reduction in Network-on-Chips](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56814c26550346895db92a76/html5/thumbnails/8.jpg)
C
S
E
8
3
8
3
Architectural modelHardware support for Compiler Directed Message Routing
The compiler attach routing information to each message-send operation in the code
The switch design was extended to handle two types of routing schemes: Default X-Y routing Compiler-directed routing
![Page 9: Profile-Driven Energy Reduction in Network-on-Chips](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56814c26550346895db92a76/html5/thumbnails/9.jpg)
C
S
E
8
3
8
3
Cont. Packet header:
Flag: indicates which routing mechanism to use
flag: 0 X-Y routing
. . .Flag Destination
![Page 10: Profile-Driven Energy Reduction in Network-on-Chips](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56814c26550346895db92a76/html5/thumbnails/10.jpg)
C
S
E
8
3
8
3
Cont. flag (1 bit): 1 Compiler directed routing
counter (4 bits): Number of hops a long the path
orientation (2 bits): used along with the routing command sequence
routing command sequence (13 bits): tells the switch to which output port to forward the packet
. . . . . .Flag Counter Orientation Routing command sequence
Orientation 00 00 01 01 10 10 11 11
Routing Command 0 1 0 1 0 1 0 1
Routing Decision N E N W S E S W
![Page 11: Profile-Driven Energy Reduction in Network-on-Chips](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56814c26550346895db92a76/html5/thumbnails/11.jpg)
C
S
E
8
3
8
3
Optimizing Link Reuse
Communication Graph Modified Communication Graph
OptimizedParallel
Code
Parallel Code
Profiler
LinkReuse
Optimizer
Code Rewriter
Link Signature
Optimized Link Signature
![Page 12: Profile-Driven Energy Reduction in Network-on-Chips](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56814c26550346895db92a76/html5/thumbnails/12.jpg)
C
S
E
8
3
8
3
Network State and Link Signature Parallel program consists of n parallel threads P1, P2, … , Pn
Pp is scheduled to run on the pth mesh node
Communication command (CC): send operationCp={M1,p, M2,p, ... , Mk,p, ... , Mq,p}
Cp:set of CCs in the program code of Pp
Mk,p : kth CC in the code of Pp
q: total number of CCs in the program code of Pp
Network State: set of messages under transmission Si = {Mk,p | A message sent by Mk,p is in transmission over
the mesh} S0 = represents a state in which no message is in transmission
![Page 13: Profile-Driven Energy Reduction in Network-on-Chips](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56814c26550346895db92a76/html5/thumbnails/13.jpg)
C
S
E
8
3
8
3
Cont. Link utilization vector (LUV):is a vector , the jth
element gives the number of packets sent by Mk,p and transferred through the jth communication link of the mesh
Link signature (LS):represents the link utilization at a network state Si
Θ( ): is a function that returns the set of links used by LS or LUV
p,ku
ip,k SM
p,ki us
w
![Page 14: Profile-Driven Energy Reduction in Network-on-Chips](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56814c26550346895db92a76/html5/thumbnails/14.jpg)
C
S
E
8
3
8
3
Example
P0P1
P2P3
l0,1
l1,0
l1,3l3,1
l0,1
l1,0
l2.3
l3,2
l0,2l2,0
m1,2
m1,1
m1,0
Links l0,1 l1,0 l2,3 l3,2 l0,2 l2,0 l1,3 l3,1
(20 0 0 0 0 0 20 0)
(0 0 0 0 0 0 20 0)
(0 0 20 0 0 0 0 0)
(20 0 20 0 0 0 40 0)
0,1u
1,1u
2,1u
1s
S1={M1,0,M1,1,M1,2}
![Page 15: Profile-Driven Energy Reduction in Network-on-Chips](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56814c26550346895db92a76/html5/thumbnails/15.jpg)
C
S
E
8
3
8
3
Communication Graph The network transitions from a state, Si, to
another state, Sj , in two situations:
1. A new message is sent by Mk,p
Sj = Si U {Mk,p}
2. A message sent by Mk,p arrives at its destination node
Sj = Si − {Mk,p}
![Page 16: Profile-Driven Energy Reduction in Network-on-Chips](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56814c26550346895db92a76/html5/thumbnails/16.jpg)
C
S
E
8
3
8
3
Cont. Communication graph (CG):
Captures the communication behavior of a program
Undirected graph Vertex: network state Edge(Si,Sj): transmission between Si, Sj
Weight(Wi,j): number of transition taking place between Si, Sj
Built through profilingS3S1
S2
S5
S3
S5
S1 S3
S5
S1 S3
S5
S2
S1 S3
S5
S4
300
500400
200
100
300
![Page 17: Profile-Driven Energy Reduction in Network-on-Chips](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56814c26550346895db92a76/html5/thumbnails/17.jpg)
C
S
E
8
3
8
3
Profiler keeps track of the current network state Si
The program notifies the profiler each time a node sends a msg or when a msg arrives its destination
When the notification is received, the profiler computes the new state, Sj, & increases the value of Wi,j
Profiling
![Page 18: Profile-Driven Energy Reduction in Network-on-Chips](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56814c26550346895db92a76/html5/thumbnails/18.jpg)
C
S
E
8
3
8
3
Restate the problem when going from one state to another at
runtime, the desire is to reuse the same set of links as much as possible
Each vertex in a CG has a default link signature (obtained using the default X-Y routing)
The compiler’s task is to re-assign link signatures to vertices
![Page 19: Profile-Driven Energy Reduction in Network-on-Chips](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56814c26550346895db92a76/html5/thumbnails/19.jpg)
C
S
E
8
3
8
3
Traversing a Communication Graph Traversing network states to assign them new
link signatures:
1. Starts with the edge with the largest weight2. Performs the signature re-assignment to the
associated vertices3. Select the next edge:
Scheme I: The one with the largest weight among the edges that are incident on the selected vertices
Scheme II: The one with the largest weight among all the remaining edges
4. Performs the signature re-assignment5. Repeat 3 & 4 until all vertices are processed
![Page 20: Profile-Driven Energy Reduction in Network-on-Chips](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56814c26550346895db92a76/html5/thumbnails/20.jpg)
C
S
E
8
3
8
3
Example
S3S1
S2
S5
S3
S5
S1 S3
S5
S1 S3
S5
S2
S1 S3
S5
S4
300
500400
200
100
300
Step Vertex-pair considered Step Vertex-pair considered
1 S1 S2 1 S1 S2
2 S1 S3 2 S3 S4
3 S3 S4 3 S4 S5
4 S4 S5
Scheme I Scheme II
![Page 21: Profile-Driven Energy Reduction in Network-on-Chips](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56814c26550346895db92a76/html5/thumbnails/21.jpg)
C
S
E
8
3
8
3
Routing Flexibility Only the shortest paths are considered for re-routing
messages
# of possible unique shortest paths= Source (xs, ys), Destination (xd, yd) m=|xd-xs| n=|yd-ys|
Alternate link utilization vectors (ALUV): set of all alternate (shortest) paths available to a message sent by Mi,p (Ai,p)
Re-routing: replacement of the current LUV for an associated Mi,p with a new LUV selected from the corresponding ALUVs
Routing Flexibility = | Ai,p | (i.e. number of alternate link utilization vectors in an ALUV set)
mnmC
![Page 22: Profile-Driven Energy Reduction in Network-on-Chips](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56814c26550346895db92a76/html5/thumbnails/22.jpg)
C
S
E
8
3
8
3
Problem formulation we can change the associated routing with a CC only once
Selecting the new utilization vectors should not degrade the performance of the default routing scheme
Selecting alternate re-routings can increase the network contention
Performance constraint was introduced for re-routing: avoid increasing the value of the largest entry in any original link signature
For example: Default LS (10, 40, 10, 10, 0, 0, 0, 0) Undesirable alternative: (10, 50, 0, 10, 0, 0, 0, 0) Accepted alternative: (40, 20, 10, 0, 0, 0, 0, 0)
![Page 23: Profile-Driven Energy Reduction in Network-on-Chips](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56814c26550346895db92a76/html5/thumbnails/23.jpg)
C
S
E
8
3
8
3
Heuristic For each Mi,p unassigned with new routing in
network state Sa, Sb Calculate LUV & ALUV Calculate LS of Sa & Sb
Compute num_links (total # links used in Sa & Sb)
Sort the CC in Sa & Sb into a sequence with ascending routing flexibilities
Start with the CC that has the lowest routing flexibility & assign a proper route to it
Assign the appropriate routes to the CCs, one-by-one, until processing all commands in Sa & Sb is complete
Reduce the value as much as possible under
the performance constraint
![Page 24: Profile-Driven Energy Reduction in Network-on-Chips](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56814c26550346895db92a76/html5/thumbnails/24.jpg)
C
S
E
8
3
8
3
Cont. The method for choosing a route for a Mi,p (Selects a new
LUV for Mi,p by considering all the re-routing options captured in Ai,p) For each alternate re-routing, checks whether the
performance constraint is satisfied with respect to state Sa.
If the performance constraint is met, the new link signature is computed for the state that it belongs to
Recalculate num_links
selects the one that leads to the minimum num_links value
Once a CC is given a new LUV, this command is not considered again when processing the other vertex-pairs
![Page 25: Profile-Driven Energy Reduction in Network-on-Chips](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56814c26550346895db92a76/html5/thumbnails/25.jpg)
C
S
E
8
3
8
3
Complexity Computational Complexity: O(N*K* )
N: number of network states K: number of send operations : largest routing flexibility in an mxn mesh
mnmC
mnmC
![Page 26: Profile-Driven Energy Reduction in Network-on-Chips](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56814c26550346895db92a76/html5/thumbnails/26.jpg)
C
S
E
8
3
8
3
Example
![Page 27: Profile-Driven Energy Reduction in Network-on-Chips](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56814c26550346895db92a76/html5/thumbnails/27.jpg)
C
S
E
8
3
8
3
Cont.
![Page 28: Profile-Driven Energy Reduction in Network-on-Chips](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56814c26550346895db92a76/html5/thumbnails/28.jpg)
C
S
E
8
3
8
3
Cont.
![Page 29: Profile-Driven Energy Reduction in Network-on-Chips](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56814c26550346895db92a76/html5/thumbnails/29.jpg)
C
S
E
8
3
8
3
Coder Rewriter Responsible for providing a version of
the message send operation, which incorporates the compiler-determined routing information
Message header for send1,3: 1 0110 11 0001110000000Message header for send1,7: 1 0100 11 1010000000000
![Page 30: Profile-Driven Energy Reduction in Network-on-Chips](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56814c26550346895db92a76/html5/thumbnails/30.jpg)
C
S
E
8
3
8
3
Handling Deadlocks Re-routings change the behavior of the default X-Y routing
scheme
An acyclic channel dependency graph is the necessary and sufficient condition for avoiding deadlocks
Incorporating deadlock handling routine by breaking cycles within the channel dependency graph: Reduces the probability of experiencing a deadlock at runtime Cannot completely eliminate deadlocks use the dynamic,
hardware-supported deadlock avoidance rule employed by the Alpha 21364 network architecture
Handling deadlocks when they occur results in both extra latency and power consumption
![Page 31: Profile-Driven Energy Reduction in Network-on-Chips](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56814c26550346895db92a76/html5/thumbnails/31.jpg)
C
S
E
8
3
8
3
Experiments1)Simulation Environment and Benchmarks Implement a flit-level on-chip interconnection network simulator
Network: 5x5 configuration
Link speed: 1 Gb/sec
Switch input port buffer size: 64 flits
Flit:128 bit
Packet size: 16 flits
The communication links can be shutdown independently, using a time-out based mechanism
Time-out counter threshold: 1.5 μsec (based on preliminary analysis)
Time to to switch a link to active state: 1 μsec
Energy overhead of switching: 140 μJ
![Page 32: Profile-Driven Energy Reduction in Network-on-Chips](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56814c26550346895db92a76/html5/thumbnails/32.jpg)
C
S
E
8
3
8
3
Cont. Perform experiments with three different
versions for each benchmark1. employs the default routing 2. Scheme I 3. Scheme II
All schemes use the underlying hardware-based link shutdown scheme.
![Page 33: Profile-Driven Energy Reduction in Network-on-Chips](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56814c26550346895db92a76/html5/thumbnails/33.jpg)
C
S
E
8
3
8
3
Cont. Code sizes: 63 - 8,612 C lines
Dataset sizes: 68.9KB - 1,866.4KB
Increase in compilation time (including profiling): 89% (3Step-log) - Lame 236% (Lame)
No deadlock was observed
![Page 34: Profile-Driven Energy Reduction in Network-on-Chips](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56814c26550346895db92a76/html5/thumbnails/34.jpg)
C
S
E
8
3
8
3
Results
Benchmark
NameBrief Description
CG Size
Network Energy
Network LatencyNod
eEdge
Morph2 Morphological operations 338 1081 75.5(64.9%) 380.4(8.8%)
Disc Speech/music discriminator 816 2937 99.2(46.3%) 123.6(6.9%)
Jpeg Compression for still images 524 1729 92.7(55.8%) 445.1(10.3%)
Viterbi A graphical Viterbi decoder 622 2239 72.5(32.9%) 150.8(9.8%)
Rasta Speech recognition 498 1424 118.1(50.7%) 219.5(6.2%)
3Step-log Logarithmic search motion est. 127 396 15.2(62.4%) 107.4(5.7%)
Full-search Full search motion est. 136 448 13.5(48.0%) 95.6(12.3%)
Hier Hierarchical motion est. 138 503 20.4(56.3%) 151.9(7.3%)
Phods Parallel hierarchical motion est. 128 440 16.7(66.6%) 111.3(10.4%)
Epic Image data compression 1144 4516 103.9(30.7%) 420.4(6.1%)
Lame MP3 encoder 2062 7526 80.1(55.0%) 272.1(9.0%)
FFT Fast Fourier transform 416 1747 87.2(55.9%) 253.3(7.4%)
![Page 35: Profile-Driven Energy Reduction in Network-on-Chips](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56814c26550346895db92a76/html5/thumbnails/35.jpg)
C
S
E
8
3
8
3
Cont.
Link Utilization Percentage reductions in leakage energy consumption
![Page 36: Profile-Driven Energy Reduction in Network-on-Chips](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56814c26550346895db92a76/html5/thumbnails/36.jpg)
C
S
E
8
3
8
3
Cont.
Percentage increases in network cycles and overall execution time
![Page 37: Profile-Driven Energy Reduction in Network-on-Chips](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56814c26550346895db92a76/html5/thumbnails/37.jpg)
C
S
E
8
3
8
3
Cont.
Sensitivity to the number of nodes (Scheme I). The results with Scheme II
are similar
Sensitivity to the input size (Scheme I). The results with Scheme II are
similar
![Page 38: Profile-Driven Energy Reduction in Network-on-Chips](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56814c26550346895db92a76/html5/thumbnails/38.jpg)
C
S
E
8
3
8
3
Conclusion The proposed approach limits links usage
into a small set of links to increase the idle period of the remaining links
Hardware schemes are more effective when used with the proposed technique