profile-driven energy reduction in network-on-chips
DESCRIPTION
Profile-Driven Energy Reduction in Network-on-Chips. 8383 – 2 nd Presentation Ranya Alawadhi. Source. Li, F., Chen, G., Kandemir, M., and Kolcu, I. 2007. Profile-driven energy reduction in network-on-chips. SIGPLAN Not. 42, 6 (Jun. 2007), 394-404. Agenda. Motivation Contribution - PowerPoint PPT PresentationTRANSCRIPT
Profile-Driven Energy Reduction in Network-
on-Chips
8383 – 2nd Presentation
Ranya Alawadhi
C
S
E
8
3
8
3
Source Li, F., Chen, G., Kandemir, M., and Kolcu, I.
2007. Profile-driven energy reduction in network-on-chips. SIGPLAN Not. 42, 6 (Jun. 2007), 394-404
C
S
E
8
3
8
3
Agenda Motivation
Contribution
Introduction
The Technique
Results
Conclusion
C
S
E
8
3
8
3
Motivation Increasing on-chip power consumption
demands a power-aware designs
Recent research shows that using voltage/frequency scaling on communication links & shutting down the idle links can significantly reduce NoC power consumption
They work best when communication links have long idle periods
C
S
E
8
3
8
3
Contribution A profile-driven compiler optimization for
increasing the length of idle periods of communication links for a two-dimensional, on-chip, mesh network by maximizing communication link reuse
C
S
E
8
3
8
3
Introduction Targeted application: array/loop-intensive
embedded programs
Targeted NoC: two-dimensional mesh used by a single application at a time.
Data tested: 12 data intensive embedded applications
Results: reduces leakage energy by more than 35% on average (as compared to a pure hardware-based link power management scheme)
C
S
E
8
3
8
3
Architectural modelNetwork Abstraction
NI CPU
Memory
S S S S
S S S S
S S S S
NI CPU
Memory
NI CPU
Memory
NI CPU
Memory
NI CPU
Memory
NI CPU
Memory
NI CPU
Memory
NI CPU
Memory
NI CPU
Memory
NI CPU
Memory
NI CPU
Memory
NI CPU
Memory
SouthSouth
Interfaceto local
processingunit
Cross-Bar
Interfaceto local
processingunit
Cross-Bar
Interfaceto local
processingunit
Cross-BarWest East
North
Input Buffer
Output Buffer
Interfaceto local
processingunit
Cross-BarWest East
North
Input Buffer
Output Buffer
C
S
E
8
3
8
3
Architectural modelHardware support for Compiler Directed Message Routing
The compiler attach routing information to each message-send operation in the code
The switch design was extended to handle two types of routing schemes: Default X-Y routing Compiler-directed routing
C
S
E
8
3
8
3
Cont. Packet header:
Flag: indicates which routing mechanism to use
flag: 0 X-Y routing
. . .Flag Destination
C
S
E
8
3
8
3
Cont. flag (1 bit): 1 Compiler directed routing
counter (4 bits): Number of hops a long the path
orientation (2 bits): used along with the routing command sequence
routing command sequence (13 bits): tells the switch to which output port to forward the packet
. . . . . .Flag Counter Orientation Routing command sequence
Orientation 00 00 01 01 10 10 11 11
Routing Command 0 1 0 1 0 1 0 1
Routing Decision N E N W S E S W
C
S
E
8
3
8
3
Optimizing Link Reuse
Communication Graph Modified Communication Graph
OptimizedParallel
Code
Parallel Code
Profiler
LinkReuse
Optimizer
Code Rewriter
Link Signature
Optimized Link Signature
C
S
E
8
3
8
3
Network State and Link Signature Parallel program consists of n parallel threads P1, P2, … , Pn
Pp is scheduled to run on the pth mesh node
Communication command (CC): send operationCp={M1,p, M2,p, ... , Mk,p, ... , Mq,p}
Cp:set of CCs in the program code of Pp
Mk,p : kth CC in the code of Pp
q: total number of CCs in the program code of Pp
Network State: set of messages under transmission Si = {Mk,p | A message sent by Mk,p is in transmission over
the mesh} S0 = represents a state in which no message is in transmission
C
S
E
8
3
8
3
Cont. Link utilization vector (LUV):is a vector , the jth
element gives the number of packets sent by Mk,p and transferred through the jth communication link of the mesh
Link signature (LS):represents the link utilization at a network state Si
Θ( ): is a function that returns the set of links used by LS or LUV
p,ku
ip,k SM
p,ki us
w
C
S
E
8
3
8
3
Example
P0P1
P2P3
l0,1
l1,0
l1,3l3,1
l0,1
l1,0
l2.3
l3,2
l0,2l2,0
m1,2
m1,1
m1,0
Links l0,1 l1,0 l2,3 l3,2 l0,2 l2,0 l1,3 l3,1
(20 0 0 0 0 0 20 0)
(0 0 0 0 0 0 20 0)
(0 0 20 0 0 0 0 0)
(20 0 20 0 0 0 40 0)
0,1u
1,1u
2,1u
1s
S1={M1,0,M1,1,M1,2}
C
S
E
8
3
8
3
Communication Graph The network transitions from a state, Si, to
another state, Sj , in two situations:
1. A new message is sent by Mk,p
Sj = Si U {Mk,p}
2. A message sent by Mk,p arrives at its destination node
Sj = Si − {Mk,p}
C
S
E
8
3
8
3
Cont. Communication graph (CG):
Captures the communication behavior of a program
Undirected graph Vertex: network state Edge(Si,Sj): transmission between Si, Sj
Weight(Wi,j): number of transition taking place between Si, Sj
Built through profilingS3S1
S2
S5
S3
S5
S1 S3
S5
S1 S3
S5
S2
S1 S3
S5
S4
300
500400
200
100
300
C
S
E
8
3
8
3
Profiler keeps track of the current network state Si
The program notifies the profiler each time a node sends a msg or when a msg arrives its destination
When the notification is received, the profiler computes the new state, Sj, & increases the value of Wi,j
Profiling
C
S
E
8
3
8
3
Restate the problem when going from one state to another at
runtime, the desire is to reuse the same set of links as much as possible
Each vertex in a CG has a default link signature (obtained using the default X-Y routing)
The compiler’s task is to re-assign link signatures to vertices
C
S
E
8
3
8
3
Traversing a Communication Graph Traversing network states to assign them new
link signatures:
1. Starts with the edge with the largest weight2. Performs the signature re-assignment to the
associated vertices3. Select the next edge:
Scheme I: The one with the largest weight among the edges that are incident on the selected vertices
Scheme II: The one with the largest weight among all the remaining edges
4. Performs the signature re-assignment5. Repeat 3 & 4 until all vertices are processed
C
S
E
8
3
8
3
Example
S3S1
S2
S5
S3
S5
S1 S3
S5
S1 S3
S5
S2
S1 S3
S5
S4
300
500400
200
100
300
Step Vertex-pair considered Step Vertex-pair considered
1 S1 S2 1 S1 S2
2 S1 S3 2 S3 S4
3 S3 S4 3 S4 S5
4 S4 S5
Scheme I Scheme II
C
S
E
8
3
8
3
Routing Flexibility Only the shortest paths are considered for re-routing
messages
# of possible unique shortest paths= Source (xs, ys), Destination (xd, yd) m=|xd-xs| n=|yd-ys|
Alternate link utilization vectors (ALUV): set of all alternate (shortest) paths available to a message sent by Mi,p (Ai,p)
Re-routing: replacement of the current LUV for an associated Mi,p with a new LUV selected from the corresponding ALUVs
Routing Flexibility = | Ai,p | (i.e. number of alternate link utilization vectors in an ALUV set)
mnmC
C
S
E
8
3
8
3
Problem formulation we can change the associated routing with a CC only once
Selecting the new utilization vectors should not degrade the performance of the default routing scheme
Selecting alternate re-routings can increase the network contention
Performance constraint was introduced for re-routing: avoid increasing the value of the largest entry in any original link signature
For example: Default LS (10, 40, 10, 10, 0, 0, 0, 0) Undesirable alternative: (10, 50, 0, 10, 0, 0, 0, 0) Accepted alternative: (40, 20, 10, 0, 0, 0, 0, 0)
C
S
E
8
3
8
3
Heuristic For each Mi,p unassigned with new routing in
network state Sa, Sb Calculate LUV & ALUV Calculate LS of Sa & Sb
Compute num_links (total # links used in Sa & Sb)
Sort the CC in Sa & Sb into a sequence with ascending routing flexibilities
Start with the CC that has the lowest routing flexibility & assign a proper route to it
Assign the appropriate routes to the CCs, one-by-one, until processing all commands in Sa & Sb is complete
Reduce the value as much as possible under
the performance constraint
C
S
E
8
3
8
3
Cont. The method for choosing a route for a Mi,p (Selects a new
LUV for Mi,p by considering all the re-routing options captured in Ai,p) For each alternate re-routing, checks whether the
performance constraint is satisfied with respect to state Sa.
If the performance constraint is met, the new link signature is computed for the state that it belongs to
Recalculate num_links
selects the one that leads to the minimum num_links value
Once a CC is given a new LUV, this command is not considered again when processing the other vertex-pairs
C
S
E
8
3
8
3
Complexity Computational Complexity: O(N*K* )
N: number of network states K: number of send operations : largest routing flexibility in an mxn mesh
mnmC
mnmC
C
S
E
8
3
8
3
Example
C
S
E
8
3
8
3
Cont.
C
S
E
8
3
8
3
Cont.
C
S
E
8
3
8
3
Coder Rewriter Responsible for providing a version of
the message send operation, which incorporates the compiler-determined routing information
Message header for send1,3: 1 0110 11 0001110000000Message header for send1,7: 1 0100 11 1010000000000
C
S
E
8
3
8
3
Handling Deadlocks Re-routings change the behavior of the default X-Y routing
scheme
An acyclic channel dependency graph is the necessary and sufficient condition for avoiding deadlocks
Incorporating deadlock handling routine by breaking cycles within the channel dependency graph: Reduces the probability of experiencing a deadlock at runtime Cannot completely eliminate deadlocks use the dynamic,
hardware-supported deadlock avoidance rule employed by the Alpha 21364 network architecture
Handling deadlocks when they occur results in both extra latency and power consumption
C
S
E
8
3
8
3
Experiments1)Simulation Environment and Benchmarks Implement a flit-level on-chip interconnection network simulator
Network: 5x5 configuration
Link speed: 1 Gb/sec
Switch input port buffer size: 64 flits
Flit:128 bit
Packet size: 16 flits
The communication links can be shutdown independently, using a time-out based mechanism
Time-out counter threshold: 1.5 μsec (based on preliminary analysis)
Time to to switch a link to active state: 1 μsec
Energy overhead of switching: 140 μJ
C
S
E
8
3
8
3
Cont. Perform experiments with three different
versions for each benchmark1. employs the default routing 2. Scheme I 3. Scheme II
All schemes use the underlying hardware-based link shutdown scheme.
C
S
E
8
3
8
3
Cont. Code sizes: 63 - 8,612 C lines
Dataset sizes: 68.9KB - 1,866.4KB
Increase in compilation time (including profiling): 89% (3Step-log) - Lame 236% (Lame)
No deadlock was observed
C
S
E
8
3
8
3
Results
Benchmark
NameBrief Description
CG Size
Network Energy
Network LatencyNod
eEdge
Morph2 Morphological operations 338 1081 75.5(64.9%) 380.4(8.8%)
Disc Speech/music discriminator 816 2937 99.2(46.3%) 123.6(6.9%)
Jpeg Compression for still images 524 1729 92.7(55.8%) 445.1(10.3%)
Viterbi A graphical Viterbi decoder 622 2239 72.5(32.9%) 150.8(9.8%)
Rasta Speech recognition 498 1424 118.1(50.7%) 219.5(6.2%)
3Step-log Logarithmic search motion est. 127 396 15.2(62.4%) 107.4(5.7%)
Full-search Full search motion est. 136 448 13.5(48.0%) 95.6(12.3%)
Hier Hierarchical motion est. 138 503 20.4(56.3%) 151.9(7.3%)
Phods Parallel hierarchical motion est. 128 440 16.7(66.6%) 111.3(10.4%)
Epic Image data compression 1144 4516 103.9(30.7%) 420.4(6.1%)
Lame MP3 encoder 2062 7526 80.1(55.0%) 272.1(9.0%)
FFT Fast Fourier transform 416 1747 87.2(55.9%) 253.3(7.4%)
C
S
E
8
3
8
3
Cont.
Link Utilization Percentage reductions in leakage energy consumption
C
S
E
8
3
8
3
Cont.
Percentage increases in network cycles and overall execution time
C
S
E
8
3
8
3
Cont.
Sensitivity to the number of nodes (Scheme I). The results with Scheme II
are similar
Sensitivity to the input size (Scheme I). The results with Scheme II are
similar
C
S
E
8
3
8
3
Conclusion The proposed approach limits links usage
into a small set of links to increase the idle period of the remaining links
Hardware schemes are more effective when used with the proposed technique