profile-driven energy reduction in network-on-chips

Profile-Driven Energy Reduction in Network-

on-Chips

8383 – 2nd Presentation

Ranya Alawadhi

C

S

E

8

3

8

3

Source Li, F., Chen, G., Kandemir, M., and Kolcu, I.

2007. Profile-driven energy reduction in network-on-chips. SIGPLAN Not. 42, 6 (Jun. 2007), 394-404

C

S

E

8

3

8

3

Agenda Motivation

Contribution

Introduction

The Technique

Results

Conclusion

C

S

E

8

3

8

3

Motivation Increasing on-chip power consumption

demands a power-aware designs

Recent research shows that using voltage/frequency scaling on communication links & shutting down the idle links can significantly reduce NoC power consumption

They work best when communication links have long idle periods

C

S

E

8

3

8

3

Contribution A profile-driven compiler optimization for

increasing the length of idle periods of communication links for a two-dimensional, on-chip, mesh network by maximizing communication link reuse

C

S

E

8

3

8

3

Introduction Targeted application: array/loop-intensive

embedded programs

Targeted NoC: two-dimensional mesh used by a single application at a time.

Data tested: 12 data intensive embedded applications

Results: reduces leakage energy by more than 35% on average (as compared to a pure hardware-based link power management scheme)

C

S

E

8

3

8

3

Architectural modelNetwork Abstraction

NI CPU

Memory

S S S S

S S S S

S S S S

NI CPU

Memory

NI CPU

Memory

NI CPU

Memory

NI CPU

Memory

NI CPU

Memory

NI CPU

Memory

NI CPU

Memory

NI CPU

Memory

NI CPU

Memory

NI CPU

Memory

NI CPU

Memory

SouthSouth

Interfaceto local

processingunit

Cross-Bar

Interfaceto local

processingunit

Cross-Bar

Interfaceto local

processingunit

Cross-BarWest East

North

Input Buffer

Output Buffer

Interfaceto local

processingunit

Cross-BarWest East

North

Input Buffer

Output Buffer

C

S

E

8

3

8

3

Architectural modelHardware support for Compiler Directed Message Routing

The compiler attach routing information to each message-send operation in the code

The switch design was extended to handle two types of routing schemes: Default X-Y routing Compiler-directed routing

C

S

E

8

3

8

3

Cont. Packet header:

Flag: indicates which routing mechanism to use

flag: 0 X-Y routing

. . .Flag Destination

C

S

E

8

3

8

3

Cont. flag (1 bit): 1 Compiler directed routing

counter (4 bits): Number of hops a long the path

orientation (2 bits): used along with the routing command sequence

routing command sequence (13 bits): tells the switch to which output port to forward the packet

. . . . . .Flag Counter Orientation Routing command sequence

Orientation 00 00 01 01 10 10 11 11

Routing Command 0 1 0 1 0 1 0 1

Routing Decision N E N W S E S W

C

S

E

8

3

8

3

Optimizing Link Reuse

Communication Graph Modified Communication Graph

OptimizedParallel

Code

Parallel Code

Profiler

LinkReuse

Optimizer

Code Rewriter

Link Signature

Optimized Link Signature

C

S

E

8

3

8

3

Network State and Link Signature Parallel program consists of n parallel threads P1, P2, … , Pn

Pp is scheduled to run on the pth mesh node

Communication command (CC): send operationCp={M1,p, M2,p, ... , Mk,p, ... , Mq,p}

Cp:set of CCs in the program code of Pp

Mk,p : kth CC in the code of Pp

q: total number of CCs in the program code of Pp

Network State: set of messages under transmission Si = {Mk,p | A message sent by Mk,p is in transmission over

the mesh} S0 = represents a state in which no message is in transmission

C

S

E

8

3

8

3

Cont. Link utilization vector (LUV):is a vector , the jth

element gives the number of packets sent by Mk,p and transferred through the jth communication link of the mesh

Link signature (LS):represents the link utilization at a network state Si

Θ( ): is a function that returns the set of links used by LS or LUV

p,ku

ip,k SM

p,ki us

w

C

S

E

8

3

8

3

Example

P0P1

P2P3

l0,1

l1,0

l1,3l3,1

l0,1

l1,0

l2.3

l3,2

l0,2l2,0

m1,2

m1,1

m1,0

Links l0,1 l1,0 l2,3 l3,2 l0,2 l2,0 l1,3 l3,1

(20 0 0 0 0 0 20 0)

(0 0 0 0 0 0 20 0)

(0 0 20 0 0 0 0 0)

(20 0 20 0 0 0 40 0)

0,1u

1,1u

2,1u

1s

S1={M1,0,M1,1,M1,2}

C

S

E

8

3

8

3

Communication Graph The network transitions from a state, Si, to

another state, Sj , in two situations:

1. A new message is sent by Mk,p

Sj = Si U {Mk,p}

2. A message sent by Mk,p arrives at its destination node

Sj = Si − {Mk,p}

C

S

E

8

3

8

3

Cont. Communication graph (CG):

Captures the communication behavior of a program

Undirected graph Vertex: network state Edge(Si,Sj): transmission between Si, Sj

Weight(Wi,j): number of transition taking place between Si, Sj

Built through profilingS3S1

S2

S5

S3

S5

S1 S3

S5

S1 S3

S5

S2

S1 S3

S5

S4

300

500400

200

100

300

C

S

E

8

3

8

3

Profiler keeps track of the current network state Si

The program notifies the profiler each time a node sends a msg or when a msg arrives its destination

When the notification is received, the profiler computes the new state, Sj, & increases the value of Wi,j

Profiling

C

S

E

8

3

8

3

Restate the problem when going from one state to another at

runtime, the desire is to reuse the same set of links as much as possible

Each vertex in a CG has a default link signature (obtained using the default X-Y routing)

The compiler’s task is to re-assign link signatures to vertices

C

S

E

8

3

8

3

Traversing a Communication Graph Traversing network states to assign them new

link signatures:

1. Starts with the edge with the largest weight2. Performs the signature re-assignment to the

associated vertices3. Select the next edge:

Scheme I: The one with the largest weight among the edges that are incident on the selected vertices

Scheme II: The one with the largest weight among all the remaining edges

4. Performs the signature re-assignment5. Repeat 3 & 4 until all vertices are processed

C

S

E

8

3

8

3

Example

S3S1

S2

S5

S3

S5

S1 S3

S5

S1 S3

S5

S2

S1 S3

S5

S4

300

500400

200

100

300

Step Vertex-pair considered Step Vertex-pair considered

1 S1 S2 1 S1 S2

2 S1 S3 2 S3 S4

3 S3 S4 3 S4 S5

4 S4 S5

Scheme I Scheme II

C

S

E

8

3

8

3

Routing Flexibility Only the shortest paths are considered for re-routing

messages

# of possible unique shortest paths= Source (xs, ys), Destination (xd, yd) m=|xd-xs| n=|yd-ys|

Alternate link utilization vectors (ALUV): set of all alternate (shortest) paths available to a message sent by Mi,p (Ai,p)

Re-routing: replacement of the current LUV for an associated Mi,p with a new LUV selected from the corresponding ALUVs

Routing Flexibility = | Ai,p | (i.e. number of alternate link utilization vectors in an ALUV set)

mnmC

C

S

E

8

3

8

3

Problem formulation we can change the associated routing with a CC only once

Selecting the new utilization vectors should not degrade the performance of the default routing scheme

Selecting alternate re-routings can increase the network contention

Performance constraint was introduced for re-routing: avoid increasing the value of the largest entry in any original link signature

For example: Default LS (10, 40, 10, 10, 0, 0, 0, 0) Undesirable alternative: (10, 50, 0, 10, 0, 0, 0, 0) Accepted alternative: (40, 20, 10, 0, 0, 0, 0, 0)

C

S

E

8

3

8

3

Heuristic For each Mi,p unassigned with new routing in

network state Sa, Sb Calculate LUV & ALUV Calculate LS of Sa & Sb

Compute num_links (total # links used in Sa & Sb)

Sort the CC in Sa & Sb into a sequence with ascending routing flexibilities

Start with the CC that has the lowest routing flexibility & assign a proper route to it

Assign the appropriate routes to the CCs, one-by-one, until processing all commands in Sa & Sb is complete

Reduce the value as much as possible under

the performance constraint

C

S

E

8

3

8

3

Cont. The method for choosing a route for a Mi,p (Selects a new

LUV for Mi,p by considering all the re-routing options captured in Ai,p) For each alternate re-routing, checks whether the

performance constraint is satisfied with respect to state Sa.

If the performance constraint is met, the new link signature is computed for the state that it belongs to

Recalculate num_links

selects the one that leads to the minimum num_links value

Once a CC is given a new LUV, this command is not considered again when processing the other vertex-pairs

C

S

E

8

3

8

3

Complexity Computational Complexity: O(N*K* )

N: number of network states K: number of send operations : largest routing flexibility in an mxn mesh

mnmC

mnmC

C

S

E

8

3

8

3

Example

C

S

E

8

3

8

3

Cont.

C

S

E

8

3

8

3

Coder Rewriter Responsible for providing a version of

the message send operation, which incorporates the compiler-determined routing information

Message header for send1,3: 1 0110 11 0001110000000Message header for send1,7: 1 0100 11 1010000000000

C

S

E

8

3

8

3

Handling Deadlocks Re-routings change the behavior of the default X-Y routing

scheme

An acyclic channel dependency graph is the necessary and sufficient condition for avoiding deadlocks

Incorporating deadlock handling routine by breaking cycles within the channel dependency graph: Reduces the probability of experiencing a deadlock at runtime Cannot completely eliminate deadlocks use the dynamic,

hardware-supported deadlock avoidance rule employed by the Alpha 21364 network architecture

Handling deadlocks when they occur results in both extra latency and power consumption

C

S

E

8

3

8

3

Experiments1)Simulation Environment and Benchmarks Implement a flit-level on-chip interconnection network simulator

Network: 5x5 configuration

Link speed: 1 Gb/sec

Switch input port buffer size: 64 flits

Flit:128 bit

Packet size: 16 flits

The communication links can be shutdown independently, using a time-out based mechanism

Time-out counter threshold: 1.5 μsec (based on preliminary analysis)

Time to to switch a link to active state: 1 μsec

Energy overhead of switching: 140 μJ

C

S

E

8

3

8

3

Cont. Perform experiments with three different

versions for each benchmark1. employs the default routing 2. Scheme I 3. Scheme II

All schemes use the underlying hardware-based link shutdown scheme.

C

S

E

8

3

8

3

Cont. Code sizes: 63 - 8,612 C lines

Dataset sizes: 68.9KB - 1,866.4KB

Increase in compilation time (including profiling): 89% (3Step-log) - Lame 236% (Lame)

No deadlock was observed

C

S

E

8

3

8

3

Results

Benchmark

NameBrief Description

CG Size

Network Energy

Network LatencyNod

eEdge

Morph2 Morphological operations 338 1081 75.5(64.9%) 380.4(8.8%)

Disc Speech/music discriminator 816 2937 99.2(46.3%) 123.6(6.9%)

Jpeg Compression for still images 524 1729 92.7(55.8%) 445.1(10.3%)

Viterbi A graphical Viterbi decoder 622 2239 72.5(32.9%) 150.8(9.8%)

Rasta Speech recognition 498 1424 118.1(50.7%) 219.5(6.2%)

3Step-log Logarithmic search motion est. 127 396 15.2(62.4%) 107.4(5.7%)

Full-search Full search motion est. 136 448 13.5(48.0%) 95.6(12.3%)

Hier Hierarchical motion est. 138 503 20.4(56.3%) 151.9(7.3%)

Phods Parallel hierarchical motion est. 128 440 16.7(66.6%) 111.3(10.4%)

Epic Image data compression 1144 4516 103.9(30.7%) 420.4(6.1%)

Lame MP3 encoder 2062 7526 80.1(55.0%) 272.1(9.0%)

FFT Fast Fourier transform 416 1747 87.2(55.9%) 253.3(7.4%)

C

S

E

8

3

8

3

Cont.

Link Utilization Percentage reductions in leakage energy consumption

C

S

E

8

3

8

3

Cont.

Percentage increases in network cycles and overall execution time

C

S

E

8

3

8

3

Cont.

Sensitivity to the number of nodes (Scheme I). The results with Scheme II

are similar

Sensitivity to the input size (Scheme I). The results with Scheme II are

similar

C

S

E

8

3

8

3

Conclusion The proposed approach limits links usage

into a small set of links to increase the idle period of the remaining links

Hardware schemes are more effective when used with the proposed technique

profile-driven energy reduction in network-on-chips

Documents

jth communication link

network state si

mesh network

link signatureparallel

idle links

set of links

message routingthe compiler

link utilization vector