profile-driven energy reduction in network-on-chips

38
Profile-Driven Energy Reduction in Network-on-Chips 8383 – 2 nd Presentation Ranya Alawadhi

Upload: ailsa

Post on 16-Jan-2016

27 views

Category:

Documents


1 download

DESCRIPTION

Profile-Driven Energy Reduction in Network-on-Chips. 8383 – 2 nd Presentation Ranya Alawadhi. Source. Li, F., Chen, G., Kandemir, M., and Kolcu, I. 2007. Profile-driven energy reduction in network-on-chips. SIGPLAN Not. 42, 6 (Jun. 2007), 394-404. Agenda. Motivation Contribution - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Profile-Driven Energy Reduction in Network-on-Chips

Profile-Driven Energy Reduction in Network-

on-Chips

8383 – 2nd Presentation

Ranya Alawadhi

Page 2: Profile-Driven Energy Reduction in Network-on-Chips

C

S

E

8

3

8

3

Source Li, F., Chen, G., Kandemir, M., and Kolcu, I.

2007. Profile-driven energy reduction in network-on-chips. SIGPLAN Not. 42, 6 (Jun. 2007), 394-404

Page 3: Profile-Driven Energy Reduction in Network-on-Chips

C

S

E

8

3

8

3

Agenda Motivation

Contribution

Introduction

The Technique

Results

Conclusion

Page 4: Profile-Driven Energy Reduction in Network-on-Chips

C

S

E

8

3

8

3

Motivation Increasing on-chip power consumption

demands a power-aware designs

Recent research shows that using voltage/frequency scaling on communication links & shutting down the idle links can significantly reduce NoC power consumption

They work best when communication links have long idle periods

Page 5: Profile-Driven Energy Reduction in Network-on-Chips

C

S

E

8

3

8

3

Contribution A profile-driven compiler optimization for

increasing the length of idle periods of communication links for a two-dimensional, on-chip, mesh network by maximizing communication link reuse

Page 6: Profile-Driven Energy Reduction in Network-on-Chips

C

S

E

8

3

8

3

Introduction Targeted application: array/loop-intensive

embedded programs

Targeted NoC: two-dimensional mesh used by a single application at a time.

Data tested: 12 data intensive embedded applications

Results: reduces leakage energy by more than 35% on average (as compared to a pure hardware-based link power management scheme)

Page 7: Profile-Driven Energy Reduction in Network-on-Chips

C

S

E

8

3

8

3

Architectural modelNetwork Abstraction

NI CPU

Memory

S S S S

S S S S

S S S S

NI CPU

Memory

NI CPU

Memory

NI CPU

Memory

NI CPU

Memory

NI CPU

Memory

NI CPU

Memory

NI CPU

Memory

NI CPU

Memory

NI CPU

Memory

NI CPU

Memory

NI CPU

Memory

SouthSouth

Interfaceto local

processingunit

Cross-Bar

Interfaceto local

processingunit

Cross-Bar

Interfaceto local

processingunit

Cross-BarWest East

North

Input Buffer

Output Buffer

Interfaceto local

processingunit

Cross-BarWest East

North

Input Buffer

Output Buffer

Page 8: Profile-Driven Energy Reduction in Network-on-Chips

C

S

E

8

3

8

3

Architectural modelHardware support for Compiler Directed Message Routing

The compiler attach routing information to each message-send operation in the code

The switch design was extended to handle two types of routing schemes: Default X-Y routing Compiler-directed routing

Page 9: Profile-Driven Energy Reduction in Network-on-Chips

C

S

E

8

3

8

3

Cont. Packet header:

Flag: indicates which routing mechanism to use

flag: 0 X-Y routing

. . .Flag Destination

Page 10: Profile-Driven Energy Reduction in Network-on-Chips

C

S

E

8

3

8

3

Cont. flag (1 bit): 1 Compiler directed routing

counter (4 bits): Number of hops a long the path

orientation (2 bits): used along with the routing command sequence

routing command sequence (13 bits): tells the switch to which output port to forward the packet

. . . . . .Flag Counter Orientation Routing command sequence

Orientation 00 00 01 01 10 10 11 11

Routing Command 0 1 0 1 0 1 0 1

Routing Decision N E N W S E S W

Page 11: Profile-Driven Energy Reduction in Network-on-Chips

C

S

E

8

3

8

3

Optimizing Link Reuse

Communication Graph Modified Communication Graph

OptimizedParallel

Code

Parallel Code

Profiler

LinkReuse

Optimizer

Code Rewriter

Link Signature

Optimized Link Signature

Page 12: Profile-Driven Energy Reduction in Network-on-Chips

C

S

E

8

3

8

3

Network State and Link Signature Parallel program consists of n parallel threads P1, P2, … , Pn

Pp is scheduled to run on the pth mesh node

Communication command (CC): send operationCp={M1,p, M2,p, ... , Mk,p, ... , Mq,p}

Cp:set of CCs in the program code of Pp

Mk,p : kth CC in the code of Pp

q: total number of CCs in the program code of Pp

Network State: set of messages under transmission Si = {Mk,p | A message sent by Mk,p is in transmission over

the mesh} S0 = represents a state in which no message is in transmission

Page 13: Profile-Driven Energy Reduction in Network-on-Chips

C

S

E

8

3

8

3

Cont. Link utilization vector (LUV):is a vector , the jth

element gives the number of packets sent by Mk,p and transferred through the jth communication link of the mesh

Link signature (LS):represents the link utilization at a network state Si

Θ( ): is a function that returns the set of links used by LS or LUV

p,ku

ip,k SM

p,ki us

w

Page 14: Profile-Driven Energy Reduction in Network-on-Chips

C

S

E

8

3

8

3

Example

P0P1

P2P3

l0,1

l1,0

l1,3l3,1

l0,1

l1,0

l2.3

l3,2

l0,2l2,0

m1,2

m1,1

m1,0

Links l0,1 l1,0 l2,3 l3,2 l0,2 l2,0 l1,3 l3,1

(20 0 0 0 0 0 20 0)

(0 0 0 0 0 0 20 0)

(0 0 20 0 0 0 0 0)

(20 0 20 0 0 0 40 0)

0,1u

1,1u

2,1u

1s

S1={M1,0,M1,1,M1,2}

Page 15: Profile-Driven Energy Reduction in Network-on-Chips

C

S

E

8

3

8

3

Communication Graph The network transitions from a state, Si, to

another state, Sj , in two situations:

1. A new message is sent by Mk,p

Sj = Si U {Mk,p}

2. A message sent by Mk,p arrives at its destination node

Sj = Si − {Mk,p}

Page 16: Profile-Driven Energy Reduction in Network-on-Chips

C

S

E

8

3

8

3

Cont. Communication graph (CG):

Captures the communication behavior of a program

Undirected graph Vertex: network state Edge(Si,Sj): transmission between Si, Sj

Weight(Wi,j): number of transition taking place between Si, Sj

Built through profilingS3S1

S2

S5

S3

S5

S1 S3

S5

S1 S3

S5

S2

S1 S3

S5

S4

300

500400

200

100

300

Page 17: Profile-Driven Energy Reduction in Network-on-Chips

C

S

E

8

3

8

3

Profiler keeps track of the current network state Si

The program notifies the profiler each time a node sends a msg or when a msg arrives its destination

When the notification is received, the profiler computes the new state, Sj, & increases the value of Wi,j

Profiling

Page 18: Profile-Driven Energy Reduction in Network-on-Chips

C

S

E

8

3

8

3

Restate the problem when going from one state to another at

runtime, the desire is to reuse the same set of links as much as possible

Each vertex in a CG has a default link signature (obtained using the default X-Y routing)

The compiler’s task is to re-assign link signatures to vertices

Page 19: Profile-Driven Energy Reduction in Network-on-Chips

C

S

E

8

3

8

3

Traversing a Communication Graph Traversing network states to assign them new

link signatures:

1. Starts with the edge with the largest weight2. Performs the signature re-assignment to the

associated vertices3. Select the next edge:

Scheme I: The one with the largest weight among the edges that are incident on the selected vertices

Scheme II: The one with the largest weight among all the remaining edges

4. Performs the signature re-assignment5. Repeat 3 & 4 until all vertices are processed

Page 20: Profile-Driven Energy Reduction in Network-on-Chips

C

S

E

8

3

8

3

Example

S3S1

S2

S5

S3

S5

S1 S3

S5

S1 S3

S5

S2

S1 S3

S5

S4

300

500400

200

100

300

Step Vertex-pair considered Step Vertex-pair considered

1 S1 S2 1 S1 S2

2 S1 S3 2 S3 S4

3 S3 S4 3 S4 S5

4 S4 S5

Scheme I Scheme II

Page 21: Profile-Driven Energy Reduction in Network-on-Chips

C

S

E

8

3

8

3

Routing Flexibility Only the shortest paths are considered for re-routing

messages

# of possible unique shortest paths= Source (xs, ys), Destination (xd, yd) m=|xd-xs| n=|yd-ys|

Alternate link utilization vectors (ALUV): set of all alternate (shortest) paths available to a message sent by Mi,p (Ai,p)

Re-routing: replacement of the current LUV for an associated Mi,p with a new LUV selected from the corresponding ALUVs

Routing Flexibility = | Ai,p | (i.e. number of alternate link utilization vectors in an ALUV set)

mnmC

Page 22: Profile-Driven Energy Reduction in Network-on-Chips

C

S

E

8

3

8

3

Problem formulation we can change the associated routing with a CC only once

Selecting the new utilization vectors should not degrade the performance of the default routing scheme

Selecting alternate re-routings can increase the network contention

Performance constraint was introduced for re-routing: avoid increasing the value of the largest entry in any original link signature

For example: Default LS (10, 40, 10, 10, 0, 0, 0, 0) Undesirable alternative: (10, 50, 0, 10, 0, 0, 0, 0) Accepted alternative: (40, 20, 10, 0, 0, 0, 0, 0)

Page 23: Profile-Driven Energy Reduction in Network-on-Chips

C

S

E

8

3

8

3

Heuristic For each Mi,p unassigned with new routing in

network state Sa, Sb Calculate LUV & ALUV Calculate LS of Sa & Sb

Compute num_links (total # links used in Sa & Sb)

Sort the CC in Sa & Sb into a sequence with ascending routing flexibilities

Start with the CC that has the lowest routing flexibility & assign a proper route to it

Assign the appropriate routes to the CCs, one-by-one, until processing all commands in Sa & Sb is complete

Reduce the value as much as possible under

the performance constraint

Page 24: Profile-Driven Energy Reduction in Network-on-Chips

C

S

E

8

3

8

3

Cont. The method for choosing a route for a Mi,p (Selects a new

LUV for Mi,p by considering all the re-routing options captured in Ai,p) For each alternate re-routing, checks whether the

performance constraint is satisfied with respect to state Sa.

If the performance constraint is met, the new link signature is computed for the state that it belongs to

Recalculate num_links

selects the one that leads to the minimum num_links value

Once a CC is given a new LUV, this command is not considered again when processing the other vertex-pairs

Page 25: Profile-Driven Energy Reduction in Network-on-Chips

C

S

E

8

3

8

3

Complexity Computational Complexity: O(N*K* )

N: number of network states K: number of send operations : largest routing flexibility in an mxn mesh

mnmC

mnmC

Page 26: Profile-Driven Energy Reduction in Network-on-Chips

C

S

E

8

3

8

3

Example

Page 27: Profile-Driven Energy Reduction in Network-on-Chips

C

S

E

8

3

8

3

Cont.

Page 28: Profile-Driven Energy Reduction in Network-on-Chips

C

S

E

8

3

8

3

Cont.

Page 29: Profile-Driven Energy Reduction in Network-on-Chips

C

S

E

8

3

8

3

Coder Rewriter Responsible for providing a version of

the message send operation, which incorporates the compiler-determined routing information

Message header for send1,3: 1 0110 11 0001110000000Message header for send1,7: 1 0100 11 1010000000000

Page 30: Profile-Driven Energy Reduction in Network-on-Chips

C

S

E

8

3

8

3

Handling Deadlocks Re-routings change the behavior of the default X-Y routing

scheme

An acyclic channel dependency graph is the necessary and sufficient condition for avoiding deadlocks

Incorporating deadlock handling routine by breaking cycles within the channel dependency graph: Reduces the probability of experiencing a deadlock at runtime Cannot completely eliminate deadlocks use the dynamic,

hardware-supported deadlock avoidance rule employed by the Alpha 21364 network architecture

Handling deadlocks when they occur results in both extra latency and power consumption

Page 31: Profile-Driven Energy Reduction in Network-on-Chips

C

S

E

8

3

8

3

Experiments1)Simulation Environment and Benchmarks Implement a flit-level on-chip interconnection network simulator

Network: 5x5 configuration

Link speed: 1 Gb/sec

Switch input port buffer size: 64 flits

Flit:128 bit

Packet size: 16 flits

The communication links can be shutdown independently, using a time-out based mechanism

Time-out counter threshold: 1.5 μsec (based on preliminary analysis)

Time to to switch a link to active state: 1 μsec

Energy overhead of switching: 140 μJ

Page 32: Profile-Driven Energy Reduction in Network-on-Chips

C

S

E

8

3

8

3

Cont. Perform experiments with three different

versions for each benchmark1. employs the default routing 2. Scheme I 3. Scheme II

All schemes use the underlying hardware-based link shutdown scheme.

Page 33: Profile-Driven Energy Reduction in Network-on-Chips

C

S

E

8

3

8

3

Cont. Code sizes: 63 - 8,612 C lines

Dataset sizes: 68.9KB - 1,866.4KB

Increase in compilation time (including profiling): 89% (3Step-log) - Lame 236% (Lame)

No deadlock was observed

Page 34: Profile-Driven Energy Reduction in Network-on-Chips

C

S

E

8

3

8

3

Results

Benchmark

NameBrief Description

CG Size

Network Energy

Network LatencyNod

eEdge

Morph2 Morphological operations 338 1081 75.5(64.9%) 380.4(8.8%)

Disc Speech/music discriminator 816 2937 99.2(46.3%) 123.6(6.9%)

Jpeg Compression for still images 524 1729 92.7(55.8%) 445.1(10.3%)

Viterbi A graphical Viterbi decoder 622 2239 72.5(32.9%) 150.8(9.8%)

Rasta Speech recognition 498 1424 118.1(50.7%) 219.5(6.2%)

3Step-log Logarithmic search motion est. 127 396 15.2(62.4%) 107.4(5.7%)

Full-search Full search motion est. 136 448 13.5(48.0%) 95.6(12.3%)

Hier Hierarchical motion est. 138 503 20.4(56.3%) 151.9(7.3%)

Phods Parallel hierarchical motion est. 128 440 16.7(66.6%) 111.3(10.4%)

Epic Image data compression 1144 4516 103.9(30.7%) 420.4(6.1%)

Lame MP3 encoder 2062 7526 80.1(55.0%) 272.1(9.0%)

FFT Fast Fourier transform 416 1747 87.2(55.9%) 253.3(7.4%)

Page 35: Profile-Driven Energy Reduction in Network-on-Chips

C

S

E

8

3

8

3

Cont.

Link Utilization Percentage reductions in leakage energy consumption

Page 36: Profile-Driven Energy Reduction in Network-on-Chips

C

S

E

8

3

8

3

Cont.

Percentage increases in network cycles and overall execution time

Page 37: Profile-Driven Energy Reduction in Network-on-Chips

C

S

E

8

3

8

3

Cont.

Sensitivity to the number of nodes (Scheme I). The results with Scheme II

are similar

Sensitivity to the input size (Scheme I). The results with Scheme II are

similar

Page 38: Profile-Driven Energy Reduction in Network-on-Chips

C

S

E

8

3

8

3

Conclusion The proposed approach limits links usage

into a small set of links to increase the idle period of the remaining links

Hardware schemes are more effective when used with the proposed technique