[ieee 16th international conference on parallel architecture and compilation techniques (pact 2007)...

12
Reducing Energy Consumption of On-Chip Networks Through a Hybrid Compiler-Runtime Approach Guangyu Chen Microsoft Corporation [email protected] Feihui Li and Mahmut Kandemir Computer Science and Engineering Department Pennsylvania State University {feli,kandemir}@cse.psu.edu Abstract This paper investigates a compiler-runtime approach for reducing power consumption in the context of the Network- on-Chip (NoC) based chip multiprocessor (CMP) architec- tures. Our proposed approach is based on the observation that the same communication patterns across the nodes of a mesh based CMP repeat themselves in successive iterations of a loop nest. The approach collects the link usage statistics dur- ing the execution of the first few iterations of a given loop nest and computes the slack (allowable delay) for each communi- cation transaction. This information is subsequently utilized in selecting the most appropriate voltage levels for the com- munication links (and the corresponding frequencies) in exe- cuting the remaining iterations of the loop nest. The results with the benchmarks from the MediaBench suite show that, not only this hybrid approach generates better energy savings than a pure hardware-directed voltage scaling scheme, but it also leads to much less performance degradation than the lat- ter. Specifically, the average energy savings achieved by the pure hardware based scheme and our approach are 24.9% and 38.1%, respectively, and the corresponding performance over- head numbers are 8.3% and 2.1%. Our results also show that the hybrid approach generates much better savings than two recently proposed pure compiler based schemes. In addition, our experimental evaluation indicates that the energy savings obtained through the proposed approach are very close to op- timal savings (within 3%) under the same performance bound. 1 Introduction NoC (Network-on-Chip) architectures emerged as an alter- nate solution to on-chip point-to-point buses in complex SoC designs. They contain on-chip routers that support communi- cation between different computation blocks, and are expand- This work is supported in part by NSF Career Award #0093082, and a grant from GSRC. able in the sense that they can be reconfigured to handle dif- ferent communication patterns (which would be very costly to handle in the case of fixed point-to-point buses). In ad- dition, they can easily respond to fault conditions where one or more connections are disabled. NoC architectures are also promising from the signal synchronization viewpoint since the routers can act as pipeline stages across long wires. Power consumption of NoC is a critical problem as previ- ous research [21] discusses that NoC can be responsible for up to 36% of the overall power consumption of a SoC. The re- sults from our own experiments also show that NoC can con- tribute to a large fraction of overall on-chip power, as shown in Figure 1 for the MediaBench applications. Consequently, there have been several efforts in the past five years or so in addressing the power consumption of the NoC based systems. Note that reducing power consumption of the NoC based sys- tems is particularly important for battery-operated embedded computing systems. Past research studied this power problem from the different angles, including application mapping, data encoding, leakage optimizations, and link voltage scaling. This paper investigates automated compiler support in re- ducing power consumption of an NoC based two-dimensional mesh architecture that uses a static (deterministic) routing al- gorithm. In our approach, NoC is exposed to the compiler through an interface. The goal is to let the compiler modify the application source code and manage power consumption of the communication links through voltage scaling. Prior re- search [30, 33, 20] showed that the dynamic scaling of the voltage/frequency of communication links is an effective to reduce energy consumption of NoCs. The key to the suc- cess of dynamic voltage scaling (DVS) is to scale the volt- age/frequency of each communication link to the right level at the right time. Our proposed approach is based upon the obser- vation that the same communication patterns across the nodes of the NoC tend to repeat themselves across the successive it- erations of a loop nest. The approach takes advantage of this observation by collecting link usage statistics during the exe- cution of the first few iterations of a given loop nest and com- puting the allowable delays (slacks) for each communication transaction and the communication bandwidths of the links 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007) 0-7695-2944-5/07 $25.00 © 2007

Upload: mahmut

Post on 11-Dec-2016

214 views

Category:

Documents


1 download

TRANSCRIPT

Reducing Energy Consumption of On-Chip Networks Through a HybridCompiler-Runtime Approach ∗

Guangyu ChenMicrosoft Corporation

[email protected]

Feihui Li and Mahmut KandemirComputer Science and Engineering Department

Pennsylvania State University{feli,kandemir}@cse.psu.edu

Abstract

This paper investigates a compiler-runtime approach forreducing power consumption in the context of the Network-on-Chip (NoC) based chip multiprocessor (CMP) architec-tures. Our proposed approach is based on the observationthat the same communication patterns across the nodes of amesh based CMP repeat themselves in successive iterations ofa loop nest. The approach collects the link usage statistics dur-ing the execution of the first few iterations of a given loop nestand computes the slack (allowable delay) for each communi-cation transaction. This information is subsequently utilizedin selecting the most appropriate voltage levels for the com-munication links (and the corresponding frequencies) in exe-cuting the remaining iterations of the loop nest. The resultswith the benchmarks from the MediaBench suite show that,not only this hybrid approach generates better energy savingsthan a pure hardware-directed voltage scaling scheme, but italso leads to much less performance degradation than the lat-ter. Specifically, the average energy savings achieved by thepure hardware based scheme and our approach are 24.9% and38.1%, respectively, and the corresponding performance over-head numbers are 8.3% and 2.1%. Our results also show thatthe hybrid approach generates much better savings than tworecently proposed pure compiler based schemes. In addition,our experimental evaluation indicates that the energy savingsobtained through the proposed approach are very close to op-timal savings (within 3%) under the same performance bound.

1 Introduction

NoC (Network-on-Chip) architectures emerged as an alter-nate solution to on-chip point-to-point buses in complex SoCdesigns. They contain on-chip routers that support communi-cation between different computation blocks, and are expand-

∗This work is supported in part by NSF Career Award #0093082, and agrant from GSRC.

able in the sense that they can be reconfigured to handle dif-ferent communication patterns (which would be very costlyto handle in the case of fixed point-to-point buses). In ad-dition, they can easily respond to fault conditions where oneor more connections are disabled. NoC architectures are alsopromising from the signal synchronization viewpoint since therouters can act as pipeline stages across long wires.

Power consumption of NoC is a critical problem as previ-ous research [21] discusses that NoC can be responsible forup to 36% of the overall power consumption of a SoC. The re-sults from our own experiments also show that NoC can con-tribute to a large fraction of overall on-chip power, as shownin Figure 1 for the MediaBench applications. Consequently,there have been several efforts in the past five years or so inaddressing the power consumption of the NoC based systems.Note that reducing power consumption of the NoC based sys-tems is particularly important for battery-operated embeddedcomputing systems. Past research studied this power problemfrom the different angles, including application mapping, dataencoding, leakage optimizations, and link voltage scaling.

This paper investigates automated compiler support in re-ducing power consumption of an NoC based two-dimensionalmesh architecture that uses a static (deterministic) routing al-gorithm. In our approach, NoC is exposed to the compilerthrough an interface. The goal is to let the compiler modifythe application source code and manage power consumptionof the communication links through voltage scaling. Prior re-search [30, 33, 20] showed that the dynamic scaling of thevoltage/frequency of communication links is an effective toreduce energy consumption of NoCs. The key to the suc-cess of dynamic voltage scaling (DVS) is to scale the volt-age/frequency of each communication link to the right level atthe right time. Our proposed approach is based upon the obser-vation that the same communication patterns across the nodesof the NoC tend to repeat themselves across the successive it-erations of a loop nest. The approach takes advantage of thisobservation by collecting link usage statistics during the exe-cution of the first few iterations of a given loop nest and com-puting the allowable delays (slacks) for each communicationtransaction and the communication bandwidths of the links

16th International Conference onParallel Architecture and Compilation Techniques (PACT 2007)0-7695-2944-5/07 $25.00 © 2007

Figure 1. The distribution of energy consumption betweencomputation (including instruction executions and memory ac-cesses) and communication (energy consumed in the networklinks and routers) on a 6 × 6 mesh under 0.10μm.

that are not fully utilized. This information is subsequentlyutilized in selecting the most appropriate voltage levels for thecommunication links (and the corresponding frequencies) inexecuting the remaining iterations of the loop nest. In otherwords, this compiler-runtime hybrid approach divides the ex-ecution of each loop nest that encloses communication amongthe mesh nodes into two parts. In the first part (called the“startup phase”), we gather statistics on link usage at runtimethrough a hardware-supported interface, and in the second part(called the “stable phase”), we use this collected informationto reduce the link voltage levels as much as possible withoutaffecting communication latency.

We implemented this approach using an optimizing com-piler. Figure 2 illustrates the interaction of our hybrid ap-proach with the rest of compilation. We performed experi-ments with the applications in the MediaBench suite [2] andcompared our experimental results to a pure hardware-basedlink voltage/frequency scaling scheme. The experimental re-sults reveal that the hybrid approach generates better energysavings than the hardware-based scheme and incurs much lessperformance overhead. Specifically, with the default valuesof our simulation parameters and including all runtime over-heads, the average energy savings achieved by the pure hard-ware based scheme and our approach are 24.9% and 38.1%,respectively, and the corresponding performance overheadnumbers are 8.3% and 2.1%. Our results also show that thehybrid approach generates much better savings than two re-cently proposed pure compiler based schemes. In addition,our experimental evaluation indicates that the energy savingsobtained through the proposed approach are very close to op-timal savings (within 3%) under the same performance bound.

The remainder of this paper is structured as follows. Inthe next section, we discuss the options for scaling down thevoltages of communication links in an on-chip mesh networkwithout degrading the performance of the application. Thedetails of our hybrid approach are presented in Section 3. Anexperimental evaluation of the proposed approach and a com-parison with the prior work are given in Section 4. Related

Figure 2. Interaction of our hybrid approach with the rest ofcompilation. A sequential code is first parallelized into multi-ple processes before applying our approach.

work is discussed in Section 5. Finally, our conclusions arepresented in Section 6.

2 Link Voltage Scaling Options

2.1 On-Chip Mesh Network

In this paper, we focus on an M×N (M rows, N columns)mesh architecture as depicted in Figure 3 for M = N = 3.Each node of this mesh consists of a processor, a small mem-ory module, and a switch. The memory module is assumed tobe divided into instruction memory and data memory compo-nents. Since our focus is on the network power, we assumethat both instruction and data memory components are underthe compiler control. The node at the ith (i = 0, 1, ...,M − 1)row and jth (j = 0, 1, ..., N −1) column is labeled with an in-teger ID: i×N + j. Figure 4 gives the structure of a switch inthis mesh. Each switch has five in-coming ports and five out-going ports. The first in-coming port and the first out-goingport are connected to the local processor (in the same node asthis switch). The remaining four in-coming ports and four out-going ports, on the other hand, are connected to the switchesin the neighboring nodes via a set of wires. Each in-comingport contains a queue to buffer the messages that cannot be im-mediately forwarded to the next link. When this queue is full,the out-going port (in a neighboring switch) that is connectedto this in-coming port is blocked. The switch also provides acontrol interface that allows the local processor to read and setthe state of each in-coming/out-going port (we will elaborateon this issue further when we discuss our link control interfacein Section 3.1).

We use link(i, j) to denote the directed physical connec-tion from a node (Ni) to one of its neighbors (Nj), i.e., thecommunication link from Ni to Nj . We refer to nodes Ni

and Nj as the sender and receiver of link(i, j), respectively.We assume that each pair of adjacent nodes, Ni and Nj , areconnected by a pair of links, namely, link(i, j) and link(j, i).Each link consists of a pair of ports (an out-going port of thesender switch and an in-coming port of the receiver switch)and the wires that connect these two ports.

16th International Conference onParallel Architecture and Compilation Techniques (PACT 2007)0-7695-2944-5/07 $25.00 © 2007

A parallel program consists of a set of parallel processesrunning on different nodes of the mesh. A process sends mes-sages to another process through a logical connection (or con-nection for short). A logical connection consists of multiplelinks if the sender and receiver processes are running on twonodes that are not adjacent to each other. In this paper, we useC(s, d) to denote the set of links in the connection from thesource node Ns to the destination node Nd. The set of linksused in a connection is dependent on the packet routing algo-rithm used by the mesh. Although, in this paper, we assumethat the mesh uses an XY-routing algorithm [13], our schemeis not bound to any particular routing algorithm; it can workwith any static routing algorithm. Since, under a static routingscheme, a connection can be unambiguously identified by theset of communication links used in this connection, we alsouse C(s, d) to denote the connection from Ns to Nd.

2.2 Scaling Link Voltage

A parallel program may require multiple connections dur-ing its execution, and these connections may share some com-munication links. The packets transferred over such connec-tions have to contend for the shared links. Such contentionson the links may increase the transmission time of packets. Inthe rest of this section, we present two different schemes thattake advantage of this observation. Our approach calculatesthe voltages of communication links using both the methodsdescribed below and selects the lower voltage/frequency levelsuggested.

2.2.1 Link Throughput Based Voltage Scaling

Recall that a communication link consists of an out-going portand an in-coming port in two neighboring switches (see Fig-ure 4). A packet transferred through a communication link firstenters the buffer of the out-going port (in the sender switch),and then propagates to the in-coming port along the wires con-necting these two ports. We define the data rate of a linkas the maximum number of data packets that can be trans-ferred from the out-going port to the in-coming port of thislink during a unit of time. The data rate of a link is deter-mined by the voltage of the out-going port and the in-comingport of this link. In our discussion, we use λi,j to denote thedata rate of link(i, j). We define the throughput of a link asthe number of packets that are forwarded from the in-comingport of this link to other links (in the same switch of the in-coming port) during a unit period of time. We use μi,j todenote the throughput of link(i, j). It should be noted thatthe throughput of a link is limited by the throughputs of thelinks to which this link is connected. For example, in Fig-ure 5, in switch Sj , link(i, j) is connected to link(j, k1),link(j, k2), link(j, k3), and link(j, k4). Consequently, themaximum throughput of link(i, j) is limited by the through-puts of link(j, k1), link(j, k2), link(j, k3), and link(j, k4).Specifically, we have μi,j ≤ ∑4

r=1 prμj,kr, where pr is the

fraction of the packets (with respect to the total number ofthe packets transferred over link(i, j)) that are forwarded tolink(j, kr).

In a network under heavy traffic, the contention on the bot-tleneck links can be severe. A link that forwards packets to abottleneck link can be congested because the bottleneck linkcould not accept the packets fast enough. If a link is congested,its input queue will be filled up; and, after that point, no morepackets can flow into this link until at least one packet in thequeue is forwarded to another link. As a result, the through-put of the congested link can be much lower than its data rate.Therefore, when congestion happens, while the bandwidth ofthe bottleneck links is fully utilized, the bandwidth of the con-gested links are underutilized. Based on this observation, onecan reduce the voltages/frequencies of the congested links toconserve energy without significantly degrading the overallperformance of the application. An indication of a commu-nication link being congested during a given time period isthat the queue associated with this link never becomes emptyduring this period. This link throughput based voltage scalingstrategy operates as follows. If we find that, during a givenperiod, the queue associated with link(i, j) never becomesempty and that μi,j < λi,j , we reduce the voltage of link(i, j)to the lowest level v such that f(v) ≥ μi,j , where f(v) is themaximum data rate that a communication link can provide atvoltage level v. Since reducing the data rate of a link that isnot congested may hurt performance of the parallel applica-tion, we apply the throughput based voltage scaling only tothe congested links whose queues never become empty duringa given period of time.

2.2.2 Link Slack Based Voltage Scaling

For the communication links whose queues may becomeempty during a given period of time, there can still be oppor-tunities to scale down voltage/frequency without significantlydegrading the system performance. Figure 6 illustrates suchan example. Figure 6(a) shows a connection C(a, d) consist-ing of three links: link(a, b), link(b, c), and link(c, d). Weassume that a packet, m1, is being transmitted along this con-nection. Let us further assume that, at time t0, link(a, b) startstransferring packet m1 to link(b, c).1 This transfer completes

1More precisely, we start transferring packet m1 from the in-coming port of link(a, b) to the in-coming port of link(b, c). Thisprocedure involves two steps. In the first step, the cross-bar forwardspacket m1 from the queue in the in-coming port of link(a, b) to thebuffer in the out-going port of link(b, c). In the second step, m1

is transferred from the out-going port of link(b, c) to the in-comingport of link(b, c) through the wires connecting these two ports. Thetime spent in the first step is determined by the speed of the cross-bar,and the time spent in the second step is determined by the data rate oflink(b, c). Since the time spent in the first step is much shorter thanthat in the second step, we can omit the delay due to the cross-barwithout significantly affecting the results of our analysis.

16th International Conference onParallel Architecture and Compilation Techniques (PACT 2007)0-7695-2944-5/07 $25.00 © 2007

Figure 3. A 3 × 3 mesh network.

Figure 4. The structure of aswitch.

Figure 5. The maximumthroughput of μi,j is limited bythe throughputs of link(j, k1),link(j, k2), link(j, k3), andlink(j, k4).

at time t1, i.e., at this time, all the bits of packet m1 are storedin the queue of the in-coming port of link(b, c). Suppose nowthat link(c, d) is shared by multiple connections, and it is cur-rently busy with transferring a packet (mx) on behalf of someother connection. As a result, m1 has to wait in the queue ofthe in-coming port of link(b, c) until time t2 when link(c, d)finishes transferring mx. Figure 6(b) depicts the timing forthis scenario. In this figure, we observe a gap between timest1 and t2. The length of this gap, t2 − t1, is referred to as theslack for packet m1 at link(b, c). The existence of slacks in-dicates that we can reduce the data rates (frequency) of somecommunication links in a connection without delaying the fi-nal delivery time of the packets transferred by this connection.For example, we can reduce the data rate of link(b, c) from1/(t1 − t0) to 1/(t2 − t0) – and scale its voltage down – with-out delaying the time when m1 arrives at its destination Nd.

A communication link may transfer multiple packets.Based on the timing of these packets, the link slacks can beclassified into two types. The first type of slack is shownin Figure 6(b). One can see from this figure that link(a, b)starts transferring the second packet, m2, after link(b, c) startstransferring the first packet, m1, to the next link. A link slack(for packet m1) of this type starts from the time point t1 whena packet is completely received by the in-coming port of alink, and it ends at the time point t2 when this link starts trans-ferring this packet to the next link on this route. The secondtype of link slack is shown in Figure 6(c). One can see thatlink(a, b) starts transferring the second packet, m2, beforelink(b, c) starts transferring the first packet, m1, to the nextlink. A link slack (for packet m1) of this type starts from thetime point t1 when a packet is completely received by the in-coming port of a link, and it ends at the time point t′0 whenanother packet starts to use the same link.

3 Hybrid Approach to Link Voltage Scaling

While our approach is applicable to any application, thespecific application domain we focus on this work is embed-ded computing. The proposed approach collects link usage

Figure 9. Overview of our approach. Our approach takes aparallelized program as input.

statistics during the execution of the first few iterations of aloop in each process (Section 3.1). This information is subse-quently utilized in selecting the most appropriate voltage lev-els for the communications links in executing the remainingiterations of the loop (Section 3.2).

3.1 Hardware Support

In this subsection, we describe the hardware interface ex-posed to the compiler. Figure 7 shows the structure of a linkfrom switch Si (on node Ni) to switch Sj (on node Nj). Inboth the out-going port and in-coming port of this link, thereis a voltage control logic that controls the voltage/frequencyof the circuit in the corresponding port. The voltage/frequencyof the in-coming port is controlled by the program running onthe local processor through the control interface of the switch(see Figure 4). The voltage/frequency of the out-going port,however, is controlled by the voltage monitor. Specifically, inour approach, when the program running on node Nj sets thevoltage/frequency of the in-coming port, the voltage monitorof the out-going port (on node Ni) that is connected to thisin-coming port detects the voltage/frequency change in the in-coming port, and sets the voltage/frequency for the out-goingport circuit accordingly.

An in-coming port also accommodates four registers (CK,QE, TR, and SL) that are accessible by the local processorthrough the control interface of the corresponding switch. The

16th International Conference onParallel Architecture and Compilation Techniques (PACT 2007)0-7695-2944-5/07 $25.00 © 2007

(a) Packet m1 is being transmitted along connection C(a, d) ={link(a, b), link(b, c), link(c, d)}.

(b) Link slacks of the first type.

(c) Link slacks of the second type.

Figure 6. Link slacks.

Figure 7. Structure of a link.

Figure 8. State transitions for the slack counting logic. Thetext attached to an arrow gives the condition that triggers thecorresponding transition.

content of register CK is increased by one at each clock cycle.This register counts the number of the cycles elapsed since itslast reset. QE is a flag register whose value is set to one whenthe in-coming port queue becomes empty. Once the value ofQE is set to one, it does not switch back to zero until the pro-gram running on the local processor explicitly resets it. Thecontent of register TR is increased by one when a packet inthe queue is forwarded by the cross-bar. Therefore, this reg-ister keeps track of the throughput of the corresponding link.Finally, register SL counts the number of slack cycles of thecorresponding link (slacks are explained in Section 2.2.2), andis controlled by a slack counting logic. When the slack count-ing logic is in the “count” state, SL is increased by one at eachclock cycle; when the slack counting logic is in the “stop”state, the value of SL is not changed. Figure 8 shows the statetransitions for the slack counting logic. The counting logicswitches to the “count” state when a packet is completely re-ceived by the in-coming port; it switches to the “stop” statewhen the cross-bar of the switch starts forwarding a packet inthe queue of the in-coming port to the buffer of an out-goingport (for the first type of slack described in Section 2.2.2), orwhen a packet starts to use the Rx component of the in-comingport (for the second type of slack described in Section 2.2.2).

We use the function call “getSwitchRegister(R, i)” to readthe contents of register R (where R can be CK, QE, TR, orSL) of the ith in-coming port in the local switch, i.e., theswitch which is in the same node as the processor that exe-cutes this function call. Similarly, we use the function call“setSwitchRegister(R, i, v)” to set the contents of register Rof the ith in-coming port in the local switch, where v is the newvalue of the register. In our experimental evaluation, the over-heads due to maintaining voltage monitors, counting slacks,and reading/updating the contents of the associated registers

are accounted for.

3.2 Compiler Support

Figure 9 gives the overview of our approach, i.e., it zoomsin the part marked as “Our Approach” in Figure 2. Our ap-proach takes a message-passing based parallel code as input.It first partitions each loop nest in each parallel process codeinto a set of voltage scaling regions (or voltage regions forshort). A voltage scaling region is a region of code for whichwe scale the voltages/frequencies of the links in the mesh. Weset the link voltages upon entering a voltage region. Within theregion, however, the link voltages are not changed. Therefore,the voltage region is the basic unit for our link voltage scaling.After marking the voltage regions, our compiler breaks eachloop nest into the startup phase and the stable phase. Duringthe startup phase, the link usage information for the differentvoltage scaling regions is collected individually so that we candetermine and set the suitable link voltage levels to be used forthe stable phase for the different voltage regions separately. Itshould be emphasized that all the necessary information forscaling the voltage of a communication link can be obtainedlocally. Each node is responsible for scaling the voltages forits local links, i.e., the links connected to the switch on thisnode. That is, a node does not need to exchange any informa-tion with any other nodes to determine the voltage/frequencylevels for its local links.

3.2.1 Determining Voltage Scaling Regions

We partition each loop nest that contains inter-node communi-cation into a set of voltage scaling regions. Each voltage scal-ing region contains loops or loop nests such that a communi-cation pattern repeats itself at every iteration. Our partitioning

16th International Conference onParallel Architecture and Compilation Techniques (PACT 2007)0-7695-2944-5/07 $25.00 © 2007

algorithm tries to put as many loops or loop nests as possiblein the same voltage scaling region so that we can minimize thenumber of link voltage changes (i.e., minimize the overheads).The loop nests with significantly different communication be-haviors, however, should not be put in the same voltage scalingregion since this can either reduce energy benefits or increasenetwork latency.

Before discussing the details of our algorithm, let us firstdefine the communication pattern (or pattern for short) for aloop nest L in the process code running on the kth mesh nodeas follows:

Sk(L) =

⎧⎪⎪⎪⎨⎪⎪⎪⎩

〈s(k,L), r(k,L)〉 L does not enclose any inner loop;〈�S, �R〉 L encloses at least one inner loop,

and all the inner loops in L havethe same pattern 〈�S, �R〉;

ε otherwise.

In the expression above, function s(k,L) gives a vector, theith element of which is the number of the “send” statementsin the body of loop L (running on the kth mesh node) thatsend data packets to the ith mesh node. Similarly, functionr(k,L) gives a vector, the ith element of which is the numberof the “receive” statements in the body of loop L (running onthe kth mesh node) that receive data packets from the ith meshnode. We omit the code that are not enclosed by a loop. Notethat the behavior of an application is mainly dictated by thecode portions enclosed by its loops because the codes enclosedby loops are executed much more frequently than those notcontained in loops. Therefore, omitting non-loop code doesnot significantly affect the results of our analysis.

The important point in our communication pattern defini-tion is that, if two loop nests have different communicationpatterns, they are not likely to exercise the communicationlinks in the same way, and thus, they should not be put in thesame voltage region. On the other hand, if two loop nests havethe same communication pattern (other than ε), they are ex-pected (not guaranteed though) to exercise the communicationlinks in a similar way, and consequently, if they are adjacentto each other, they can be placed into the same voltage region.Although the loops with the same counts of communicationinstructions do not necessarily have the same communicationbehavior, the loops with the different counts of communicationinstructions are very likely to have different communicationbehaviors. Consequently, our approach at least avoids cluster-ing the loops with different communication behaviors into thesame voltage region. Further, if a loop nest has a pattern ε, weknow that this loop nest encloses multiple inner loops with thedifferent communication patterns. Therefore, the body of thisloop nest should be partitioned into multiple voltage regions.

Figure 10 gives our compiler algorithm for determiningvoltage scaling regions in a given loop nest L. Our algorithmmarks the start point of each voltage scaling region such thatthe program codes between two consecutive start points be-long to the same voltage region. It first computes the com-munication pattern for loop nest L as explained above. If this

pattern is not ε, we treat the entire loop nest as a single volt-age scaling region without further partitioning. On the otherhand, if the pattern of this loop is ε, we need to call func-tion partition(L) to partition this loop nest since it con-tains multiple inner loops with the different communicationpatterns. When partitioning a loop nest L containing innerloops L1, L2, ..., Ln, we first compute the communication pat-tern for each inner loop. After that, we put the adjacent innerloops with the same pattern (other than ε) into the same volt-age region. For each inner loop with pattern ε, however, werecursively call function partition() to further partitionit.

Figure 12 gives an example showing how our partitioningalgorithm works. On the left side, we show a loop nest struc-ture. The figure on the right represents this loop nest as atree. Each node of this tree corresponds to a loop in a givenloop nest. An edge (i, j) in this tree, on the other hand, in-dicates that loop i encloses loop j. Loops 3, 5, 6, 7, and 8are the inner-most loops, i.e., they do not enclose any loopwithin them. The communication patterns of these inner-mostloops can be calculated by counting the number of send andreceive statements in their bodies as discussed above. Forclarity of presentation, let us assume that the inner-most loops3, 5, 6, and 7 have the pattern 〈�S1, �R1〉, whereas the inner-most loop 8 has a different pattern, namely, 〈�S2, �R2〉. Sinceall the inner loops enclosed by loop 2 have the same pattern〈�S1, �R1〉, the communication pattern for loop 2 can be com-puted as 〈�S1, �R1〉. On the other hand, since the inner loopsenclosed by loop 4 have different patterns, the communicationpattern for loop 4 is set to ε. Similarly, we find that loop 1 hasa pattern ε. Since loops 2 and 3 are adjacent to each other andhave the same communication pattern, they are placed into thesame voltage region (region 1). Loops 7 and 8, however, areplaced into the different voltage regions since their communi-cation patterns are different.

3.2.2 Code Transformation for Dynamic Link VoltageScaling

Figure 11 shows our compiler algorithm that generates thestartup and stable phases for a given loop nest L. Thisalgorithm first invokes the partitioning algorithm given inFigure 10 to mark the start point for each voltage scal-ing region in this loop nest. After that, it calls functionsgenerateStartPhase(L) to generate the codes for thestartup, and generateStablePhase(L) to generate sta-ble phases. Note that splitting a loop nest into the startup andthe stable phases does not change the order in which loopiterations are executed, and thus no data dependency is vio-lated. In the startup phase, for each voltage scaling region,we create a data structure called the sampling context. Be-fore the backward jump of each loop, we insert a call to func-tion takeSampling() to collect the link usage informa-tion for each iteration of this loop. The information collectedis stored in the sampling context of the corresponding volt-

16th International Conference onParallel Architecture and Compilation Techniques (PACT 2007)0-7695-2944-5/07 $25.00 © 2007

determineVoltageScalingRegions(L) {compute the communication patterns for all the loops in Lif(S(L) != ε)

insert a region start point before the entry of loop L;else

partition(L);}partition(L) {

// assume that S[1], S[2], ..., S[n] are the communication patterns// for loops L1, L2, ..., Ln, respectively,// where L1, L2, ..., Ln are the inner loops enclosed by L.if(S[1]==ε)

partition(L1);else

insert a region start point before the entry of loop L1;for(i=2; i ≤ n; i++) {

if(S[i] == ε)partition(Li);

else if(S[i] != S[i-1])insert a region start point before the entry of loop Li;

}}

Figure 10. Compiler algorithm for determining thevoltage scaling regions for a loop nest L.

Transformation(L) {determineVoltageScalingRegions(L);generateStartPhase(L);generateStablePhase(L);

}generateStartPhase(L) {

// p1, p2, ..., pn are the region start points in loop L,// and pn+1 is the end point of loop L.adjust the iteration range of the outer-most loop of loop nest L such thatthe startup phase takes the first 5% of the iterations of L;for(i=1; i≤n; i++) {

create a sampling context cx[i];for each loop whose backward jump is in region between pi and pi+1 {

insert “takeSampling(cx[i])” before backward jump of this loop;}insert “temp[i]=v[i]; v[i] = finalize(cx[i])” at point pi;

}insert code to terminate the startup phase when the voltages are stabilized;

}generateStablePhase(L) {

// assume that p1, p2, ..., pn are the region start points in L,// and pn+1 is the end point of loop L.adjust the iteration range of the outer-most loop of loop nest L such thatthe startup phase takes the remaining iterations of L;for(i=1; i≤n; i++)

insert “setVoltage(v[i])” at the point marked by pi;}

Figure 11. Compiler algorithm that generates thestartup and stable phases for a loop nest L.

age region. Since the different voltage regions have differ-ent communication behaviors, we use different sampling con-texts for the different regions. This makes it possible to deter-mine the suitable link voltages for each voltage region sepa-rately. At the end of the startup phase, we insert calls to func-tion finalize() to calculate the suitable voltage levels foreach voltage region. As mentioned earlier, this function calcu-lates the voltages for each local link using both the throughputbased and slack based voltage scaling techniques, and then se-lects the lower voltage suggested for this link. In the stablephase, for each voltage scaling region, we insert calls to func-tion setVoltage() to set the voltage levels for the linksconnected to the local switch to the values determined by thecorresponding finalize() function call.

Figure 12. An example showing how to partition a loop nestinto voltage scaling regions. Each node on the right part rep-resents a loop shown on the left part. An edge (i, j) indicatesthat loop i encloses loop j.

4 Experiments

4.1 Platform and Benchmarks

We used nine benchmarks from the MediaBench suite [2]to test the effectiveness of our approach. We do not includethe results for the remaining two benchmarks in this suite,ghostscript and mesa, because we could not execute them tocompletion in our simulation setup. Table 1 lists the bench-mark codes used in this paper and some important statistics.We manually parallelized these applications and inserted ex-plicit communication calls (that implement message passing)in the process codes. In parallelizing each application code,we tried to exploit the coarse-grain parallelism to the extentallowed by the data dependencies in the loop nests; how-ever, we did not apply any inter-procedural optimization (ac-tually, these applications do not seem to take advantage of anyinter-procedural analysis). In order to minimize the frequencyand volume of data communication among the mesh nodes,we also applied – whenever possible – several source codelevel message optimizations, which include message coalesc-ing, message vectorization, message reuse, and message ag-gregation. The specific versions of these optimizations usedin our implementation are from [35]. These parallelized andcommunication-optimized codes are then fed to our approach(implemented based on the SUIF compiler infrastructure [4])and optimized for energy reduction, as detailed in Section 3.The additional compilation time overhead introduced by ourapproach, over the parallelization and communication opti-mizations, was below 30% for all the benchmark codes used inthe experiments. By analyzing the compiler-generated codes,we observed that our approach was able to detect very largevoltage regions in these applications. The number of voltageregions detected in our benchmarks varied between 3 and 28.Also, the static code size increase caused by our approach wasless than 10% for all the benchmarks (ranging from 6.7% to9.8%). The increase in the dynamic instruction count due to

16th International Conference onParallel Architecture and Compilation Techniques (PACT 2007)0-7695-2944-5/07 $25.00 © 2007

Communication ExecutionBenchmark Messages Volume Energy Cycles

Name Description Count (KB) (mJ) (×106)adpcm Adaptive audio coding 851 438.21 14.23 74.27

epic Experimental image compression 1382 796.53 46.54 211.05g721 CCITT voce compression 1818 921.36 187.40 986.37gsm European standard for speech coding 1023 581.07 128.71 705.33jpeg Lossy compression for still images 793 308.64 27.16 136.91

mpeg2 Lossy compression for video 977 618.40 154.75 911.54pegwit Elliptical public-key encryption 995 584.26 119.09 688.42

pgp IDEA/RSA public-key encryption 886 883.37 158.84 865.34rasta Speech recognition 1054 694.78 81.32 592.60

Table 1. The applications used in the experiments.The values in the last two columns are collected whenno power optimization is employed. The energy valuesshown include both switching energy and leakage energy.

Parameter ValueCommunication Network

Mesh size 6 × 6Voltage switching overheads 150ns delay, 250nJ energy

Flit size 39 bitsMesh Node

Processor core In-order execution, single-issueOn-chip memory 8KB instruction, 16KB data

Table 2. The default values of our major simulationparameters.

Voltage (V) 1.2 1.0 0.8 0.6Rate (Gb/s) 2.00 1.66 1.33 1.00

Energy (nJ/bit) 0.45 0.31 0.20 0.11

Table 3. The default link voltage/frequency levels.

our optimization was negligible for all the benchmarks tested.The reason for this is that we do not change the original num-ber of iterations in the optimized loops (we just restructure theloops). Note that the code memory requirements of our bench-marks are much smaller than their data memory requirements.The third and fourth columns of Table 1 give, respectively, thenumber and volume of inter-node communication messagesissued at the source code level after the parallelization whenaccumulated across all mesh nodes. The last two columns, onthe other hand, give the network energy consumption and thenumber of cycles spent in executing each application when nopower optimization is employed. In the rest of this section, weuse the term default scheme to indicate an approach that doesnot use any network power optimizations. However, even inthis default scheme, if a link is not used by the application, it isnot activated at all. The values shown in the last two columnsof Table 1 are for the default scheme.

To obtain the power and performance numbers, we builta custom network simulator on top of Orion [37], which isa cycle accurate energy/performance simulator for commu-nication networks. This simulator allows each mesh nodeto inject communication messages (based on the parametersspecified in the communication calls inserted in each processcode). Apart from the network, our custom tool also simu-lates the performance/power behavior of each mesh node us-ing a Wattch [7] based model. While our main goal in theexperimental evaluation is the energy and performance behav-

Figure 13. The network energy consump-tions.

Figure 14. The percentage increase in ex-ecution cycles.

ior of the network itself, we simulate the node behavior aswell in order to measure the energy/performance overheadsincurred due to collecting statistics during the startup phasesand processing the statistics collected at the end of the startupphases. Table 2 gives the default values of our simulation pa-rameters. The default communication network we focus on isa 6 × 6 mesh. Each node in this mesh consists of a switchand computing resources (a single-issue, in-order, 32-bit em-bedded processor core; 8KB scratch pad memory for instruc-tions and 16KB scratch-pad memory for data). The internalstructures of the components are given earlier in Figure 3.We assume that this network employs packet switching andXY routing [13]. All the communication links can work at avariety of voltage/frequency levels [20]. In our experiments,unless otherwise stated, we assume that there are four volt-age/frequency levels that can be used for each link. Table 3lists the per bit energy consumptions for all voltage levels. Thevalues in Tables 2 and 3 are similar to those used in the priorwork [31, 34, 36]. The energy consumption for transferringan S-bit package over a communication link is computed us-ing SEv, where Ev is the per-bit energy cost at voltage levelv, which is also shown in Table 3. In addition, switchingbetween two different voltage levels incurs 150ns delay and

16th International Conference onParallel Architecture and Compilation Techniques (PACT 2007)0-7695-2944-5/07 $25.00 © 2007

250nJ energy overhead. Note that our simulation environmentgives, as output, both energy – including leakage and dynamic– and performance statistics. The values shown in the last fourcolumns of Table 1 are obtained using the values shown inTable 2. Recall that, the energy distribution between compu-tation and communication is given earlier in Figure 1.

For comparison purposes, we also implemented a hardwarescheme, which is based on the technique proposed by Shanget al in [30]. In their approach, a hardware component contin-uously collects statistics on network traffic through each link,and based on the statistics collected, it predicts the future be-havior of the application and adjusts the link voltage levelsto conserve energy. Instead of using the throughput and thenumber of slack cycles as our approach does, the voltage de-cision logic of their approach is based on three metrics: linkutilization, input buffer utilization, and input buffer age. Ofthese, the link utilization is the fraction of time that a link istransferring packets; the input buffer utilization captures howmany packets, on average, are waiting in the queue associ-ated with each link in a given period; and the input buffer agereflects how long each packet remains in the queue, waitingto be transferred. In our implementation, we hand-tuned theparameters of this hardware scheme to maximize its energysavings and keep the resulting performance overheads as lowas possible. In presenting our experimental results, we use theterms “hardware based” and “hybrid” to denote this hardwarebased approach and our hybrid approach, respectively. Laterin our experiments, we also compare our hybrid approach totwo recently proposed pure compiler based schemes.

4.2 Results

We start by presenting the energy consumption results. Fig-ure 13 gives the network energy consumption values for ourbenchmarks with the hardware scheme and the hybrid ap-proach. These values are normalized with respect to thosegiven in the fifth column of Table 1. We see that the net-work energy savings with the hybrid and the hardware basedschemes are 38.1% and 24.9%, respectively, on the average.The additional energy savings the hybrid scheme achievesover the hardware scheme can be explained as follows. Inthe hybrid approach, a program in the stable phase sets thelink voltage levels proactively. Further, the hybrid approachchanges the voltage of each link to the suitable level directly,whereas the hardware approach has to go through all the inter-mediate voltage levels until it finally reaches the desired volt-age level. While it is possible to design a hardware schemethat can transition to the target voltage directly, such a schemerequires more hardware to perform accurate prediction at run-time (later we also evaluate an alternate hardware scheme withdifferent metrics).

The power savings shown in Figure 13 are not without theircosts. Voltage scaling typically incurs some increase in execu-tion cycles, which is shown in Figure 14 for our benchmarks.From these results we see that the average performance degra-

dations due to the hardware scheme and our approach are 8.3%and 2.1%, respectively. The hybrid approach causes less per-formance degradation than the hardware approach because ofthe following reasons. First, the hybrid approach scales linkvoltage levels based on the maximum throughput and the num-ber of slack cycles for each link, while the hardware approachscales voltages based on the link utilization, input buffer uti-lization, and input buffer age. Note that the metrics used in thehybrid approach directly reflect the potential that the voltagelevel of a link can be scaled down without delaying the timewhen each message arrives at its destination. On the otherhand, the metrics used in the hardware based approach are“throughput oriented”, rather than “delay oriented”. As re-ported in [30], the hardware approach achieves 4.6X powersavings, with a 2.5% reduction in network throughput and15.2% increase in the network latency on an average. Sec-ond, the hybrid approach incurs fewer voltage changes (eachvoltage change incurs a certain delay) than the hardware basedapproach does since the former can set, in the stable phase, thesuitable voltage level for each link directly, whereas the latterneeds to go through all the intermediate voltage levels. Now,considering Figures 13 and 14 together, we can conclude that,from both the network energy and performance viewpoints,the hybrid scheme is more successful than the hardware-basedvoltage scaling.

However, since our approach modifies the application codeas well, it also incurs some performance and energy penaltyin the mesh nodes. The performance overheads incurred atthe NoC nodes are already captured in the results presentedin Figure 14. Table 4, on the other hand, summarizes the en-ergy overheads incurred in the mesh nodes, as a result of usingthe hybrid approach. These overhead numbers include all thedynamic and leakage overheads that occur in the CPUs andmemory components. We see that the average energy overheadat mesh nodes due to our approach is about 1.13%.

We now discuss the total energy savings achieved by ourapproach, when considering not only the network but the meshnodes as well. The normalized total energy consumption val-ues (with respect to the default scheme) resulting from ourhybrid approach and the hardware-based scheme are given inTable 5. These results include both computation and com-munication and also capture all the overheads resulted fromour approach. We see that, on an average, the the hybrid andhardware-based schemes reduce the total energy consump-tions of the default scheme by 10.78% and 4.29%, respec-tively. We also observe that, for some benchmarks, the hard-ware based approach actually increases the overall energy con-sumption (over the default scheme) due to the increased exe-cution time caused by the increased network latency. Our con-clusion is that the hybrid scheme performs much better thanthe hardware-based scheme from the total energy consump-tion viewpoint as well, even when all overheads are accountedfor.

16th International Conference onParallel Architecture and Compilation Techniques (PACT 2007)0-7695-2944-5/07 $25.00 © 2007

adpcm epic g721 gsm jpeg mpeg2 pegwit pgp rasta% 1.5 1.5 1.9 0.7 0.9 0.8 1.1 1.3 0.5

Table 4. The percentage increases in the node energy con-sumptions due to the hybrid approach.

Scheme adpcm epic g721 gsm jpeg mpeg2 pegwit pgp rastaHybrid (%) 87.1 93.5 93.6 91.7 92.1 81.0 82.9 91.2 89.8

Hardware Based (%) 93.5 102.5 103.3 96.3 98.1 86.8 91.0 93.7 96.2

Table 5. The normalized total energy consumptions of thehybrid and hardware based schemes including both the com-putation and communication related components and all over-heads.

4.3 Comparison with Pure Compiler BasedApproaches and an Optimal Scheme

There has been recent research on software directed com-munication link power minimization. These studies sug-gest analyzing the application code and inserting explicit linkpower management calls in the code, either (1) to shutdown,temporarily, the communication links that are unused by thecurrent communication (note that the links that are not usedby the application at all are not activated anyway in the de-fault scheme) or (2) to scale down the voltage/frequency of notheavily used links. To see how our hybrid scheme comparesto such pure compiler based schemes, we also implementedthe compiler-based link shutdown scheme described in [10]and the compiler-based voltage/frequency scaling scheme de-scribed in [11]. The results with these schemes are presentedin Figure 15. The first bar, for each benchmark, gives the nor-malized energy consumption with the scheme in [11], and thesecond bar gives the results with the scheme in [10]. Thethird bar reproduces the results obtained through our hybridapproach, and finally, the last bar shows the results when ourhybrid approach is applied after the scheme in [10] is alreadyapplied. In this combined approach, the scheme in [10] deter-mines the unused links by the current communication pattern,and the hybrid scheme scales down the voltage for the remain-ing links. The main observation from these results is that thehybrid scheme generates much better results than the schemesin [10] and [11]. The reason for this is the fact that those twoprior schemes rely solely on compiler support for extractingcommunication pattern (statically) and selecting the suitablevoltage/frequency levels and set of links to shut down based onthe compiler-exposed NoC model. Such pure compiler basedschemes may not be very successful when the communica-tion pattern cannot be fully analyzed at compile time (notethat this non-analyzability does not necessarily mean lack oflink reuse). In contrast, our hybrid model takes the runtimecommunication behavior into account, and therefore catchesthe opportunities for link reuse, which could not be caught bystatic analysis alone. As a result, the hybrid scheme achieveshigher energy savings than these two compiler-based schemes.We also see from the results in Figure 15 that the best savings

are obtained when our hybrid approach is combined with thecompiler-based link shutdown scheme.

While the network energy savings achieved by our ap-proach are significant, it is also important to check how closeour approach comes to the optimal network energy savings.We performed another set of experiments to compare our sav-ings to a scheme that selects the optimal voltage level for eachcommunication link at any time (from the four voltage lev-els in our default configuration). The results with this opti-mal (and probably unimplementable) scheme are presented inFigure 16.2 We also reproduce the results with our hybrid ap-proach from Figure 13 for convenience. We see that the av-erage network energy savings achieved through the optimalscheme is 41.2%. This value is not significantly higher thanthe corresponding average saving obtained by our hybrid ap-proach (38.1%). Therefore, we can conclude from these re-sults that our approach comes very close to the optimal oneas far as the network energy savings are concerned. In orderto explain the difference between our scheme and the optimalone, we also give in Figure 17 the breakdown of our energylosses. It needs to be noted that, as compared to the optimalapproach, our scheme has two problems. First, we consumeextra network energy in the startup phases where we operatewith the highest voltage level available. The second problem isthat our approach can select sub-optimal voltage levels (sinceit is a heuristic). We see from Figure 17 that most of our lossesare due to selecting sub-optimal voltage levels for all our ap-plications. This means that, by employing a more sophisti-cated voltage selection scheme in the future, we might be ableto reduce the gap between the hybrid scheme and the optimalapproach even further. Exploring this issue further is in ourfuture agenda.

5 Related Work

Voltage scaling is an important technique used frequentlyfor power optimization in computer systems [8]. Many com-mercial CPUs (e.g., Transmeta [3] and Intel XScale [1]) nowprovide interfaces that allows the software to control voltagelevels to reduce energy consumption. The proposed CPU volt-age scaling techniques include [8, 9, 15, 29, 39, 38, 40]. Cheda

2We calculate the optimal network energy for an application as fol-lows. First, we track the transmission time ti (the time for transmit-ting a packet from the source node to the destination node) for eachpacket mi when our hybrid scheme is applied. We then compute theminimum energy required to transmit each packet mi individuallyregardless of the other traffic in the network, under the constraint thatthe transmission time for mi does not exceed ti. By summing up theminimum energy consumption for all packets, we obtain the optimalnetwork energy consumption for the application. Therefore, the op-timal energy consumption is the minimum energy consumed by thenetwork during the execution of an application under the constraintthat the transmission time for each packet does not exceed that in ourhybrid scheme.

16th International Conference onParallel Architecture and Compilation Techniques (PACT 2007)0-7695-2944-5/07 $25.00 © 2007

Figure 15. The normalized energy(with respect to the default scheme)consumption values with differentmesh sizes.

Figure 16. A comparison of ourapproach with the optimal scheme.

Figure 17. The breakdown of theenergy losses when compared to theoptimal scheme.

et al. [12] use both static and runtime IPC statistics to adap-tively adjust the voltage and speed of CPU components to con-serve energy while still meeting the applications’ target perfor-mance constraints.

In the area of multiprocessors, due to the increasing linkbandwidth, the power consumed by interconnection networksis becoming an increasingly significant portion of the to-tal power consumption for both NoCs and large scale net-works [18, 25]. Kim and Horowitz [20] proposed link de-sign techniques where the links can operate at different volt-age/frequency levels. Based on the variable/frequency link de-sign technique, Shang et al. [30] presented and evaluated ahistory-based dynamic voltage scaling (DVS) scheme for thecommunication links. Their approach is hardware-based. Adetailed comparison between their approach and our hybridapproach has been made earlier in Section 4. Soterious et al.[33] proposed a software-directed DVS technique for reduc-ing the energy consumption of communication links. In orderto determine the appropriate voltage levels for the communi-cation links, their approach requires off-line profiling of theapplication. Kim et al. [19] proposed a dynamic link shut-down (DLS) technique for chip-to-chip networks. They com-pared DVS and DLS, and demonstrated that the scheme inte-grating both DVS and DLS provided the best energy savingwith around 5% performance degradation.

Besides voltage scaling and link shutdown, another ap-proach to reducing energy consumption of a network-on-chipsystem is based on task mapping. Shin and Kim [31] use ge-netic algorithms to determine task assignment, tile mapping,routing path allocation, task scheduling, and link speed assign-ment for applications running on NoC based system. Asica etal. [5] proposed another genetic algorithm that allows user tospecify a particular optimization goal, such as performance orenergy consumption. Hu and Marculescu [17] proposed analgorithm that maps a given set of IP blocks onto a genericregular NoC and constructs a deadlock-free routing functionsuch that the total communication energy consumption duedo communication is minimized. The prior compiler work[22, 23, 26] on chip multiprocessors targeted at exploiting par-allelism. In contrast, our work targets at energy reduction on

communication network. There are also prior efforts that usecompiler-based techniques for reducing the energy consump-tion of NoCs. Li et al. [24] proposed a compiler directed tech-nique to shut down some links to save leakage energy. Chen etal. [10] proposed reducing the energy consumption of NoCsusing compiler-directed communication link allocation. Chenet al. [11] proposed a compiler directed DVS technique forcommunication links. All these efforts relies on static analy-sis of the application code. Our hybrid approach differs fromthem in that it uses runtime information to determine the volt-ages of the communication links.

There exist several other efforts devoted to optimizingthe power consumption of interconnection networks. In[37, 14, 27], the different power models for interconnectionnetworks are built, as well as some interconnection networksimulators. Wang et al [36] analyzed the power dissipationof existing network microarchitectures. They also devisedseveral power-efficient network microarchitectures, includingsegmented crossbar, cut-through crossbar, and write-throughbuffer. Soteriou and Peh [34] explored the design space ofpower-aware link management. In the context of NoCs, Beniniand De Micheli [6] identified the possible approaches to en-ergy savings, including node-centric and network-centric tech-niques. Simunic and Boyd [32] later implemented several ofthese techniques using a closed-loop control model. In [28],a survey of energy efficient on-chip communication was pre-sented by Raghunathan et al. Our approach is different fromthese prior efforts since we employ the optimizing compilertechnology for reducing network energy.

6 Conclusions

As compared to bus-based SoC systems, NoC architec-tures present better scalability, facile IP core reusing, andhigher parallelism. However, power consumption can be-come a critical design issue in NoC based systems. Whilecircuit/architectural techniques are certainly very important inreducing power consumption of NoCs, software can also playan important role. The prior software work mainly focusedon application mapping onto the NoC based systems. Our

16th International Conference onParallel Architecture and Compilation Techniques (PACT 2007)0-7695-2944-5/07 $25.00 © 2007

goal instead is to employ an optimizing compiler in manag-ing voltages/frequencies of the communication links. Specifi-cally, we implemented a hybrid approach that divides the taskof power minimization between the compiler and the runtime.In this approach, the compiler modifies the application sourcecode such that the first few iterations of a loop nest are usedto collect statistics on link usage and to decide the suitablevoltage and frequency levels for communication links. Theremaining iterations of the loop execute with these selectedvoltage/frequency levels so that power consumption is reducedand performance is not impacted excessively. This approachis automated within a compiler and tested using the applica-tions from the MediaBench suite. The results obtained showthat the proposed hybrid approach results in much higher en-ergy savings than the pure hardware based and pure compilerbased energy reduction schemes. They also indicate that ourapproach generates energy savings that are very close to thosethat could be obtained using an optimal scheme.

References

[1] Intel xscale technology. http://www.intel.com/design/intelxscale/.[2] Mediabench. http://cares.icsl.ucla.edu/MediaBench/.[3] Transmeta processor. http://www.transmeta.com/crusoe.[4] S. P. Amarasinghe, J. M. Anderson, M. S. Lam, and C. W. Tseng. The

SUIF compiler for scalable parallel machines. In Proc. Seventh SIAMConference on Parallel Processing for Scientific Computing, Feb. 1995.

[5] G. Ascia, V. Catania, and M. Palesi. Multi-objective mapping for mesh-based NoC architectures. In Proc. the International Conference onHardware/Software Codesign and System Synthesis, Sept. 2004.

[6] L. Benini and G. D. Micheli. Powering networks on chips: energy-efficient and reliable interconnect design for SoCs. In Proc. the 14thInternational Symposium on Systems Synthesis, 2001.

[7] D. Brooks, V. Tiwari, and M. Martonosi. Wattch: a framework forarchitectural-level power analysis and optimizations. In Proc. the In-ternational Symposium on Computer Architecture, pages 83–94, 2000.

[8] T. D. Burd and R. W. Brodersen. Design issues for dynamic voltagescaling. In Proc. the International Symposium on Low Power Electron-ics and Design, pages 9–14, New York, NY, USA, 2000. ACM Press.

[9] T. D. Burd, T. A. Pering, A. J. Stratakos, and R. W. Brodersen. A dy-namic voltage scaled microprocessor system. IEEE Journal of Solid-State Circuit, 35(11):294–295, Nov. 2000.

[10] G. Chen, F. Li, and M. Kandemir. Compiler-directed channel allocationfor saving power in on-chip networks. In Proc. Symposium on Principlesof Programming Languages, Charleston, SC, Jan. 2006.

[11] G. Chen, F. Li, M. Kandemir, and M. J. Irwin. Reducing NoC en-ergy consumption through compiler-directed channel voltage scaling.In Proc. Conference on Programming Language Design and Implemen-tation, Ottawa, Canada, June 2006.

[12] S. Chheda, O. Unsal, I. Koren, C. M. Krishna, and C. A. Moritz. Com-bining compiler and runtime ipc predictions to reduce energy in nextgeneration architectures. In Proc. the 1st Conference on ComputingFrontiers, pages 240–254, New York, NY, USA, 2004. ACM Press.

[13] J. B. Duato, S. Yalamanchili, and L. Ni. Interconnection Networks. Mor-gan Kaufmann Publishers, 2002.

[14] N. Eisley and L.-S. Peh. High-level power analysis of on-chip networks.In Proc. the 7th International Conference on Compilers, Architecturesand Synthesis for Embedded Systems, Sept. 2004.

[15] C.-H. Hsu and U. Kremer. The design, implementation, and evaluationof a compiler algorithm for cpu energy reduction. In Proc. the ACMSIGPLAN Conference on Programming Language Design and Imple-mentation, June 2003.

[16] J. Hu and R. Marculescu. Exploiting the routing flexibility for en-ergy/performance aware mapping of regular NoC architectures. In Proc.the Design Automation and Test in Europe, Mar. 2003.

[17] J. Hu and R. Marculescu. Energy- and performance-aware mappingfor regular Noc architectures. IEEE Transactions on Computer-AidedDesign of Integrated Circuits and Systems, 24(4):551–562, Apr. 2005.

[18] InfiniBand Trade Association. InfiniBand Architecture Specification,Release 1.2, Oct. 2004.

[19] E. J. Kim, K. H. Yum, G. Link, N. Vijaykrishnan, M. Kandemir, M. J.Irwin, M. Yousif, and C. R. Das. Energy optimization techniques incluster interconnects. In Proc. the International Symposium on LowPower Electronics and Design, Aug. 2003.

[20] J. Kim and M. Horowitz. Adaptive supply serial links with sub-1V oper-ation and per-pin clock recovery. In Proc. the International Solid-StateCircuits Conference, Feb. 2002.

[21] J. S. Kim, M. B. Taylor, J. Miller, and D. Wentzlaff. Energy charac-terization of a tiled architecture processor with on-chip networks. InProc. the International Symposium on Low Power Electronics and De-sign, 2003.

[22] W. Lee, R. Barua, M. Frank, D. Srikrishna, J. Babb, V. Sarkar, andS. Amarasinghe. Space-time scheduling of instruction-level parallelismon a RAW machine. In Proc. the 8th International Conference on Archi-tectural Support for Programming Languages and Operating Systems,Oct. 1998.

[23] W. Lee, D. Puppin, S. Swenson, and S. Amarasinghe. Convergentscheduling. In Proc. the 35th International Symposium on Microarchi-tecture, Nov. 2002.

[24] F. Li, G. Chen, M. Kandemir, and M. J. Irwin. Compiler-directed proac-tive power management for networks. In Proc. Conference on Compil-ers, Architectures and Synthesis of Embedded Systems, 2005.

[25] S. S. Mukherjee, P. Bannon, S. Lang, A. Spink, and D. Webb. The alpha21364 network architecture. IEEE Micro, 22(1):26–35, Jan. 2002.

[26] R. Nagarajan, D. Burger, K. S. McKinley, C. Lin, S. W. Keckler, andS. K. Kushwaha. Static placement, dynamic issue (SPDI) schedulingfor EDGE architectures. In Proc. International Conference on ParallelArchitectures and Compilation Techniques, Oct. 2004.

[27] C. S. Patel. Power constrained design of multiprocessor interconnectionnetworks. In Proc. the International Conference on Computer Design,USA, 1997.

[28] V. Raghunathan, M. B. Srivastava, and R. K. Gupta. A survey of tech-niques for energy efficient on-chip communication. In Proc. the 40thDesign Automation Conference, 2003.

[29] H. Saputra, M. Kandemir, N. Vijaykrishnan, M. J. Irwin, J. Hu, C.-H.Hsu, and U. Kremer. Energy-conscious compilation based on voltagescaling. In Proc. the ACM Joint Conf. on Languages, Compilers, andTools for Embedded Systems and Software and Compilers for EmbeddedSystems, June 2002.

[30] L. Shang, L.-S. Peh, and N. K. Jha. Dynamic voltage scaling with linksfor power optimization of interconnection networks. In Proc. the Inter-national Symposium on High-Performance Computer Architecture, Feb.2003.

[31] D. Shin and J. Kim. Power-aware communication optimization fornetworks-on-chips with voltage scalable links. In Proc. the Interna-tional Conference on Hardware/Software Codesign and System Synthe-sis, Sept. 2004.

[32] T. Simunic and S. Boyd. Managing power consumption in networkson chip. In Proc. the Conference on Design, Automation and Test inEurope, 2002.

[33] V. Soteriou, N. Eisley, and L.-S. Peh. Software-directed power-awareinterconnection networks. In Proc. Conference on Compilers, Architec-ture and Synthesis for Embedded Systems, Sept. 2005.

[34] V. Soteriou and L.-S. Peh. Design space exploration of power-awareon/off interconnection networks. In Proc. the 22nd International Con-ference on Computer Design, Oct. 2004.

[35] C.-W. Tseng. An optimizing Fortran D compiler for MIMD distributed-memory machines. PhD thesis, CS Dept., Rice University, TX, Jan.1993.

[36] H. Wang, L. Peh, and S. Malik. Power-driven design of router microar-chitectures in on-chip networks. In Proc. the 36th International Confer-ence on Microarchitecture, Dec. 2003.

[37] H.-S. Wang, X. Zhu, L.-S. Peh, and S. Malik. Orion: A power-performance simulator for interconnection networks. In Proc. the 35thInternational Symposium on Microarchitecture, Nov. 2002.

[38] M. Weiser, B. Welch, A. J. Demers, and S. Shenker. Scheduling forreduced CPU energy. In Proc. the Operating Systems Design and Im-plementation, pages 13–23, 1994.

[39] Q. Wu, M. Martonosi, D. W. Clark, V. J. Reddi, D. Connors, Y. Wu,J. Lee, and D. Brooks. A dynamic compilation framework for control-ling microprocessor energy and performance. In Proc. the 38th AnnualIEEE/ACM International Symposium on Microarchitecture, pages 271–282, 2005.

[40] F. Xie, M. Martonosi, and S. Malik. Compile-time dynamic voltage scal-ing settings: Opportunities and limits. In Proc. the ACM SIGPLAN Con-ference on Programming Language Design and Implementation, June2003.

16th International Conference onParallel Architecture and Compilation Techniques (PACT 2007)0-7695-2944-5/07 $25.00 © 2007