tech report nwu-eecs-07-09: a dynamic-programming algorithm for reducing the energy consumption of...

8/6/2019 Tech Report NWU-EECS-07-09: A Dynamic-Programming Algorithm for Reducing the Energy Consumption of Pipelined System-Level Streaming Applications

1/9

Electrical Engineering and Computer Science Department

Technical ReportNWU-EECS-07-09November 15, 2007

A Dynamic-Programming Algorithm for Reducing the Energy Consumption of Pipelined System-Level Streaming Applications

N. Liveris, H. Zhou, P. Banerjee

Abstract

In this paper we present a System-Level technique for reducing energy consumption. The

technique is applicable to pipelined applications represented as chain-structured graphsand targets the energy overhead of switching between active and sleep mode. Theoverhead is reduced by increasing the number of consecutive executions of the pipelinestages. The technique has no impact on the average throughput. We derive upper boundson the number of consecutive executions and present a dynamic-programming algorithmthat finds the optimal solution using these bounds. For specific cases we derive a qualitymetric that can be used to trade quality of the result for running-time.

Keywords: Energy Consumption, Streaming Applications, Synchronous Data Flow Graphs


2/9

A Dynamic-Programming Algorithm for Reducing the Energy Consumption of Pipelined System-Level Streaming Applications

N. LiverisNorthwestern UniversityEvanston, IL 60208 USA

H. ZhouNorthwestern UniversityEvanston IL 60208 USA

P. BanerjeeHP Labs

Palo Alto, CA 94301 USA

In this paper we present a System-Level technique for reducing en-ergy consumption. The technique is applicable to pipelined applica-tions represented as chain-structured graphs and targets the energyoverhead of switching between active and sleep mode. The over-head is reduced by increasing the number of consecutive executionsof the pipeline stages. The technique has no impact on the averagethroughput. We derive upper bounds on the number of consecutiveexecutions and present a dynamic-programming algorithm that ndsthe optimal solution using these bounds. For specic cases we derivea quality metric that can be used to trade quality of the result for running-time.

! # "$ & %

Synchronous Dataow Graphs (SDFs) are considered a useful wayto model Digital Signal Processing applications [8]. This is becausein most cases the portions of DSP applications, where most of theexecution-time is spent, can be described by processes or actors withconstant rates of data consumption and production.

Energy consumption is one quality metric for digital integratedcircuits. The main sources of energy consumption are dynamic andstatic power dissipation. Static or leakage power is expected to be-come the dominant power dissipation component for future tech-nologies [4]. Therefore, techniques to reduce the leakage power areneeded.

Work on leakage reduction at the higher levels of design has beenfocused on replacing cells or submodules of the design with oneswith the same functionality but higher threshold voltage (e.g. [6]).Although these techniques can lead to signicant reductions, they are

not applicable to parts of the design that come as hard cores or whenthe available time slack changes, even with a low frequency, e.g. bythe user of the system. In these cases, techniques are needed thatare adaptive to environment changes and do not require resynthesisof IP cores. Such techniques include Dynamic Voltage Scaling [2],Adaptive Body Biasing [7], and Power Gating [4]. In this paper wefocus on the latter technique.

With power gating, a hardware module is shut down when it isidle. This way the stand-by leakage of the module is reduced. Theswitching from active to sleep mode and back to active has an energypenalty caused mainly by the loading of the nodes to normal V dd levels [4]. In this work we try to decrease energy consumption byreducing the number of times the mode switch occurs.

In this paper we try to nd the number of consecutive iterationsfor each pipeline stage of a chain-structured SDF graph. This prob-

lem is similar to vectorization [11], but in our case instead of trying tomaximize the consecutive number of executions, we try to maximizethe energy savings taking into account the energy penalty paid byadding more buffers to each channel. Dynamic programming tech-niques have been used to determine a schedule for a chain-structuredSDF, so that the memory requirements are minimized [9]. In our ap-proach, the buffer requirements are increased, whenever this increaseleads to a reduction of the total energy consumption.

The throughput of the application does not change after applyingour method. Moreover, our method guarantees that any latency in-crease does not cause data loss. In general for streaming multimedia

S iS i-1 S i+1 S |S|S 0 S 1 S |S|+1

x

+

x

x

x

+

+

R

R

R

R

R

R

R

buffer

buffer

buffer

buffer

S i-1 S i+1

Figure 1: System structure. At the rst level of hierarchy the systemis a pipelined chain-structured graph. The processes (nodes) of thesecond level can be independently power gated. Cross edges betweenpipeline stages are implemented using buffers.

applications throughput constraints are important and less emphasisis put on latency [5]. Our technique is applicable only to streamingapplications, for which a latency increase is acceptable.

In Section 2 we explain the model we use to describe pipelinedsystem-level applications. Section 3 denes the problem we try totackle. In Section 4 the theoretical issues of the problem are ad-dressed, while Section 5 describes an algorithm that can be used tosolve it. Finally, in Sections 6 and 7 we present experimental results

and draw conclusions.' (

! 0 ) 13 2 4 )5 5 %7 68 & %

In thissection we describe the model we use for system-level pipelinedapplications. Table 1 summarizes the denitions of the symbols usedin this paper. In Figure 1 the structure of the model can be seen.

'! 9@ B AD C

8 %

EG F

H "0 I "$ )5

F

2 Q P0

In an SDF G RT S V U E V each node represents a process and each edgea channel, in which the tail produces data and the head consumesdata. We assume a global clock for the whole system. Functions p : E WY X , c : E W X , and w : E Wb ad c0 represent the production,consumption rates, and the number of initial tokens (delays) of eachchannel.

In order for an SDF to be executable with bounded memory, thesystem q R 0 should have non-trivial solutions, where is the topol-ogy matrix of G [8]. The vector with the minimum positive integersin the solution space, q, is called the repetition vector and each en-try represents the number of times the corresponding node shouldbe executed during each complete cycle of the graph. An SDF iscalled consistent if it has a repetition vector and the system does notdeadlock [10].

A proper subset of E in the graph may not have either a tail ora head. These are the input and output edges with which the SDFcommunicates with its environment.


3/9


4/9

S U V3 R the fact that event happens before happens. If and are not ordered by the relation, S U Vv R and S U V R, the twoevents can occur in any order, even at the same time. An event canbe the starting time or the ending time of the execution of a node.We can extend this relation to the execution of instances of stages.More specically, we denote as i j j the fact that the ending timeof instance i of stage happens before the starting time of instance jof stage . The relation j is also transitive.

The edges of the graph dene precedence constraints that restrict

the number of available schedules that can be generated. Since allnodes (processes and stages) may carry state from one iteration tothe next k k 1 y y qv : vk l 1 j vk .

The buffer size of a channel should be large enough to store themaximum number of tokens present at that channel at any time. Sup-pose that j denes a consistent and admissible schedule, then:

m

e u f v h0 p E fm

i pq

f let jmax

maxu

j n j pr

i

0o

v j ui w , then

b S u U v V R maxi

S i p S u U v V jmax c S u U v VH V8 w S u U v V (2)

The above formula holds because during the ith instance of the pro-ducer S i 1 V# p S u U v V tokens have already been produced and p S u U v Vare being produced during that iteration. Meanwhile, jmax instancesof the consumer have already completed execution and, therefore,

jmax c S u U v V tokens have been consumed. To the total number of tokens present we need to add the w S u U v V initial tokens.

We assume that the token production at the inputs is periodic.That means that if a complete cycle is executed within Lcc , then foreach input edge i the period is Li R

Lccq i , where qi is the number of

instances in a complete cycle. Note that, since the input graph is as-sumed to be consistent, we do not need to worry about the existenceof the qi values. Each stage should have an average invocation periodof Ls R Lccqs . Therefore, l S G s V Ls .

L

l(Gs i)

executing

idle

buffer

bufferbuffer

bufferS i-1 S i+1S i

Figure 2: Execution of stage si with xi R 1

3 L

3 l(Gs)

executing

idle

buffer

bufferS i-1 S i+1

bufferbuffer

bufferbuffer

bufferbufferbufferbufferbufferbuffer

S i-1

Figure 3: Execution of stage si with xi R 3

# $ 1 ) b P8 H "$ 1 5 & %

As we saw in Equation 1, energy can be saved by switching the op-eration mode of some processes when enough idle time is available.One way to increase the energy savings could be to consecutivelyexecute the stage for an integer number of times x 1 and then al-low its processes to be in sleep mode for a longer interval. This mayincrease the buffer requirements for the input and output channels.

However, the switching mode penalty can now be shared across dif-ferent instances for some of the processes of the stage.

The transformation can be described as replacing stage s by s ,whose ring rules can be derived by multiplying by x the number of required inputs tokens for s. Moreover, s and s have the same pro-cess graph, which for s is repeated x times for each invocation, and,therefore, l S G s V R x l S G s V . For the edges connected to s k S t U s V E : c S t U s z V{ R x c S t U s V and k S s U t V| E : p S s } U t V R x p S s U t V .

It is easy to prove that the topology matrix of the graph G after

the transformation has the same rank and since the graph is acyclic,the graph is still consistent [8].An example is shown in Figures 2 and 3. In the rst gure for

stage si the x value equals 1. Every L cycles si is idle for L l S G si Vcycles. If this interval is long enough, some of the processes of s i canbe switched to sleep mode. The penalty for switching from active tosleep and back to active is E sm every L cycles for those processes.In Figure 3 the addition of 2 extra buffers allows si to execute forthree consecutive times. The idle time increases and potentially moreprocesses can be shut down. Besides that, the penalty for the modeswitch E sm is paid once every 3 L cycles for each process. If thereis a change in the input rate and the slack becomes zero, the twoadditional buffers can be shut down and the stage can operate as inthe rst case. We assume that such changes happen with a very lowfrequency, e.g. the changes are caused by the user of the system, andthere is a small set of predened values for the input rate. For eachof these values we solve the following problem.

Given a multirate, consistent, hierarchical graph G R~ S V U E V withthe rst level of hierarchy being a chain-structured multigraph, and aquality metric , nd the number of consecutive executions xs d X foreach stage s, so that the energy savings are not less than S 1 V5 E max ,where E max are the maximum energy savings that can theoreticallybe achieved by any solution to this problem.

C)5 0 ) & % I 0 1 5 6$ 1 0 & %

In this section we reduce the search space of the solution. The solu-tion space of the problem is X S . Using properties of the problem,the quality metric, and the energy penalty on the additional bufferswe nd an upper bound on the x values, making the solution spacenite. This upper bound affects the complexity of the proposed algo-

rithm and its running-time as shown in Sections 5 and 6.

3 L

3 l(Gs) v executing

Gs and v idle

Gs executing

l(v)

3L-3 l(Gs)

3L-3 l(Gs) + l(Gs) - l(v)=3L - ( 2l(Gs)+l(v) )

Figure 4: Idle time for type-1 processes

3 L

3 l(Gs)v executing

Gs and v idle

Gs executing

l(v)3L-3 l(Gs)

3L-3 l(Gs) + l(Gs) - l(v)=3L - ( 2l(Gs)+l(v) )

l(Gs)-l(v)l(Gs)-l(v)

3L - 3l(v)

Figure 5: Idle time for type-2 processes


5/9

{ 9z)

$ F5 5 %

! I )5 & )

Using Equation (1) we can explore the energy savings that can beobtained by any process. Suppose that G s is the graph representingpipeline stage s and xs is the number of consecutive executions of G s .We can distinguish two types of processes.

6$ )E&

) & )5

Processes v G s for which

S l S G s V l S v VH V$ P S v V E sm S v V (3)

are Type-1 processes.Process v is invoked xs times with a period of l S G s V cycles (Fig-

ure 4). Let st S v V and et S v V , st S G s V and et S G s V be the start and endtime of intervals l S v V , l S G s V , respectively. In the rst iteration v canbe invoked after st S v V st S G s V cycles and in the xs th iteration it canbe put to sleep mode for et S G s V et S v V . Therefore, in the rst andlast iterations v is in active mode for l S G s V l S v V . In the rest xs 2 it-erations v is not switched to sleep mode, since Inequality (3) suggeststhat this would cause an energy loss. Therefore, the total time spentin active mode after xs Ls cycles is S xs 2 V# l S G s V l S G s V l S v V or

S xs 1 V0 l S Gs V5 l S v V and the energy savings in this case are:

E s S v V RT S xs Ls S xs 1 V$ l S G s V l S v VG V P S v V E sm S v V

Therefore, on average the energy savings in Ls cycles are:

E s S v V{ RT S LsS xs 1 V l S G s V8 l S v V

xsV P S v V

E sm S v Vxs

(4)

Lemma 1 The energy savings after L s cycles of Type-1 process vGs are upper bounded by S Ls l S Gs VG V$ P S v V .

Proof: : Follows from Equation 4 for x W .From Equation (4) we can express the energy savings difference

obtained by increasing xs from x1 to x2 as

E s e v hG e x2 f x1 h x2 x1 x2 x1

E sm e v hI P e v hI G e l e Gs hI l e v h h} (5)

which is always greater than zero, since x2 x1 and because of Inequality (3). Since E s S v V S x2 U x1 V is positive, E s S v V is an increasing

function of xs for all Type-1 processes.6$ )

EH ' ) & )5

Processes v G s for which

S l S G s V l S v VH V$ P S v V E sm S v V (6)

are classied as Type-2 processes.In this case during each of the x 1 executions of the pipeline

stage G s , the process can be put to idle mode for l S G s V l S v V cycles.

Lemma 2 The energy savings of a Type-2 process v G s are independent of x s .

Proof: Let xs be the number of consecutive executions of G s . Thenthe energy savings for process v in xs Ls cycles are:

E s

e v h e xs

Ls

e xs

1 hI l e Gs

hI l e v h hI P e v hI E sm

e v hG

savings due to idle time after the xs th ex. of Gs

g

g e xs 1 hI e l e Gs hI l e v h hI P e v hI E sm e v h}G

savings during the rst xs 1 ex. of G se xs Ls xs l e v h hI P e v hI xs E sm e v h

which means that the energy savings in Ls cycles are

E s S v V R S Ls l S v VG V$ P S v V E sm S v V

Therefore, the energy savings are independent of xs .For both Type-1 and Type-2 processes we need to multiply the

above ndings for Ls by qs to obtain the energy savings in Lcc cycles.{ 97 '

)8 0

) 0 1 ! )5

In Section 2.3 we saw that the energy penalty on an edge is a non-linear, increasing function E S e U b S e VG V with respect to the buffer sizeb S e V . Therefore, it is important to study how the buffer size of a crossedge is affected by the transformation, in order to estimate the energy

penalty.Determining the minimum buffer sizes for a sequential deadlock free schedule has been done in the past [1]. However, in our case wewant to nd the buffer sizes for a given parallel schedule, for whichwe are allowed to make as few assumptions as possible. For thatreason we use formula (2).

A simplistic approach would be to consider the buffer size of across edge to be an increasing function of the x values of the stages.Even though this approach is simplistic it helps us draw some usefulconclusions for the more general cases.

%7 5 )A

8 )

If the input graph is a unirate graph, a more real-istic approach would be to consider the buffer size as the lcm func-tion of the x values of the adjacent stages. In the graph before thetransformation is applied q R 11 y y y 1 and each stage is invoked once

every L cycles. After the transformation the average rate of invoca-tion for each instance remains the same. In lcm S xi U xic

1 V! L cycleswe know that si is executed

lcm xi xi 1 xi times, which correspond to

lcm S xi U xic

1 V instances before the transformation. Moreover, s ic

1 isinvoked lcm xi xi 1 xi 1 times during the lcm S xi U xi

c

1 V L cycles. There-

fore, if sk ic

1j sli , then d 1 N such that k l

k RT Sl

lcm xi xi 1 xi

d 1 V$lcm S xi U xi

c

1 V

xic

1

Because of (2), b S e V R max l S l p S e V k c S e VH V w S e V . For l R d 2lcm xi xi 1

xi , where d 2 X , b S e V is maximized:

be

eh

d 2lcm e xi f xi i 1 h

xi pe

ehI e

d 2 d 1hI

lcm e xi f xi i 1 h xi i 1 c

e

eh g

we

eh

Since after the transformation p S e V R xi and c S e V R xic

1 ,

b e e h d 2lcm e xi f xi i 1 h

xi xi e d 2 d 1 hI

lcm e xi f xi i 1 h xi i 1

xi i 1 g w e e h

b S e V R d 1 lcm S xi U xic

1 V8 w S e V

In this case the buffer size and, because of that, the energy penaltyare increasing functions with respect to the lcm S xi U xi

c

1 V .

("$ 1 & %7 5 )

A8 )

The multirate case is similar to the unirate caseexcept that the entries in the repetition vector need to be taken intoaccount as well. Without the transformation, stage s i completes qi

executions and stage si c 1 completes qi c 1 executions during one com-plete cycle. After the transformation that is not necessarily true.However, for the transformed graph we know that lcm S xi U xi

c

1 U qi U qic

1 V

denes a period during which si completeslcm xi xi 1 qi qi 1

xi and sic

1lcm xi xi 1 qi q i 1

xi 1executions. Solving as for the unirate case, we can

nd that the buffer sizes, b S e V R d 1 lcm S xi U xic

1 U qi U qic

1 V8 w S e V arean increasing function of lcm S xi U xi

c

1 U qi U qic

1 V .Since in all the above cases the information that we have about

each edge is that they are increasing functions of b S e V , we can col-lapse all edges between two stages to one. The new function is given


6/9

by: E i i c 1 p R econnecting i to i+1 E p S e U b S e VH V . Function E i i

c

1 p is also an

increasing function with respect to the buffer sizes on all channelsbetween stages i and i 1.

{ 9

)

$ F

5 5 %

8 %7 %

xmaxFrom the previous sections we can derive the formula for the totalenergy savings E t R s v Gs E s S v V

Si 0 E p

i ic

1 . This is thefunction we want to maximize. In this section we derive a bound onthe x values of the stages to prune the search space of the problem.

We start again from the simplistic case assuming that the buffer sizesare an increasing function of the xis and move to more realistic cases.

We denote as b S i V the buffer sizes of edges that connect stages iand i 1.

Let v G s be a Type-1 process and

C S v V

R

L P S v V l S Gs V$ P S v V E sm S v V P S v V$ & S l S Gs V l S v VG VG V

a constant for that node that depends only on the input graph. Let

C S G V R minv Type-1

C S v V

Then the following Lemma can be proved.

Lemma 3 If G is a unirate graph and b S i V{ R f S xi U xic

1 V are increas-ing functions with respect to both x i and x i

c

1 , and x R x1 x2 y y y x Sis the optimal solution resulting in maximum total energy savings E maxt , then for any 0 1 and x max R

1 C G , there exists x R

x1 x2 y y y x S with k i 1 y y@ x S x : 1 xi xmax , for which the total energy

savings E t are greater or equal to S 1 V$ E maxt .

Proof: Startingwith x R x1 x2 y y y x S we can construct x R x1 x2 y y y x Sby letting:

xi R xi : xi xmax

xmax : xi xmax

Suppose that E maxt are the total energy savings obtained by x and E t are the total energy savings obtained by x .

Energy savings for all Type-2 processes are the same for x and x ,since they are independent of the x values (Lemma 2).

The values of b R f S x0 U x1 V f S x1 U x2 V y y y f S x S 7 U x Sc

1 V are all greateror equal to the values b R f S x0 U x1 V f S x1 U x2 V& y y y f S x S U x S

c

1 V . (Thevalues of x0 and x S

c

1 cannot change as stages 0 and x S x 1 are ex-

ternal). And since E i i c 1P S b S i VG V is also an increasing function of b S i V ,the energy penalty on edges for x are less or equal to the ones for x.

Therefore, the total energy for x is at least as much as for x con-sidering Type-2 processes and cross edges. The energy savings dif-ference for Type-1 processes is the upper bound of the energy differ-ence E maxt E

newt . For any stage si for which xi xmax the energy

difference is again zero. For each v G si with v being a Type-1process and xi xmax , we have from Equation (5) that

E s e vf xi hI E s e vf xmax h xi xmax xi xmax

e E sm e v hI P e v hI G e l e G hI l e v h h h

for L cycles.We want E s v xi } l E s v xmax E s v xi , where is a real number with

0 1, which denotes the desired quality ratio, for all Type-1processes v.

E s v xi l E s v xmax E s v xi

Lemma 1

E s v xi l E s v xmax L P v l l G P v

Equation 5

xi l xmax xi xmax

E sm v l P v } l G l l v @ z

L P v l l G P v @

C v L

P v l G P v E sm v P v l G l v

xi l xmax xi xmax

C v

1 C v

c

1 xi

xmax

xmax it is sufcient:

1 C v G

xmax

where C S

vV

depends on the input graph. For xmax of the graph G weneed to nd C S G V{ R min v Type-1 C S v V .For that xmax the total energy savings E t for x are S 1 V5 E maxt

E t .For example, if the designer chooses R 0 y 05, we can nd xmax

from the input graph and . Then Lemma 3 states that there exists x ,whose entries are all less or equal to xmax and the energy savings for x are E t 0 y 95 E

maxt .

%7 5 )~ t 0 6

C

A similar approach can be followed for b S i V3 R f i S lcm S xi U xi

c

1 VG V , where for all i 1 y y} x S x , f i : X W X is an increasingfunction.

Lemma 4 If G is a unirate graph, f i : X W X is an increasing func-tion, b S i V R f i S lcm S xi U xi

c

1 VG V for each cross edge S i U i 1 V , and x R

x1 x2y y y

x S is the optimal solution resulting in maximum total energysavings E maxt , then for any 0 1 and x max R S

1 C G V

2 , there

exists x R x1 x2 y y y x S with k i 1 y y@ x S x : 1 xi xmax , for which thetotal energy savings are greater or equal to S 1 V$ E maxt .

Proof: First we assume that xi l 1 xmax and xic

1 xmax and showhow we can replace xi of the solution x by xi xmax :

xi R

xi : xi xmaxm n : xi xmax U xg m n xmax

max S m U n V : xi xmax U m n xmax xgm n m n : xi xmax U xg m n

where m R gcd S xi l 1 U xi V , n R gcd S xi U xic

1 V . The value xg is givenby xg R 1

C G, and xmax R xg 2 .

As before, Type-2 process energy savings are independent of thevalues of xi . For Type-1 processes the choice of xg and the fact that xi either is equal to xi or has a value xg implies that the energysavings of x are bounded from below by S 1 V multiplied with theenergy savings of x (Lemma 3).

Therefore, in order to prove the lemma, it is sufcient to showthat lcm S xi l 1 U xi V{ lcm S xi l 1 U xi V and lcm S xi U xi

c

1 V{ lcm S xi U xic

1 V .It is easy to see that it holds for xi xmax . For xi xmax we have:

i. if m n xmax , then xi R max S m U n V , and we can assume xi R mwithout loss of generality. Then,

lcm xi 1 xi I xi 1 xi

gcd xi 1 xi xi 1 & m

m xi 1 max xi 1 xi lcm xi 1 xi

lcm xi xi 1 lcm xi xi 1

m xi 1gcd m xi 1

k m xi 1gcd xi xi 1

mgcd m l n

k mn

m ngcd m l n

k m xi

which is always true because

m ngcd e m f k n h4

m ngcd e m f n hv

lcm e m f n h

xi


7/9

as xi is a common multiple of both m and n. In this case m xg asm n xmax R xg 2 and m xmax as m R gcd S xi l 1 U xi V , with xi l 1 xmax .

ii. if xg m n xmax , then xi R m n and

lcm e xi 1 f xi h xi 1 m n

gcd e xi 1 f m n h xi 1 x i

m xi 1 xi

m lcm e xi 1 f xi h

The same way we can prove lcm S xic

1 U xi V lcm S xic

1 U xi V

iii. if m n xg , then xiR

xgm n m nIn this case:

lcm e xi 1 f xi h xi 1 xi

gcd e xi 1 f xi h xi 1 x i

m xi 1 xi

m lcm e xi 1 f xi h

The same way we can prove that lcm S xi U xic

1 V lcm S xi U xic

1 V .The proof so far works for the case of a single value xi xmax

with xi l 1 and xic

1 less or equal to xmax . If instead of a single value xithere is a number of consecutive values xl

c

1 UG y y y U xr l 1 that are greaterthan xmax with xl U xr xmax , we follow the procedure above using xl as xi l 1 and xr as xi

c

1 . The value xi given by the procedure is thevalue of all xl

c

1 UH y y y U xr l 1 . From the above we know that lcm S xl U xlc

1 V!

lcm S xl U xlc

1 V and lcm S xr l 1 U xr VD lcm S xr l 1 U xr V . For all other crossedges with l k r : lcm S xk U xk

c

1 V R xi xmax lcm S xk U xk c

1 V

since both xk and xk c

1 are greater than xmax . Values xl and xr alwaysexist as x0 R x S

c

1 R 1.

("# 1 & %7 5 ) t 0 6

C

For multirate graphs the buffer sizes dependon the q values. Using a similar approach as for unirate graphs couldresult in describing xmax as a function of q. The q values though cangrow exponentially with the input graph [10] and, therefore, a moregeneral method is needed to derive xmax .

From Lemma 1 we can derive a bound on the energy savings thatcan be achieved. Let E s S V be the sum of the savings for Type-1 pro-cesses when x goes to innity, and E s S 1 V when x R 1. Also let E p S 1 Vbe the value of the energy penalty when x R 1. Let ymaxi be the mini-

mum value, for which the energy penalty becomes E i i c 1 p S e U ymaxi V{ E s S V E s S 1 V E p S 1 V . We know that increasing yi to a value greaterthan ymaxi can cause only energy loss, since the savings cannot be-

come greater than E s S V and E i i

c

1 p S e U yi V is increasing with respect

to yi . Therefore, any yi ymaxi causes an energy penalty that exceedsany energy savings obtained by the Type-1 processes.

Since the energy penalty for all edges is already given (mostprobably in form of an array of values), binary search can be ap-plied to each of the S x S x 1 V functions E i i c 1 p to nd ymaxi . Thebinary search procedure can start with a very large value Y as themaximum value for y that is determined by computational precisionlimits or area constraints. We know that yi R lcm S xi U xi

c

1 U qi U qic

1 V .Since we also have xmax i lcm S xmax i U xi

c

1 U qi U qic

1 V# R yi and xmax i

lcm S xi l 1 U xmax i U qi l 1 U qi V# R yi l 1 , it holds xmax i min S yi l 1 U yi V . If foreach stage i xmax i R min S yi l 1 U yi V , then xmax R max i S xmax i V can bechosen as the maximum value for the whole design. Any increase of x above that value for any of the stages causes energy loss comparedto the case, in which all x values are 1.

Lemma 5 For the optimal solution x of the multirate problem the following property holds: k i 1 y y@ x S x : 1 xi xmax .

Proof: Suppose xini R 1 1 y y y 1 . Suppose that x is the optimal solu-tion and xi x : xi xmax . Then from the above discussion it holds E t S x V E t S xini V , which is a contradiction.

This method can be applied to a unirate graph as well, and, there-fore, we use it in conjunction with the approaches for unirate graphsdescribed above. We use the minimum of the two xmax values pro-duced. The running time of the binary search method describedabove is O S x S xH log Y V .

Algorithm DP-for x valuesInput : A chain structured SDF graph G e S f E h representingthe pipeline stages, a quality metric which will be used

if G is unirate, and functions E i i i 1

p e xi f xi i 1 h returning theenergy overhead for cross edges between stages i and i+1.Output : Two arrays xbest n S n f n S n f xmax f xmax and es n S n f n S n f xmax f xmax

from which the optimal solution can be extracted.InitEs(G);

xmax DetermineXmax(G, );for i

1 to x S x do // main diagonal d R 0for xi l 1 1 to xmax do

for xic

1 1 to xmax dofor xi 1 to xmax do

enews E is S xi V E

i l 1 i p S xi l 1 U xi V E

i ic

1 p S xi U xi

c

1 Vif S es i U i U xi l 1 U xi

c

1 enews V thenes i U i U xi l 1 U xi

c

1 enews xbest i U i U xi l 1 U xi

c

1 xifor i

1 to x S x 1 do // init step for d R 1for xi l 1 1 to xmax do

for xic

2 1 to xmax dofor xnode 1 to xmax do

// xnode represents x i and x ic

1 in this loopenew1s es i U i U xi l 1 U xnode E

ic

1s S xnode V

E i c 1 i c 2 p S xnode U xic

2 V

enew2s es i 1 U i 1 U xnode U xic

2 E is S xnode V

E il 1 i

p S xi l 1 U xnode Venews max S e

new1s U e

new2s V

node

S enew1s enew2s V

if S es i U i 1 U xi l 1 U xic

2 # enews V thenes i U i 1 U xi l 1 U xi

c

2 enews xbest i U i 1 U xi l 1 U xi

c

2 S node U xnode Vfor d

2 to x S x 1 do //diagonal count for i

1 to x S x d do j

i d for k

1 to j i 1 dofor xi l 1 1 to xmax do

for x jc

1 1 to xmax dofor

xi c k

1to

xmaxdo

enews es i U i k 1 U xi l 1 U xic

k E i c k s S xic

k VG

es i k 1 U j U xic

k U x jc

1if S es i U j U xi l 1 U x j

c

1 # enews V thenes i U j U xi l 1 U x j

c

1 enews xbest i U j U xi l 1 U x j

c

1 S i k U xic

k VReturn xbest U es ;

Figure 6: Pseudocode for the dynamic programming algorithm

2

5

0 % 0 %

F

# 1 "8 & %

In this section we describe a dynamic programming algorithm whichcan determine the x values for maximum energy savings given a qual-ity metric. The algorithm is needed because the size of the solutionspace is still large after bounding the x values with xmax . Exhaustivesearch requires O S xmax S V steps to nd the x values for maximumenergy savings.

In Figure 6 the algorithm can be seen. The inputs are the graphwhich is partitioned in pipeline stages and a quality metric in casethe graph is unirate. After initialization, the algorithm determines xmax using the procedures described in Section 4. The purpose of therest of the algorithm is to solve independently the problem for eachsubchain and combine the solutions to nd the optimal solution forthe chain-structured graph.

The intuition behind the DP solution is that the values xi l 1 and


8/9

Final

(i,j)

j

i

1 |S|1

|S|

i j |S|1 i+1i-1 j+1... ... ...

xi-1

x j+1

11

xmax

xmax

Figure 7: Dynamic Programming Algorithm. The solution for sub-chain S i U j V depends only on values xi l 1 and x j

c

1 . In a xmax 2 arraythe best conguration of S i U j V is stored for each value combinationof xi l 1 and x j

c

1 .

x jc

1 are the only external values that can affect the optimal solutionfor a subchain from stage i to stage j . More specically, if i UG y y y U j is asubchain with1 i j x S x , the best conguration for thissubchain,i.e. the vector of x values xi UH y y y U x j that provides maximum energysavings, depends only on the x values of the stage exactly beforethe subchain, i.e., xi l 1 , and the stage after the subchain, i.e., x j

c

1 ,

(Figure 7). Therefore, an xmax 2 matrix can be constructed storing themaximum energy savings that can be obtained for that subchain foreach value of the pair S xi l 1 U x j

c

1 V . Such a matrix can gradually bebuilt for all possible subchains of the problem. This array is denotedas es x S x z x S x xmax xmax in the algorithm of Figure 6. Asan example,element es i U j U xi l 1 U x j

c

1 holds the best conguration for subchainstarting at stage i and ending at stage j, when the x value for stagei 1 is xi l 1 and for stage j 1 it is x j

c

1 .After nding xmax the algorithm starts by creating the es array

for subchains of length 0. The entries lled during this phase arethe ones on the main diagonal of the simplied array es of Figure 7.For each es i i the xmax 2 matrix is built from the energy savings forstage i and the energy penalty of both cross edges S i 1 U i V& U S i U i 1 Vfor that stage. In the second phase the algorithm lls the entriesfor subchains with two elements. Finally, in the third phase the en-

ergy savings for all remaining subchains are found. The reason forthe separate treatment of subchains with two and more than two el-ements is to make sure that the energy penalty for the same crossedge is not taken twice into account. The maximum energy savingsfor the whole graph are stored at position es 1 x S x H 1 1 . This entryrepresents the whole chain with x0 R x S

c

1 R 1. As mentioned be-fore, we assume that stages s0 and s S

c

1 are external and we haveno control over them. Therefore, their x values remain 1. Array xbest z x S x U x S x U xmax U xmax stores the decision taken at each step and in-formation necessary to retrieve the optimal solution.

The algorithm searches all possible values from 1 to xmax for x,at each subproblem and, therefore, it solves each subproblem opti-mally. Moreover, since the subproblems are independent, the algo-rithm nds the solution with the maximum total energy savings forall 1 xi xmax .

At each step the algorithm computes the energy savings and en-ergy penalty using functions E s and E p. The function for the energysavings can be implemented as described in Section 4. More speci-cally, during initialization, i.e. InitEs(G) step, we can nd the energysavings for the Type-2 processes, which are independent of x and,therefore, we do not need to recompute them during the iterations of the algorithm. For Type-1 processes of each stage s we can use thefollowing formula to nd the energy savings for a specic x

E s S x V R v G s

Ls P S v Vx 1

x v G sl S G s V$ P S v V

1 x v Gs

S l S v V$ P S v V5 E sm S v VH V

It is clear from the equation above that all summations can be com-puted during the initialization step (InitEs). Then E s can be computedin constant time for each new value of x. It is assumed that the func-tions E p are given by the user in the form of an array and, therefore,the energy penalty for a pair of x values can be returned in constanttime. Consequently, the algorithms complexity is O S x S x 3 xmax 3

x S xH log Y V and its memory space requirements are O S x S x2

xmax2

V .

Theorem 1 The solution found by the dynamic programming algo-rithm produces total energy savings E algt S xmax V , which are at least

S 1 V! E maxt , if the energy penalty at the cross edges S i U i 1 V isan increasing function of both x i U xi

c

1 and x max is given by x max R1

C G .

Proof: From Lemma 3, we know that there is at least one solutionfor which k i U 1 xi xmax . for which E t S 1 V$ E maxt . Since thealgorithm produces the maximum energy savings for all solutionswith 1 xi xmax , E

algt E t E

algt S 1 V$ E maxt .

Theorem 2 The solution found by the dynamic programming algo-rithm produces total energy savings E alg

t S xmax V , which are at least

S 1 V# E maxt , if the energy penalty at the cross edges S i U i 1 V is anincreasing function of lcm S xi U xi

c

1 V and x max is given by x max R xg 2 , xg 1 C G .

Proof: Can be proven similarly to Theorem 1 using Lemma 4.

Theorem 3 The solution found by the dynamic programming algo-rithm produces total energy savings E algt S xmax VD R E

maxt , if the en-

ergy penalty at the cross edges S i U i 1 V is an increasing function of lcm S xi U xi

c

1 U qi U qic

1 V and x max is given by the binary search proce-dure described above for multirate graphs.

Proof: Let x be the optimal solution. Then k i with 1 i x S x , itholds that 1 xi xmax (Lemma 5). Since the algorithm nds theoptimal solution within that space, it nds the optimal solution to theproblem.

# 8 6# )8 %7 ) 1 | )5 "# 1

The algorithm was implemented as a C++ program taking consistentgraphs as an input and determining the x values for each pipelinestage. For the experiments we normalized the power of all compo-nents using the static power of the 32-bit latch. The static power of 32-bit output multipliers was set to 25 and the 32-bit cla adders 4times that of the latch. The static power of the decoding logic for thechannels was considered at the same level as the static power of thelatches. On the channels the dynamic power increase with x, causedby the extra wiring and control, was considered 50% of the staticpower increase. The switching mode overhead was considered equalwith the energy savings obtained by 10 cycle time slack.

We applied the algorithm on three pipelined architectures. The

rst is the CD to DAT sample-rate conversion graph adopted from [9].Each of the 3 SDF actors of the multirate graph was considered apipeline stage. The FIR lters were assumed to be 4-tap lters imple-mented with multipliers. Upsampling, ltering, and downsamplingunits were considered independent processes forming together onestage. So, in total there were 3 stages in the rst level of hierarchyexecuting in parallel. The second application was the unirate 10-stage pipelined K-means clustering with euclidean distances adoptedfrom [3]. The third was a 3-stage pipelined architecture for K-meansclustering. In the latter case the rst two, intermediate four, and lastfour pipeline stages of the 10-stage pipelined K-means were merged


9/9

tech report nwu-eecs-07-09: a dynamic-programming algorithm for reducing the energy consumption of...

Documents