improving the energy behavior of block buffering using compiler … · 2008-03-11 · improving the...

Improving the Energy Behavior of BlockBuffering Using Compiler Optimizations

M. KANDEMIR

Pennsylvania State University

J. RAMANUJAM

Louisiana State University

and

U. SEZER

University of Wisconsin

On-chip caches consume a significant fraction of the energy in current microprocessors. As a result,

architectural/circuit-level techniques such as block buffering and sub-banking have been proposed

and shown to be very effective in reducing the energy consumption of on-chip caches. While there

has been some work on evaluating the energy and performance impact of different block buffering

schemes, we are not aware of software solutions to take advantage of on-chip cache block buffers.

This article presents a compiler-based approach that modifies code and variable layout to take

better advantage of block buffering. The proposed technique is aimed at a class of embedded codes

that make heavy use of scalar variables. Unlike previous work that uses only storage pattern

optimization or only access pattern optimization, we propose an integrated approach that uses

both code restructuring (which affects the access sequence) and storage pattern optimization (which

determines the storage layout of variables). We use a graph-based formulation of the problem and

present a solution for determining suitable variable placements and accompanying access pattern

transformations. The proposed technique has been implemented using an experimental compiler

and evaluated using a set of complete programs. The experimental results demonstrate that our

A preliminary version of this article appeared in Proceedings of the ACM/IEEE InternationalSymposium on Low Power Electronics and Design (ISLPED), ACM, New York, 2001. This article

enhances the preliminary version by explaining how the proposed technique interacts with register

allocation and by presenting more experimental data.

M. Kandemir is funded in part by National Science Foundation (NSF) grant CCR-0093082 and

by the Pittsburgh Digital Greenhouse through a grant from the Commonwealth of Pennsylvania,

Department of Community and Economic Development.

J. Ramanujam is supported in part by NSF grant CCR-007380 and NSF Young Investigator Award

CCR-9457768.

Authors’ addresses: M. Kandemir, Pennsylvania State University, Department of Computer Sci-

ence and Engineering, University Park, PA 16802; email: [email protected]; J. Ramanujam,

Louisiana State University, Department of Electrical and Computer Engineering Baton Rouge,

LA 70803; email: [email protected]; U. Sezer, University of Wisconsin, Department of Electrical and

Computer Engineering Madison, WI 53706; email: [email protected].

Permission to make digital or hard copies of part or all of this work for personal or classroom use is

granted without fee provided that copies are not made or distributed for profit or direct commercial

advantage and that copies show this notice on the first page or initial screen of a display along

with the full citation. Copyrights for components of this work owned by others than ACM must be

honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers,

to redistribute to lists, or to use any component of this work in other works requires prior specific

permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 1515

Broadway, New York, NY 10036 USA, fax: +1 (212) 869-0481, or [email protected]© 2006 ACM 1084-4309/06/0100-0228 $5.00

ACM Transactions on Design Automation of Electronic Systems, Vol. 11, No. 1, January 2006, Pages 228–250.

Improving the Energy Behavior of Block Buffering • 229

approach leads to significant energy savings. Based on these results, we conclude that compiler

support is complementary to architecture and circuit-based techniques to extract the best energy

behavior from a cache subsystem that employs block buffering.

Categories and Subject Descriptors: B.3.2 [Memory Structures]: Design Styles—Caches memo-ries; D.3.4 [Programming Languages]: Processors—Compilers, optimization

General Terms: Design, Performance

Additional Key Words and Phrases: Energy optimizations, compiler transformations, block buffer-

ing, embedded systems, data cache

1. INTRODUCTION

On-chip caches are a major source of energy consumption in current micropro-cessors. For example, Edmondson et al. [1995] report that the on-chip cache inDEC Alpha 21264 consumes approximately 25% of the on-chip energy. Circuitand architectural techniques that implement alternative cache organizations(e.g., sub-banking, bitline segmentation, and block buffering [Kamble andGhose 1997]) have been shown to be very effective in reducing the energyconsumption of on-chip caches. A block buffer is a small line buffer insertedbetween the processor and the first-level on-chip cache to hold the mostrecently accessed cache line (block). If the cache line that resides in the blockbuffer is accessed again, this request can be satisfied from the block buffer, andthe main on-chip cache (both data and tag arrays) can be disabled during thisaccess to save energy. While previous research [Su and Despain 1995; Ghoseand Kamble 1999; Kin et al. 1997; Esakkimuthu et al. 2000] investigateddifferent block buffering schemes and evaluated their impact on energy andperformance, to the best of our knowledge, no previous study consideredcompiler support for block buffering.

In this article, we present a compiler-based approach that modifies codeand variable (data) layout to take better advantage of block buffering. Theapplication domain that this technique targets includes a class of embed-ded codes that make heavy use of scalar variables. That is in contrast to avast amount of locality-oriented work done for array-dominated applicationsin the optimizing compiler area (see, for example, Wolfe’s book [Wolfe 1996]and the references therein). While previous work such as that of Panda et al.[1997] and Panda [1998] addresses optimizations for improving cache local-ity of scalar-dominated codes, their approach is limited to variable placement(called storage pattern or storage sequence optimization in this article). In com-parison, the approach proposed in this paper employs both storage patternand access pattern optimizations. Specifically, this article makes the followingcontributions:

—It presents an access pattern (access sequence) optimization technique tomaximize the benefits of block buffering for data accesses in scalar-dominatedembedded codes;

—It presents a unified (integrated) optimization strategy that employs bothaccess pattern and storage pattern (variable placement) optimizations formulti-basic block codes; and

ACM Transactions on Design Automation of Electronic Systems, Vol. 11, No. 1, January 2006.

230 • M. Kandemir et al.

—It quantifies the benefits from the proposed techniques over a straightforwardblock buffering scheme that does not make use of any compiler support usingfour complete programs.

Our experimental results indicate that the proposed techniques improve datacache energy consumption by 45.8% on average, and by as much as 49.7% insome cases when no register allocation is performed. In the existence of an ag-gressive register allocation, the percentage energy benefits range from 17.0% to43.9% for data cache, and from 6.8% to 18.8% for the entire data memory sys-tem. Based on these results, we believe that compiler support is complementaryto architecture and circuit-based techniques to extract the best energy behaviorfrom the cache subsystem equipped with a block buffer.

The remainder of this article is organized as follows. Section 2 gives back-ground material on block buffering and compiler optimizations. Section 3 de-scribes the problem and Section 4 formulates it on a graph-based structure.Section 5 gives a solution strategy for a single basic block (a straight line ofcode without branching/conditionals) case, and Section 6 presents a solutionfor multiple basic block case (i.e., whole procedure). Section 7 discusses exten-sions to our basic scheme and Section 8 presents experimental data showing theeffectiveness of the proposed technique. Section 9 concludes with a summaryand an outline of the planned future work.

2. BACKGROUND

2.1 Block Buffering

Block buffering is an extension of conventional cache architectures that usessmall (one-block wide) line buffers. The key idea is to keep the most recentlyaccessed cache line (block) in the block buffer so that the following requestcan access the data from the block buffer if it targets the same cache line (thisclearly depends on the block-level data locality exhibited by the application). Asnoted by Ghose and Kamble [1999], this not only saves accesses to data arraysof the on-chip cache but, at the same time, also saves the access to the tagarrays. Su and Despain [1995] propose a single block buffer structure; Ghoseand Kamble [1999] extend this structure to multiple buffers (for set-associativecaches), and report as much as 75% savings in power dissipation as comparedto a conventional cache architecture when block buffering is combined withother energy-saving hardware optimizations. Bellas et al. [1998] propose theuse of a cache that sits between the CPU and the instruction cache to improveperformance and energy consumption.

A simplified block buffering scheme is depicted in Figure 1. This is similarto the architecture proposed by Ghose and Kamble [1999]. An access to theblock buffer-augmented cache is performed in two cycles but at the rate of oneaccess per cycle using a two-phase clock. In the first cycle, the last set number iscompared to the corresponding field of the address issued by CPU to determineif the current access is to the same set as the previous one. If it is, the tag anddata array sensing are disabled. Otherwise, the selected set is latched in andthe set number for the current access is moved into the latch that holds the



Fig. 1. Operation of block buffering.

set number for the last access. In the second cycle, normal tag comparison isdone and if it succeeds, word multiplexing is performed as in the case of normalcaches.

2.2 Compiler Optimizations for Data Locality

A vast majority of research in compiler optimizations has targeted array-dominated codes using source-level transformations to make better use ofcaches. Among the techniques investigated are loop restructurings (e.g., looppermutations and tiling [Wolfe 1996]), data transformations (e.g., dimensionreindexings [Kandemir and Ramanujam 2000], integrated data and loop trans-formations [Kandemir et al. 1998, 1999; Kulkarni et al. 2000, 2001], dynamicmemory layout modifications [Mellor-Crummey et al. 1999]), and locality en-hancing transformations for pointer-intensive codes [Chilimbi et al. 2000]. Animportant characteristic of the techniques in the first two groups mentionedis that they focus on multilevel nested loops and exploit the regularity due toaffine expressions in array subscript functions. Consequently, it is possible tomodel the problem using linear algebraic representations [Wolfe 1996] and/orpolyhedral tools [Kelly et al. 1995], and solve it using linear transformationsof iteration and/or data spaces [Catthoor et al. 1998]. The codes that are tar-geted by the approach presented in this paper, however, are quite different fromthose as they involve sequences of scalar assignments and conditional flow ofcontrol. Thus, they do not lend themselves to polyhedral formulations; instead,a graphical representation might be preferable.

A unique aspect of our approach is that the unified strategy that we discusscan optimize an entire procedure that might consist of multiple basic blocks.It operates on a control flow graph (CFG) representation of the procedure, andpropagates information (obtained through compiler analysis as explained later)between basic blocks to obtain a globally (procedure-wide) acceptable solution.Our experimental results show that such a global approach is superior to thetechniques that optimize each basic block in isolation.



3. PROBLEM DESCRIPTION

In this section, we describe the problem addressed in this article. UntilSection 7, we assume that no register allocation is performed and all variablesaccessed from memory (through cache). We also assume the the architectureor the back-end compiler does not modify the order of load/store operations.In Section 7, we discuss in detail how our approach interacts with registerallocation.

The success of a block buffering scheme is strongly dependent on the variableaccess pattern. For example, consider the following code fragment that consistsof two assignment statements:

a = c + a + bb = c + b + a

Assuming that the addition operation (+) is left-associative, the variable accesspattern (access sequence) imposed by these two statements is c,a,b,a, c,b,a,b.If we have a single block buffer of one scalar element (word) wide, such anaccess pattern will not take advantage of block buffering as no variable in thesequence is accessed twice in a row. Note, however, that using commutativityand associativity properties of the addition operation, this program fragmentcan be rewritten as:

a = c + b + ab = a + c + b

The new access sequence is c,b,a,a,a,c,b,b. While this transformed code frag-ment is semantically equivalent to the original one, it exploits the mentionedblock buffer much better as it has a total of three repetitions: two for variable a(in subsequence a,a,a) and one for variable b (in subsequence b,b). Therefore,we can expect three block buffer hits (i.e., data accesses satisfied from the blockbuffer) from this new sequence. Now, consider the following code fragment:

a = 1c = c - 1d = a + b

The original access sequence is a,c,c,a,b,d and an optimization scheme thatis limited to intrastatement transformations (e.g., those exploit commutativityand associativity) cannot generate a better sequence. However, if we apply aninterstatement transformation that interchanges the order of the second andthe third assignments, we obtain the following fragment:

a = 1d = a + bc = c - 1

This new fragment leads to an access sequence of a,a,b,d,c,c which is betterthan the original one as the second access to variable a in this sequence resultsin a buffer hit. These two examples show that access sequence optimizations cangenerate improvements (over original codes); that is, they increase the numberof hits in the block buffer.



Let us now assume that our block buffer can hold k (> 1) variables (i.e., aline size of k elements). In this case, given a code fragment, we can increasethe block buffer hit rate (which is defined as the ratio between the number ofblock buffer hits and the total number of data accesses) by using both accesssequence and storage sequence (variable layout) optimizations. Consider thefollowing code fragment, assuming that k = 2 and that the storage order of thevariables is a,b,c,d, the first variable being at the head of a cache line, that is,aligned to the cache line boundary.

a = b + dc = c + d + ad = b + ad = a + c + d

The original access sequence is b, d, a, c, d, a, c, b, a, d, a, c, d, d.For the given storage sequence a, b, c, d, the block buffer hit rate is 4/14; thehits are due to the subsequences c, d, b, a, c, d and d, d, which map ontothe same cache line, and therefore, the block buffer.

Now, we will illustrate the effects of (i) optimizing only the storage sequence;(ii) optimizing only the access pattern (e.g., changing the access sequence byexploiting properties of the computation such as operator commutativity andassociativity); and (iii) combining (integrating) access and storage sequenceoptimizations.

Optimizing only the Storage Sequence. Without making any changes tothe given access sequence, if we change the storage sequence to a, d, b, c(derived using the approach due to Panda et al. [1997]), we get the best(among those that change only the storage sequence) block buffer hit rateof 6/14.

Optimizing only the Access Sequence. Given the original storage sequence ofa, b, c, d, consider the following (equivalent) transformed code:

a = d + bc = a + c + dd = b + ad = a + c + d

The right-hand side expressions in the first two statements have been re-ordered to improve the block buffer hit rate. The transformed (changed) ac-cess sequence is d, b, a, a, c, d, c, b, a, d, a, c, d, d. The best hitrate possible for this new access sequence is 7/14, which for example, can berealized using the storage sequence a, b, c, d.

Optimizing both the Access Sequence and Storage Sequence. Starting fromthe original code, consider the use of the storage sequence a, d, b, c togetherwith the following (equivalent) transformed code:

a = b + dc = a + d + cd = b + ad = a + c + d



The transformed (changed) access sequence is b, d, a, a, d, c, c, b, a,d, a, c, d, d. With the storage sequence a, d, b, c, we get a hit rate of8/14. Note that, this hit rate is higher than the best achievable for the givencode segment using either only storage sequence or only access sequence op-timizations. Thus, in general, an integrated framework that uses both accessand storage sequence optimizations can be very useful in practice.

We can conclude from these examples that in order to increase the blockbuffer hit rate, we need to maximize the number of consecutive accesses to agiven line. This can be done by modifying the order of variable access (accesssequence optimization), by modifying the storage order of variables in memory(storage pattern optimization), or by a combination of these (unified optimiza-tion). Note that, when k is one, access sequence optimization is the only option.

4. REPRESENTATION

It should be emphasized that the access pattern optimization problem is impor-tant for two major reasons. First, it is the only available optimization strategywhen the cache line is single element wide or when storage (variable layout)optimizations are not applicable. Second, the access pattern optimization strat-egy can be used as a part of a global optimization framework that uses bothstorage pattern and access pattern transformations. Section 6 discusses suchan optimization framework that can handle an entire procedure.

We represent the problem of access pattern optimization using a graph-basedstructure and solve it using a longest path algorithm. We define two data ele-ments as neighbors if they map on the same cache line and the distance betweenthem (in memory) is less than k (the cache line size) elements. Note that theneighboring elements are brought into cache (when they are accessed) andhence into the block buffer at the same time. We also define virtual lines inmemory, each of which holding a group of neighboring elements.

Given a basic block (i.e., a block of sequential assignment statements withouta branch except maybe at the end of the sequence [Aho et al. 1986]), we use alayout transition graph (LTG) to show the connections between elements thatare mapped on the same cache line. Specifically, a layout transition graph ofa basic block is a directed graph LT G(V , E) where each node (vertex) vi ∈ Vrepresents the occurrence of a variable in the basic block, and a bi-directionaledge e = (vi, vj ) ∈ E from a node vi to a node vj indicates that the variablesrepresented by vi and vj are neighbors (i.e., they are in the same virtual line).An LTG also contains an edge between vi to vj if these two nodes represent theoccurrences of the same variable.

For ease of exposition, we divide a given LTG into layers, each of whichcorresponding to an assignment statement in the basic block. If the basic blockcontains K statements, each variable vi in the j th statement from top (denotedsj where 1 ≤ j ≤ K ) is assumed to belong to the variable set of sj ; we expressthis concept using the notation vi ∈ sj . When there is no confusion, we willabuse the notation sj to denote both the statement and its variable set.

A given variable set si can also be divided into two logical subsets: onethat contains the variable on the left-hand side (LHS), and one that contains



the variables on the right-hand side (RHS). For a variable set si, the first setis denoted by siL and the second set is denoted by siR . As will be discussedlater in this section, the edges of the LTG can be used to traverse the nodes inthe graph, which, in turn, corresponds to intrastatement and interstatementtransformations.

We define a traversal of (the variables in) a given LTG as a set of pathsthat collectively visit each and every node only once without mixing accessesto the variables from different statements. While a given LTG shows the stor-age connections (relations) between the variables in the basic block, it doesnot dictate any traversal. Note that a traversal of the LTG corresponds to anordered sequence of variable accesses (i.e., access pattern). Consequently, dif-ferent traversals correspond to different access patterns. It is well known fromdata dependence theory [Wolfe 1996] that there are some traversals (of vari-ables during execution) that are not valid (i.e., semantically correct). To elimi-nate some of the invalid traversals from the LTG, we constrain it by eliminatingany edge e = (vi, vj ) ∈ E if going from vi to vj during the program execution(i.e., touching the variables represented by vi and vj immediately one after an-other) does not preserve the original semantics of the code. This pruned LTGis called constrained LTG (CLTG) in this article and is the main data structurewhich the optimizations we employ operate on.

To illustrate how a LTG and a CLTG are constructed, let us consider thefollowing code fragment (basic block):

a = i + b + 3je = g + h + kb = 2d + 4a - 5k = l + f

Let us assume that the variables are stored in memory in the order ofa,b,c,d,e,f,g, h,i,j,k,l and that k = 4. Consequently, we have three vir-tual lines: {a,b,c,d}, {e,f,g,h}, and {i,j,k,l} (We assume perfect alignment).Figure 2(i) shows the four layers corresponding to four statements in thisfragment. Each layer is delimited using dashed lines and corresponds to anindividual statement. For example, labeling the first statement as s1, we haves1L = {a} and s1R = {i,b,j}. Given the storage sequence (virtual line mapping)above, Figure 2(v) shows the LTG for this code fragment. Note that there isa bidirectional edge between two nodes whenever the corresponding variablesare neighbors (i.e., they reside in the same virtual line). Figures 2(ii), (iii), and(iv), on the other hand, show the contributions (that can be called sub-LTGs)coming from the three virtual lines (group of neighbors) mentioned above. Itshould be noted that a different alignment in memory (virtual line mapping)would generate a totally different LTG.

We see that the LTG shown in Figure 2(v) is very dense. However, assumingthat the statements in the fragment will not be broken into sub-statements andthat the variable accesses from different statements are not mixed, a simpleanalysis of the code fragment reveals that many of the edges in this LTG cannotbe traversed by a legal access pattern. For example, there is no way that the edgefrom i to k be taken by any given schedule (access pattern) as a LHS variableneeds to be touched between these two variables. Data dependence constraints



Fig. 2. (i) Four layers (corresponding to four statements) for a basic block. (ii–iv) sub-LTGs induced

by different virtual lines. (v) the overall LTG. (vi) the corresponding CLTG. (vii) four representative

paths in the CLTG. (viii) the final traversal of nodes.

might also help one to eliminate a number of edges (e.g., the one from a in thefirst statement to d in the third statement). Eliminating all these illegal (un-acceptable) edges (transitions) gives us the CLTG shown in Figure 2(vi). Notethat, the graph in Figure 2(vi) contains very few edges, compared to Figure 2(v).

The optimization process described in the next section operates on the CLTG.Before moving to the optimization phase, let us first formalize the constraintsthat allow us derive a CLTG from a given LTG. A constrained layout transitiongraph, written CLT G(V ′, E ′), is a subgraph of the LT G(V , E) such that V ′ =V and E ′ contains all the edges in E except those that can lead to an incorrector infeasible code transformation (transition). Note that the construction ofthe CLTG subsumes both the intrastatement constraints (i.e., evaluation rulesthat need to be obeyed when processing the RHS expression) and the inter-statement constraints (i.e., data dependence and other constraints betweendifferent statements). In mathematical terms, to build the CLTG, the followingedges of the LTG should be dropped:

—Any edge (vi, vj ) ∈ E such that vi ∈ skR , vj ∈ sk′ R with k �= k′

—Any edge (vi, vj ) ∈ E such that vi ∈ skR , vj ∈ sk′L with k �= k′

—Any edge (vi, vj ) ∈ E such that vi ∈ skL and vj ∈ sk′L, where k �= k′ andsk′ R �= ∅

—Any edge (vi, vj ) ∈ E such that traversing this edge would break expressionevaluation rules or data dependence.

While a CLTG shows only the legal edges, this does not necessarily meanthat traversing the shown edges will always result in correct codes. For ex-ample, accessing two nodes, vi and vj consecutively, itself may not break anydependence; however, after this access sequence, it may not be possible to gen-erate legal code due to the a new restriction (in the access order) resulting fromthe said transition between vi and vj . Note also that this may not be detectable



during the construction of the CLTG since we do not know at that point whatthe access sequence (traversal order) will be (e.g., whether the edge (vi, vj ) willactually be visited by the traversal).

5. SOLUTION

We formulate the problem of modifying a given basic block code for effective useof the available block buffer of k elements as one of determining a path cover[Cormen et al. 1990] and a traversal order on the CLTG. To generate correctcode (i.e., to preserve the semantics of the basic block), we impose the followingconditions on the traversal order:

—Each node in the CLTG (i.e., a variable occurrence in the basic block) shouldbe visited.

—For a given layer in the CLTG corresponding to the statement sk , all nodesin skR should be visited before the node in skL.

—Once the traversal reaches a layer corresponding to the statement sk , itshould finish all variables in that layer (i.e., the set skL ∪ skR) before movingto another layer.

—All data dependences and other restrictions such as latency constraints orexpression evaluation constraints should be preserved.

Based on these observations, our approach determines a traversal order andduring the traversal it also transforms the underlying code fragment (basicblock). The objective of the traversal (and that of transformation) is to minimizethe cost of traversal. In our context, the traversal cost is defined as the number oftransitions (i.e., successive variable accesses) that do not have a correspondingedge in the CLTG. This is because each such transition accesses two variables(one after another) that reside in different virtual lines, and consequently, thesecond access (in the transition) cannot exploit the data currently residing inthe block buffer (i.e., results in a block buffer miss). Recall that we assumea single block buffer that can hold k consecutive elements. Therefore, oneway of minimizing the cost is to traverse as many edges from the CLTG aspossible.

In the following, we present the description of an algorithm that takes asinput a CLTG and generates as output a traversal (an access sequence) and allthe necessary (interstatement and intrastatement) transformations to obtainthis access sequence. Given a CLTG, the algorithm first detects the longest di-rected path (i.e., the path that contains the maximum number of edges in thesame direction). It then transforms the portion of the CLTG (which contains asubset of the statements in the original basic block) in accordance with thislongest path. Finding the longest path in a given directed graph is straight-forward, and takes O(N 3) time where N is the number of nodes in the graph[Cormen et al. 1990]. Transforming the program code in accordance with thelongest path is more challenging as explained next.

Let us now discuss the transformations imposed by a path. Figure 3 showsthe situations that may demand code (access sequence) transformations dur-ing the optimization process. The first situation corresponds to the case where



Fig. 3. Example cases that might require code transformations.

there is an edge from a node vi in skR to the node vj in skL. In this case, theRHS node in question should be made the last node accessed on the RHS (ifit is not already the last node accessed on the RHS). The second situation isthe one in which we have an edge (in the CLTG) from the LHS node vi of astatement sk to a node vj in sk′ R of statement sk′ where k �= k′. In this case, wemay need two types of transformations: first, if sk′ and sk are not consecutive,that should be made consecutive (an inter-statement transformation); second,if vj is not the first variable accessed in the sk′ R , it should be made so (an in-trastatement transformation). The third situation corresponds to the case inwhich we have an edge from a node vi in skR to another node vj in skR . Inthis case, we need an intra-statement transformation that should bring thesetwo nodes together. In cases where a variable vj is needed to be both the firstvariable accessed in sk′ (due to the edge from sk to sk′ ) and be the last variableaccessed (due to the intra-statement edge in sk′ ), we need a conflict resolutionscheme.

We illustrate the operation of this transformation mechanism using the ex-ample in Figure 2. Recall that in this example we have three virtual lines:{a,b,c,d}, {e,f,g,h}, and {i,j,k,l}. The longest path in the CLTG in Fig-ure 2(vi) is marked (1) in Figure 2(vii). It contains two nodes (b and a) fromthe first statement and three nodes (a, d, and b) from the third statement.In order to realize this path (i.e., to touch the variables at runtime in the or-der indicated by this path), variable b in the first statement should be thevariable that is accessed last on the RHS (just before the LHS variable a),the third statement should be moved to just below the first statement (aninter-statement transformation) to ensure successive accesses to variable a,the variables a and d (in the original third statement) should be accessed oneafter another (already satisfied as there are only two variables on the RHS),and d should be the last variable accessed on the RHS. Note that after per-forming these transformations the access orders for variables i and j (in thefirst statement) are also fixed. The approach next moves to the second longestpath (marked (2) in Figure 2(vii)) and performs the transformations indicatedby it. It is important to stress that the transformations that would be per-formed based on the second (longest) path cannot override (nullify) those per-formed based on the longest (previously optimized) path. Note that after pro-cessing path (2), the access order of all the variables in the code is fixed. The



approach verifies this by checking the remaining paths (which are marked (3)in the figure) and making sure that the nodes contained in them have all beenprocessed.

Figure 2(viii) shows the final access sequence (traversal). The dashed edges(arrows) correspond to transitions that contribute the cost of the traversal asthey do not have corresponding edges in the CLTG. In this particular case,the overall cost is 4. Note that the cost of the traversal directly corresponds tothe number of block buffer misses. It is easy to verify that the traversal costwould be 11 had we not performed any transformation. Considering that thetotal number of variable accesses in this example is 14, we observe a significantimpact of the code transformations used.

6. MULTIPLE BASIC BLOCKS

Up until now, we have focused on a single basic block and presented a tech-nique that employs access sequence transformations built upon a graph-basedrepresentation. In reality, most embedded codes have multiple basic blocks con-nected to each other through control dependences such as loop structures andconditional branching. Below we present an approach that optimizes a givenprocedure that might contain multiple basic blocks using a mix of data (vari-able) layout (storage sequence) and access sequence transformations.

It should be noted that if we restrict ourselves to access sequence transfor-mations, it is relatively straightforward to extend the technique presented inthe previous section to multiple basic blocks as follows. Assume that the pro-cedure being optimized is represented using a control flow graph (CFG) whereeach node represents a basic block and each directed edge represents a pos-sible transition (transfer of control) between two basic blocks. The proposedapproach starts with the most frequently executed basic block (which might bedetermined through profiling and static analysis where applicable) and opti-mizes it as explained in the previous section. It then visits each basic blocksin the order of decreasing execution frequency and uses the same algorithm totransform its access sequence.

A more interesting problem (and that is the one addressed in this section),however, is to optimize a given CFG using both access sequence and storagesequence transformations. In order to achieve this we need to employ a storagesequence optimization strategy. Panda et al. [1997] and Panda [1998] presenta powerful strategy for this purpose. Their strategy first obtains an access se-quence and then builds a closeness graph (CG). The CG has a node for everyvariable used in the code. The two nodes, vi and vj , in the CG are connectedusing an edge with a weight of wij if the distance between them in the accesspattern is ≤ k (line size) and the corresponding edge (transition) is taken wij

times. After constructing the CG, the next step is to group the variables intoclusters of k elements. As explained in Panda et al. [1997], a higher edge weightin the CG between two variable represents a potential reduction in the numberof memory accesses (i.e., increase in the cache hit rate) if the two variables arestored close to each other in memory. They propose a heuristic that determinesthe temporally correlated variables and stores them consecutively as much as



possible. Note that in our terminology, each cluster in Panda et al. [1997] cor-responds to a virtual line.

In principle, an approach that integrates access sequence optimizations andstorage sequence optimizations in a unified framework should generate betterresults than pure access sequence optimizations and pure storage sequenceoptimizations.

Before discussing how the approach in Panda et al. [1997] can be combined(integrated) with the access transformation technique presented in this paper,let us address a subproblem that will be needed in the integration process. Theproblem that we address below is to optimize the access and storage pattern of abasic block assuming that only some (not all) of the variables have fixed memorylocations and the remaining variables do not have fixed locations (i.e., they areyet to be assigned to storage locations). Let us call these two groups of variablesFA and FN , respectively. The objective of the optimization process is then to findan access sequence (to come up with an access sequence transformation) andto find a storage sequence for the variables in FN . We propose the followingthree-step approach to solve this subproblem:

—Use access sequence transformations to obtain an access sequence (for thevariables in FA) compatible with the storage sequence of the variables in FA.These transformations will typically modify the access sequence of some ofthe variables in FN as well. Let us call the set of these variables FN ′ (⊆ FN ).

—Use storage sequence optimizations for the variables in FN ′ to obtain a stor-age sequence (variable layout) which is compatible with the access sequencedetermined in the previous step.

—If the set FN −FN ′ is not empty, then determine a suitable storage sequencefor the variables in this set. In most of the cases encountered in practice, thisset is empty.

Note that the first step uses our approach explained in the previous sectionwhereas the second step use Panda et al.’s [1997] approach.

We are now ready to present our unified approach to procedure-wide (global)optimization for effective utilization of block buffering. Our approach operateson the CFG representation of the procedure and starts by ranking the basicblocks according to decreasing execution frequencies (which might be obtainedthrough static analysis or profiling as mentioned earlier). It then starts theoptimization process with the most frequently executed basic block and opti-mizes this basic block using only storage sequence optimizations proposed byPanda et al. [1997]. After optimizing this block, the storage order of the vari-ables accessed by this block is known. The approach then moves to the secondmost frequently executed basic block and optimizes this basic block using thethree-step approach discussed in the previous paragraph. After optimizing thisblock, the set of variables whose storage locations are determined is updatedand the approach moves to optimize the next basic block, and so on. Therefore,two distinguishing characteristics of the approach are:

—during the optimization process, it gives priority in optimization to the mostfrequently executed basic blocks; and



Fig. 4. An example control flow graph (CFG).

—it propagates variable layouts between basic blocks to reach a globally(procedure-wide) acceptable solution.

It should be noted that once a memory location has been determined for avariable (during optimization of a basic block), it is never changed later (duringthe optimization of a less frequently executed basic block); that is, the approachdoes not backtrack.

To illustrate the operation of this approach, we consider the CFG shown inFigure 4. Let us assume without loss of generality that the basic blocks areordered according to decreasing execution frequency as B4, B0, B2, B3, and B1and that k = 2. Note that a total of eight variables, namely a, b, c, d, e, f,g, and h, are referenced in this CFG. Our approach starts with B4 and opti-mizes it using Panda et al.’s scheme which indicates that variables a and c,and similarly variables g and b should be stored consecutively; that is, a andc should form a virtual line and g and b should form a virtual line. Havingfixed the storage order (sequence) for four variables, we move to B0 (the sec-ond most frequently executed basic block). Since c and a (and similarly b andg) have already been decided to be stored one after another (in the same vir-tual line), we apply a commutativity transformation and transform the state-ment g = b + a to g = a + b. The next block to be optimized is B2. In op-timizing this block, our approach first applies commutativity transformationto the first statement to makes accesses to variables b and g successive. Itthen captures the temporal proximity between d and h and stores them con-secutively (in a single virtual line). After that, it moves to B3 and changesthe access order of variables in the second statement (a = a + h) to bring ac-cesses to d (in the first statement) and h (in the second statement) together.Finally, the approach handles B1 by storing e and f, consecutively (in thesame virtual line). A possible storage order satisfying these transformationsis a,c,g,b,d,h,e,f in which we have four virtual lines: {a,c}, {g,b}, {d,h},and {e,f}.



7. EXTENSIONS TO THE BASIC APPROACH

7.1 Multiple Blocks

So far, we have considered only a single block buffer. The proposed techniquecan be extended to work with multiple block buffers. In this case, after obtainingthe storage and access sequences, we need to place the virtual lines in memoryso as to minimize the number of conflict misses (in the block buffer) betweenthe virtual lines. We can achieve this by employing the cluster optimizationproposed by Panda et al. [1997]. The idea is to build a cluster (virtual line) in-terference graph in which each node represents a virtual line (determined by ourapproach in the previous section), and the weight of an edge between two nodesindicates the number of potential conflict misses in the block buffer (cache) dueto these two clusters. Since this cluster (virtual line) assignment problem can beshown to be NP-hard, a heuristic can be used so that the nodes (clusters) withhigh number of potential conflict misses (i.e., high edge weights) are assignedto consecutive memory locations to minimize the chances for conflict misses.

7.2 Impact of Register Allocation

Until now, we have assumed no register allocation; that is, all variables areaccessed from memory (through the cache plus block buffer system). A registerallocation mechanism can have two impacts on the energy consumption of amemory system. First, it filters out a number of memory references that wouldotherwise end up in the cache and (maybe) off-chip memory. This filtering, ingeneral, leads to a reduction on the overall memory system energy consumption.Second, it can distort the regularity in data accesses to cache memory. Thissecond impact can be expected to lower the percentage benefits due to ourstrategy. On one hand, there would be fewer references to the block buffer,hence less scope for optimization. On the other hand, those references that endup in the block buffer exhibit less regularity. This obviously makes a high-levelcompiler optimization that targets block buffering less predictable.

There are two ways of handling the interaction between our approach andregister allocation. The first approach is to do nothing. In other words, we cancontinue to optimize the code as if all data references would go to cache, andexpect that the references that actually go to cache at runtime would, hopefully,still exhibit a regularity so that our optimization would bring some benefit. Thesecond approach is more involved as explained below.

Note that our strategy as explained so far, has been applied at the source-level. This obviously makes it difficult to take into account the interactions withlow-level optimizations performed by the back-end compiler. However, it is alsopossible to apply our approach at the low level, possibly even after register allo-cation and instruction scheduling. Such a strategy will bring two main benefits.First, we can clearly see (at the low level) which references are register allo-cated (therefore, they do not need to be considered by our approach). Second, wecan have a better idea about the orders of loads and stores. Consequently, ourmodified approach proceeds as follows. The CLTG is constructed after registerallocation has been performed and all register-allocated variables are omitted



from consideration. In the multiple basic blocks case, once the storage locationsof all variables that are not register-allocated have been determined, the re-maining variables are assigned locations from a separate region of memory;that is, they do not interfere with the former group of variables.

It should be stressed, however, that modifying the access pattern at low levelis more challenging than doing it at high level. For example, going from anevaluation order a + b + c to c + a + b in the low-level representation of thecode not only changes the order of loads but may also affect the register usage.However, this potential problem does not create an excessively negative impactwith our current implementation as we do not move load/store instructionsbeyond basic block boundaries.

7.3 Impact of Load/Store Re-ordering

As mentioned earlier in the article, if the back-end optimizer changes the rela-tive order of loads/stores, the effectiveness of our strategy may reduce as suchre-orderings can make access pattern assumptions at the high level invalid.Implementing our optimization strategy at the low level solves this load/storeproblem partially (i.e., as far as the compiler’s view of the order of loads/storesis concerned, if we run our strategy after instruction scheduling). This hasdrawbacks in that reordering the instructions (that were ordered by the in-struction scheduler to reduce stalls) may have an adverse effect by increas-ing stalls. A detailed modeling of the potential interaction between instructionscheduling and our strategy is beyond the scope of this article. We are cur-rently exploring the problem of integrating instruction scheduling with ourstrategy. This problem can be much worse for superscalar processors that per-form out-of-order (OOO) execution through which the order of instructions canbe modified at runtime in a manner that may not be predictable at compiletime.

It should also be mentioned that many embedded systems (in particular,those aiming at real-time environments) do not perform aggressive load/storescheduling at runtime as such optimizations make the execution time predictionhighly difficult, in general [Wolf 2001].

8. EXPERIMENTAL EVALUATION

This section provides results of the experiments we performed to evaluate theproposed optimization scheme. Our scheme has been implemented within theSUIF compilation framework [Amarasinghe et al. 1996] and has been evaluatedusing four codes: int mxm, an integer matrix multiply program (that containsone initialization and one multiplication nest); full search, a motion estima-tion code; fft, a discrete Fourier analysis code; and flt, a filtering routine.Since our strategy targets codes that make frequent use of scalar variables, apre-pass in the compiler unrolls the loops in the code completely, and replacesarray references with their scalar counterparts. For example, array referencessuch as a[1][2] and b[5][3] are converted to scalar variables a12 and b53, re-spectively. After this optimization, each element can be treated independentlyfrom others.



The data set sizes (respectively the number of basic blocks) for int mxm,full search, fft, and flt are 196K (respectively, 2), 71K (respectively, 11),224K (respectively, 8), and 128K (respectively, 12). For each code, four differentversions have been evaluated: the original code, a version that uses only storagelayout optimizations (denoted s opt) [Panda et al. 1997], a version that usesonly access sequence optimizations (denoted a opt), and a version that usesboth storage layout and access sequence optimizations (denoted s+a opt).

Our first set of experiments measure the energy benefits of our approachwhen no register allocation is performed. The input to our implementation is acode written in C. The SUIF pass implemented first applies loop unrolling andconverts array references to scalars as mentioned above. After this step, it ap-plies our strategy explained in Section 6, which transforms access patterns anddetermines storage locations for variables. Then, this optimized output code inC and the original code are compared with each other. To eliminate registerallocation, all the versions are compiled using a custom Sparc-based back-endwhere no scalar variable is permanently register allocated. The block buffermiss rates are obtained through the use of an in-house simulator built upon theShade infrastructure [Cmelik and Keppel 1994]. To calculate energy consump-tions, we employ the on-chip energy formulations in Shiue and Chakrabarti[1999]. All the graphs shown here are for an 8K, direct-mapped, write-backdata cache with a single block buffer. Experiments with 2-way and 4-way asso-ciative caches exhibited similar trends; so, they are not presented here. Also,when we increase the number of block buffers (to 4 and 8), we obtained largerenergy savings (though slightly sub-linear in terms of the number of blockbuffers). Beyond 8 buffers, we noticed that the buffers themselves consume alarge amount of energy.

Figure 5 gives the percentage improvements (reductions) in data cache en-ergy consumption with respect to the original version (unoptimized code) fortwo different values of k (1 and 4). Note that, the unoptimized code also takesadvantage of block buffering depending on the amount of block-level data local-ity in the code. We can make two major observations from these graphs. First,there is no clear choice between s opt and a opt as none of them dominates theother. This means that both access pattern optimizations and storage patternoptimizations need to be considered by the compiler. Second, the unified strat-egy (s+a opt) outperforms the other two optimization schemes over all codesand buffer sizes. When k = 1, the average percentage improvements broughtabout by s opt, a opt, and s+a opt are 26.1%, 26.7, and 48.2%, respectively.

Figure 6 presents reductions in the overall data memory system energy(main memory energy plus data cache energy, including the block buffer). Wesee that the average energy reductions (over all codes and all buffer sizes) dueto s opt, a opt, and s+a opt are 12.6%, 12.2%, and 21.4%, respectively. Theseresults clearly demonstrate that the unified strategy generates much betterresults than pure storage pattern or pure access pattern optimizations.

Next, we performed a set of experiments in which the impact of register allo-cation on the effectiveness of our approach was gauged. Figure 7 gives the per-centage reductions in data cache energy consumption with respect to the orig-inal version. To perform these experiments, we first coded a simple back-end,



Fig. 5. Data cache energy percentage improvement over original codes (without register

allocation).

which takes the SUIF output (of the unoptimized code) and generates Sparccode. Then, our back-end implementation, after register allocation, applied theoptimization technique discussed in this article. The register allocator employedhere closely follows the Chaitin’s graph-coloring-based approach [Chaitin 1982].We see from Figure 7 that, when k = 1, the average percentage data cache en-ergy improvement due to s+a opt is 35.7%, which is 12.5% worse than whenno register allocation is employed. Similarly, the results illustrated in Figure 8show that (when register allocation is used) our approach improves overall datamemory energy by 16.4%. These results indicate that, even in the existence ofan aggressive register allocation, large energy benefits are possible by usingcompiler-directed storage pattern and access pattern optimizations in concert.

In the next set of experiments, we measured the impact of load/store re-ordering. For this purpose, we modified our back-end so that loads and storesare moved to exploit as much instruction level parallelism (ILP) as possible(without performing any load speculation). Since the observed trends are thesame for all the codes in our experimental suite, we present only the resultsfor flt when the register allocator is activated and the unified optimizationstrategy (s+a opt) is used. The results presented in Figure 9 show that, withload/store re-ordering, the data cache energy and overall data memory energybrought by our approach is 24.7% and 9.1%, respectively. It should be notedthat load/store speculation can even take these benefits to lower values.



Fig. 6. Overall memory energy percentage improvement over original codes (without register

allocation).

Before moving to our concluding remarks, we would like to discuss two im-portant issues: performance impact of our approach and its impact on overallenergy savings. The impact of our approach on performance (execution cycles)can be evaluated under two scenarios. In the first scenario, we assume thatthe block buffer access and the cache access are serialized. That is, we first ac-cess the block buffer and then to the cache if we miss in the block buffer. In thiscase, the impact of our approach on performance is directly dictated by the blockbuffer hit rate. Specifically, if the block buffer hit rate is high, we can expectperformance benefits (in addition to energy benefits). If, on the other hand, theblock buffer hit rate is low, this can affect the overall performance since eachmiss in the block buffer brings 1 cycle additional penalty. The second scenario,whose implementation we followed up to this point in the article, is based onthe implementation strategy suggested by Ghose and Kamble [1999], in whichthe buffer and cache accesses are performed concurrently, using a two-phaseclock. In this case, the cache access is terminated early if the block buffer accessturns out to be a hit. In this case, the performance behavior of our scheme is lessaffected by the buffer hit rate, compared to the first scenario described above.

Figure 10 gives the performance behavior of our strategy (s+a opt) underthese two scenarios. Each bar in this graph gives the execution cycles, nor-malized with respect to the case without any optimization (i.e., the originalversion). Two trends can be observed from this graph. First, our approach



Fig. 7. Data cache energy percentage improvement over original codes (with register allocation).

improves overall performance under both the scenarios. The average execu-tion cycle reductions for the first (serial access) and the second (parallel access)scenarios are 9.4% and 13.1%, respectively. Second, as expected, the executioncycle savings are lower with the first scenario. The reason that we still achieveexecution cycle savings with this scenario is the fact that the block buffer hitrates are generally high, thereby removing the potential cache access in manycircumstances. The experimental results so far demonstrate that our approachis beneficial from both energy and performance viewpoints.

Finally, it is also important to consider the overall energy savings achieved byour approach. We start by observing that since our approach does not increaseexecution cycles (under both the scenarios discussed above), we do not incurany extra leakage consumption (over the original codes). Our experimentationwith these benchmark codes showed that, without any optimization, the datamemory system energy (including the data cache and the data main memory)constitutes about 31% of the total energy consumption (for a simple five-stagepipelined embedded processor, excluding the input/output units), indicatingthat data memory system is an important target from the energy angle. Sincethe results in Figures 6 and 8 show that our approach is able to reduce overalldata memory system energy by 16–21% when averaged over all benchmarkcodes in our experimental suite, one can expect around 5–6.5% savings, on theaverage, in total energy consumption.



Fig. 8. Overall memory energy percentage improvement over original codes (with register

allocation).

Fig. 9. Data cache energy and overall memory energy percentage improvements over original

codes when load/store re-ordering is performed (with register allocation).

9. CONCLUSIONS AND FUTURE WORK

This article presented a compiler-based approach that modifies code and vari-able layout to take better advantage of block buffering. The proposed techniqueis aimed at a class of embedded codes that make heavy use of scalar variables.



Fig. 10. Overall performance improvement over original codes (without register allocation and

when k = 1.)

Unlike previous work that uses only storage pattern optimization, we use anintegrated approach that employs both code restructuring and storage patternoptimizations. Our solution works for whole procedures, that is, multiple basicblocks. We have implemented our solution in an experimental compiler. Exper-iments with several codes demonstrate that our solution results in up to 24%savings in overall memory energy without register allocation and 18% withregister allocation.

Work in progress includes the investigation of different ways of combiningstorage layout and code restructuring transformations, incorporating partition-ing of variables for multiple block buffers, and studying the impact of staticsingle assignment. We also plan to investigate how storage pattern decisionsmade for a given procedure can be propagated to other procedures in the sameapplication.

REFERENCES

AHO, A. V., SETHI, R., AND ULLMAN, J. 1986. Compilers: Principles, Techniques, and Tools. Addison-

Wesley, Reading, MA.

AMARASINGHE, S. P., ANDERSON, J. M., WILSON, C. S., LIAO, S.-W., MURPHY, B. R., FRENCH, R. S.,

LAM, M. S., AND HALL, M. W. 1996. Multiprocessors from a software perspective. IEEEMicro.

BELLAS, N., HAJJ, I., STAMOULIS, G., AND POLYCHRONOPOULOS, C. 1998. Architectural and compiler

support for energy reduction in the memory hierarchy of high-performance microprocessors. In

Proceedings of the 1998 ACM/IEEE International Symposium on Low Power Electronics andDesign, ACM, New York, 70–75.

CATTHOOR, F., WUYTACK, S., GREEF, E. D., BALASA, F., NACHTERGAELE, L., AND VANDECAPPELLE, A. 1998.

Custom Memory Management Methodology—Exploration of Memory Organization for EmbeddedMultimedia System Design. Kluwer Academic Publishers.

CHAITIN, G. 1982. Register allocation and spilling via graph coloring. In Proceedings of theSIGPLAN’82 Symposium on Compiler Construction, ACM, New York, 98–105.

CHILIMBI, T., LARUS, J., AND HILL, M. 2000. Making pointer-based data structures cache conscious.

IEEE Comput.CMELIK, B. AND KEPPEL, D. 1994. Shade: A fast instruction-set simulator for execution profiling. In

Proceedings of the ACM SIGMETRICS Conference on Measurement and Modeling of ComputerSystems, ACM, New York, 128–137.



CORMEN, T., LEISERSON, C., AND RIVEST, R. 1990. Introduction to Algorithms. MIT Press, Cambridge,

MA.

EDMONDSON, J., FISCHER, T. C., JAIN, A. K., MEHTA, S., MEYER, J. E., PRESTON, R. P., RAJAGOPALAN, V.,

SOMANATHAN, C., TAYLOR, S. A., WOLRICH, G. M., RUBINFELD, P. I., BANNON, P. J., BENSCHNEIDER, B. J.,

BERNSTEIN, D., CASTELINO, R. W., COOPER, E. M., DEVER, D. E., AND DONCHIN, D. R. 1995. Internal

organization of the Alpha 21164, a 300 MHz 64-bit quad-issue CMOS RISC microprocessor. Digit.Tech. J. 7, 1, 119–135.

ESAKKIMUTHU, G., VIJAYKRISHNAN, N., KANDEMIR, M., AND IRWIN, M. 2000. Memory system energy:

Influence of hardware-software imizations. In Proceedings of the ISLPED’00 ACM/IEEE Inter-national Symposium on Low Power Electronics and Design. ACM, New York.

GHOSE, K. AND KAMBLE, M. 1999. Reducing power in superscalar processor caches using sub-

banking, multiple line buffers and bit-line segmentation. In Proceedings of the ACM/IEEEInternational Symposium on Low Power Electronics and Design, ACM, New York, 70–75.

KAMBLE, M. AND GHOSE, K. 1997. Energy-efficiency of vlsi caches: a comparative study. In Pro-ceedings of the International Conference on VLSI Design, 261–267.

KANDEMIR, M., CHOUDHARY, A., RAMANUJAM, J., AND BANERJEE, P. 1998. Improving locality using

loop and data transformations in an integrated framework. In Proceedings of the InternationalSymposium on Microarchitecture.

KANDEMIR, M., CHOUDHARY, A., SHENOY, N., BANERJEE, P., AND RAMANUJAM, J. 1999. A linear alge-

bra framework for automatic determination of imal data layouts. IEEE Trans. Parall. Distrib.Syst. 10, 2, 115–135.

KANDEMIR, M. AND RAMANUJAM, J. 2000. Data relation vectors: A new abstraction for data imiza-

tions. In Proceedings of the International Conference on Parallel Architectures and CompilationTechniques.

KELLY, W., MASLOV, V., PUGH, W., ROSSER, E., SHPEISMAN, T., AND WONNACOTT, D. 1995. The Omega

library interface guide. Technical Report CS–TR–3445, CS Dept., University of Maryland.

KIN, J., GUPTA, M., AND MANGIONE-SMITH, W. 1997. The filter cache: An energy-efficient memory

structure. In Proceedings of the MICRO-30, 184–193.

KULKARNI, C., CATTHOOR, F., AND MAN, H. D. 2000. Advanced data layout imization for multimedia

applications. In Proceedings of the Workshop on Parallel and Distributed Computing in Image,Video, and Multimedia Processing, held in conjunction with IPDPS’00.

KULKARNI, C., MIRANDA, M., GHEZ, C., CATTHOOR, F., AND MAN, H. D. 2001. Cache conscious data

layout organization for embedded multimedia applications. In Proceedings of the 4th ACM/IEEEDesign and Test in Europe Conference. ACM, New York.

MELLOR-CRUMMEY, J., WHALLEY, D., AND KENNEDY, K. 1999. Improving memory hierarchy perfor-

mance for irregular applications. In Proceedings of the ACM International Conference on Super-computing. ACM, New York.

PANDA, P. 1998. Memory minimization and exploration for embedded systems. Ph.D. dissertation.

University of California, Irvine, Irvine, CA.

PANDA, P., DUTT, N., AND NICOLAU, A. 1997. Memory data organization for improved cache perfor-

mance in embedded processor applications. ACM Trans. Des. Auto. Elec. Sys. 2, 4.

SHIUE, W. AND CHAKRABARTI, C. 1999. Memory exploration for low power embedded systems. Tech.

Rep., Arizona State Univ.

SU, C. AND DESPAIN, A. 1995. Cache design tradeoffs for power and performance imization: A case

study. In Proceedings of the ISLPED95, ACM/IEEE International Symposium on Low PowerElectronics and Design, ACM, New York, 63–68.

WOLF, W. 2001. Computers as Components: Principles of Embedded Computing System Design.

Morgan-Kaufmann, San Francisco, CA.

WOLFE, M. 1996. High Performance Compilers for Parallel Computing. Addison-Wesley, Reading,

MA.

Received January 2003; revised March 2003; accepted March 2005


improving the energy behavior of block buffering using compiler … · 2008-03-11 · improving the...

Documents