peak—a fast and effective performance tuning …...peak—fast and effective performance tuning...

17

PEAK—A Fast and Effective PerformanceTuning System via Compiler OptimizationOrchestration

ZHELONG PAN and RUDOLF EIGENMANN

Purdue University

Compile-time optimizations generally improve program performance. Nevertheless, degradationscaused by individual compiler optimization techniques are to be expected. Feedback-directed op-timization orchestration systems generate optimized code versions under a series of optimizationcombinations, evaluate their performance, and search for the best version. One challenge to suchsystems is to tune program performance quickly in an exponential search space. Another challengeis to achieve high program performance, considering that optimizations interact. Aiming at thesetwo goals, this article presents an automated performance tuning system, PEAK, which searchesfor the best compiler optimization combinations for the important code sections in a program. Themajor contributions made in this work are as follows: (1) An algorithm called Combined Elimina-tion (CE) is developed to explore the optimization space quickly and effectively; (2) Three fast andaccurate rating methods are designed to evaluate the performance of an optimized code sectionbased on a partial execution of the program; (3) An algorithm is developed to identify importantcode sections as candidates for performance tuning by trading off tuning speed and tuned programperformance; and (4) A set of compiler tools are implemented to automate optimization orches-tration. Orchestrating optimization options in SUN Forte compilers at the whole-program level,our CE algorithm improves performance by 10.8% over the SPEC CPU2000 FP baseline setting,compared to 5.6% improved by manual tuning. Orchestrating GCC O3 optimizations, CE improvesperformance by 12% over O3, the highest optimization level. Applying the rating methods, PEAKreduces tuning time from 2.19 hours to 5.85 minutes on average, while achieving equal or betterprogram performance.

Categories and Subject Descriptors: D.3.4 [Programming Languages]: Processors—Code gener-ation, compilers, optimization, run-time environments

General Terms: Performance

Additional Key Words and Phrases: Performance tuning, optimization orchestration, dynamiccompilation

This work was supported, in part, by the National Science Foundation under Grants No. EIA-0103582, CCF-0429535, and CNS-0650016.Authors’ address: School of Electrical and Computer Engineering, Purdue University, WestLafayette, IN, 47907-1285; email: [email protected] to make digital or hard copies of part or all of this work for personal or classroom use isgranted without fee provided that copies are not made or distributed for profit or direct commercialadvantage and that copies show this notice on the first page or initial screen of a display alongwith the full citation. Copyrights for components of this work owned by others than ACM must behonored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers,to redistribute to lists, or to use any component of this work in other works requires prior specificpermission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 PennPlaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or [email protected]© 2008 ACM 0164-0925/2008/05-ART17 $5.00 DOI 10.1145/1353445.1353451 http://doi.acm.org/10.1145/1353445.1353451

ACM Transactions on Programming Languages and Systems, Vol. 30, No. 3, Article 17, Publication date: May 2008.

17:2 • Z. Pan and R. Eigenmann

ACM Reference Format:Pan, Z. and Eigenmann, R. 2008. PEAK—A fast and effective performance tuning system via com-piler optimization orchestration. ACM Trans. Program. Lang. Syst. 30, 3, Article 17 (May 2008),43 pages. DOI = 10.1145/1353445.1353451 http://doi.acm.org/10.1145/ 1353445.1353451

1. INTRODUCTION

Compiler optimizations generally yield significant performance improvementsin many programs on modern architectures. Nevertheless, the potential for per-formance degradation in certain program patterns is known to compiler writersand many programmers. The state of the art is to let programmers deal withthis problem through compiler options. For example, a programmer can switchoff an optimization after finding that it causes performance degradation. Thepresence of these options reflects the inability of today’s compilers to makethe best optimization decision at compile time. In this article, we refer to thisprocess of finding the best optimization combination for a target program asoptimization orchestration. The large number of compiler optimizations, thesubtle interactions between optimizations, the sophistication of computer ar-chitectures, and the complexity of the program itself make this optimizationorchestration problem difficult to solve.

Our PEAK system automates the optimization orchestration process in a fastand effective way. It consists of a few run-time and compile-time subsystemswith the goal of finding effective optimization settings for important code seg-ments of an application. PEAK employs a profiling system to create a call graphand to identify these important code segments, which we call tuning sections.For each tuning section, PEAK uses a fast and effective search technique todetermine the best set of optimization flags with which to compile. Because thesearch technique needs to fairly compare the effectiveness of one optimizationvector against another, this article presents accurate rating methods as well.

In more detail, PEAK generates a series of experimental versions, compiledunder different optimization combinations, for each tuning section. The per-formance of each experimental version is rated based on a partial execution ofthe program, that is, a few invocations of the tuning section during a trainingrun of the program. An orchestration algorithm chooses a sequence of experi-mental optimization combinations, based on these ratings, until performancereturns diminish. Finally, the tuned versions of all segments are combined forthe production run of the program.

To achieve the two goals of high program performance and fast tuning speed,this article develops fast and effective orchestration algorithms for empiricallysearching the optimization space [Pan and Eigenmann 2006]. It also presentsfast and accurate rating methods to speed up performance evaluation [Pan andEigenmann 2004b]. Compiler tools are implemented to analyze and to instru-ment the target program automatically.

Figure 1 shows the steps of our envisioned performance tuning scenario.Steps (1) to (4) are taken before tuning to construct a tuning driver for a givenprogram. In these steps, our compiler analyzes and instruments the sourcecode, selecting candidates for performance tuning and applying accurate rating


PEAK—Fast and Effective Performance Tuning System • 17:3

Fig. 1. Work flow of the PEAK system: Compile-time pre-tuning steps identify the importantprogram sections to be tuned and suitable methods for comparing the measured performance. Theactual tuning, through empirical search, happens during an offline execution, similar to a profilerun. The individually optimized tuning sections are combined in a post-tuning step.

methods. During performance tuning at Step (5), the tuning driver continuallyruns the program under a training input until the best version is found foreach tuning section. In this step, the system dynamically generates and loadsoptimized versions, and rates these versions. After tuning, in Step (6), eachtuning section is compiled under its best optimization combination and linkedto the main program to generate the final tuned version.

This article makes the following specific contributions:

(1) An optimization orchestration algorithm is developed to explore the opti-mization space fast and effectively. This algorithm achieves equal or betterperformance than related algorithms, but with less tuning time (57% of theclosest alternative) [Pan and Eigenmann 2006].

(2) Three accurate and fast performance rating methods based on a partial exe-cution of the program are designed to improve rating accuracy and to reducetuning time. These rating methods reduce tuning time from several hoursto several minutes, while achieving equal or higher program performance.

(3) An algorithm to select important code segments as candidates for perfor-mance tuning is presented. The selected tuning sections cover most of (typ-ically more than 90% of) the total execution time. Each of them is invokedmany times (typically several hundreds) in one run of the program. A highexecution time coverage leads to high program performance; a large num-ber of tuning-section invocations leads to fast tuning, because it means thatmany optimized versions can be experimented in one run of the program.

(4) An automatic performance tuning system via optimization orchestrationis implemented. The PEAK compiler analyzes and instruments the sourceprogram before the tuning phase. The PEAK runtime system explores theoptimization space and evaluates the performance of optimized versionsautomatically.

(5) Optimization orchestration performance is measured and analyzed com-prehensively for GCC and the SUN Forte compiler, with a focus on GCC.The experiments are done for SPEC CPU2000 benchmarks on a Pentium 4machine and a SPARC II machine.



The remainder of this article is organized as follows. Section 2 presents afast and effective orchestration algorithm named Combined Elimination (CE).Section 3 shows fast and accurate rating methods to evaluate the performanceof optimized tuning-section versions based on a partial execution of the pro-gram. Section 4 develops an algorithm for selecting the important code sec-tions in a program as tuning sections. Section 5 shows the design of our PEAKsystem, which tunes program performance at tuning-section granularity. Sec-tion 6 evaluates PEAK in terms of tuning time and tuned program performance.Section 7 discusses related work. Section 8 concludes this article.

2. A FAST AND EFFECTIVE OPTIMIZATION ORCHESTRATION ALGORITHM

Today’s compilers have evolved to the point where they present to program-mers a large number of optimization options. For example, GCC compilers in-clude 38 options, roughly grouped into three optimization levels, O1 throughO3. On the other hand, compiler optimizations interact in unpredictable man-ners, as many have observed [Chow and Wu 1999; Kisuki et al. 1999; Pan andEigenmann 2004a; Pinkers et al. 2004; Triantafyllis et al. 2003]. How do wesearch for the best optimization combination for a given program in order toachieve the best performance? We call this process optimization orchestration,and this section aims at developing a fast and effective algorithm to do so. InSection 2.1, we develop a feedback-directed algorithm, Combined Elimination(CE), to orchestrate on-off options. All the GCC O3 optimization options are ofthis on-off type. In Section 2.2, we extend the algorithm to handle multivaluedoptions. One example of a multivalued option is the “-unroll” option in SUNForte compilers [Sun 2000], which takes an argument indicating the degree ofloop unrolling.

2.1 The CE Algorithm to Tune On-Off Options

Focusing on on-off options, similar to several of the other papers [Chow and Wu1999; Pan and Eigenmann 2004a, 2004b; Pinkers et al. 2004], we define thegoal of optimization orchestration as follows:

Given a set of compiler optimization options {F1, F2, . . . , Fn}, find the com-bination that minimizes the program execution time. Each option Fi has twopossible values: on and off. (Here, n is the number of optimizations.)

We first develop two simple feedback-directed algorithms: (1) Batch Elim-ination (BE) identifies harmful optimizations and removes them in a batch.(2) Iterative Elimination (IE) successively removes harmful optimizations.Based on the above two algorithms, we design our final algorithm, CombinedElimination (CE). In our previous paper [Pan and Eigenmann 2006], we appliedthese algorithms to tune program performance at the whole-program level.In this article, we will apply CE at the tuning-section level, using the PEAKsystem.

2.1.1 Batch Elimination (BE). The idea of Batch Elimination (BE) is toidentify the optimizations with negative effects and turn them off all at once.The negative effect of one optimization, Fi, can be represented by its Relative



(1) Compile the code under the baseline B = {F1 = on, F2 = on, . . . , Fn = on}. Execute the gener-ated version to get the baseline rating TB.

(2) For each optimization Fi , switch it off from B and compile the code, execute the generatedversion to get T (Fi = off ), and compute the RIPB(Fi = off ) according to Equation (2).

(3) Disable all optimizations that cause performance degradation, i.e., whose RIPs are negative,to generate the final, tuned version for the code.

Fig. 2. Pseudo-code of batch elimination.

Improvement Percentage (RIP), RIP(Fi), which is the relative performancedifference of the two versions with and without Fi, T (Fi = on) and T (Fi = off).RIP(Fi) can be computed according to Eq. (1).

RIP(Fi) = T (Fi = off) − T (Fi = on)T (Fi = on)

× 100%. (1)

If the orchestration algorithm is applied at the whole-program level, T (Fi = off )is the execution time of the whole program with Fi turned off; if the algorithmis applied at the tuning-section level, T (Fi = off ) is the rating generated basedon the execution times of a few invocations to the tuning section compiled withFi turned off. (We will discuss the rating methods in detail in Section 3.)

BE switches on all optimizations as the baseline. The performance improve-ment by switching off Fi from the baseline B relative to the baseline perfor-mance can be computed by Eq. (2).

RIPB(Fi = off ) = T (Fi = off ) − TB

TB× 100%. (2)

If RIPB(Fi = off ) < 0, the optimization Fi has a negative effect. BE eliminatesthe optimizations with negative RIPs in a batch to generate the final, tunedversion. This algorithm has a complexity of O(n), where n is the number ofoptimizations. Figure 2 lists the pseudo-code.

2.1.2 Iterative Elimination (IE). IE takes the interaction of optimizationsinto consideration, using a greedy method. Unlike BE, which turns off all theoptimizations with negative effects at once, IE iteratively turns off one opti-mization with the most negative effect at a time.

IE starts with the baseline that switches on all the optimizations. After com-puting the RIPs of the optimizations in accordance with Eq. (2), IE switchesoff the one optimization with the most negative effect from the baseline. Thisprocess repeats with all remaining optimizations, until none of them causesperformance degradation. The complexity of IE is O(n2). Figure 3 shows thepseudo-code.

2.1.3 Combined Elimination (CE). CE, our final algorithm, combines theideas of the two algorithms just described. It has a similar iterative structureas IE. However, in each iteration, CE applies the idea of BE. After identifyingthe optimizations with negative effects, in each iteration, CE tries to eliminatethese optimizations one by one in a greedy fashion.

We have shown in Pan and Eigenmann [2006] that IE achieves better pro-gram performance than BE, since IE considers the interaction of optimizations



(1) Let B be the baseline optimization combination. Let S be the set of optimizations forming thesearch space. Initialize these two sets: S = {F1, F2, . . . , Fn} and B = {F1 = on, F2 = on, . . . ,Fn = on}.

(2) Compile and execute the code under the baseline setting to get the baseline rating TB.(3) For each optimization Fi ε S, switch Fi off from B and compile the code, execute the generated

version to get T (Fi = off ), and compute the RIP of Fi relative to the baseline B, RIPB(Fi = off ),according to Equation (2).

(4) Find the optimization Fx that causes the most performance degradation, i.e., whose RIP hasthe most negative value. Remove Fx from S, and set Fx to off in B.

(5) Repeat Steps (2), (3) and (4) until all options in S have non-negative RIPs. B represents thefinal optimization combination.

Fig. 3. Pseudo-code of iterative elimination.

(1) Let B be the baseline optimization combination. Let S be the set of optimizations formingthe search space. Initialize these two sets: S = {F1, F2, . . . , Fn} and B = {F1 = on, F2 =on, . . . , Fn = on}.

(2) Compile and execute the code under the baseline setting to get the baseline rating TB.(3) Measure the RIPB(Fi = off ) of each optimization option Fi in S relative to the baseline B.(4) Let X = {X 1, X 2, . . . , X l } be the list of optimization options with negative RIPs. These op-

timizations cause performance degradation. X is sorted in an increasing order, i.e., the firstelement, X 1, causes the most degradation. Remove X 1 from S and set X 1 to off in B. (B ischanged in this step.) For i from 2 to l ,(a) Measure the RIP of X i relative to the baseline B.(b) If the RIP of X i is negative, i.e, X i causes performance degradation, remove X i from S

and set X i to off in B.(5) Repeat Steps (2), (3), and (4) until all options in S have non-negative RIPs. B represents the

final optimization combination.

Fig. 4. Pseudo-code of combined elimination.

using a greedy method. But, when the interactions have only small effects,BE may perform close to IE and more quickly provide the solution. CE takesthe advantages of both BE and IE. When the optimizations interact weakly,CE eliminates the optimizations with negative effects in one iteration, just likeBE. Otherwise, CE eliminates them iteratively, like IE. As a result, CE achievesboth good program performance and fast tuning speed. CE has the complexityof O(n2). The pseudo-code is listed in Figure 4.

2.2 The GCE Algorithm to Tune Multi-Valued Options

Generally, to tune multivalued options, the optimization orchestration problemcan be described as follows:

Given a set of optimization options {F1, F2, . . . , Fn}, where each Fi has Kipossible values {Vi, j , j = 1..Ki} (i = 1..n), find the combination that minimizesthe program execution time. (Moreover, one Fi can actually contain multipleoptimizations that have a high possibility of interaction; the possible values ofthis Fi are all possible combinations of the values of the two optimizations.)

We name our algorithm to handle multivalued options General CombinedElimination (GCE). GCE has an iterative structure similar to CE with a fewextensions. The initial baseline of GCE is the default optimization setting usedin the compiler. In each iteration of GCE, all nondefault values of the remainingoptions in the search space S are evaluated; for each of these options, GCE



(1) Let B be the baseline option setting. Let S be the set of optimizations forming the searchspace. Initialize B = { F1 = f1, F2 = f2, . . . , Fn = fn | fi is the default value of option Fi } andS = {F1, F2, . . . , Fn}.

(2) Measure the RIPs of all the non-default values of the options in S relative to the baseline B,that is, RIPB(Fi = Vi, j ) s.t. Fi ε S and Vi, j �= fi . The definition of RIP is the same as in Eq. (2).

(3) Let X = {X 1 = x1, X 2 = x2, . . . , X l = xl } be the list of options with negative RIPs and xihave the most negative RIP among the possible values of option X i . This means that X 1 = x1,X 2 = x2, . . . , and X l = xl improve the baseline performance, and value xi has the bestperformance among all the possible values of Option X i (i = 1..l ). X is sorted in an increasingorder, that is, the first element in X , X 1, has the best performance among X i(i = 1..l ). RemoveX 1 from S and set X 1 in B to be x1. (B is changed in this step.) For i from 2 to l ,(a) Measure RIPB(X i = xi).(b) If it is negative, i.e., setting X i = xi achieves better performance than the baseline, remove

X i from S and set X i in B to be xi .(4) Repeat Step (2) and Step (3) until all options in S have non-negative RIPs. B represents the

final option setting.

Fig. 5. Pseudo-code of general combined elimination.

records the value causing the most negative RIP, which is computed accordingto Eq. (2); GCE tries to apply these recorded values of the options with these neg-ative RIPs one by one in a greedy fashion just like CE. When none of the remain-ing options in S has a value improving the performance, GCE gives the finaloptimization setting for the program. GCE has a complexity of O(�n

i=1Ki × n).In most cases, O(�n

i=1Ki) = O(n), so roughly the complexity is still O(n2). (Thetechniques developed in Kisuki et al. [2000] could also be included in GCE tohandle options with a large number of possible values, for example, blockingfactors.) The pseudo-code of GCE is shown in Figure 5.

3. FAST AND ACCURATE RATING METHODS

In this article, we tune performance for important code sections of a program,called tuning sections (TS). A tuning section is essentially a procedure includingall its callees. We define one invocation of a TS as the total execution of this TSduring one call to that procedure. To tune a program at this tuning-section level,we apply the CE algorithm to search for an effective optimization combinationfor each TS independently. Noticing that each TS is invoked many (usuallyhundreds or thousands of) times, we develop rating methods to evaluate theperformance of one optimized version based on a small number of invocationsof the TS, called a window. In this way, a partial execution of the program canbe used to rate the performance of an optimized version. It leads to a largespeedup of the tuning process compared to the whole-program tuning, whichuses the total execution time of the whole program to evaluate the performance.

Rating based on a number of invocations of the tuning sections leads to fasttuning speed, but, the workload may change from one invocation to the next.Directly averaging the execution times of a number of invocations is not al-ways accurate. This section presents the rating methods that compare in afair way the execution times of the optimized versions under different invoca-tions. The key ideas of our rating methods are as follows: Context-Based Rating(CBR) identifies and compares invocations of a tuning section that have thesame workload, in the course of the program run. Model-Based Rating (MBR)



formulates relationship between different workloads, which it factors into thecomparison. Re-execution-Based Rating (RBR) directly re-executes a tuning sec-tion under the same input for fair comparison. (In our previous paper [Pan andEigenmann 2004b], we have presented automated compiler techniques to ana-lyze the source code for choosing the most appropriate rating method for eachtuning section.)

The rating methods are applied in an offline performance tuning scenarioas follows. Before tuning, the program is partitioned by our PEAK compilerinto a number of tuning sections. The tuning system runs the program oneor several times under a training input, while dynamically generating andswapping in/out new optimized versions for each TS. The performance of theseversions is compared using the proposed rating methods. The winning versionwill be used in the final tuned program.

3.1 Context Based Rating (CBR)

Context-based rating identifies the invocations to a tuning section that have thesame workload in the course of program execution. The PEAK compiler findsthe set of context variables, which are the program variables that influence theexecution time of the tuning section. For example, the context variables includethe variables that determine the conditions of control regions, such as if orloop constructs. (These variables can be function parameters, global variables,or static variables.) Thus, the context variables determine the workload of atuning section. We define the context of one TS invocation as the set of values ofall context variables. Therefore, each context represents one unique workload.

CBR rates one optimized version under a certain context by using the averageexecution time of several invocations. (Typically, this number is tens of times.)The best versions for different contexts may be different, in which case CBRcould report the context-specific winners. PEAK makes use of only the bestversion under the most important context, which covers most (e.g., more than80%) of execution time spent in the TS. This major context is determined byone profile run of the program. (If a tuning section has no major context orthe number of invocations of the major context is too small, the MBR methoddescribed next is preferred.)

In summary, the rating of a version v, R(v), is computed according to Eq. (3),where x is the most time-consuming context of version v, T (i, x) is the execu-tion time of the ith invocation under context x, and w is the window size, thenumber of invocations used to rate one version. Var(v) records the variance ofthe measurement.

R(v) =∑

i=1..w

T (i, x)/w (3)

Var(v) =∑

i=1..w

(T (i, x) − R(v))2/w. (4)

We presented in Pan and Eigenmann [2004b] the compiler analysis to find thecontext variables so as to determine the applicability of CBR. The algorithmtraverses each control statement and recursively finds the related variables;



that is, it finds all of the input variables that may influence the values used incontrol statements. All these variables are considered to be context variables.If there exist one or more nonscalar context variables, there is usually no majorcontext. So, in this case, CBR is not applicable. Similarly, if the context variableis floating-point, CBR is not applicable. To reduce the context match overheadduring tuning, we eliminate the runtime constants from the context variablelist. The runtime constant variables are found in the same profile run thatdetermines the major context.

3.2 Model-Based Rating (MBR)

Model-based rating formulates mathematical relationships between differentcontexts of a tuning section and adjusts the measured execution time accord-ingly. In this way, different contexts become comparable.

The execution time of a tuning section consists of the execution time spentin all of its basic blocks:

TTS =∑

b=1..m

(Tb × Cb). (5)

TTS is the execution time in one invocation to the whole tuning section; Tb is theexecution time in one entry to the basic block b; and Cb is the number of entriesto the basic block b in the TS invocation; m is the number of basic blocks.

If the numbers of entries to two basic blocks, Cb1 and Cb2, are linearly depen-dent on each other in every TS invocation through the whole run of the program,(that is, Cb1 = α ∗ Cb2 + β, where α and β are constants), our compiler mergesthe items corresponding to these basic blocks into one component. Hence, MBRuses the following execution time estimation model:

TTS =∑

i=1..n

(Ti × Ci). (6)

TTS consists of several components, each of which has a component count Ci anda component time Ti.

During tuning, PEAK collects execution times of a number of invocationsto an optimized version. It gathers a TS-execution-time vector, Y , and acomponent-count matrix, C, in which Y ( j ) is the TTS in the j th invocationand C(i, j ) is the ith component count Ci in the j th invocation. Solving thefollowing linear regression problem yields the component-time vector T

Y = T × C (7)

Here, T = (T1, T2, . . . , Tn) represents the component-time vector of one partic-ular version. The version with smaller Ti ’s performs better. Hence, MBR maycompare different versions using the rating, R(v), computed based on their Tvectors according to the following equation.

R(v) =∑

i=1..n

(Ti × Cavgi) (8)

Var(v) =∑

j=1..w

(Y j −

∑i=1..n

(Ti × Ci, j )

)2

/w. (9)



RBR(TuningSection TS) :1. Swap Version 1 and Version 22. Save Modified Input(T S)3. Run the preconditional version4. Restore Modified Input(T S)5. Time Version 16. Restore Modified Input(T S)7. Time Version 28. Return the two execution times

Fig. 6. Re-execution-based rating method.

Cavgi is the average count of component i during one whole run of the program.These data are obtained from the profile run. And, w is the number of invoca-tions to compute the rating. The variance of the rating, Var(v) is the residualerror of this linear regression.

Adding a counter into the code may have side effects on the compiler opti-mizations. To avoid this, we can require that the value of the counter be deter-mined before the entry to the tuning section.

If there are many (for example, ten) components in the execution time model,a large number of invocations need to be experimented in order to perform anaccurate linear regression. MBR would lead to a long tuning time in this caseand so is not applied. Instead, RBR will be applied, which is described next.

3.3 Re-execution Based Rating (RBR)

Figure 6 shows the basic idea of RBR: the input data is saved before each in-vocation to the TS (Step (2)), Version 1 is timed (Step (5)), the input is restored(Step (6)), and Version 2 is executed (Step (7)). These two execution times can becompared directly to decide which version is better, since both versions are ex-ecuted with the same input. Suppose that the execution times of these two ver-sions are Tv1 and Tv2. Then, the performance of Version 2 relative to Version 1is Rv2/v1:

Rv2/v1 = Tv1/Tv2. (10)

If Rv2/v1 is larger than 1, Version 2 performs better than Version 1. Otherwise,Version 2 performs worse. For multiple versions, we compare their performancerelative to the same base version. For example, if Rv2/v1 is less than Rv3/v1,Version 3 performs better than Version 2. In our tuning system, we use theaverage of Rvx/vb’s across a number of TS invocations as the rating of Version vxrelative to the base version vb. The rating of v is computed based on Eq. (11),where w is the number of invocations. Similar to CBR and MBR, we computethe rating variance Var(v).

R(v) =∑

i=1..w

Rv/vb(i)/w (11)

Var(v) =∑

i=1..w

(Rv/vb(i) − R(v))2/w. (12)

To improve the rating accuracy, RBR (1) swaps Version 1 and Version 2 ateach invocation, so that their order does not bias the result (Step (1)), and



Step (2) inserts a preconditional version before Version 1 to bring the used datainto cache (Step (3)). In addition, to reduce the roll-back overhead, RBR savesand restores only the input variables that are modified in the invocation, the setof Modified Input(TS), which is the intersection of the input set and def set of TS

Modified Input(TS) = Input(TS) ∩ Def (TS). (13)

Our compiler finds the input variables using a combination of compile-timeand run-time techniques. If it fails to do so during static analysis, for example,due to complicated memory aliases, our compiler instruments the source codeto get the input during tuning. So, generally, RBR is applicable to all tuning sec-tions, except the ones that call library functions with side effects, such as malloc,free, and I/O operations. Right now, we do not tune these tuning sections. Asfuture work, these library functions with side effects could be rewritten, so thattheir execution can be rolled back. In our experiments, RBR is applicable to allSPEC CPU2000 FP benchmarks and half of the INT benchmarks.

3.4 The Use of Rating Methods in PEAK

We have presented three rating methods: CBR, MBR and RBR. Context-basedrating (CBR) has the least overhead but is not applicable to code without a majorcontext. Model-based rating (MBR) works for code without a major context, butis not applicable to irregular programs. Re-execution-based rating (RBR) can beapplied to almost all programs, however, the overhead is the highest among thethree. Generally, the applicability of these three rating approaches increases inthe order of CBR, MBR and RBR; so does the overhead.

Before tuning, our PEAK compiler divides the target program into severaltuning sections. The original source program is analyzed according to the tech-niques presented in the previous sections. The details about the compiler anal-ysis can be found in Pan and Eigenmann [2004b]. After one profile run, thePEAK compiler finds the major context, if it exists, for CBR, and the execu-tion time model for MBR. From the static analysis and profile information,the PEAK compiler decides which applicable rating method should be used foreach tuning section, in the priority order of CBR, MBR and RBR. Again, CBRis applicable if there is a major context whose execution time is more than 80%of total execution time spent in the tuning section. MBR is applicable if thereare only a few (less than 10) components in its execution time model. Then,the PEAK compiler inserts three kinds of instrumentation code into the sourceprogram to construct a tuning driver: (1) code to activate performance tuning;(2) code to measure the execution times and to trigger the rating methods;(3) code to facilitate the rating methods, for example, context match code forCBR and the preconditional version for RBR.

During tuning, the tuning driver generates the rating, R(v), and the ratingvariance, Var(v), across a number of TS invocations, which is called a window.The tuning driver compares R(v) of different versions to know which version isthe best one. In summary, R(v) and Var(v) are computed as follows.

—CBR: Suppose that T (i, x) is the execution time of the ith invocation undercontext x. R(v) and Var(v) under context x are the mean and the variance



of T (i, x), i = 1 · · · w, where w is the window size. They are computed inaccordance with Eqs. (3) and (4).

—MBR: R(v) is execution time estimated from the execution time model, andVar(v) is the residual error of the linear regression. They are computed inaccordance with Eqs. (8) and (9).

—RBR: Suppose that Rv/vb(i) is the relative performance of version v over baseversion vb at the ith invocation. R(v) and Var(v) are the mean and the vari-ance of Rv/vb(i), i = 1 . . . w, in accordance with Eqs. (11) and (12).

To improve rating accuracy, the tuning system applies two optimizations:

(1) The tuning system identifies and eliminates measurement outliers, whichare far away from the average. Such data may result from system pertur-bations, such as interrupts.

(2) The tuning system uses a dynamic approach to determining the windowsize. It continually executes and rates a version until the rating varianceVar(v) falls below a threshold. This optimization is applied based on theobservation that Var(v) decreases with increasing size of the window.

4. TUNING SECTION SELECTION

This section deals with the problem of how to select the important code sectionsin a program as the tuning sections, in order to achieve fast tuning speed andhigh tuned program performance. To do so, tuning sections need to meet thefollowing requirements.

(1) A tuning section should be invoked a large number of times, for example,more than 100 times, in one run of the program. A large number of invocationsusually leads to fast tuning. For tuning section TSi, let the average number ofinvocations used to rate one optimized version be N1(TSi), the total number ofinvocations to the tuning section be Nt(TSi), the number of optimized versionsrated in one run of the program be Nv(TSi)

Nv(TSi) = Nt(TSi)/N1(TSi). (14)

So, when the number of invocations Nt(TSi) is large, the PEAK system mayrate a large number of, Nv(TSi), versions in each run of the program, whichmeans a fast tuning.1 If there are multiple tuning sections, the tuning time isbound by the slowest one. Denote the smallest number of invocations to thetuning sections as Nmin.

Nmin = mini(Nt(TSi)). (15)

So, the first requirement for good section selections is to have a large Nmin.(2) Tuning sections should cover the largest possible part of the program.

The coverage of the tuning sections is computed based on the execution times.Denote the time spent in tuning section TSi as Tt(TSi), which includes the time

1Although N1(TSi) may not be the same for different tuning sections because it is determineddynamically as described in the previous section, generally, a large Nt (TSi) still means a largeNv(TSi).



spent in all the functions/subroutines invoked within this tuning section, andthe total execution time of the program as Ttotal

Coverage =∑

Tt(TSi)Ttotal

× 100%. (16)

A large coverage means that a large percentage of the program is tuned via opti-mization orchestration. So, the system can achieve good program performance.

(3) A tuning section should be big enough, so that the average execution timespent in one invocation is big enough, for example, greater than 100 μ sec. Theaverage execution time Tavg(TSi) is computed based on the number of invoca-tions, Nt(TSi), and the execution time spent in TSi, Tt(TSi)

Tavg(TSi) = Tt(TSi)/Nt(TSi). (17)

We do not choose tiny tuning sections, because they usually result in a lowtiming accuracy, even though we use a high-resolution timer. The low timingaccuracy results in low rating accuracy and a large number of invocations perversion, N1(TSi).

This section presents a tuning section selection algorithm, which meets allthese three requirements. Section 4.1 shows the call graph annotated withexecution time profiles used for tuning section selection. Based on the callgraph, the problem of tuning section selection is formally defined in Section 4.2.Section 4.3 presents our tuning section selection algorithm. In this algorithm,nodes for recursive functions are merged to construct a simplified call graph,which is a directed acyclic graph. An algorithm is designed to select the tun-ing sections by maximizing the program coverage under a given constraint forNmin, working on the simplified call graph. The final algorithm iteratively callsthe previous algorithm to trade off the coverage and Nmin.

4.1 Profile Data for Tuning Section Selection

From the previous section, a tuning section is selected based on its numberof invocations and its execution time. These data are collected from a profilepass. In our implementation, we use the call graph profile generated by gprof.This call graph profile shows how much time is spent in each function andits children, how many times each function is called and how many times thefunction calls its children, during the profile run.

Our tuning section selection algorithm reads the output of gprof [Grahamet al. 1982] and generates a call graph G = (V , E).2 This call graph is a directedgraph. It has one source (root) node, δ, whose in-degree is 0, and a set, �, of sink(leaf) nodes whose out-degrees are 0. Each node v ∈ V identifies a function.

2Here, call graph G contains the dynamic calls during the profile run, not the static calls in theprogram code. So, if a function call is not executed during the profile run, G does not include thiscall. However, a static call graph generated by a compiler should include this call. For the purpose oftuning section selection, the dynamic call graph is good enough. After tuning sections are selected,our compiler tools analyze and transform the program. During this process, a static call graphis used. Our tuning selection algorithm can be applied to the static call graph as well, if profileinformation is assigned to its nodes and edges.



δ identifies the function main. The nodes in � identify the functions that donot call any other function. Each edge e ∈ E identifies a function call. Theassociated profile information is as follows.

v = {fn} (18)e = {s, t, n, tm} (19)

fn(v) is the function name of the node. s(e) identifies the caller node; t(e) identi-fies the callee node; n(e) is the number of invocations to t(e) made by s(e); tm(e)is the time spent in t(e) and its callees, when t(e) is called from s(e).

Ideally, we could truncate the program at any place, for example, in the be-ginning of a basic block, to create a tuning section. In practice, we choose tuningsections at the procedure level, since we use the call graph profile generatedby gprof. If we had accurate basic block profiles, we could select the tuningsections at the basic-block level. The tuning section selection algorithm wouldbe similar to the one presented in the next sections. The difference would bethat a bigger graph should be used with basic blocks as the nodes instead offunctions/subroutines.

4.2 A Formal Description of the Tuning Section Selection Problem

Tuning section selection aims at partitioning the program into several compo-nents. Basically, a tuning section starts with an entry function, including all thesubroutines called directly or indirectly by the entry function. If one subroutineis called by two different tuning sections, it is replicated into these two tuningsections. (Such replication does not lead to code explosion, because the numberof tuning sections is fixed and usually small, around three.) In the representa-tion of the call graph in Section 4.1, tuning section selection tries to find a setof single entry-node regions (subgraphs) in G; each region is a tuning section.The entry function of a region identifies that region. One region can overlapwith other regions, although it would be better if no regions overlap.

The problem of tuning section selection can be described, in a formal way,as an optimal edge cut problem. Given call graph G = (V , E), find an edgecut (�, ) so as to maximize the invocation numbers and the coverage of thetuning sections. Here, � and are a partition of the node set V , such that �

contains the source node δ, and contains the set of sink nodes in �. This edgecut (�, ) is a set of edges, each of which leaves � and enters . This edge cutdetermines the set of tuning sections in two steps. (1) Find all the edges in thiscut whose average execution times are greater than Tlb, the lower bound onthe average execution time, that is, for each edge e ∈ (�, ), put e in set S, iftm(e)/n(e) ≥ Tlb. (2) The edges in set S point to the selected tuning sections,that is, make the entry-node set T = {v|v = t(ei), ei ∈ S}. Each node v in set Tidentifies a tuning section. (Tuning sections are the subgraphs led by the entryfunction v.) Figure 7 gives an example.

The tuning section selection algorithm maximizes the invocation numbersand the coverage of the tuning sections, which are computed according to afore-mentioned S and T as follows:



Fig. 7. An example of tuning section selection. The graph is a call graph with node a as the mainfunction. The weights on an edge are the number of invocations and the execution time in theparentheses. The optimal edge cut is (� = {a, c}, = {b, d , e, f }), shown by the dashed curve.Edges (a, b) and (c, f ) are chosen as the S set. Edge (c, e) in the cut (�, ) is not included in S,because its average execution time is 1/20000 less than Tlb = 1e−4. There are two tuning sectionsled by node b and node f , T = {b, f }. The number of invocations to b and f are 1000 and 200respectively, so, Nmin = 200. The coverage of this optimal tuning section selection is (80 + 18)/100 = 0.98, where the total execution time, Ttotal, is 100.

(1) The number of invocations to the tuning section v is denoted as Nt(v).

Nt(v) =∑

e∈S,t(e)=v

n(e). (20)

One goal of the tuning section selection algorithm is to maximize the small-est Nt(v), denoted as Nmin.

Nmin = minv∈T (Nt(v)) (21)

(2) The other goal of the tuning section selection algorithm is to maximize theexecution coverage.

Coverage =∑e∈S

tm(e)/Ttotal (22)

Ttotal is the total execution time of the program.

The tuning section selection problem does not always have a reasonablesolution. For example, suppose that a program has only one procedure, main(),which contains a loop consuming most of the execution time. If main is chosenas the tuning section, Nmin is 1 and coverage is 100%. Otherwise, coverage is 0%.The first solution degrades to the whole-program tuning. The second solutiondoes not find any tuning section. Neither of them is acceptable. In fact, weshould use the loop body in main as a tuning section. Using a call graph profile,the selection algorithm cannot identify the loops within a procedure. Extra workis needed to find the loop and extract its body into a separate procedure. We callthis step of the algorithm extra code partitioning. After this step, the loop bodyappears in the call graph profile, which then is chosen as a tuning section bythe selection algorithm. Finding the procedures that are not selected as tuningsections but worth extra code partitioning is another task of the algorithm.



4.3 The Tuning Section Selection Algorithm

From the previous section, the tuning section selection problem can be viewedas a constrained max cut problem. (The original max cut problem is NP-complete [Karp 1972].) In this section, we will develop a greedy algorithm toselect the tuning sections so as to get maximal Nmin and coverage. This algo-rithm solves the problem in two steps. (1) We design an algorithm which aimsto maximize coverage, under the constraint that the number of invocations toeach selected tuning section is larger than a lower bound Nlb. (2) The final al-gorithm raises the Nlb gradually to trade off the coverage. It aims to achieve alarge Nmin by tolerating a small decrease in coverage. These two steps will bepresented in Section 4.3.2 and Section 4.3.3 separately. We discuss the handlingof recursive functions in Section 4.3.1.

4.3.1 Dealing with Recursive Functions. To call a recursive function, theprogram makes an initial call to the function. Then the function is called byitself, in the case of self-recursion, or by its callees, in the case of mutual-recursion. Both self-recursive calls and mutually-recursive calls are referred toas recursive calls, which are different from the initial call. Our PEAK systemtreats initial calls to a recursive function as normal function calls; recursivecalls can be viewed as loop iterations, which are ignored by tuning section se-lection. In other words, the tuning section selection algorithm does not choosethe call graph edges that correspond to recursive calls, but only the edges cor-responding to initial calls.

In a call graph, the functions (nodes) that recursively call themselves oreach other form cycles (including loops). To exclude recursive calls from tuningsection selection, our algorithm identifies the cycles and ignores the edges thatappear in the cycles. We do this through a call graph simplification process. Thisprocess merges the nodes involved in a common cycle into one node, removesthe edges used inside a cycle, and adjusts the edges entering or leaving themerged nodes. This process uses the strongly connected components to findthe nodes and edges that appear in a cycle. It adjusts the profile data as well.The pseudo-code of this simplification process is shown in Figure 8. An exampleis shown in Figure 9.

This simplified call graph removes the self-recursive calls and makes a newnode for the functions that recursively call each other. This graph is a DirectedAcyclic Graph (DAG). The call graph simplification algorithm maintains theprofile information for the new nodes and edges. So, the graph has the sameprofile information as described in Section 4.1.

4.3.2 Maximizing Tuning Section Coverage under Nlb. This section de-scribes a tuning section selection algorithm, which aims to achieve as largea coverage as possible, under the constraint that the number of invocations toeach selected tuning section is larger than a lower bound Nlb. This algorithmselects the tuning sections and puts their entry functions into set T . It findsthe functions that are worth code partitioning and puts them into set M . Be-sides Nlb, this algorithm uses two other parameters: (1) Tlb, the lower bound on



Subroutine G = GenerateSimplifiedCallGraph(profile)

(1) Construct call graph G = (V , E) according to the profile data as described in Section 4.1. Foreach node v ∈ V , v = {fn}: fn(v) is the function name. For each edge e ∈ E, e = {s, t, n, tm}: s(e)is the caller node; t(e) is the callee node; n(e) is the number of invocations to t(e) made by s(e);tm(e) is the time spent in t(e) and its callees, when t(e) is called from s(e).

(2) Remove loops in G. (A loop identifies a self-recursive call.)(3) For each strongly connected component SCC that contains more than one nodes, do the follow-

ing to remove the cycles by merging the nodes in SCC. (Such strongly connected componentsidentify mutually-recursive calls.)(a) Construct a new node u for SCC. The name of u, fn(u) is the concatenation of the names of

all the nodes in SCC.(b) Remove the inner edges, i.e., the edges starting from and ending at SCC.(c) Keep all the edges leaving SCC and set their starting node to be the new node u. Merge

the edges that start from u and end at the same node, and sum up the profile data for themerged edges.

(d) For each node v in SCC, remove v if there is no edge entering v.Otherwise, add a new edge e, which leaves v and enters the new node u. The profile infor-mation for this e is the sum of the profile data for all the edges enters v:

n(e) =∑

x∈E,t(x)=v

n(x) (23)

tm(e) =∑

x∈E,t(x)=v

tm(x) (24)

Fig. 8. The pseudo-code for call graph simplification. The algorithm generates a call graph fromprofile data, detects and discards recursive calls. Hence, the call graph is simplified to a directedacyclic graph.

Fig. 9. An example of call graph simplification. The graph is a call graph with node a as the mainfunction. c is a self-recursive function. b and e recursively call each other. The weights on an edgeare the number of invocations and the execution time in the parentheses. After simplification, theloop at node c is discarded. The strongly connected component {b, e} is merged into one node be.The entry node b for this strongly connected component is kept. A new edge (b, be) is added. Edges(b, f ) and (e, f ) are merged to (be, f ). The profile data on edges (b, be) and (be, f ) are updated.

average execution times; (2) Plb, the lower bound on the execution percentagefor a code section worth code partitioning.

Tlb is determined by the timing accuracy of the PEAK system. We use100 μ sec in our experiments. Plb is used to determine whether a code sec-tion is worth code partitioning. We use 0.02. This means that the code section



Subroutine [T, M] = MaxCoverage(profile, Nlb, Tlb, Plb)There are three thresholds used in this algorithm: Nlb, the lower bound on numbers of invocations;Tlb, the lower bound on average execution times; Plb, the lower bound on the execution percentagefor a code section worth partitioning. The algorithm selects the tuning sections and puts their entryfunctions into set T . The functions that are worth code partitioning are put into M.

(1) Construct the simplified call graph G = (V , E), according to Figure 8. i.e., G = GenerateSim-plifiedCallGraph(profile).

(2) Clear the selection flag for each edge e ∈ E: f (e) = 0. Clear the execution time due to theselected tuning sections, for each node v: T X (v) = 0. Empty set T and M .

(3) Mark the edges whose average execution times are less than Tlb. i.e., for each edge e ∈ E, setf (e) = −1, if tm(e)/n(e) < Tlb. (These edges will not be counted when summing the profile dataof the edges for one node.)

(4) Sort all nodes in topological order: v1, v2, . . . , i.e., if there is an edge (u, v), node u appearsbefore node v. The nodes will be traversed in this order.

(5) For node vi (i = 1, 2, . . . ), compute the total number of invocations to vi .

n(vi) =∑

e∈E,t(e)=vi , f (e)=0

n(e) (25)

If n(vi) is greater than Nlb, put vi into T and set f (e) = 1 if t(e) = vi ; and update profileinformation of G, by calling UpdateProfile(G, vi , T X ).

(6) Put node v into M , if v is not selected but consumes a large amount of execution time, i.e.,∑t(e)=v, f (e)�=1 tm(e) − T X (v) > Plb × Ttotal. (These nodes may be partitioned to improve tuning

section coverage.)

Fig. 10. Tuning section selection algorithm to maximize program coverage under the lower boundon numbers of TS invocations, Nlb. This algorithm traverses the simplified call graph from topdown to find the code sections whose numbers of invocations are greater than Nlb. In addition, thealgorithm finds the functions that may be partitioned to improve tuning section coverage.

is worth tuning if its execution time is greater than 2% of the total executiontime. Nlb will be adjusted to trade off the tuning section coverage in the finaltuning section selection algorithm described in Section 4.3.3. The optimal Nlbpicked by the final algorithm usually ranges from tens to thousands.

In order to maximize the coverage, the algorithm traverses the call graphfrom top down to select the code sections that meet the requirements. (We usea topological order to go through the call graph.) When a tuning section is se-lected, the profile data are updated to reflect the execution times and invocationnumbers after excluding this selected tuning section. The execution time due tothe selected tuning section is deducted from the execution time of its ancestorsas well. (Remember that the execution time on node v includes the time spentin v itself and the descendants of v.) After the selection process finishes, theremaining execution time on each node v is used to judge whether it is worthcode partitioning. (It is worth code partitioning, if its execution time is greaterthan Plb of the total execution time.)

Figure 10 shows the pseudo-code for the algorithm to maximize the tuningsection coverage. This algorithm constructs an acyclic call graph, annotatedwith profile information after removing the recursive calls, according to thealgorithm described in Section 4.3.1. It ignores the edges whose average exe-cution time is less than threshold Tlb when computing the execution profile fora node. It goes through the call graph in a topological order to find the nodeswhose numbers of invocations are greater than threshold Nlb. These nodes are



Subroutine UpdateProfile(G = (V, E ), vi, TX )This algorithm removes the execution time due to vi for each edge reachable from vi , and addsthe time due to vi into TX for each ancestor of vi . (TX(v) records the time spent in v due to all theselected tuning sections. Manual code partitioning will use TX to estimate the remaining executiontime after tuning section selection.)Update the profile information of the edges that are reachable from node vi .

(1) Set execution frequency of all nodes except vi as 0, q(v) = 0; while q(vi) = 1.0, which means viis executed 100% within the chosen tuning section.

(2) For node vj ( j = i +1, i +2, . . . ), compute the execution frequency of vj based on the executionfrequency of its parents according to the following equation.

q(vj ) =∑

e∈E,t(e)=v jq(s(e)) × n(e)∑

e∈E,t(e)=v jn(e)

(26)

(3) For node vj ( j = i + 1, i + 2, . . . ), adjust the profile information for the edges entering vjaccording to the following equations, where t(e) = vj .

n(e) = n(e) × (1 − q(s(e))) (27)

tm(e) = tm(e) × (1 − q(s(e))) (28)

For the nodes that reach node vi , record how much execution time is spent due to vi . This time willbe excluded for choosing code partitioning candidates.

(1) For each node v, set the time spent in v due to vi , p(v), as 0.(2) Set the time spent in vi , p(vi) = ∑

e∈E,t(e)=vitm(e).

(3) For node vj ( j = i, i − 1, .., 1), update the time spent in its parents due to the time spent in vi .For each edge e, who enters vj , i.e., t(e) = vj , adjust p(s(e)) according to the following equation.

p(s(e)) = p(s(e)) + tm(e)∑a∈E,t(a)=v j

tm(a)× p(vj ) (29)

(4) For node vj ( j = i − 1, i − 2, .., 1), update the time spent in v due to all the selected tuningsections: TX(vj ) = T X (vj ) + p(vj )

Fig. 11. Update profile data after vi is chosen as the entry function to a tuning section. Theupdated profile reflects the execution times and invocation numbers after excluding the chosentuning section.

selected as entry functions to the tuning sections. The profile information ofthe relevant edges is adjusted to reflect the number of invocations and execu-tion time spent in the rest of the program, when a node is selected. In the end,the algorithm finds the functions worth code partitioning, using the residualexecution time after excluding the selected tuning sections.

When a tuning section is selected, the profile data needs to be updated inorder to exclude the execution information (time and number of invocations)due to the selected tuning section. This requires a context-sensitive inclusiveprofile, which lists the direct and indirect callers of a function, when the execu-tion information of this function is given. This profile should show the executiontime and number of invocations of each function call; it should also split thisinformation for each call path. Unfortunately, gprof does not provide such in-formation. Instead, we use an algorithm to do an estimation, which is describedin Figure 11. It handles the descendants and the ancestors of the selected nodev separately.

For each descendant u, the algorithm estimates how often u is directly orindirectly called from the selected node v, out of the total invocations to u. Thisexecution frequency of u is computed based on the execution frequency of u’s



parents and the numbers of invocation to u directly from the parents, assumingthe execution frequency of v is 100%. Equation (30) does the estimation, whereq(u) is the execution frequency of u.

q(u) =∑

e∈E,t(e)=u q(s(e)) × n(e)∑e∈E,t(e)=u n(e)

(30)

Knowing the execution frequency of all the descendants, the profile informa-tion for all the edges reachable from v can be estimated in accordance withEqs. (31) and (32).

n(e) = n(e) × (1 − q(s(e))) (31)tm(e) = tm(e) × (1 − q(s(e))) (32)

The algorithm traverses the node list forwards from node v. (The nodes aresorted in a topological order.)

For each ancestor u of the selected node v, the algorithm estimates how muchexecution time is spent in u due to v, which is denoted as p(u). The time spent inthe selected node v, p(v), is equal to

∑e∈E,t(e)=v tm(e). We assume that the time

spent in u due to v is distributed in u’s parent, x, proportional to the time spentin u when u is called from x. That is, p(u) is distributed to p(x) proportional totm(e), where e leaves x and enters u. So, p(u) is distributed in accordance withto the following equation.

p(s(e)) = p(s(e)) + tm(e)∑a∈E,t(a)=u tm(a)

× p(u) (33)

The algorithm traverses the node list backwards from node v.

4.3.3 The Final Tuning Section Selection Algorithm. The previous sectiondescribes how to maximize the tuning section coverage with the constraint thatthe number of invocations to each selected tuning section should be larger thanNlb. This section aims to achieve a large Nmin by tolerating a small decrease incoverage. It does this by raising Nlb gradually to trade against coverage.

Two new parameters are introduced to the final algorithm.

(1) Clb, the lower bound on the tuning section coverage. If the coverage of theselected tuning sections is smaller than Clb, code partitioning is necessary.We set it as 80% of the total execution time.

(2) Rub, the upper bound on the coverage drop rate. The coverage drop rate iscomputed based on two tuning section selection solutions as follows, whereNmin2 is larger than Nmin1.

R = coverage1 − coverage2

Nmin2 − Nmin1(34)

If, on average, after increasing Nmin by 1, the coverage drops more thanRub, the algorithm finds a trade-off point. In our experiments, we tolerate1% decrease of the coverage, if Nmin can be improved by 100. So, we set Rubas 0.01/100 = 1e−4.



Subroutine [T, M] = TSSelection(profile, Rub, Clb, Tlb, Plb)There are four thresholds used in this algorithm: Rub, the upper bound on the coverage drop rate;Clb, the lower bound on the tuning section coverage; Tlb, the lower bound on average executiontimes; Plb, the lower bound on the execution percentage for a code section worth code partitioning.The algorithm selects the tuning sections and puts their entry functions into set T . The functionsthat are worth code partitioning are put into M .

(1) Initialization. Tbest = φ, Mbest = φ, coveragebest = 0, Nbest = 0, Nlb = 10.(2) Use the algorithm in Figure 10 to maximize the coverage.

[T , M ] = MaxCoverage(profile, Nlb, Tlb, Plb)Compute coverage and Nmin of the selected tuning sections T .

(3) If coverage is less than Clb, do the following.(a) If Tbest is φ, code partitioning is necessary. Print T and M . Stop.(b) Else, Tbest and Mbest is the solution. Print Tbest and Mbest. Stop.

(4) Trade off coverage and Nmin as follows.(a) If Tbest is φ, record this solution: Tbest = T , Mbest = M , coveragebest = coverage, Nbest = Nmin.(b) Else, compute the coverage drop rate R.

R = coveragebest − coverageNmin − Nbest

(35)

If the drop rate R is less than Rub, record this solution: Tbest = T , Mbest = M , coveragebest =coverage, Nbest = Nmin.Otherwise, the trade-off point is found. Print Tbest and Mbest. Stop.

(5) Set Nlb = Nmin and go to Step (2).

Fig. 12. The final tuning section selection algorithm. This algorithm achieves both a large Nmin

and a high coverage. It iteratively uses the method shown in Figure 10 to maximize the tuningsection coverage under a series of thresholds Nlb ’s, until the optimal Nlb is found.

Table I. Tuning Section Selection formgrid. The Best Nlb is 400. The Optimalcoverage and Nmin are 0.957 and 2000,

Respectively

iteration Nlb coverage Nmin

1 10 0.998 4002 400 0.957 20003 2000 0.808 2400

Figure 12 shows the pseudo code of this tuning section selection algorithm.This algorithm iteratively uses the method shown in Figure 10 to maximize thetuning section coverage under a series of thresholds Nlb’s. The new Nlb in thenext iteration, Nlb2, is equal to the Nmin obtained from the previous iteration.We notice that this Nmin is greater than the old Nlb in the previous iteration,Nlb1, and that any threshold value in [Nlb1, Nmin) gives the same solution tomaximize the coverage. Using Nmin from the previous iteration as the newthreshold value for Nlb makes the trade-off process fast. This process finisheswhen the coverage drops below Clb or the coverage drop rate is greater thanRub.

For example, Table I shows the result of each iteration, via applying this al-gorithm to benchmark mgrid. The second iteration gets the optimal result, withcoverage = 0.957 and Nmin = 2000. (The initial Nlb is 10, which is a reasonableboundary for our rating methods to achieve faster tuning than whole-programtuning.)



Fig. 13. Block diagram of the PEAK performance tuning system. The blocks in the diagram arethe components in PEAK and the input and output of these components. Steps (1) to (6) correspondto the ones shown in Figure 1.

5. THE PEAK SYSTEM

Section 2 presented a fast and effective orchestration algorithm, which itera-tively generates optimized code versions and evaluates their performance basedon the execution time under a training input to search for the best version.Noticing that using a partial execution of the program may enable performancetuning in a faster way, we developed three rating methods. These methods areapplied to important code sections of the program, called tuning sections. Therating methods, presented in Section 3, evaluate the performance of an op-timized version of a tuning section based on a number of invocations of thetuning section. Section 4 showed an algorithm to select the tuning sections outof a program to maximize both the program execution time coverage and thenumbers of invocations to the tuning sections, aiming at high-tuned programperformance and fast-tuning speed. This section puts everything together toconstruct an automated performance tuning system, PEAK. Section 5.1 showsthe design of PEAK. Special implementation problems related to runtime codegeneration and loading are discussed in Section 5.2. Section 5.3 gives an exam-ple of using the PEAK system.

5.1 The Design of PEAK

Figure 13 displays a block diagram of the PEAK system, showing system com-ponents. The PEAK system has two major parts: the PEAK compiler and thePEAK runtime system. The PEAK compiler is used before tuning, while thePEAK runtime system is used during tuning.

(1) The tuning section selector chooses the important code sections as thetuning sections, using the call graph profile generated by gprof. It applies the



algorithm described in Section 4. The output of this step is a list of functionnames, each of which identifies a tuning section.

(2) The rating method consultant analyzes the source program to find theapplicable rating methods for each tuning section. The compiler techniquesdeveloped in Section 3 are implemented here. This tool annotates the programwith the information about the context variables for CBR and the performancemodel for MBR.

(3) The PEAK instrumentation tool applies the appropriate rating method toeach tuning section, after a profile run to find the major context for CBR and themodel parameters for MBR according to Section 3. It adds the initialization andfinalization functions to activate the PEAK runtime system and the functionsto load and save the tuning state of previous runs, since the performance tuningdriver may run the program multiple times in Step (5). Each tuning section isretrieved into a separate file, which will be compiled at Step (5) under differentoptimization combinations to generate optimized versions.

(4) The instrumented code is compiled and linked with the PEAK runtime li-brary, forming the performance tuning driver. The PEAK runtime system imple-ments the three rating methods described in Section 3 and the CE optimizationorchestration algorithm developed in Section 2. Special functions for dynami-cally loading the binary code during tuning are also included in the PEAK run-time system. (The generation of the tuning driver and the optimized tuning sec-tion versions is done by the backend compiler, in this paper, the GCC compiler.)

(5) The performance tuning driver iteratively runs the program under atraining input until optimization orchestration finishes for all tuning sections.At each invocation to a tuning section, the driver takes over control. It runsand times the current experimental version and decides whether more invo-cations are needed to rate the performance of this version. After the rating ofthis version is done (i.e., when the rating variance is small enough) the drivergenerates new experimental versions according to the orchestration algorithm.(The tuning sections are tuned independently.) The tuning process ends whenthe best version is found for each tuning section.

(6) After the tuning process finds the best optimized version for each tuningsection, these best versions are linked to the main program to generate the finaltuned code.

5.2 Dynamic Code Generation and Loading

PEAK tunes program performance via executing the program under a traininginput. It generates and loads the binary code at runtime. (This distinguishingfeature is inherited from the ADAPT [Voss and Eigenmann 2000, 2001] infras-tructure.) To do so, PEAK uses the dynamic linking facility functions: dlopen,dlsym, dlclose and dlerror. Basically, these functions enable PEAK to load bi-nary code into memory and to resolve the address of the experimental versioncontained in that binary. (The binary code is generated by the backend compiler,GCC, under the given optimization options.)

To generate an optimized version for an individual tuning section, PEAKextracts the section into a separate source file. This file includes the entry



SUBROUTINE calc1

...

DO j = 1, n, 1

DO i = 1, m, 1

...

ENDDO

ENDDO

DO j = 1, n, 1

...

ENDDO

DO i = 1, m, 1

...

ENDDO

...

RETURN

END

Fig. 14. The tuning section calc1 in swim.

function to the tuning section and all its direct and indirect callees; it can becompiled and optimized separately. Since callees are included in the tuningsection, inlining can be performed during code generation. If a procedure iscalled in two tuning sections, this procedure is replicated and renamed in thecorresponding source files. Such replication does not lead to code explosion,because the number of tuning sections is fixed and usually small, around three.Our PEAK compiler performs the above tasks automatically through source-to-source translation.

To load an optimized version successfully and correctly, PEAK needs to payattention to the global variables used in the tuning sections, especially the staticvariables in C and the common blocks in Fortran. Because multiple versionsof the same tuning section are invoked during one run of the program, PEAKensures that each global variable used in these versions must be located at thesame address, when the versions are loaded.

5.3 An Example of Using PEAK

This section uses the benchmark swim to illustrate the six tuning steps ofPEAK.

(1) The tuning section selector uses a gprof output and selects four tuningsections: calc1, calc2, calc3 and loop3500. Figure 14 shows the source codeof the tuning section calc1 as an example.

(2) The rating method consultant analyzes the source code to find the applicablerating methods. For calc1, all three rating methods are applicable. Thecontext variables are n and m. After one profile run, PEAK finds that nand m are runtime constants. So, there is only one context for calc1 andcontext-based-rating will be applied.

(3) The PEAK instrumentation tool adds PEAK runtime library calls to thesource program. Each tuning section is retrieved into a separate file, whichwill be compiled under different optimization combinations to generateexperimental versions during performance tuning. The instrumentation



extern void calc1_old_();

void calc1_()

{

hrtime_t t0, t1;

typedef void(*FunType)();

FunType fun = NULL;

hrtime_t tm;

//Experiment the current optimized version

if( (fun=(FunType)DCGetVersion(0)) != NULL ) {

t0 = gethrtime();

fun(); //invoke the experimental version

t1 = gethrtime();

tm = t1 - t0; //time the current invocation

DCMonRecordCBR(0, tm); //rating generation

return;

}

//Optimization orchestration is done

calc1_old_();

}

Fig. 15. The tuning section calc1 instrumented by the PEAK compiler.

code is added to each tuning section and to the entry and exits of theprogram.—Each entry to a tuning section is redirected to the corresponding instru-

mented code. Figure 15 shows the instrumented calc1, which calls twoPEAK runtime functions, DCGetVersion and DCMonRecordCBR.

(a) DCGetVersion implements the optimization orchestration algorithmand dynamic code generation and loading. It returns the currentexperimental version. If the previous version has already beenrated, this call will generate and load a new optimized version.

(b) DCMonRecordCBR passes the execution time obtained from ahigh resolution timer to the context-based-rating method. Therating method rates the current version based on a small numberof invocations. The orchestration algorithm uses the ratings ofprevious versions to guide the generation of new optimized versions.

(c) The argument, 0, of these two functions identifies the tuning sectionof calc1.

In the code, calc1 old is the original version of the tuning section.—At the entry of the entire program, Function init dc is called to set up

the tuning state. It initializes the tuning state for four tuning sections,sets up the orchestrated GCC optimizations, specifies the optimizationorchestration algorithm and the rating method for each tuning section,and loads the tuning state from the previous run, because multiple runsof the program may be involved during performance tuning.

—At the exits of the program, Function exit dc is called to save the tuningstate.

(4) The instrumented program is compiled and linked with the PEAK runtimelibrary to generate the performance tuning driver.



(5) During performance tuning, the tuning driver continuously runs under atraining input until optimization orchestration is done for all tuning sec-tions. Its output shows the best optimization combination for each tuningsection. An example output for calc1 on a Pentium 4 machine is “g77 -c -O3-fno-strength-reduce -fno-rename-registers -fno-align-loops calc1.f”. Thismeans that three optimizations, strength-reduce, rename-registers andalign-loops, should be turned off from the baseline O3.

(6) In the final version generation stage, each tuning section uses the corre-sponding best optimization combination, as defined in the previous step.To generate the final program version, these optimized tuning sectionsare linked to the main program, which is compiled under the defaultoptimization combination,

6. EXPERIMENTAL RESULTS

6.1 GCE Performance on SUN Forte Compilers

In Pan and Eigenmann [2006], we have compared our CE algorithm withother optimization orchestration algorithms. We have shown that CE takes theleast tuning time (57% of the closest alternative), while achieving the sameprogram performance. In this section, we conduct an experiment to evalu-ate the GCE algorithm for whole-program tuning using SUN Forte compil-ers. GCE is our generalized CE algorithm described in Section 2.2, aimingat multi-valued compiler options. We compare the performance achieved byour GCE algorithm with the one by manual tuning, which is based on theSPEC CPU2000 performance results [SPEC 2000]. In such a result, there isa base setting and multiple peak settings for optimization options. The basesetting is common to all benchmarks. We use this as the baseline performance.The peak settings may be different for different benchmarks and achieves bet-ter performance than the base setting. We will compare GCE with the peaksetting.

The experiments are conducted on a Sun Enterprise 450 SPARC II machine.We evaluate the GCE algorithm using the optimization options of the ForteDeveloper 6 compilers. The baseline option setting is “-fast -xcrossfile -xprofile”.Table II lists the tuned options. The Forte compilers have some options thatcan be passed directly to the compiler components. These options are passedby the “-W” option for the C compiler and the “-Qoption” option for the Fortrancompiler or the C++ compiler. In this table, we list the “-Qoption” only (“-W”options are similar).

Figure 16 shows the performance results. GCE tunes performance usingtrain dataset as input. The final performance is evaluated with ref dataset.In summary, GCE achieves equal or better performance for each benchmark.On average, for floating point benchmarks, GCE achieves 10.8% improvementrelative to the base setting, compared to 5.6% by the peak settings. For integerbenchmarks, GCE achieves 8.1% compared to 4.1% by the peak settings.



Table II. Optimization Options Orchestrated by GCE

Flag Name Experimented Values Meaning-xarch v8, v8plus, generic target architecture instruction

set-xO 3, 4, 5 optimization level-xalias level std, strong, basic alias level-stackvar on/off using the stack to hold local

variables-d y, n allowing dynamic libraries-xrestrict %all, %none pointer-valued parameters as

restricted pointers-xdepend on/off data dependence test and loop

restructuring-xsafe=mem on/off assuming no memory

protection violations-Qoption iropt -crit on/off optimization of critical control

paths-Qoption iropt -Abopt on/off aggressive optimizations of all

branches-Qoption iropt -whole on/off whole program optimizations-Qoption iropt -Adata access on/off analysis of data access

patterns-Qoption iropt -Mt 500, 1000, 2000, 6000, default max size of a routine body

eligible for inlining-Qoption iropt -Mr 6000,12000,24000,40000,default max code increase due to

inlining per routine-Qoption iropt -Mm 6000, 12000, 24000, default max code increase due to

inlining per module-Qoption iropt -Ma 200, 400, 800, default max level of recursive inlining

6.2 Results of Tuning Section Selection

This section evaluates the tuning section selection algorithm described inSection 4. The default threshold values are used, that is, Rub = 1e−4, Clb = 80%,Tlb = 100 μ sec, Plb = 2%. Again, Rub is the upper bound on the coverage droprate; Clb is the lower bound on the tuning section coverage; Tlb is the lowerbound on average execution times; Plb is the lower bound on the executionpercentage for a code section worth code partitioning. Focusing on the scientificprograms, we analyze the FP benchmarks in Section 6.2.1 in detail. The resultson the INT benchmarks are presented in Section 6.2.2. The data are listed inTable III and Table IV, respectively.

6.2.1 SPEC CPU2000 FP Benchmarks. Table III shows the results of ourtuning section selection algorithms, demonstrating that the algorithm achievesthe goal of maximizing both the program coverage and the number of invoca-tions to the tuning sections. In most benchmarks, the coverage is 90% or higher.Three codes needed extra code partitioning.

The minimum number of invocations, Nmin, ranges from hundreds to thou-sands for all the benchmarks, except wupwise. Wupwise contains many smallfunctions with an average execution time in the order of μsecs, which are belowthe chosen threshold Tlb. When lowering the threshold to 1 μ sec, the algorithm



0

10

20

30

40

50

60

am

mp

ap

plu

ap

si

art

eq

ua

ke

face

rec

fma

3d

ga

lge

l

luca

s

me

sa

mg

rid

six

tra

ck

sw

im

wu

pw

ise

ave

rag

e

SPEC peak performance / base performance GCE performance / base performance

(a) SPEC CPU2000 FP benchmarks

0

5

10

15

20

25

30

35

bzip

2

cra

fty

eon

gap

gcc

gzip

mcf

pars

er

perlbm

k

twolf

vort

ex

vpr

avera

ge

Perf

orm

an

ce im

pro

vem

en

t p

erc

en

tag

e (

%)

Perf

orm

an

ce

im

pro

ve

me

nt

pe

rce

nta

ge

(%

)

SPEC peak performance / base performance GCE performance / base performance

(b) SPEC CPU2000 INT benchmarks

Fig. 16. Program performance achieved by the GCE algorithm vs the performance of the man-ually tuned results (peak setting). Higher is better. In all cases, GCE achieves equal or betterperformance. On average, GCE nearly doubles the performance.

finds tuning sections that are called a large number of times (22528000). Thissolution necessitates a high-resolution timer for accurate measurements, whichis available in our system.

The three benchmarks that needed extra partitioning are equake, sixtrackand swim. In these codes, our tuning section selection algorithm identifies theprocedures that take a large execution time but with only a few invocations.In these procedures, the extra partitioning step extracts the loop body of the



Table III. Tuning-Section Selection Results forSPEC CPU2000 FP Benchmarks. (Three

benchmarks that needed extra code partitioningare annotated with ‘*’. The last row, wupwise+,

uses a smaller Tlb = 1 μ sec.)

Benchmark coverage Nmin # of TSammp 88.6 127 3applu 97.9 250 5apsi 87.8 720 9art 99.9 250 2equake 54.6 2709 1equake∗ 99.0 2709 1mesa 96.9 4000 1mgrid 95.7 2000 4sixtrack 10.4 208 2sixtrack∗ 97.9 1693 2swim 83.9 198 3swim∗ 99.2 198 4wupwise 91.7 22 1wupwise+ 83.0 22528000 2

Table IV. Tuning-Section Selection Results forSPEC CPU2000 INT Benchmarks. (The

benchmarks annotated with ‘+’ use smaller Tlb ’s.)

Benchmark coverage Nmin #TSbzip2 99.9 22 6crafty 100.0 1272 1gap 96.6 24 3gcc 90.7 109 1gzip 41.7 3668 3gzip+ 84.1 4021 5mcf 46.6 5235 1mcf+ 74.7 5235 2parser 91.6 309 3perlbmk 76.2 11 5perlbmk+ 92.6 11 6twolf 98.0 120 1vortex 78.0 6209 2vortex+ 95.5 4000 7vpr 52.7 10746 1vpr+ 98.3 10746 2

important loop into a separate procedure. After partitioning, the new procedurewill cover a significant part of the program execution time with a large numberof invocations. Extra code partitioning, while automatable, was done manuallyin our experiments.

6.2.2 SPEC CPU2000 INT Benchmarks. From Table IV, our tuning sectionselection algorithm achieves the goal of maximizing the program coverage andthe number of invocations to the tuning sections for INT benchmarks as well.

In general, the profiles of INT benchmarks are different from those of FPbenchmarks: (1) The INT benchmarks have flatter profiles with more functions



Table V. Optimization Options in GCC 3.3.3

F1 rename-registers F2 inline-functionsF3 align-labels F4 align-loopsF5 align-jumps F6 align-functionsF7 strict-aliasing F8 reorder-functionsF9 reorder-blocks F10 peephole2F11 caller-saves F12 sched-specF13 sched-interblock F14 schedule-insns2F15 schedule-insns F16 regmoveF17 expensive-optimizations F18 delete-null-pointer-checksF19 gcse-sm F20 gcse-lmF21 gcse F22 rerun-loop-optF23 rerun-cse-after-loop F24 cse-skip-blocksF25 cse-follow-jumps F26 strength-reduceF27 optimize-sibling-calls F28 force-memF29 cprop-registers F30 guess-branch-probabilityF31 delayed-branch F32 if-conversion2F33 if-conversion F34 crossjumpingF35 loop-optimize F36 thread-jumpsF37 merge-constants F38 defer-pop

in the call graphs. (2) The call graphs of the INT benchmarks are deeper. (3) Thefunctions in the INT benchmarks usually have small average execution times.Due to (3), the default threshold of the average execution time, Tlb = 100 μ

sec, prohibits the selection of some important functions in gzip, mcf, perlbmk,vortex, and vpr. After using a smaller Tlb ranging from 0.01 μ sec to 1 μ sec,their program coverage is improved. (For these benchmarks, a high resolutiontimer is needed to generate accurate performance ratings for the selected tuningsections.)

In bzip2, gap and perlbmk, some important functions are only invoked tensof times, which brings down Nmin, the minimum number of invocations to thetuning sections. For other benchmarks, Nmin is hundreds to thousands.

6.3 PEAK Performance

6.3.1 Experimental Environment for PEAK. We evaluate PEAK using theoptimization options of the GCC 3.3.3 compiler on two different computer ar-chitectures: Pentium 4 and SPARC II. Our reasons for choosing GCC is thatthis compiler is widely used, has many easily accessible compiler optimizations,and is portable across many different computer architectures.

In this section, we use all 38 optimization options implied by “O3”, the highestoptimization level. These options are listed in Table V and are described in theGCC manual [GNU 2005].

We take our measurements using all SPEC CPU2000 benchmarks written inF77 and C, which are amenable to GCC. To differentiate the effect of compileroptimizations on integer (INT) and floating-point (FP) programs, we displaythe results of these two benchmark categories separately. Our overall tuningprocess is similar to profile-based optimizations. A train dataset is used totune the program. A different input, the SPEC ref dataset, is usually usedto measure performance. To separate the performance effects attributed to the



tuning algorithms from those caused by the input sets, we measure programperformance under both the train and ref datasets.

For the whole-program tuning, to ensure accurate measurements and to elim-inate perturbation by the operating system, we re-execute each code versionmultiple times under a single-user environment, until the three least executiontimes are within a range of [−1%,1%]. In most of our experiments, each versionis executed exactly three times. Hence, the impact on tuning time is negligible.

In our experiments, the same code version may be generated under differentoptimization combinations.3 This observation allows us to reduce tuning time.We keep a repository of code versions generated under different optimizationcombinations. The repository allows us to memorize and reuse their perfor-mance results. Different orchestration algorithms use their own repositoriesand get affected in similar ways, so that our comparison remains fair.

6.3.2 Metrics. Two important metrics characterize the behavior of opti-mization orchestration:

(1) The program performance of the best optimized version found by the orches-tration algorithm. We define it as the performance improvement percentageof the best version relative to the base version under the highest optimiza-tion level O3.

(2) The total tuning time spent in the orchestration process. Because the exe-cution times of different benchmarks are not the same, we use the metricof normalized tuning time (NTT). NTT is the tuning time (TT) normalizedby the time of evaluating the whole program optimized under O3, that is,one compilation time (CTB) plus three execution times (ETB) of the baseversion.

NTT = TT/(CTB + 3 × ETB). (36)

A good optimization orchestration method is meant to achieve both high pro-gram performance and short normalized tuning time.

6.3.3 SPEC CPU2000 FP Benchmarks. Figure 17 shows the normalizedtuning time of the whole-program tuning and the PEAK system. (For PEAK, thetuning time includes the time spent in all the six tuning steps.) On average, thenormalized tuning time is reduced from 68.3 to 3.36. So, PEAK gains a tuningspeedup of 20.3. The benchmark that has a high speedup usually has a largenumber of invocations to the tuning sections, which are shown in Table III. Thisagrees with our assumption during the tuning section selection in Section 4. Onaverage, the absolute tuning time is reduced from 2.19 hours to 5.85 minutes.Applying the rating methods in Section 3 significantly improves the tuningtime.

Figure 18 shows the tuning time percentages of the six tuning steps. Mostof the time is spent in Step (5), the performance tuning (PT) stage. The secondlargest portion of tuning time is spent in Step (1), the tuning section selection

3Comparing the binaries generated under two different optimization combinations via the UNIXutility, diff, can show whether these two binaries are identical.



62.22

50.99

105.76

69.2363.14

89.28

50.59

87.32

36.96

102.97

68.28

2.337.06

11.214.03 1.79 2.33 3.38 4.22 2.59 1.61 3.36

0.00

20.00

40.00

60.00

80.00

100.00

120.00

am

mp

ap

plu

ap

si

art

eq

ua

ke

me

sa

mg

rid

six

tra

ck

sw

im

wu

pw

ise

Ge

oM

ea

n

No

rma

lize

d t

un

ing

tim

e

Whole PEAK

Fig. 17. Normalized tuning time of the whole-program tuning and the PEAK system for SPECCPU2000 FP benchmarks on Pentium 4. Lower is better. On average, PEAK gains a speedup of20.3 over the whole-program tuning.

0%

20%

40%

60%

80%

100%

am

mp

ap

plu

ap

si

art

eq

ua

ke

me

sa

mg

rid

six

tra

ck

sw

im

wu

pw

ise

Ave

rag

e

Perc

en

tag

e o

f th

e t

ota

l ti

me s

pen

t in

th

e t

un

ing

pro

cess

TSS RMA CI DG PT FVG

Fig. 18. Tuning time percentage of the six stages shown in Figure 1 for SPEC CPU2000 FPbenchmarks on Pentium 4. (TSS: tuning section selection, RMA: rating method analysis, CI: codeinstrumentation, DG: driver generation, PT: performance tuning, FVG: final version generation.)The most time-consuming steps are PT, TSS and RMA.

(TSS) stage. This is because Step (1) does a profile run, which in some cases (e.g.,wupwise) takes more time than a normal run of the program. The last largeportion of the tuning time is spent in Step (2), the rating method analysis (RMA)stage. Some of the time is spent in data flow analysis; some is spent in the profile



0

10

20

30

40

50

60

70

am

mp

ap

plu

ap

si

art

eq

ua

ke

me

sa

mg

rid

six

tra

ck

sw

im

wu

pw

ise

Ge

oM

ea

n

Re

lati

ve

pe

rfo

rma

nc

e i

mp

rov

em

en

t p

erc

en

tag

e (

%)

Whole_Train PEAK_Train Whole_Ref PEAK_Ref

Fig. 19. Program performance improvement relative to the baseline under O3 for SPEC CPU2000FP benchmarks on Pentium 4. Higher is better. All the benchmarks use the train dataset as theinput to the tuning process. Whole Train (PEAK Train) is the performance achieved by the whole-program tuning (the PEAK system) under the train dataset. Whole Ref and PEAK Ref use the refdataset to evaluate the tuned program performance, but still the train dataset for tuning. PEAKachieves equal or better program performance than whole-program tuning.

run to get the context parameters for CBR and execution model parameters forMBR. For programs with a large code size (e.g., ammp and mesa), compilingthe source code into the internal representation of our source-to-source compileralso accounts for a big portion of the tuning time. (This compilation time comesfrom the compiler infrastructure we use for the implementation of our PEAKcompiler.)

The results on SPARC II are similar. The normalized tuning time is reducedfrom 63.42 to 4.88, with a tuning speedup of 13.0. The absolute tuning timeis reduced from 9.83 hours to 43.7 minutes. (Our SPARC II machine is slowerthan Our Pentium 4 machine.)

Figure 19 shows the program performance achieved by the whole-programtuning and the PEAK system on Pentium 4. We use the train dataset as theinput to the tuning process.

The first two bars show the performance of the final tuned version underthe same train dataset for the whole-program tuning and the PEAK system.For all the benchmarks, PEAK achieves equal or better performance. PEAKoutperforms whole-program tuning by 1.5% on applu. Some benchmarks, suchas equake and mesa, have only one tuning section. Some benchmarks, such asart, swim and mgrid, have similar code structure in the selected tuning sections,which leads to the fact that the tuning sections favor similar optimizations.The above two observations explain why PEAK does not outperform the whole-program tuning significantly in terms of tuned program performance.



A fair performance evaluation should use an input different from the traininginput. To this aim, we use the ref dataset to evaluate the performance of thetuned version. (Still, the train dataset is the input to the tuning process.) Theresults are shown by the last two bars. We can see that using different inputsstill achieves similar performance to the first two bars. So, our tuning scenariodoes find an optimal combination of the compiler optimizations, which performsmuch better than the default optimization configuration.

On average, PEAK improves the performance by 12.0% and 12.1% with re-spect to the train dataset and the ref dataset, while the whole-program tuningby 11.9% and 11.7%. The results on SPARC II are similar: PEAK improves theperformance by 4.1% and 3.7% with respect to train and ref; while the whole-program tuning by 4.1% and 3.7% as well. So, PEAK achieves equal or betterprogram performance than the whole-program tuning.

6.3.4 SPEC CPU2000 INT Benchmarks. Integer benchmarks generallyhave irregular code structures with many conditional statements. Moreover,the use of pointers complicates rating-method analysis. As a result, only re-execution-based rating could be applicable. On the other hand, the selectedtuning sections in some benchmarks use library functions with side effects, forexample, the memory allocation functions. For these benchmarks, our currentimplementation fails tuning, unable to roll back the execution of the tuningsection.

To illustrate the performance of PEAK on INT benchmarks, we experimentwith five benchmarks, bzip2, crafty, gzip, mcf and parser, which do not use thefunctions with side effects in their tuning sections.

Figure 20(a) shows the normalized tuning time. PEAK speeds up the tuningprocess for INT benchmarks as well. On average, the normalized tuning time isreduced from 73.48 to 13.37, and the absolute tuning time is reduced from 71.64minutes to 14.67 minutes. bzip2 does not gain a high speedup due to the smallNmin of 22. Besides, there are three major reasons for the fact that the speedupfor INT benchmarks is not as high as the one for FP benchmarks: (1) Due tothe irregularity of the code, INT benchmarks can only use re-execution-basedrating, which introduces more overhead than the other two rating methodsused in FP benchmarks. (2) PEAK compiler spends more time in analyzing INTbenchmarks, which generally have more code, especially in crafty and parser.(3) Unlike FP benchmarks, the tuning sections in INT benchmarks generally donot have a similar code structure. Different tuning sections in INT benchmarksmay favor different optimizations, which leads to more experimental versions,especially in parser.

Figure 20(b) shows the tuning time percentages of the six tuning steps. Themost time-consuming components are performance tuning, rating method anal-ysis, and tuning section selection. Due to the larger source code size, ratingmethod analysis and code instrumentation spend more time on INT bench-marks than on FP benchmarks.

Figure 21 shows the tuned program performance. On average, PEAK im-proves the performance by 4.6% and 4.2% with respect to the train dataset andthe ref dataset, while the whole-program tuning improves the performance by



58.88

80.23 78.68 79.7172.32 73.48

24.1716.18

5.334.33

47.26

13.37

0.00

10.00

20.00

30.00

40.00

50.00

60.00

70.00

80.00

90.00

bzip

2

cra

fty

gzip

mcf

pars

er

GeoM

ean

No

rmali

zed

tu

nin

g t

ime

Whole PEAK

(a) Normalized tuning time of the whole-program tuning and the PEAK system for SPEC CPU2000INT benchmarks on Pentium 4. Lower is better. On average, PEAK gains a speedup of 5.5.

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

bzip2 crafty gzip mcf parser Average

Pe

rcen

tag

e o

f th

e t

ota

l ti

me s

pen

t in

th

e t

un

ing

pro

ce

ss

TSS RMA CI DG PT FVG

(b) Tuning time percentage of the six stages for SPEC CPU2000 INT benchmarks on Pentium 4.(TSS: tuning section selection, RMA: rating method analysis, CI: code instrumentation, DG: drivergeneration, PT: performance tuning, FVG: final version generation.) The most time-consuming

steps are PT, RMA and TSS.

Fig. 20. PEAK tuning time for INT benchmarks on Pentium 4.

4.4% and 4.2% respectively. PEAK achieves less performance for gzip than thewhole-program tuning, because of the low program coverage of 84%. For parser,PEAK achieves much better performance than the whole-program tuning.

7. RELATED WORK

Our work is motivated by the observation that, in some situations, compileroptimizations may degrade performance. Attempts to avoid such degradations



0.00

1.00

2.00

3.00

4.00

5.00

6.00

7.00

8.00

9.00

10.00

bzip

2

cra

fty

gzip

mcf

pars

er

GeoM

ean

Rela

tive p

erf

orm

an

ce i

mp

rovem

en

t p

erc

en

tag

e (

%) Whole_Train PEAK_Train Whole_Ref PEAK_Ref

Fig. 21. Program performance improvement relative to the baseline under O3 for SPEC CPU2000INT benchmarks on Pentium 4. Higher is better. All the benchmarks use the train dataset as theinput to the tuning process. Whole Train (PEAK Train) is the performance achieved by the whole-program tuning (the PEAK system) under the train dataset. Whole Ref and PEAK Ref use the refdataset to evaluate the tuned program performance, but still the train dataset for tuning.

at their source are important related work. Doing so is difficult, as compile-timeperformance models are intrinsically limited by the knowledge available aboutinput data and characteristics of the execution platform. Even more difficultis to model optimization interactions, which can be subtle and unexpected. Inaddition to compile-time methods, some work has made use of runtime tech-niques.

One attempt to alleviate the problem of optimization orchestration is to im-prove compiler optimizations so as to reduce their potential performance degra-dation. However, this approach has two problems: (1) Compilers can hardlyconsider the interaction between all the optimizations, given the large numberof optimizations and the complexity of their interactions; (2) The compile-timeperformance models used in the compiler are limited by the unavailability ofprogram input data and insufficient knowledge of the target architecture. Manyprojects try to improve optimization performance in these two aspects, usingeither compile-time or runtime techniques.

Some try to combine a few optimizations into one big pass, which considersthe interactions between the optimizations. For example, Wolf et al. [1996]develop an algorithm that applies fission, fusion, tiling, permutation and outerloop unrolling to optimize loop nests. Similarly, Click and Cooper [1995] showthat combining constant propagation, global value numbering, and dead codeelimination leads to more optimization opportunities. Nevertheless, it wouldbe very difficult, or even impossible, to combine all optimizations into one passthat removes all possible performance degradation, because of the complexityof the compilation task and the subtle interactions between optimizations.



Some postpone optimizations until runtime, when accurate knowledge aboutthe target architecture and the program input can be supplied to the compiler:

(1) Similar to JVM JIT compilers [Adl-Tabatabai et al. 1998; Arnold et al.2000, 2002; Cierniak et al. 2000], several projects aim at achieving bothportability and performance. DCG [Engler and Proebsting 1994] proposesa retargetable dynamic code generation system; VCODE [Engler 1996] pro-vides a machine-independent interface for native machine code generation;DAISY [Ebcioglu and Altman 1997] generates VLIW code on-the-fly to em-ulate the existing architectures on a VLIW architecture.

(2) Several projects generate optimized code using runtime input via RuntimeSpecialization [Consel and Noel 1996]. RCG and Fabius [Lee and Leone1996; Leone and Lee 1998] automatically translates ML programs into codethat generates native binary code at runtime. Calpa [Mock et al. 2000] andDyC [Grant et al. 1999] form a staged compiler [Auslander et al. 1996]:Calpa annotates the program at compile-time; DyC creates a runtime com-piler from the annotated program; this runtime compiler in turn generatesthe executable using runtime values.

(3) Some try to re-optimize binaries based on runtime information. Dy-namo [Bala et al. 2000; Duesterwald and Bala 2000] (for HP workstations)and DynamoRIO [Bruening et al. 2003] (for IA-32 machines) find and op-timize hot traces for statically generated native binaries at runtime. Sim-ilarly, in Merten et al. [1999, 2001] and Nystrom et al. [2001], techniquesare developed to detect hot spots and to generate new traces for runtimeoptimization via hardware support. ADAPT [Voss and Eigenmann 2000,2001] compiles code intervals under different optimization configurationsat runtime, and chooses the best version for each interval. Continuous Pro-gram Optimization [Kistler and Franz 2003] continually adjusts the stor-age layouts of dynamic data structures to enhance the data locality andre-schedules the instructions based on runtime profiling.

(4) Besides the above runtime compilation systems, some develop specific run-time optimization techniques. For example, a runtime data and iterationreordering technique is applied to improve data locality in Strout et al.[2003]. LRPD test [Rauchwerger and Padua 1999; Dang et al. 2002] spec-ulatively executes candidate loops in a parallel form, and re-executes theloops serially when some data dependence is detected at runtime. Rus et al.[2002] analyze the memory reference at both compile-time and runtime tohelp parallelization. Runtime path profiling is proposed to help compileroptimizations in Nandy et al. [2003]. Dynamic Feedback [Diniz and Rinard1997] produces several versions under different synchronization optimiza-tion policies and automatically chooses the best version by periodically sam-pling the performance of each version.

The above techniques defer the optimizations to runtime; hence they needto reduce or amortize additional overhead introduced to program execution. Insome systems, the baseline performance is much worse than the optimized code,therefore, there is an opportunity to amortize the overhead by performance



improvement from the optimizations. For example, Fabius [Lee and Leone1996; Leone and Lee 1998] improves ML code, JIT [Adl-Tabatabai et al. 1998;Arnold et al. 2000, 2002; Cierniak et al. 2000] improves byte code, and Dy-namo [Bala et al. 2000; Bruening et al. 2003] improves un-optimized librarycode. Some [Voss and Eigenmann 2000, 2001] off-load the compilation job toanother processor. Some [Consel and Noel 1996; Lee and Leone 1996; Leoneand Lee 1998; Mock et al. 2000; Grant et al. 1999; Auslander et al. 1996] gen-erate a small code generator before the production run, so as to reduce thecompilation overhead at runtime. Others [Merten et al. 1999, 2001; Nystromet al. 2001] use hardware to reduce the profiling overhead.

The major goal of this article is to find the best compiler optimization com-bination for scientific programs. These programs are mostly written in C orFortran, and the baseline is compiled efficiently under the default optimizationsetting. By contrast, a large number of optimization options, 38 GCC flags in ourexperiments, are involved in advanced optimizations. As the time for navigat-ing this space is substantial, our work takes an offline optimization approach.Similar to profile-based optimizations, this article tunes program performanceunder a training input, creating an optimized, final version, which will be usedat runtime. The tuning process can also be applied in-between the productionruns.

This article takes the commonly used feedback-directed optimization ap-proach. In this approach, many different binary code versions generated underdifferent experimental optimization combinations are evaluated. The perfor-mance of these versions is compared using either measured execution timesor profile-based estimates. Iteratively, the orchestration algorithms use thisinformation to decide the next experimental optimization combinations, untilconvergence criteria are reached. In the end, the optimization orchestrationalgorithm gives the final optimal version for the entire program or importantcode sections in the program.

Many performance tuning systems use this feedback-directed approach. Forexample, ATLAS [Whaley and Dongarra 1998] generates numerous variants ofmatrix multiplication to search for the best one for a specific target machine.Similarly, Iterative Compilation [Kisuki et al. 1999] searches through the trans-formation space to find the best block sizes and unrolling factors. Meta opti-mization [Stephenson et al. 2003] uses machine-learning techniques to adjustseveral compiler heuristics automatically.

The above three projects [Whaley and Dongarra 1998; Kisuki et al. 1999;Stephenson et al. 2003] have focused on a relatively small number of optimiza-tion techniques, while this paper tunes all optimizations that are controlled bycompiler options (e.g., all 38 GCC O3 optimization options in our experiments).

Several projects target the same optimization orchestration problem as thispaper. Random search [Haneda et al. 2005] and machine learning [Agakovet al. 2006; Cavazos and O’Boyle 2006] are introduced to solve this problem.The Optimization-Space Exploration (OSE) compiler [Triantafyllis et al. 2003]defines sets of optimization configurations and an exploration space, which istraversed to find the best configuration for the program using compile-timeperformance estimates as feedback. Statistical Selection (SS) in Pinkers et al.



[2004] uses orthogonal arrays [Hedayat et al. 1999] to compute the performanceeffects of the optimizations based on a statistical analysis of profile information,which, in turn, is used to find the best optimization combination. Compiler Op-timization Selection [Chow and Wu 1999] applies fractional factorial design tooptimize the selection of compiler options. Option Recommendation [Granstonand Holler 2001] chooses the PA-RISC compiler options intelligently for an ap-plication, using heuristics based on information from the user, the compiler andthe profiler. (Different from finding the best optimization combination, Adap-tive Optimizing Compiler [Cooper et al. 2002] uses a biased random searchto discover the best order of optimizations.4 In Kulkarni et al. [2004, 2006], agenetic algorithm is used to solve the above problem. Genetic algorithms arepowerful but are converging slowly; two approaches are proposed to make thegenetic algorithm faster by avoiding unnecessary execution and reducing thenumber of generations.)

Two major issues remain for optimization orchestration.

(1) How do we search through the optimization space fast and effectively, giventhat the search space is huge due to the interactions between compiler op-timizations? Complex algorithms, such as the exhaustive search in ATLAS[Whaley and Dongarra 1998] and the automatic theorem prover in Denali[Joshi et al. 2002], would be prohibitively slow to solve this problem fora real application. Also, while heuristics for special environments can befound [Granston and Holler 2001], this article looks for a more generalsolution.

(2) How do we evaluate the optimized versions fast and accurately? The mostaccurate method is to use the real execution time to evaluate the perfor-mance of the experimental versions. However, given the large number ofexperimental versions and the execution time of a real application, thismethod takes excessively long. Attempts to develop performance modelsthat reduce this time [Triantafyllis et al. 2003] came at the cost of signif-icantly less program performance. In Cooper et al. [2005], a “virtual exe-cution” technique is introduced to derive instruction counts from branchprofile information. This method is fast but not as accurate as using realexecution times, because instruction counts cannot fully represent machinecycles. For example, this method can hardly reflect the effectiveness of codescheduling. Moreover, this method may fail when the control flow of a pro-gram is significantly changed. In Lau et al. [2006], a system similar toour PEAK is implemented for JVM. It simply uses average execution timeto evaluate performance, hoping that with a large number of samples theevaluation may be accurate.

The goal of our PEAK system is to tune the performance of the importantcode sections in a program, in a fast and effective way, via orchestrating theoptimizations controlled by compiler options. PEAK is an automated system,

4Usually, a compiler does not have the option to specify the order of the optimizations. So, thisarticle does not compare to Cooper et al. [2002], although the techniques developed in this articlecan be extended to search for the best order of optimizations.



targeting at the above two issues. First, a fast and effective feedback-directedalgorithm is designed to search through the optimization space, consideringthe interactions between the optimizations. The effectiveness of our algorithmstems from its ability to navigate the search space rapidly where optimizationsdo not interact, but consider interactions where they exist. Second, PEAK usesfast and accurate performance evaluation methods based on a partial executionof the program.

8. CONCLUSIONS

The techniques developed in this paper have led to the creation of an automatedperformance tuning system called PEAK. PEAK searches for the best compileroptimization combinations for important tuning sections in a program, usinga fast and effective optimization orchestration algorithm – combined elimi-nation. Three fast and accurate rating methods – CBR, MBR and RBR – aredeveloped to evaluate the performance of the optimized versions based on a par-tial execution of the program. The PEAK compiler selects the important codesections for performance tuning, analyzes the source program for applicablerating methods, and instruments the source code to construct a tuning driver.The tuning driver adopts a feedback-directed tuning approach. It continuallyruns the program, loads experimental versions generated under different opti-mization combinations, rates these versions based on the execution times, andexplores the optimization space according to the generated ratings, until thebest optimization combination is found for each tuning section. In addition tothe above functionalities, the PEAK runtime system provides the facilities todynamically load executables at runtime.

PEAK achieves fast tuning speed and high tuned program performance.When our Combined Elimination (CE) algorithm is applied to the whole pro-gram, the program performance is improved by 12% over GCC O3 for SPECCPU2000 FP benchmarks and 4% for INT benchmarks. Using SUN Forte com-pilers, CE improves performance by 10.8%, compared to 5.6% improved by man-ual tuning for FP benchmarks, and 8.1% to 4.1% for INT benchmarks. After ap-plying the rating methods to individual tuning sections, PEAK reduces tuningtime from 2 hours to 4.9 minutes for FP benchmarks, and from 1.2 hours to 13minutes for selected INT benchmarks, while achieving equal or better programperformance.

REFERENCES

ADL-TABATABAI, A.-R., CIERNIAK, M., LUEH, G.-Y., PARIKH, V. M., AND STICHNOTH, J. M. 1998. Fast,effective code generation in a just-in-time java compiler. In Proceedings of the ACM SIGPLAN1998 Conference on Programming Language Design and Implementation. ACM, New York, 280–290.

AGAKOV, F., BONILLA, E., CAVAZOS, J., FRANKE, B., FURSIN, G., O’BOYLE, M. F. P., THOMSON, J., TOUSSAINT,M., AND WILLIAMS, C. K. I. 2006. Using machine learning to focus iterative optimization. InCGO ’06: Proceedings of the International Symposium on Code Generation and Optimization(Washington, DC). IEEE Computer Society Press, Los Alamitos, CA, 295–305.

ARNOLD, M., FINK, S., GROVE, D., HIND, M., AND SWEENEY, P. F. 2000. Adaptive optimization inthe Jalapeno JVM. In Proceedings of the 15th ACM SIGPLAN Conference on Object-OrientedProgramming, Systems, Languages, and Applications. ACM, New York, 47–65.



ARNOLD, M., HIND, M., AND RYDER, B. G. 2002. Online feedback-directed optimization of java. InProceedings of the 17th ACM Conference on Object-Oriented Programming, Systems, Languages,and Applications. ACM, New York, 111–129.

AUSLANDER, J., PHILIPOSE, M., CHAMBERS, C., EGGERS, S. J., AND BERSHAD, B. N. 1996. Fast, effectivedynamic compilation. In Proceedings of the SIGPLAN Conference on Programming LanguageDesign and Implementation. ACM, New York, 149–159.

BALA, V., DUESTERWALD, E., AND BANERJIA, S. 2000. Dynamo: A transparent dynamic optimizationsystem. In Proceedings of the ACM SIGPLAN 2000 Conference on Programming Language Designand Implementation. ACM, New York, 1–12.

BRUENING, D., GARNETT, T., AND AMARASINGHE, S. 2003. An infrastructure for adaptive dynamicoptimization. In Proceedings of the International Symposium on Computer Generation and Op-timization (CGO ’03). IEEE Computer Society Press, Los Alamitos, CA.

CAVAZOS, J. AND O’BOYLE, M. F. P. 2006. Method-specific dynamic compilation using logistic re-gression. In OOPSLA ’06: Proceedings of the 21st Annual ACM SIGPLAN Conference on Object-Oriented Programming Systems, Languages, and Applications. ACM, New York, 229–240.

CHOW, K. AND WU, Y. 1999. Feedback-directed selection and characterization of compiler opti-mizations. In Proceedings of the 2nd Workshop on Feedback Directed Optimizations. Israel.

CIERNIAK, M., LUEH, G.-Y., AND STICHNOTH, J. M. 2000. Practicing JUDO: Java under dynamicoptimizations. In Proceedings of the ACM SIGPLAN 2000 Conference on Programming LanguageDesign and Implementation. ACM, New York, 13–26.

CLICK, C. AND COOPER, K. D. 1995. Combining analyses, combining optimizations. ACM Trans.Prog. Lang. Syst. (TOPLAS) 17, 2, 181–196.

CONSEL, C. AND NOEL, F. 1996. A general approach for run-time specialization and its applicationto c. In Proceedings of the 23rd ACM SIGPLAN-SIGACT Symposium on Principles of Program-ming Languages. ACM, New York, 145–156.

COOPER, K. D., GROSUL, A., HARVEY, T. J., REEVES, S., SUBRAMANIAN, D., TORCZON, L., AND WATERMAN,T. 2005. Acme: adaptive compilation made efficient. In LCTES ’05: Proceedings of the 2005ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems.ACM, New York, 69–77.

COOPER, K. D., SUBRAMANIAN, D., AND TORCZON, L. 2002. Adaptive optimizing compilers for the 21stcentury. J. Supercomput. 23, 1, 7–22.

DANG, F., YU, H., AND RAUCHWERGER, L. 2002. The R-LRPD test: Speculative parallelization of par-tially parallel loops. In Proceedings of the 16th International Parallel and Distributed ProcessingSymposium (IPDPS ’02). IEEE Computer Society Press, Los Alamitos, CA.

DINIZ, P. C. AND RINARD, M. C. 1997. Dynamic feedback: An effective technique for adaptive com-puting. In Proceedings of the SIGPLAN Conference on Programming Language Design and Im-plementation. ACM, New York, 71–84.

DUESTERWALD, E. AND BALA, V. 2000. Software profiling for hot path prediction: Less is more.In Proceedings of the 9th International Conference on Architectural Support for ProgrammingLanguages and Operating Systems. ACM, New York, 202–211.

EBCIOGLU, K. AND ALTMAN, E. R. 1997. DAISY: Dynamic compilation for 100. In Proceedings of theAnnual International Symposium on Computer Architecture. 26–37.

ENGLER, D. R. 1996. VCODE: a retargetable, extensible, very fast dynamic code generation sys-tem. SIGPLAN Notes 31, 5, 160–170.

ENGLER, D. R. AND PROEBSTING, T. A. 1994. DCG: An efficient, retargetable dynamic code gener-ation system. In Proceedings of the 6th International Conference on Architectural Support forProgramming Languages and Operating Systems. ACM, New York, 263–272.

GNU. 2005. GCC online documentation. http://gcc.gnu.org/onlinedocs/.GRAHAM, S. L., KESSLER, P. B., AND MCKUSICK, M. K. 1982. GPROF: A call graph execution profiler.

In Proceedings of the SIGPLAN Symposium on Compiler Construction. ACM, New York, 120–126.

GRANSTON, E. D. AND HOLLER, A. 2001. Automatic recommendation of compiler options. In Pro-ceedings of the 4th Workshop on Feedback-Directed and Dynamic Optimization (FDDO-4).

GRANT, B., PHILIPOSE, M., MOCK, M., CHAMBERS, C., AND EGGERS, S. J. 1999. An evaluation of stagedrun-time optimizations in dyc. In Proceedings of the ACM SIGPLAN 1999 Conference on Pro-gramming Language Design and Implementation. ACM, New York, 293–304.



HANEDA, M., KNIJNENBURG, P. M. W., AND WIJSHOFF, H. A. G. 2005. Generating new general compileroptimization settings. In ICS ’05: Proceedings of the 19th Annual International Conference onSupercomputing. ACM, New York, 161–168.

HEDAYAT, A., SLOANE, N., AND STUFKEN, J. 1999. Orthogonal Arrays: Theory and Applications.Springer-Verlag, New York.

JOSHI, R., NELSON, G., AND RANDALL, K. 2002. Denali: A goal-directed superoptimizer. In Proceed-ings of the ACM SIGPLAN 2002 Conference on Programming Language Design and Implemen-tation. ACM, New York, 304–314.

KARP, R. 1972. Reducibility among combinatorial problems. In Proceedings of the Symposium onthe Complexity of Computer Computations. Plenum Press, New York, 85–103.

KISTLER, T. AND FRANZ, M. 2003. Continuous program optimization: A case study. ACM Trans.Program. Lang. Syst. 25, 4, 500–548.

KISUKI, T., KNIJNENBURG, P. M. W., AND O’BOYLE, M. F. P. 2000. Combined selection of tile sizes andunroll factors using iterative compilation. In Proceedings of the IEEE PACT. IEEE ComputerSociety Press, Los Alamitos, CA, 237–248.

KISUKI, T., KNIJNENBURG, P. M. W., O’BOYLE, M. F. P., BODIN, F., AND WIJSHOFF, H. A. G. 1999. Afeasibility study in iterative compilation. In Proceedings of the International Symposium on HighPerformance Computing (ISHPC’99). 121–132.

KULKARNI, P., HINES, S., HISER, J., WHALLEY, D., DAVIDSON, J., AND JONES, D. 2004. Fast searchesfor effective optimization phase sequences. In PLDI ’04: Proceedings of the ACM SIGPLAN2004 Conference on Programming Language Design and Implementation. ACM, New York, 171–182.

KULKARNI, P. A., WHALLEY, D. B., TYSON, G. S., AND DAVIDSON, J. W. 2006. Exhaustive optimizationphase order space exploration. In CGO ’06: Proceedings of the International Symposium on CodeGeneration and Optimization (Washington, DC). IEEE Computer Society Press, Los Alamitos,CA, 306–318.

LAU, J., ARNOLD, M., HIND, M., AND CALDER, B. 2006. Online performance auditing: Using hotoptimizations without getting burned. In Proceedings of the 2006 ACM SIGPLAN Conference onProgramming Language Design and Implementation (PLDI). ACM, New York.

LEE, P. AND LEONE, M. 1996. Optimizing ML with run-time code generation. In Proceedings of theSIGPLAN Conference on Programming Language Design and Implementation. ACM, New York,137–148.

LEONE, M. AND LEE, P. 1998. Dynamic specialization in the fabius system. ACM Comput.Surv. 30, 3es, 23.

MERTEN, M. C., TRICK, A. R., BARNES, R. D., NYSTROM, E. M., GEORGE, C. N., GYLLENHAAL, J. C., AND

MEI W. HWU, W. 2001. An architectural framework for runtime optimization. IEEE Trans.Comput. 50, 6, 567–589.

MERTEN, M. C., TRICK, A. R., GEORGE, C. N., GYLLENHAAL, J. C., AND HWU, W. W. 1999. A hardware-driven profiling scheme for identifying program hot spots to support runtime optimization. InProceedings of the 26th Annual International Symposium on Computer Architecture. IEEE Com-puter Society Press, Los Alamitos, CA, 136–147.

MOCK, M., CHAMBERS, C., AND EGGERS, S. J. 2000. CALPA: A tool for automating selective dynamiccompilation. In Proceedings of the International Symposium on Microarchitecture. 291–302.

NANDY, S., GAO, X., AND FERRANTE, J. 2003. TFP: Time-sensitive, flow-specific profiling at runtime.In Proceedings of the Workshop on Languages and Compiling for Parallel Computing (LCPC).

NYSTROM, E. M., BARNES, R. D., MERTEN, M. C., AND MEI W. HWU, W. 2001. Code reordering andspeculation support for dynamic optimization systems. In Proceedings of the International Con-ference on Parallel Architectures and Compilation Techniques.

PAN, Z. AND EIGENMANN, R. 2004a. Compiler optimization orchestration for peak performance.Tech. Rep. TR-ECE-04-01, School of Electrical and Computer Engineering, Purdue University.

PAN, Z. AND EIGENMANN, R. 2004b. Rating compiler optimizations for automatic performance tun-ing. In SC2004: Proceedings of the High Performance Computing, Networking and Storage Con-ference. (10 pages).

PAN, Z. AND EIGENMANN, R. 2006. Fast and effective orchestration of compiler optimizations forautomatic performance tuning. In Proceedings of the 4th Annual International Symposium onCode Generation and Optimization (CGO). (12 pages).



PINKERS, R. P. J., KNIJNENBURG, P. M. W., HANEDA, M., AND WIJSHOFF, H. A. G. 2004. Statistical selec-tion of compiler options. In Proceedings of the IEEE Computer Society’s 12th Annual InternationalSymposium on Modeling, Analysis, and Simulation of Computer and Telecommunications Sys-tems (MASCOTS’04) (Volendam, The Netherlands). IEEE Computer Society Press, Los Alamitos,CA, 494–501.

RAUCHWERGER, L. AND PADUA, D. A. 1999. The LRPD test: Speculative run-time parallelizationof loops with privatization and reduction parallelization. IEEE Trans. Para. Distrib. Syst. 10, 2(Feb.), 160–180.

RUS, S., RAUCHWERGER, L., AND HOEFLINGER, J. 2002. Hybrid analysis: Static & dynamic memoryreference analysis. In Proceedings of the 16th International Conference on Supercomputing. ACM,New York, 274–284.

SPEC. 2000. SPEC CPU2000 Results. http://www.spec.org/cpu2000/results.STEPHENSON, M., AMARASINGHE, S., MARTIN, M., AND O’REILLY, U.-M. 2003. Meta optimization: im-

proving compiler heuristics with machine learning. In Proceedings of the ACM SIGPLAN 2003Conference on Programming Language Design and Implementation. ACM, New York, 77–90.

STROUT, M. M., CARTER, L., AND FERRANTE, J. 2003. Compile-time composition of run-time data anditeration reorderings. In Proceedings of the 2003 ACM SIGPLAN Conference on ProgrammingLanguage Design and Implementation (PLDI). ACM, New York.

SUN. 2000. Forte C 6 /Sun WorkShop 6 Compilers C User’s Guide. http://docs.sun.com/app/docs/doc/806-3567.

TRIANTAFYLLIS, S., VACHHARAJANI, M., VACHHARAJANI, N., AND AUGUST, D. I. 2003. Compileroptimization-space exploration. In Proceedings of the International Symposium on Code Gen-eration and Optimization. 204–215.

VOSS, M. AND EIGENMANN, R. 2000. ADAPT: Automated de-coupled adaptive program transfor-mation. In Proceedings of the International Conference on Parallel Processing. IEEE ComputerSociety Press, Los Alamitos, CA, 163–170.

VOSS, M. J. AND EIGEMANN, R. 2001. High-level adaptive program optimization with ADAPT.In Proceedings of the 8th ACM SIGPLAN Symposium on Principles and Practices of ParallelProgramming. ACM, New York, 93–102.

WHALEY, R. C. AND DONGARRA, J. 1998. Automatically tuned linear algebra software. In Proceed-ings of the SuperComputing 1998: High Performance Networking and Computing.

WOLF, M. E., MAYDAN, D. E., AND CHEN, D.-K. 1996. Combining loop transformations consideringcaches and scheduling. In Proceedings of the 29th Annual ACM/IEEE International Symposiumon Microarchitecture. ACM, New York, 274–286.

Received May 2006; revised November 2006; accepted May 2007


peak—a fast and effective performance tuning …...peak—fast and effective performance tuning...

Documents