milepost gcc: machine learning enabled self-tuning compiler

32
Int J Parallel Prog (2011) 39:296–327 DOI 10.1007/s10766-010-0161-2 Milepost GCC: Machine Learning Enabled Self-tuning Compiler Grigori Fursin · Yuriy Kashnikov · Abdul Wahid Memon · Zbigniew Chamski · Olivier Temam · Mircea Namolaru · Elad Yom-Tov · Bilha Mendelson · Ayal Zaks · Eric Courtois · Francois Bodin · Phil Barnard · Elton Ashton · Edwin Bonilla · John Thomson · Christopher K. I. Williams · Michael O’Boyle Received: 12 February 2009 / Accepted: 30 November 2010 / Published online: 7 January 2011 © Springer Science+Business Media, LLC 2011 Abstract Tuning compiler optimizations for rapidly evolving hardware makes port- ing and extending an optimizing compiler for each new platform extremely chal- lenging. Iterative optimization is a popular approach to adapting programs to a new architecture automatically using feedback-directed compilation. However, the large number of evaluations required for each program has prevented iterative compilation from widespread take-up in production compilers. Machine learning has been pro- posed to tune optimizations across programs systematically but is currently limited to a few transformations, long training phases and critically lacks publicly released, stable tools. Our approach is to develop a modular, extensible, self-tuning optimization infrastructure to automatically learn the best optimizations across multiple programs and architectures based on the correlation between program features, run-time behavior and optimizations. In this paper we describe Milepost GCC, the first publicly-available G. Fursin (B ) · Z. Chamski · O. Temam INRIA Saclay, Parc Club Orsay Universite, 3 rue Jean Rostand, 91893 Orsay, France e-mail: [email protected] G. Fursin · Y. Kashnikov · A. W. Memon University of Versailles Saint Quentin en Yvelines, 45 avenue des Etats Unis, 78000 Versailles, France M. Namolaru · E. Yom-Tov · B. Mendelson · A. Zaks IBM Research Lab, Haifa University Campus, Mount Carmel, 31905 Haifa, Israel E. Courtois · F. Bodin CAPS Entreprise, 4 Allée Marie Berhaut, 35000 Rennes, France P. Barnard · E. Ashton ARC International, St. Albans AL1 5HE, UK E. Bonilla · J. Thomson · C. K. I. Williams · M. O’Boyle University of Edinburgh, Informatics Forum, 10 Crichton Street, Edinburgh EH8 9AB, UK 123

Upload: grigori-fursin

Post on 14-Jul-2016

223 views

Category:

Documents


1 download

TRANSCRIPT

  • Int J Parallel Prog (2011) 39:296327DOI 10.1007/s10766-010-0161-2

    Milepost GCC: Machine Learning Enabled Self-tuningCompiler

    Grigori Fursin Yuriy Kashnikov Abdul Wahid Memon Zbigniew Chamski Olivier Temam Mircea Namolaru Elad Yom-Tov Bilha Mendelson Ayal Zaks Eric Courtois Francois Bodin Phil Barnard Elton Ashton Edwin Bonilla John Thomson Christopher K. I. Williams Michael OBoyle

    Received: 12 February 2009 / Accepted: 30 November 2010 / Published online: 7 January 2011 Springer Science+Business Media, LLC 2011

    Abstract Tuning compiler optimizations for rapidly evolving hardware makes port-ing and extending an optimizing compiler for each new platform extremely chal-lenging. Iterative optimization is a popular approach to adapting programs to a newarchitecture automatically using feedback-directed compilation. However, the largenumber of evaluations required for each program has prevented iterative compilationfrom widespread take-up in production compilers. Machine learning has been pro-posed to tune optimizations across programs systematically but is currently limitedto a few transformations, long training phases and critically lacks publicly released,stable tools. Our approach is to develop a modular, extensible, self-tuning optimizationinfrastructure to automatically learn the best optimizations across multiple programsand architectures based on the correlation between program features, run-time behaviorand optimizations. In this paper we describe Milepost GCC, the first publicly-available

    G. Fursin (B) Z. Chamski O. TemamINRIA Saclay, Parc Club Orsay Universite, 3 rue Jean Rostand, 91893 Orsay, Francee-mail: [email protected]

    G. Fursin Y. Kashnikov A. W. MemonUniversity of Versailles Saint Quentin en Yvelines, 45 avenue des Etats Unis, 78000 Versailles, France

    M. Namolaru E. Yom-Tov B. Mendelson A. ZaksIBM Research Lab, Haifa University Campus, Mount Carmel, 31905 Haifa, Israel

    E. Courtois F. BodinCAPS Entreprise, 4 Alle Marie Berhaut, 35000 Rennes, France

    P. Barnard E. AshtonARC International, St. Albans AL1 5HE, UK

    E. Bonilla J. Thomson C. K. I. Williams M. OBoyleUniversity of Edinburgh, Informatics Forum, 10 Crichton Street, Edinburgh EH8 9AB, UK

    123

  • Int J Parallel Prog (2011) 39:296327 297

    open-source machine learning-based compiler. It consists of an Interactive Compila-tion Interface (ICI) and plugins to extract program features and exchange optimizationdata with the cTuning.org open public repository. It automatically adapts the inter-nal optimization heuristic at function-level granularity to improve execution time,code size and compilation time of a new program on a given architecture. Part ofthe MILEPOST technology together with low-level ICI-inspired plugin framework isnow included in the mainline GCC. We developed machine learning plugins based onprobabilistic and transductive approaches to predict good combinations of optimiza-tions. Our preliminary experimental results show that it is possible to automaticallyreduce the execution time of individual MiBench programs, some by more than a fac-tor of 2, while also improving compilation time and code size. On average we are ableto reduce the execution time of the MiBench benchmark suite by 11% for the ARCreconfigurable processor. We also present a realistic multi-objective optimization sce-nario for Berkeley DB library using Milepost GCC and improve execution time byapproximately 17%, while reducing compilation time and code size by 12% and 7%respectively on Intel Xeon processor.

    Keywords Machine learning compiler Self-tuning compiler Adaptive compiler Automatic performance tuning Machine learning Program characterization Program features Collective optimization Continuous optimization Multi-objective optimization Empirical performance tuning Optimizationrepository Iterative compilation Feedback-directed compilation Adaptive compilation Optimization prediction Portable optimization

    1 Introduction

    Designers of new processor architectures attempt to bring higher performance andlower power across a wide range of programs while keeping time to market as short aspossible. However, static compilers fail to deliver satisfactory levels of performance asthey cannot keep pace with the rate of change in hardware evolution. Fixed heuristicsbased on simplistic hardware models and lack of run-time information means thatmuch manual retuning of the compiler is needed for each new hardware generation.Typical systems now have multiple heterogeneous reconfigurable cores making suchmanual compiler tuning increasingly infeasible.

    The difficulty of achieving portable performance has led to empirical iterative com-pilation for statically compiled programs [7,64,17,44,16,46,18,77,27,67,37,38,26,29,15,30], applying automatic compiler tuning based on feedback-directed compi-lation. Here the static optimization model of a compiler is replaced by an iterativesearch of the optimization space to empirically find the most profitable solutions thatimprove execution time, compilation time, code size, power and other metrics. Need-ing little or no knowledge of the current platform, this approach can adapt programsto any given architecture. This approach is currently used in library generators andadaptive tools [84,56,71,68,1,25]. However it is generally limited to searching forcombinations of global compiler optimization flags and tweaking a few fine-graintransformations within relatively narrow search spaces. The main barrier to its wider

    123

  • 298 Int J Parallel Prog (2011) 39:296327

    use is the currently excessive compilation and execution time needed in order to opti-mize each program. This prevents a wider adoption of iterative compilation for generalpurpose compilers.

    Our approach to solve this problem is to use machine learning which has the poten-tial of reusing knowledge across iterative compilation runs, gaining the benefits ofiterative compilation while reducing the number of executions needed. The objectiveof the Milepost project [58] is to develop compiler technology that can automaticallylearn how to best optimize programs for configurable heterogeneous processors basedon the correlation between program features, run-time behavior and optimizations. Italso aims to dramatically reduce the time to market configurable or frequently evolvingsystems. Rather than developing a specialized compiler by hand for each configuration,Milepost aims to produce optimizing compilers automatically.

    A key goal of the project is to make machine learning based compilation a realistictechnology for general-purpose production compilers. Current approaches [60,12,72,2,11] are highly preliminary, limited to global compiler flags or a few transformationsconsidered in isolation. GCC was selected as the compiler infrastructure for Milepostas it is currently the most stable and robust open-source compiler. GCC is currently theonly production compiler that supports more than 30 different architectures and hasmultiple aggressive optimizations making it a natural vehicle for our research. Eachnew version usually features new transformations and it may take months to adjusteach optimization heuristic, if only to prevent performance degradation in any of thesupported architectures. This further emphasizes the need for an automated approach.

    We use the Interactive Compilation Interface (ICI) [41,40] that separates the opti-mization process from a particular production compiler. ICI is a plugin system thatacts as a middleware interface between production compilers such as GCC and user-definable research plugins. ICI allowed us to add a program feature extraction moduleand to select arbitrary optimization passes in GCC. In the future, compiler indepen-dent ICI should help transfer Milepost technology to other compilers. We connectedMilepost GCC to a public collective optimization database at cTuning.org [14,28,32].This provides a wealth of continuously updated training data from multiple users andenvironments.

    In this paper we present experimental results showing that it is possible to improvethe performance of the well-known MiBench [36] benchmark suite automatically usingiterative compilation and machine learning on several platforms including x86: Inteland AMD, and the ARC configurable core family. Using Milepost GCC, after a fewweeks training, we were able to learn a model that automatically improves the exe-cution time of some individual MiBench programs by a factor of more than 2 whileimproving the overall MiBench suite by 11% on reconfigurable ARC architecture,often without sacrificing code size or compilation time. Furthermore, our approachsupports general multi-objective optimization where a user can choose to minimizenot only execution time but also code size and compilation time.

    This paper is organized as follows: this section provided motivation for our researchand developments. It is followed by Sect. 2 describing the experimental setup usedthroughout the article. Section 3 describes how iterative compilation can delivermulti-objective optimization. Section 4 describes the overall Milepost collaborativeinfrastructure while Sect. 5 describes our machine learning techniques used to predict

    123

  • Int J Parallel Prog (2011) 39:296327 299

    good optimizations based on program features and provides experimental evaluation.It is followed by the sections on related and future work.

    2 Experimental Setup

    The tools, benchmarks, architectures and environment used throughout the article arebriefly described in this section.

    2.1 Compiler

    We considered several compilers for our research and development includingOpen64 [65], LLVM [53], ROSE [70], Phoenix [69], GCC [34]. GCC was selectedas it is a mature and popular open-source optimizing compiler that supports manylanguages, has a large community, is competitive with the best commercial compilers,and features a large number of program transformation techniques including advancedoptimizations such as the polyhedral transformation framework (GRAPHITE) [78].Furthermore, GCC is the only extensible open-source optimizing compiler that sup-ports more than 30 processor families. However, our developed techniques are notcompiler dependent. We selected the latest GCC 4.4.x as the base for our machine-learning enabled self-tuning compiler.

    2.2 Optimizations

    There are approximately 100 flags available for tuning in the most recent version ofGCC, all of which are considered by our framework. However, it is impossible tovalidate all possible combinations of optimizations due to their number. Since GCChas not been originally designed for iterative compilation it is not always possible toexplore the entire optimization space by simply combining multiple compiler opti-mization flags, because some of them are initiated only with a given global GCCoptimization level (-Os,-O1,-O2,-O3). We overcome this issue by selecting a globaloptimization level -O1 .. -O3 first and then either turning on a particular optimiza-tion through a corresponding flag -f or turning it offusing -fno- flag. In some cases, certain combinationsof compiler flags or passes cause the compiler to crash or produce incorrect programexecution. We reduce the probability of such cases by comparing outputs of programswith reference outputs.

    2.3 Platforms

    We selected two general-purpose and one embedded processor for evaluation:

    AMD a cluster of 16 AMD Opteron 2218, 2.6 GHz, 4GB main memory, 2MBL2 cache, running Debian Linux Sid x64 with kernel 2.6.28.1 (provided byGRID5000 [35])

    123

  • 300 Int J Parallel Prog (2011) 39:296327

    Intel a cluster of 16 Intel Xeon EM64T, 3 GHz, 2GB main memory, 1MBL2 cache, running Debian Linux Sid x64 with kernel 2.6.28.1 (provided byGRID5000)

    ARC FPGA implementation of the ARC 725D reconfigurable processor,200MHz, 32 KB L1 cache, running Linux ARC with kernel 2.4.29

    We specifically selected platforms that have been in the market for some time butnot outdated to allow a fair comparison of our optimization techniques with defaultcompiler optimization heuristics that had been reasonably hand-tuned.

    2.4 Benchmarks and Experiments

    We use both embedded and server processors so we selected MiBench/cBench [36,29,28] benchmark suite for evaluation, covering a broad range of applications from sim-ple embedded functions to larger desktop/server programs. Most of the benchmarkshave been rewritten to be easily portable to different architectures; we use dataset 1in all cases. We encountered problems while compiling 4 tiff programs on the ARCplatform and hence used them only on AMD and Intel platforms.

    We use OProfile [66] with hardware counters support to perform non intrusive func-tion-level profiling during each run. This tool may introduce some overhead, so weexecute each compiled program three times and averaged the execution and compila-tion time. In the future, we plan to use more statistically rigorous approaches [75,33].For this study, we selected the most time consuming function from each benchmark forfurther analysis and optimization. If a program has several hot functions depending ona dataset, we analyze and optimize them one by one and report separately. Analyzingthe effects of interactions between multiple functions on optimization is left for futurework.

    2.5 Collective Optimization Database

    All experimental results were recorded in the public Collective Optimization Database[14,28,32] at cTuning.org, allowing independent analysis of our results.

    3 Motivation

    This section shows that tuning optimization heuristics of an existing real-world com-piler for multiple objectives such as execution time, code size and compilation time isa non-trivial task. We demonstrate that iterative compilation can effectively solve thisproblem, however often with excessive search costs that motivate the use of machinelearning to mitigate the need for per-program iterative compilation and learn optimi-zations across programs based on their features.

    123

  • Int J Parallel Prog (2011) 39:296327 301

    3.1 Multi-Objective Empirical Iterative Optimization

    Iterative compilation is a popular method to explore different optimizations by exe-cuting a given program on a given architecture and finding good solutions to improveprogram execution time and other characteristics based on empirical search.

    We selected 88 program transformations of GCC known to influence performance,including inlining, unrolling, scheduling, register allocation, and constant propagation.We selected 1000 combinations of optimizations using a random search strategy with50% probability to select each flag and either turn it on or off. We use this strategy toallow uniform unbiased exploration of unknown optimization search spaces. In orderto validate the resulting diversity of program transformations, we checked that no twocombinations of optimizations generated the same binary for any of the benchmarksusing the MD5 checksum of the assembler code obtained through the objdump -dcommand. Occasionally, random selection of flags in GCC may result in an invalidcode. In order to avoid such situations, we validated all generated combinations ofoptimizations by comparing the outputs of all benchmarks used in our study with therecorded outputs during reference runs when compiled with -O3 global optimizationlevel.

    Figure 1 shows the best execution time speedup achieved for each benchmarkover the highest GCC optimization level (-O3) after 1000 iterations across 3 selectedarchitectures. It confirms results from previous research on iterative compilation anddemonstrates that it is possible to outperform GCCs highest default optimization levelfor most programs using random iterative search for good combinations of optimi-zations. Several benchmarks achieve more than 2 times speedup while on averagewe reached speedups of 1.33 and 1.4 for Intel and AMD respectively and a smallerspeedup of 1.15 for ARC. This is likely due to simpler architecture and less sensitivityto program optimizations.

    However, the task of an optimizing compiler is not only to improve execution timebut also to balance code size and compilation time across a wide range of programsand architectures. The violin graphs1 in Fig. 2 show high variation of execution timespeedups, code size improvements and compilation time speedups during iterativecompilation across all benchmarks on Intel platform. Multi-objective optimization insuch cases depends on end-user usage scenarios: improving both execution time andcode size is often required for embedded applications, improving both compilation andexecution time is important for data centers and real-time systems, while improvingonly execution time is common for desktops and supercomputers.

    As an example, in Fig. 3, we present the execution time speedups vs code sizeimprovements and vs compilation time for susan_c on the AMD platform. Naturally,depending on optimization scenario, users are interested in optimization cases on thefrontier of the program optimization area. Circles on these graphs show the 2D fron-tier that improves at least two metrics, while squares show optimization cases wherethe speedup is also achieved on the third optimization metric and is more than somethreshold (compilation time speedup is more than 2 in the first graph and code size

    1 Violin graphs are similar to box graphs, showing the probability density in addition to min, max andinterquartile.

    123

  • 302 Int J Parallel Prog (2011) 39:296327

    1

    1.1

    1.2

    1.3

    1.4

    1.5

    1.6

    1.7

    1.8

    bitcountsusan_csusan_esusan_sjpeg_cjpeg_ddijkstrapatriciablowfish_dblowfish_erijndael_drijndael_eadpcm_cadpcm_dCRC32gsmtiff2bwtiff2rgbatiffdithertiffmedianqsort1stringsearch1AVERAGE

    Exec

    utio

    n tim

    e sp

    eedu

    p5.37.28.10.32.20.2

    AMDIntelARC

    Fig. 1 Maximum execution time speedups over the highest GCC optimization level (-O3) using iterativecompilation with uniform random distribution after 1,000 iterations on 3 selected architectures

    improvement is more than 1.2 in the second graph). These graphs demonstrate that forthis selected benchmark and architecture there are relatively many optimization casesthat improve execution time, code size and compilation time simultaneously. This isbecause many flags turned on for the default optimization level (-O3) do not influ-ence this program or even degrade performance and take considerable compilationtime.

    Figure 4 summarizes code size improvements and compilation time speedupsachievable on Intel platform across evaluated programs with the execution time speed-ups within 95% of the maximum available during iterative compilation. We can observethat in some cases we can improve both execution time, code size and compilationtime such as for susan_c and dijkstra for example. In some other cases, without avoid-ing degradation of execution time for the default optimization level (-O3), we canimprove compilation time considerably (more than 1.7 times) and code size such asfor jpeg_c and patricia. Throughout the rest of the article, we will consider improvingexecution time of primary importance, then code size and compilation time. However,our self-tuning compiler can work with other arbitrary optimization scenarios. Usersmay provide their own plugins to choose optimal solutions, for example using a Paretodistribution as shown in [37,38].

    The combinations of flags corresponding to Fig. 4 across all programs and archi-tectures are presented2 in Table 1. Some combinations can reduce compilation timeby 70% which can be critical when compiling large-scale applications or for cloudcomputing services where a quick response time is critical. The diversity of compiler

    2 The flags that do not influence execution time, code size or compilation time have been iteratively andautomatically removed from the original combination of random optimizations using CCC framework tosimplify the analysis of the results.

    123

  • Int J Parallel Prog (2011) 39:296327 303

    Exec

    utio

    n tim

    e sp

    eedu

    p

    0.0

    0.5

    1.0

    1.5

    3.0 3.5

    bitco

    unt

    susa

    n_c

    susa

    n_e

    susa

    n_s

    jpeg_c

    jpeg_d

    dijkstra

    patric

    ia

    blowfi

    sh_d

    blowfi

    sh_e

    rijndae

    l_d

    rijndae

    l_e

    adpc

    m_c

    adpc

    m_d

    CRC3

    2tiff

    2bw

    tiff2rg

    ba

    tiffdit

    her

    tiffme

    dian

    gsm qsort1

    string

    search

    1

    Code

    size

    impr

    ovem

    ent

    0.6

    0.8

    1.0

    1.2

    1.4

    1.6

    bitco

    unt

    susa

    n_c

    susa

    n_e

    susa

    n_s

    jpeg_c

    jpeg_d

    dijkstra

    patric

    ia

    blowfi

    sh_d

    blowfi

    sh_e

    rijndae

    l_d

    rijndae

    l_e

    adpc

    m_c

    adpc

    m_d

    CRC3

    2tiff

    2bw

    tiff2rg

    ba

    tiffdit

    her

    tiffme

    dian

    gsm qsort1

    string

    search

    1

    Com

    pila

    tion

    time

    spee

    dup

    0

    1

    2

    3

    4

    5

    bitco

    unt

    susa

    n_c

    susa

    n_e

    susa

    n_s

    jpeg_c

    jpeg_d

    dijkstra

    patric

    ia

    blowfi

    sh_d

    blowfi

    sh_e

    rijndae

    l_d

    rijndae

    l_e

    adpc

    m_c

    adpc

    m_d

    CRC3

    2tiff

    2bw

    tiff2rg

    ba

    tiffdit

    her

    tiffme

    dian

    gsm qsort1

    string

    search

    1

    Fig. 2 Distribution of execution time speedups, code size improvements and compilation time speedupson Intel platform during iterative compilation (1,000 iterations)

    optimizations involved demonstrates that the compiler optimization space is not trivialand the compiler best optimization heuristic (-O3) is far from optimal. All combina-tions of flags found per program and architecture during this research are available

    123

  • 304 Int J Parallel Prog (2011) 39:296327

    0

    0.5

    1

    1.5

    2

    0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

    Exec

    utio

    n tim

    e sp

    eedu

    p

    Code size improvement

    2D optimization area frontiercompilation time speedup > 2

    0

    0.5

    1

    1.5

    2

    0 0.5 1 1.5 2 2.5 3 3.5 4 4.5

    Exec

    utio

    n tim

    e sp

    eedu

    p

    Compilation time speedup

    2D optimization area frontiercode size improvement > 1.25

    Fig. 3 Distribution of execution time speedups, code size improvements and compilation time speedups forbenchmarks susan_c on AMD platform during iterative compilation. Depending on optimization scenarios,good optimization cases are depicted with circles on 2D optimization area frontier and with squares wherethird metric is more than some threshold (compilation time speedup > 2 or code size improvement > 1.2)

    on-line in the Collective Optimization Database [14] to allow end-users to optimizetheir programs or enable further collaborative research.

    Finally, Fig. 5 shows that it may take on average 70 iterations before reaching 95%of the speedup available after 1000 iterations (averaged over 10 repetitions) and isheavily dependent on the programs and architectures. Such a large number of iter-ations is needed due to an increasing number of aggressive optimizations availablein the compiler where multiple combinations of optimizations can both considerablyincrease or decrease performance, change code size and compilation time.

    The experimental results of this section suggest that iterative compilation can effec-tively generalize and automate the program optimization process but can be too time

    123

  • Int J Parallel Prog (2011) 39:296327 305

    0.6

    0.8

    1

    1.2

    1.4

    1.6

    1.8

    2

    bitcountsusan_csusan_esusan_sjpeg_cjpeg_ddijkstrapatriciablowfish_dblowfish_erijndael_drijndael_eadpcm_cadpcm_dCRC32gsmtiff2bwtiff2rgbatiffmediantiffditherqsort1stringsearch1

    Spee

    dup

    3.0 3.6

    execution timecode size

    compilation time

    Fig. 4 Code size improvements and compilation time speedups for optimization cases with execution timespeedups within 95% of the maximum available on Intel platform (as found by iterative compilation)

    0

    20

    40

    60

    80

    100

    120

    140

    160

    180

    bitcountsusan_csusan_esusan_sjpeg_cjpeg_ddijkstrapatriciablowfish_dblowfish_erijndael_drijndael_eadpcm_cadpcm_dCRC32gsmtiff2bwtiff2rgbatiffmediantiffditherqsort1stringsearch1AVERAGE

    Num

    ber o

    f ite

    ratio

    ns

    262 280 487 307 374

    AMDIntelARC

    Fig. 5 Number of iterations needed to obtain 95% of the available speedup using iterative compilationwith uniform random distribution

    consuming. Hence it is important to speed up iterative compilation. In the next section,we present the Milepost framework which speeds up program optimization throughmachine learning.

    123

  • 306 Int J Parallel Prog (2011) 39:296327

    Table 1 Best found combinations of Milepost GCC flags to improve execution time, code size and com-pilation time after iterative compilation (1000 iterations) across all evaluated benchmarks and platforms

    -O1 -fcse-follow-jumps -fno-tree-ter -ftree-vectorize-O1 -fno-cprop-registers -fno-dce -fno-move-loop-invariants -frename-registers

    -fno-tree-copy-prop -fno-tree-copyrename-O1 -freorder-blocks -fschedule-insns -fno-tree-ccp -fno-tree-dominator-opts-O2-O2 -falign-loops -fno-cse-follow-jumps -fno-dce -fno-gcse-lm -fno-inline-functions-called-once

    -fno-schedule-insns2 -fno-tree-ccp -fno-tree-copyrename -funroll-all-loops-O2 -finline-functions -fno-omit-frame-pointer -fschedule-insns -fno-split-ivs-in-unroller

    -fno-tree-sink -funroll-all-loops-O2 -fno-align-jumps -fno-early-inlining -fno-gcse -fno-inline-functions-called-once

    -fno-move-loop-invariants -fschedule-insns -fno-tree-copyrename -fno-tree-loop-optimize-fno-tree-ter -fno-tree-vrp

    -O2 -fno-caller-saves -fno-guess-branch-probability -fno-ira-share-spill-slots -fno-tree-reassoc-funroll-all-loops -fno-web

    -O2 -fno-caller-saves -fno-ivopts -fno-reorder-blocks -fno-strict-overflow -funroll-all-loops-O2 -fno-cprop-registers -fno-move-loop-invariants -fno-omit-frame-pointer -fpeel-loops-O2 -fno-dce -fno-guess-branch-probability -fno-strict-overflow -fno-tree-dominator-opts

    -fno-tree-loop-optimize -fno-tree-reassoc -fno-tree-sink-O2 -fno-ivopts -fpeel-loops -fschedule-insns-O2 -fno-tree-loop-im -fno-tree-pre-O3 -falign-loops -fno-caller-saves -fno-cprop-registers -fno-if-conversion -fno-ivopts

    -freorder-blocks-and-partition -fno-tree-pre -funroll-all-loops-O3 -falign-loops -fno-cprop-registers -fno-if-conversion -fno-peephole2 -funroll-all-loops-O3 -falign-loops -fno-delete-null-pointer-checks -fno-gcse-lm -fira-coalesce -floop-interchange

    -fsched2-use-superblocks -fno-tree-pre -fno-tree-vectorize -funroll-all-loops-funsafe-loop-optimizations -fno-web

    -O3 -fno-gcse -floop-strip-mine -fno-move-loop-invariants -fno-predictive-commoning -ftracer-O3 -fno-inline-functions-called-once -fno-regmove -frename-registers -fno-tree-copyrename-O3 -fno-inline-functions -fno-move-loop-invariants

    4 Milepost Optimization Approach and Framework

    As shown in the previous section, iterative compilation can considerably outperformexisting compilers but at the cost of excessive recompilation and program executionduring optimization search space exploration. Multiple techniques have been proposedto speed up this process. For example, ACOVEA tool [1] utilizes genetic algorithms;hill-climbing search [27] and run-time function-level per-phase optimization evalu-ation [30] have been used, as well as the use of Pareto distribution [37,38] to findmulti-objective solutions. However, these approaches start their exploration of opti-mizations for a new program from scratch and do not reuse any prior optimizationknowledge across different programs and architectures.

    The Milepost project takes an orthogonal approach based on the observation thatsimilar programs may exhibit similar behavior and require similar optimizations soit is possible to correlate program features and optimizations, thereby predictinggood transformations for unseen programs based on previous optimization experience

    123

  • Int J Parallel Prog (2011) 39:296327 307

    [60,12,72,2,39,11,32]. In the current version of Milepost GCC we use static programfeatures (such as the number of instructions in a method, number of branches, etc)to characterize programs and build predictive models. Naturally, since static featuresmay not be enough to capture run-time program behavior, we plan to add pluginsto improve program and optimization correlation based on dynamic features (perfor-mance counters [11], microarchitecture-independent characteristics [39], reactions totransformations [32] or semantically non-equivalent program modifications [31]).

    The next section describes the overall framework and is followed by a detaileddescription of Milepost GCC and the Interactive Compiler Interface. This is thenfollowed by a discussion of the features used to predict good optimizations.

    4.1 Milepost Adaptive Optimization Framework

    The Milepost framework shown in Fig. 6 uses a number of components including (i) amachine learning enabled Milepost GCC with Interactive Compilation Interface (ICI)to modify internal optimization decisions, (ii) a Continuous Collective CompilationFramework (CCC) to perform iterative search for good combinations of optimizationsand (iii) a Collective Optimization Database (COD) to record compilation and exe-cution statistics in the common repository. Such information is later used as trainingdata for the machine learning models. We use public COD that is hosted at cTun-ing.org [14,28,32]. The Milepost framework currently proceeds in two distinct phases,in accordance with typical machine learning practice: training and deployment.

    Training During the training phase we need to gather information about the structureof programs and record how they behave when compiled under different optimization

    MILEPOST GCC with ICI

    IC PluginsRecord sequences of

    optimization passes per function

    Extract static program features

    Program1

    ProgramN

    Training

    New program

    Deploym

    ent

    MILEPOST GCC

    Extract static program features

    Substitute compiler default optimization heuristic with

    predicted passes

    Compiler independent plugins to perform iterative

    compilation and model training

    Continuous Collective Compilation Framework

    CCC

    Predicting good passes to improve exec. time, code

    size and comp. time

    cTuning.orgOpen Collaborative Optimization Center

    Collective Optimization Web

    Services Register events Query database Get statistics Predict

    optimizations

    Web server

    Fig. 6 Open framework to automatically tune programs and improve default optimization heuristics usingpredictive machine learning techniques, Milepost GCC with Interactive Compilation Interface (ICI) andprogram features extractor, CCC Framework to train ML model and predict good optimization passes, andCOD optimization repository at cTuning.org

    123

  • 308 Int J Parallel Prog (2011) 39:296327

    settings. Such information allows machine learning tools to correlate aspects of pro-gram structure, or features, with optimizations, building a strategy that predicts goodcombinations of optimizations.

    In order to train a useful model, a large number of compilations and executions areneeded as training examples. These training examples are generated by CCC [13,28],which evaluates different combinations of optimizations and stores execution time,profiling information, code size, compilation time and other metrics in a database. Thefeatures of the program are also extracted from Milepost GCC and stored in the COD.Plugins allow fine grained control and examination of the compiler, driven externallythrough shared libraries.

    Deployment Once sufficient training data is gathered, multiple machine learningmodels can be created. Such models aim to correlate a given set of program featureswith profitable program transformations to predict good optimization strategies. Theycan later be re-inserted as plugins back to Milepost GCC or deployed as web-servicesat cTuning.org. The last method allows continuous update of the machine learningmodel based on collected information from multiple users. When encountering a newprogram, Milepost GCC determines the programs features and passes them to themodel to predict the most profitable optimizations to improve execution time or othermetrics depending on the users optimization requirements.

    4.2 Milepost GCC and Interactive Compilation Interface

    Current production compilers often have fixed and black-box optimization heuris-tics without the means to fine-tune the application of transformations. This sectiondescribes the Interactive Compilation Interface (ICI) [40] which unveils a compilerand provides opportunities for external control and examination of its optimizationdecisions with minimal changes. To avoid the pitfall of revealing intermediate repre-sentation and libraries of the compiler to a point where it would overspecify too manyinternals details and prevent further evolution, we choose to control the decision pro-cess itself, granting access only to the high-level features needed for effectively takinga decision. Optimization settings at a fine-grained level, beyond the capabilities ofcommand line options or pragmas, can be managed through external shared libraries,leaving the compiler uncluttered. By replacing default optimization heuristics, execu-tion time, code size and compilation time can be improved.

    We decided to implement ICI for GCC and transform it into a research self-tuningcompiler to provide a common stable extensible compiler infrastructure shared by bothacademia and industry, aiming to improve the quality, practicality and reproducibilityof research, and make experimental results immediately useful to the community.

    The internal structure of ICI is shown in Fig. 7. We separate ICI into two parts:low-level compiler-dependent and high-level compiler independent, the main reasonbeing to keep high-level iterative compilation and machine learning plugins invariantwhen moving from one compiler to another. At the same time, since plugins nowextend GCC through external shared libraries, experiments can be performed with nofurther modifications to the underlying compiler.

    123

  • Int J Parallel Prog (2011) 39:296327 309

    Low

    -leve

    l com

    pile

    r-dep

    ende

    nt

    Inte

    ract

    ive

    Com

    pila

    tion

    Detectoptimization flags

    GCC Controller(Pass Manager)

    IC Event

    Pass N

    IC Event

    Pass 1

    GCC Data LayerAST, CFG, CF, etc

    IC Data

    IC Event

    ICI

    MILEPOST GCC with ICI

    ...

    Detect

    (or other compilers)

    optimization flags

    GCC Controller(Pass Manager)

    Pass NPass 1

    GCC

    GCC Data LayerAST, CFG, CF, etc

    (a) (b)

    IC Plugins

    Selecting pass combinations/sequences

    Extracting static program features

    ...

    High-level compiler-independent ICI

    cTuning.orgg

    Plugins to perform iterative compilation for multi-objective

    optimization to improve execution time, code size and

    compilation time, etc.

    COD web-

    services

    CCC

    Fig. 7 GCC Interactive Compilation Interface: a original GCC, b Milepost GCC with ICI and plugins

    External plugins can transparently monitor execution of passes or replace the GCCController (Pass Manager), if desired. Passes can be selected by an external pluginwhich may choose to drive them in a very different order than that currently usedin GCC, even choosing different pass orderings for each and every function in theprogram being compiled. This mechanism simplifies the introduction of new analysisand optimization passes to the compiler.

    In an additional set of enhancements, a coherent event and data passing mechanismenables external plugins to discover the state of the compiler and to be informed asit changes. At various points in the compilation process events (IC Event) are raisedindicating decisions about transformations. Auxiliary data (IC Data) is registered ifneeded.

    Using ICI, we can now substitute all default optimization heuristics with externaloptimization plugins to suggest an arbitrary combination of optimization passes dur-ing compilation without the need for any project or Makefile changes. Together withadditional routines needed for machine learning, such as program feature extraction,our compiler infrastructure forms the Milepost GCC. We also added a -Oml flagwhich calls a plugin to extract features, queries machine learning model plugins andsubstitutes the default optimization levels.

    In this work, we do not investigate optimal orders of optimizations since this requiresdetailed information about dependencies between passes to detect legal orders; we planto provide this information in the future. Hence, we examine the pass orders generatedby compiler flags during iterative compilation and focus on selecting or deselectingappropriate passes that improve program execution time, compilation time or code

    123

  • 310 Int J Parallel Prog (2011) 39:296327

    size. In the future, we will focus on fine-grain parametric transformations in MILE-POST GCC [40] and combine them with the POET scripting language [86].

    4.3 Static Program Features

    Our machine learning models predict the best GCC optimization to apply to an inputprogram based on its program structure or program features. The program features aretypically a summary of the internal program representation and characterize essentialaspects of a program that help to distinguish between good and bad optimizations.

    The current version of ICI allows to invoke auxiliary passes that are not part ofGCCs default compiler optimization heuristics. These passes can monitor and pro-file the compilation process or extract data structures needed for generating programfeatures.

    During compilation, a program is represented by several data structures, implement-ing the intermediate representation (tree-SSA, RTL etc), control flow graph (CFG),def-use chains, loop hierarchy, etc. The data structures available depend on the compi-lation pass currently being performed. For statistical machine learning, the informationabout these data structures is encoded in a constant size vector of numbers (i.e fea-tures)this process is called feature extraction and facilitates reuse of optimizationknowledge across different programs.

    We implemented an additional ml-feat pass in GCC to extract static program fea-tures. This pass is not invoked during default compilation but can be called usingan extract_program_static_ features plugin after any arbitrary pass, when all datanecessary to produce features is available.

    In Milepost GCC, feature extraction is performed in two stages. In the first stage,a relational representation of the program is extracted; in the second stage, the vectorof features is computed from this representation. In the first stage, the program is con-sidered to be characterized by a number of entities and relations over these entities.The entities are a direct mapping of similar entities defined by the language refer-ence, or generated during compilation. Such examples of entities are variables, types,instructions, basic blocks, temporary variables, etc.

    A relation over a set of entities is a subset of their Cartesian product. The relationsspecify properties of the entities or the connections among them. We use a nota-tion based on logic for describing the relationsDatalog is a Prolog-like languagebut with a simpler semantics, suitable for expressing relations and operations uponthem [83,79].

    To extract the relational representation of the program, we used a simple methodbased on the examination of the include files. The main data structures of the compilerare built using struct data types, having a number of f ields. Each such struct datatype may introduce an entity, and its f ields may introduce relations over the entity,representing the including struct data type and the entity representing the data typeof the f ield. This data is collected by the ml-feat pass.

    In the second stage, we provide a Prolog program defining the features to be com-puted from the Datalog relational representation, extracted from the compilers internaldata structures in the first stage. The extract_program_static_ features plugin invokes

    123

  • Int J Parallel Prog (2011) 39:296327 311

    Table 2 List of static program features currently available in Milepost GCC V2.1

    ft1 Number of basic blocks in the methodft2 Number of basic blocks with a single successorft3 Number of basic blocks with two successorsft4 Number of basic blocks with more then two successorsft5 Number of basic blocks with a single predecessorft6 Number of basic blocks with two predecessorsft7 Number of basic blocks with more then two predecessorsft8 Number of basic blocks with a single predecessor and a single successorft9 Number of basic blocks with a single predecessor and two successorsft10 Number of basic blocks with a two predecessors and one successorft11 Number of basic blocks with two successors and two predecessorsft12 Number of basic blocks with more then two successors and more then two predecessorsft13 Number of basic blocks with number of instructions less then 15ft14 Number of basic blocks with number of instructions in the interval [15, 500]ft15 Number of basic blocks with number of instructions greater then 500ft16 Number of edges in the control flow graphft17 Number of critical edges in the control flow graphft18 Number of abnormal edges in the control flow graphft19 Number of direct calls in the methodft20 Number of conditional branches in the methodft21 Number of assignment instructions in the methodft22 Number of binary integer operations in the methodft23 Number of binary floating point operations in the methodft24 Number of instructions in the methodft25 Average of number of instructions in basic blocksft26 Average of number of phi-nodes at the beginning of a basic blockft27 Average of arguments for a phi-nodeft28 Number of basic blocks with no phi nodes

    a Prolog compiler to execute this program, resulting in a vector of features (as shownin Table 2) which later serves to detect similarities between programs, build machinelearning models and predict the best combinations of passes for new programs. Weprovide more details about aggregation of semantical program properties for machinelearning based optimization in [63].

    5 Using Machine Learning to Predict Good Optimization Passes

    The Milepost approach to learning optimizations across programs is based on theobservation that programs may exhibit similar behavior for a similar set of optimiza-tions [2,32], and hence we try to apply machine learning techniques to correlate theirfeatures with most profitable program optimizations. In this case, whenever we are

    123

  • 312 Int J Parallel Prog (2011) 39:296327

    Table 2 continued

    ft29 Number of basic blocks with phi nodes in the interval [0, 3]ft30 Number of basic blocks with more then 3 phi nodesft31 Number of basic block where total number of arguments for all phi-nodes is in greater then 5ft32 Number of basic block where total number of arguments for all phi-nodes is in the interval [1, 5]ft33 Number of switch instructions in the methodft34 Number of unary operations in the methodft35 Number of instruction that do pointer arithmetic in the methodft36 Number of indirect references via pointers (* in C)ft37 Number of times the address of a variables is taken (& in C)ft38 Number of times the address of a function is taken (& in C)ft39 Number of indirect calls (i.e. done via pointers) in the methodft40 Number of assignment instructions with the left operand an integer constant in the methodft41 Number of binary operations with one of the operands an integer constant in the methodft42 Number of calls with pointers as argumentsft43 Number of calls with the number of arguments is greater then 4ft44 Number of calls that return a pointerft45 Number of calls that return an integerft46 Number of occurrences of integer constant zeroft47 Number of occurrences of 32-bit integer constantsft48 Number of occurrences of integer constant oneft49 Number of occurrences of 64-bit integer constantsft50 Number of references of a local variables in the methodft51 Number of references (def/use) of static/extern variables in the methodft52 Number of local variables referred in the methodft53 Number of static/extern variables referred in the methodft54 Number of local variables that are pointers in the methodft55 Number of static/extern variables that are pointers in the methodft56 Number of unconditional branches in the method

    given a new unseen program, we can search for similar programs within the trainingset and suggest good optimizations based on their optimization experience.

    In order to test this assumption, we selected the combination of optimizations whichyields the best performance for a given program on AMD, see reference in Fig. 8.We then applied all these best combinations to all other programs and reported theperformance difference, see applied to. It is possible to see that there is a fairlylarge amount of programs that share similar optimizations.

    In the next subsections we introduce two machine learning techniques to selectcombinations of optimization passes based on construction of a probabilistic modelor a transductive model on a set of M training programs, and then use these modelsto predict good combinations of optimization passes for unseen programs based ontheir features.

    123

  • Int J Parallel Prog (2011) 39:296327 313

    Reference

    Appl

    ied

    to

    bitcountsusan_c

    susan_e

    susan_s

    jpeg_cjpeg_ddijkstrapatricia

    blowfish_dblowfish_erijndael_drijndael_eadpcm_cadpcm_d

    CRC32gsm

    tiff2bwtiff2rgba

    tiffmediantiffdither

    qsort1stringsearch1

    bitco

    unt

    susa

    n_c

    susa

    n_e

    susa

    n_s

    jpeg_c

    jpeg_d

    dijkstrapa

    tricia

    blowfi

    sh_d

    blowfi

    sh_e

    rijndae

    l_d

    rijndae

    l_e

    adpc

    m_c

    adpc

    m_d

    CRC3

    2gsm

    tiff2b

    w

    tiff2rg

    ba

    tiffme

    dian

    tiffdit

    herqs

    ort1

    string

    search

    1

    % difference with best performance0% 80%

    Fig. 8 Percentage difference between speedup achievable after iterative compilation for applied to pro-gram and speedup obtained when applying best optimization from reference program to applied toprogram on AMD. means that best optimization was not found for this program

    There are several differences between the two models: first, in our implementationthe probabilistic model assumes each attribute is independent, whereas the proposedtransductive model also analyzes interdependencies between attributes. Second, theprobabilistic model finds the closest programs from the training set to the test program,whereas the transductive model attempts to generalize and identify good combinationsof flags and program attributes. Therefore, it is expected that in some settings pro-grams will benefit more from the probabilistic approach, whereas in others programswill be improved more by using the transductive method, depending on training setsize, number of samples of the program space, as well as program and architectureattributes.

    In order to train the two machine learning models, we generated 1000 random com-binations of flags turned either on or off as described in Sect. 3. Such a number ofruns is small relative to the size of the optimization space yet it provides enough opti-mization cases and sufficient information to capture good optimization choices. Theprogram features for each benchmark, the flag settings and execution times formedthe training data for each model. All experiments were conducted using leave-one-outcross-validation. This means that for each of the N programs, the other N 1 programs

    123

  • 314 Int J Parallel Prog (2011) 39:296327

    are used as training data. This guarantees that each program is unseen when the modelpredicts good optimization settings to avoid bias.

    5.1 Probabilistic Machine Learning Model

    Our probabilistic machine learning method is similar to that of [2] where a probabilitydistribution over good solutions (i.e. optimization passes or compiler flags) is learntacross different programs. This approach has been referred to as Predictive SearchDistributions (PSD) [8]. However, unlike prior work [2,8] where such a distributionis used to focus the search of compiler optimizations on a new program, we use thelearnt distribution to make one-shot predictions on unseen programs. Thus we do notsearch for the best optimization, we automatically predict it.

    Given a set of training programs T 1, . . . , T M , which can be described by fea-ture vectors t1, . . . , tM , and for which we have evaluated different combinations ofoptimization passes (x) and their corresponding execution times (or speed-ups) y sothat we have for each program T j an associated dataset D j = {(xi , yi )}N ji=1, withj = 1, . . . , M , our goal is to predict a good combination of optimization passes xminimizing y when a new program T is presented.

    We approach this problem by learning a mapping from the features of a programt to a distribution over good solutions q(x|t, ), where are the parameters of thedistribution. Once this distribution has been learnt, prediction for a new program T isstraightforward and is achieved by sampling at the mode of the distribution. In otherwords, we obtain the predicted combination of flags by computing:

    x = argmaxx

    q(x|t, ). (1)

    In order to learn the model it is necessary to fit a distribution over good solutions toeach training program beforehand. These solutions can be obtained, for example, byusing uniform sampling or by running an estimation of distribution algorithm (EDA,see [47] for an overview) on each of the training programs. In our experiments weuse uniform sampling and we choose the set of good solutions to be those optimi-zation settings that achieve at least 98% of the maximum speed-up available in thecorresponding program-dependent dataset.

    Let us denote the distribution over good solutions on each training program byP(x|T j ) with j = 1, . . . , M . In principle, these distributions can belong to any para-metric family. However, in our experiments we use an IID model where each of theelements of the combination are considered independently. In other words, the proba-bility of a good combination of passes is simply the product of each of the individualprobabilities corresponding to how likely each pass is to belong to a good solution:

    P(x|T j ) =L

    =1P(x|T j ), (2)

    where L is the length of the combination.

    123

  • Int J Parallel Prog (2011) 39:296327 315

    bitcountsusan_c

    susan_e

    susan_s

    jpeg_cjpeg_ddijkstrapatricia

    blowfish_dblowfish_erijndael_drijndael_eadpcm_cadpcm_d

    CRC32gsm

    tiff2bwtiff2rgba

    tiffmediantiffdither

    qsort1stringsearch1

    bitco

    unt

    susa

    n_c

    susa

    n_e

    susa

    n_s

    jpeg_c

    jpeg_d

    dijkstrapa

    tricia

    blowfi

    sh_d

    blowfi

    sh_e

    rijndae

    l_d

    rijndae

    l_e

    adpc

    m_c

    adpc

    m_d

    CRC3

    2gsm

    tiff2b

    w

    tiff2rg

    ba

    tiffme

    dian

    tiffdit

    herqs

    ort1

    string

    search

    1

    0 1.6

    Fig. 9 Euclidean distance for all programs based on static program features normalized by feature 24(number of instructions in a method)

    Once the individual training distributions P(x|T j ) are obtained, the predictive dis-tribution q(x|t, ) can be learnt by maximization of the conditional likelihood or byusing k-nearest neighbor methods. In our experiments we use a 1-nearest neighborapproach (Fig. 9 shows Euclidean distances between all programs with a visible clus-tering). In other words, we set the predictive distribution q(x|t, ) to be the distributioncorresponding to the training program that is closest in feature space to the new (test)program.

    Figure 10 compares the speedups achieved after iterative compilation using 1000iterations and 50% probability of selecting each optimization on AMD and Intel afterone-shot prediction using probabilistic model or simply after selecting best combi-nation of optimizations from the closest program. Interestingly, the results suggestthat simply selecting best combination of optimizations from a similar program maynot perform well in many cases; this may be due to our random optimization spaceexploration technique - each good combination of optimizations includes multipleflags that do not influence performance or other metrics on a given program, howeversome of them can considerably degrade performance on other programs. On the con-trary, probabilistic approach helps to filter away non-influential flags statistically andthereby improve predictions.

    123

  • 316 Int J Parallel Prog (2011) 39:296327

    0.6

    0.8

    1

    1.2

    1.4

    1.6

    1.8

    2

    bitcountsusan_csusan_esusan_sjpeg_cjpeg_ddijkstrapatriciablowfish_dblowfish_erijndael_drijndael_eadpcm_cadpcm_dCRC32gsmtiff2bwtiff2rgbatiffmediantiffditherqsort1stringsearch1AVERAGE

    Exec

    utio

    n tim

    e sp

    eedu

    pUpper bound (iterative compilation)

    Predicted (best optimization from the neighbour)Predicted (probability distribution from the neighbour)

    (a)

    0.6

    0.8

    1

    1.2

    1.4

    1.6

    1.8

    2

    bitcountsusan_csusan_esusan_sjpeg_cjpeg_ddijkstrapatriciablowfish_dblowfish_erijndael_drijndael_eadpcm_cadpcm_dCRC32gsmtiff2bwtiff2rgbatiffmediantiffditherqsort1stringsearch1AVERAGE

    Exec

    utio

    n tim

    e sp

    eedu

    p

    Upper bound (iterative compilation)Predicted (best optimization from the neighbour)

    (b)Predicted (probability distribution from the neighbour)

    Fig. 10 Speedups achieved when using iterative compilation on a AMD and b Intel with random searchstrategy (1,000 iterations; 50% probability to select each optimization;), when selecting best optimizationfrom the nearest program and when predicting optimization using probabilistic ML model based on programfeatures

    5.2 Transductive Machine Learning Model

    In this subsection we describe a new transductive approach where optimization com-binations themselves are used, as features for the learning algorithm, together withprogram features. The model is then queried for the best combination of optimizationsout of the set of optimizations that the program was compiled with. Many learningalgorithms can be used for building the ML model. In this work we used a decisiontree model [23] to ease analysis of the resulting model.

    123

  • Int J Parallel Prog (2011) 39:296327 317

    As in the previous section, we try to predict whether a specific optimization com-bination will obtain at least 95% of the maximal speedup possible. The feature setis comprised of the flags/passes and the extracted program features, obtained fromMilepost GCC. Denoting the vector of extracted features from the i-th program byti , i = 1, . . . , M and the possible optimization passes by x j , j = 1, . . . , N , we trainthe ML model with a set of features which is the cross-product of x t, such thateach feature vector is a concatenation of x j and ti . This is akin to multi-class methodswhich rely on single binary classifiers (see [24] for a detailed discussion of such meth-ods). The target for the predictor is whether this combination of program features andflags/passes combination will give a speedup of at least 95% of the maximal speedup.

    Once a program is compiled with different optimization settings (either an exhaus-tive sample, or a random sample of optimization combinations), all successfully com-piled program settings are used as a query for the learned model together with theprogram features, and the flag setting which is predicted to have the best speedupis used. If several settings are predicted to have the same speedup, the one whichexhibited, on average, the best speedup with the training set programs, is used.

    Figure 11 compares the speedups achieved after iterative compilation using 1000iterations and 50% probability of selecting each optimization on ARC and after one-shot prediction using probabilistic and transductive models. It shows that our probabi-listic model can automatically improve the default optimization heuristics of GCC by11% on average while reaching 100% of the achievable speedup in some cases. On theother hand, transductive model improves GCC by only a modest 5%. However, in sev-eral cases it outperforms the probabilistic model: susan_s, dijkstra, rijndael_e, qsort1and strinsearch1 likely due to a different mechanism of capturing the importance ofprogram features and optimizations. Moreover, transductive (decision tree) model hasan advantage that it is much easier to analyze the results. For example, Fig. 12 showsthe top levels of the decision trees learnt for ARC. The leafs indicate the probability thatthe optimization and program feature combinations which reached these nodes willbe in the top 95% of the speedup for a benchmark. Most of these features found at thetop level characterize the control flow graph (CFG). This is somehow expected, sincethe structure of the CFG is one of the major factors that may affect the efficiency ofseveral optimizations. Other features relate to the applicability of the address-takenoperator to functions that may affect the accuracy of the call-graph and of subsequentanalysis using it. To improve the performance of both models, we intend to analyze thequality and importance of program features and their correlation with optimizationsin the future.

    5.3 Realistic Optimization Scenario of a Production Application

    Experimental results from the previous section show how to optimize several stan-dard benchmarks using Milepost GCC. In this section we show how to optimize areal production application using Milepost technology combined with machine learn-ing model from Sect. 5.1. For this purpose, we selected the open-source BerkeleyDB library (BDB) which is a popular high-performance database written in C withAPIs to most other languages. For evaluation purposes we used an official internal

    123

  • 318 Int J Parallel Prog (2011) 39:296327

    0.8

    0.9

    1

    1.1

    1.2

    1.3

    1.4

    1.5

    1.6

    bitcountsusan_csusan_esusan_sjpeg_cjpeg_ddijkstrapatriciablowfish_dblowfish_erijndael_drijndael_eadpcm_cadpcm_dCRC32gsm

    qsort1stringsearch1AVERAGE

    Execu

    tion

    time

    spee

    dup

    Upper bound (iterative compilation)Predicted optimization using probabilistic ML modelPredicted optimization using transductive ML model

    Fig. 11 Speedups achieved when using iterative compilation on ARC with random search strategy (1,000iterations; 50% probability to select each optimization;) and when predicting best optimizations usingprobabilistic ML model and transductive ML model based on program features

    ft6: Number of basic blocks with

    a two predecessors and one

    successor < 18

    ft38: Number of times the address

    of a function is taken (& in C). < 16.5

    Yes

    ft9: Number of basic blocks with

    a single predecessor and two

    successors < 15.5

    No

    0.038008

    Yes

    -O < 0.5

    No

    0.031966

    Yes

    ft16: Number of edges in the control

    flow graph < 193.5

    No

    0.24367

    Yes

    0.88301

    No

    -O < 0.5

    Yes

    0.087843

    No

    0

    Yes

    0.69581

    No

    Fig. 12 Top levels of decision trees learnt for ARC

    123

  • Int J Parallel Prog (2011) 39:296327 319

    benchmarking suite and provided support of the CCC framework to perform iterativecompilation in a same manner as described in Sect. 3, in order to find the upper boundsfor execution time, code size and compilation time.

    For simplicity, we decided to use a probabilistic machine learning model fromSect. 5.1. Since BDB is relatively large (around 200,000 lines of code) we selected the3 hottest functions, extracted features for each function using Milepost GCC and cal-culated Euclidean distance with all programs from our training set (MiBench/cBench)to find the five most similar programs. Then, depending on the optimization scenario,we selected the best optimizations from those programs to (a) improve execution timewhile not degrading compilation time (b) improve code size while not degrading exe-cution time and (c) improve compilation time while not degrading execution time.Figure 13 shows the achieved execution time speedups, code size improvements andcompilation time speedups over -O3 optimization level when applying selected opti-mizations from the most similar programs to BerkeleyDB for these three optimizationscenarios. These speedups are compared to the upper bound for the respective metricsachieved after iterative compilation (200 iterations) for the whole program. The pro-grams on the X-axis are sorted by distances starting from the closest program. In thecase of improving execution time, we show significant speedup across the functions.For improving compilation time we are far from the optimal solution because it isnaturally associated with the lowest optimization level, while we have been focusingalso on not degrading execution time of -O3. Overall, the best results were achievedwhen applying optimizations from tiff programs that are closer in the feature space tothe hot functions selected from BerkeleyDB, than any other program of the trainingset.

    We added information about the best optimizations from these 3 optimization sce-narios to the open online Collective Optimization Database [14] to help users andresearchers validate and reproduce such results. These optimization cases are refer-enced by the following cTuning RUN_ID reference numbers: 24857532370695782,17268781782733561 and 9072658980980875. The default run related to -O3 optimi-zation level is referenced by 965827379437489142. We also added support for pragma#ctuning-opt-case UID that allows end-users to explicitly force Milepost GCCto connect combinations of optimizations found by other users during empirical col-lective search and referenced by UID in COD to a given code section instead of usingmachine learning.

    6 Related Work

    Automatic performance tuning techniques are now widely adopted to improve differentcharacteristics of a code empirically. They search automatically for good optimiza-tion settings, applying multiple compilations and executions of a given program whilerequiring little or no knowledge of the current platform, so programs can be adaptedto any given architecture. Originally developed to improve performance of varioussmall kernels using a few parametric transformations across multiple architectures,where static compilers fail to deliver best performance [84,45,56,71,81,6,85], these

    123

  • 320 Int J Parallel Prog (2011) 39:296327

    0.8

    0.85

    0.9

    0.95

    1

    1.05

    1.1

    1.15

    1.2

    Max afteriterative compilation

    tiffmedian tiffdither tiff2rgba patricia stringsearch1

    Spee

    dup/

    impr

    ovem

    ent execution time speedup

    compilation time speedupcode size improvement

    (a)

    0.8

    0.85

    0.9

    0.95

    1

    1.05

    1.1

    1.15

    1.2

    Max afteriterative compilation

    tiffmedian tiffdither tiff2rgba patricia stringsearch1

    Spee

    dup/

    impr

    ovem

    ent execution time speedup

    compilation time speedupcode size improvement

    (b)

    0.8

    0.85

    0.9

    0.95

    1

    1.05

    1.1

    1.15

    1.2

    Max afteriterative compilation

    tiffmedian tiffdither tiff2rgba patricia stringsearch1

    Spee

    dup/

    impr

    ovem

    ent

    2.1

    execution time speedupcompilation time speedup

    code size improvement

    (c)

    Fig. 13 Execution time speedups (a), code size improvements (b) and compilation time speedup (c) forBerkeleyDB on Intel when applying optimizations from 5 closest programs from MiBench/cBench (basedon Euclidean distance using static program features of 3 hottest functions) using several optimizationscenarios

    techniques have been extended to larger applications and richer set of optimizations[7,44,16,46,18,77,27,21,51,67,86,25,68,78].

    Though popular for library generators and embedded systems, iterative compilationis still not widely adopted by general purpose compilers mainly due to excessivelylong optimization time. Multiple genetic and probabilistic approaches have been devel-oped to speed up optimization of a given program on a given architecture [64,17,73,37,4,38,26]. Furthermore, novel dynamic and hybrid (static and dynamic) adaptationtechniques have been proposed to speed up evaluation of optimizations at run-time[80,49]. In [30], we have shown the possibility to speed up iterative compilation by

    123

  • Int J Parallel Prog (2011) 39:296327 321

    several orders of magnitude using static function cloning with pre-optimized versionsfor various objectives and run-time low-overhead optimization evaluation that alsoenabled adaptive binaries reactive to run-time changes in the program and environ-ment. In [27,31], we introduced a new technique to quickly detect realistic lowerbound of the execution time of memory intensive applications by converting arrayaccesses to scalars in various ways without preserving the semantics of the code toquickly detect performance anomalies and identify code sections that can benefit fromempirical optimizations. All these techniques can effectively learn the optimizationspace of an individual program and optimization process but they still do not learnoptimizations across programs and architectures.

    Calder et al. [10] presented a new approach to predict branches for a new programbased on behavior of other programs. They used neural networks and decision treesto map static features associated with each branch to a prediction that the branchwill be taken, and managed to slightly reduce the branch misprediction rate on a setof C and Fortran programs. Moss and McGovern et al. [62,59] incorporated a rein-forcement learning model with a compiler to improve code scheduling, however noabsolute performance improvements were reported. Monsifrot et al. [61] presented aclassifier based on decision tree learning to determine which loops to unroll. MarkStephenson and Saman Amarasinghe [72] also predict unroll factors using nearestneighbor classification and support vector machines. In our previous work [2,11] weused static or dynamic (performance counters) code features with SUIF, Intel andPathScale compilers to predict a set of multiple optimizations that improve executiontime for new programs based on similarities between previously optimized programs.Liao et al. [82] used machine learning to performance counters and decision trees tochoose hardware prefetcher configurations. Several researchers [9,50,55] attempted tocharacterize program input in order to predict best code variant at run-time using sev-eral machine learning methods, including automatically generated decision trees andstatistical modeling. Other works [42,39,22] used machine learning for performanceprediction and hardware-software co-design.

    Though machine learning techniques demonstrate a good potential to speed up theiterative compilation process and facilitate reuse of optimization knowledge across dif-ferent programs and architectures, the training phase can still be very long. Techniquesfor continuous optimization can effectively speed up training of machine learning mod-els. Anderson et al. [3] presented a practical framework for continuous and transparentprofiling and analysis of computing systems, though unfortunately this work did notcontinue and no machine learning has been used. Lattner and Adve [48] and Lu etal. [54] describe frameworks for lifelong program optimization, but without providingdetails on practical collection of data and optimization strategies across runs. Otherframeworks [5,74] can collect profile information across multiple runs of users andcontinuously alter run-time decisions in Java virtual machines, while we focus on pro-duction static compilers and predictive modeling to correlate program features withprogram optimizations. In previous work [28,32] we presented an open source frame-work for statistical collective optimization that can leverage experience of multipleusers with static compilers and collect run-time profile data transparently in an openpublic database for further machine learning processing. In [32], we also presented anew technique to characterize programs based on reaction to transformations, which

    123

  • 322 Int J Parallel Prog (2011) 39:296327

    can be an alternative portable approach to program characterization using static ordynamic program features.

    We found many of the above approaches highly preliminary, limited to a few trans-formations and global flags, rarely with publicly released open source tools or exper-imental data to reproduce results. In contrast, the main goal of the Milepost project isto make machine learning based multi-objective optimization a realistic, automatic,reproducible and portable technology for general-purpose production compilers.

    7 Conclusions and Future Work

    The main contribution of this article is the first practical attempt to move empiri-cal multi-objective iterative optimization and machine learning research techniquesinto production compilers, deliver open collaborative R&D infrastructure based onthe popular GCC compiler and connect it to the cTuning.org optimization repositoryto help end-users optimize their applications and allow researchers to reproduce andimprove experimental results. We show that Milepost GCC has a potential to automatethe tuning of compiler heuristics for a wide range of architectures and multi-objectiveoptimization such as improving execution time, code size, compilation time and otherconstraints while considerably simplifying overall compiler design and time to market.

    We released all Milepost/cTuning infrastructure and experimental data as opensource at cTuning.org [57,19,20] to be immediately useful to end users and research-ers. We hope that Milepost GCC connected to cTuning.orgs public collaborativetools and databases with common API will open many new practical opportunitiesfor systematic and reproducible research in the area of empirical multi-objective opti-mization and machine learning. Some of Mileposts technology is now included inmainline GCC.

    We continue to extend the Interactive Compilation Interface [41,40] to abstracthigh-level optimization processes from compiler internals and provide finer grain tun-ing for performance, power, compilation time and code size. We also expect to combineICI with the POET scripting language [86] and pragmas to unify fine-grain programtuning. Future work will connect LLVM, ROSE, Path64 and other compilers to ourframework. We are also integrating our framework with the collective optimizationapproach [32] to reduce or completely remove training stage overheads with limitedbenchmarks, architectures and datasets. Collective optimization also allows to definetruly representative benchmarks based on classical clustering techniques.

    Our framework now facilitates deeper analysis of interactions among optimizationsand investigation of the influence of program inputs and run-time state on programoptimizations in large applications. We also extend Milepost/cTuning technology toimprove machine learning models and analyze the quality of program features tosearch for optimal sequences of optimization passes or polyhedral transformations[51,78]. We started combining Milepost technology with machine-learning basedauto-parallelization and predictive scheduling techniques [52,43,76]. We have alsostarted investigating staged compilation techniques to balance between static anddynamic optimizations using machine learning in LLVM or Milepost GCC4CIL con-nected to Mono virtual machine. We plan to connect architecture simulators to our

    123

  • Int J Parallel Prog (2011) 39:296327 323

    framework to enable software and hardware co-optimization. Finally, we will inves-tigate adaptive and machine learning techniques for parallelization on heterogeneousmulti-core architectures and power saving prediction for large data centers and super-computers.

    Acknowledgments This research was generously supported by the EU FP6 035307 Project Milepost(MachIne Learning for Embedded PrOgramS opTimization) [58]. We would like to thank GRID5000 [35]community for providing computational resources to help validate results of this paper. We are grateful toProf. William Jalby for providing financial support to Abdul Wahid Memon and Yuriy Kashnikov to workon this project. We would like to thank Ari Freund, Bjrn Franke, Hugh Leather, our colleagues from theInstitute of Computing Technology of Chinese Academy of Science and users from GCC, cTuning andHiPEAC communities for interesting discussions and feedback during this project. We would like to thankCupertino Miranda and Joern Rennecke for their help to improve the Interactive Compilation Interface [41].We would also like to thank Diego Novillo and GCC developers for practical discussions about the imple-mentation of the ICI-compatible plugin framework in GCC. We are grateful to Google for their support toextend Milepost GCC during Google Summer of Code09 program. We would also like to thank Yossi Gilfor proof-reading this paper and anonymous reviewers for their insightful comments.

    References

    1. ACOVEA: Using Natural Selection to Investigate Software Complexities. http://www.coyotegulch.com/products/acovea

    2. Agakov, F., Bonilla, E., Cavazos, J., Franke, B., Fursin, G., OBoyle, M., Thomson, J., Toussaint, M.,Williams, C.: Using machine learning to focus iterative optimization. In: Proceedings of the Interna-tional Symposium on Code Generation and Optimization (CGO) (2006)

    3. Anderson, J., Berc, L., Dean, J., Ghemawat, S., Henzinger, M., Leung, S., Sites, D., Vandevoorde, M.,Waldspurger, C., Weihl, W.: Continuous profiling: Where have all the cycles gone? In: Proceedings ofthe 30th Symposium on Microarchitecture (MICRO-30), (1997)

    4. Arcuri, A., White, D.R., Clark, J., Yao, X.: Multi-objective improvement of software using co-evolutionand smart seeding. In: Proceedings of the 7th International Conference on Simulated Evolution AndLearning (SEAL08) (2008)

    5. Arnold, M., Welc, A., Rajan, V.T.: Improving virtual machine performance using a cross-run pro-file repository. In: Proceedings of the ACM Conference on Object-Oriented Programming, Systems,Languages and Applications (OOPSLA05) (2005)

    6. Barthou, D., Donadio, S., Carribault, P., Duchateau, A., Jalby, W.: Loop optimization using hierarchi-cal compilation and kernel decomposition. In: Proceedings of the International Symposium on CodeGeneration and Optimization (CGO) (2007)

    7. Bodin, F., Kisuki, T., Knijnenburg, P., OBoyle, M., Rohou, E.: Iterative compilation in a non-linearoptimisation space. In: Proceedings of the Workshop on Profile and Feedback Directed Compilation(1998)

    8. Bonilla, E.V., Williams, C.K.I., Agakov, F.V., Cavazos, J., Thomson, J., OBoyle, M.F.P.: Predic-tive search distributions. In: Proceedings of the 23rd International Conference on Machine Learning.pp. 121128, New York, NY, USA, (2006)

    9. Brewer E. High-level optimization via automated statistical modeling. In: Proceedings of the 5th ACMSIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 8091, (1995)

    10. Calder, B., Grunwald, D., Jones, M., Lindsay, D., Martin, J., Mozer, M., Zorn, B.: Evidence-basedstatic branch prediction using machine learning. ACM Transactions on Programming Languages andSystems (TOPLAS) (1997)

    11. Cavazos, J., Fursin, G., Agakov, F., Bonilla, E., OBoyle, M., Temam, O.: Rapidly selecting goodcompiler optimizations using performance counters. In: Proceedings of the International Symposiumon Code Generation and Optimization (CGO) March (2007)

    12. Cavazos J., Moss J. Inducing heuristics to decide whether to schedule. In: Proceedings of the ACMSIGPLAN Conference on Programming Language Design and Implementation (PLDI) (2004)

    13. CCC: Continuous Collective Compilation Framework for iterative multi-objective optimization. http://cTuning.org/ccc

    123

  • 324 Int J Parallel Prog (2011) 39:296327

    14. COD: Public collaborative repository and tools for program and architecture characterization andoptimization. http://cTuning.org/cdatabase

    15. Chen, Y., Huang, Y., Eeckhout, L., Fursin, G., Peng, L., Temam, O., Wu, C.: Evaluating iterative opti-mization across 1000 data sets. In: Proceedings of the ACM SIGPLAN Conference on ProgrammingLanguage Design and Implementation (PLDI) June (2010)

    16. Cooper, K., Grosul, A., Harvey, T., Reeves, S., Subramanian, D., Torczon, L., Waterman, T.: ACME:adaptive compilation made efficient. In: Proceedings of the Conference on Languages, Compilers, andTools for Embedded Systems (LCTES) (2005)

    17. Cooper, K., Schielke, P., Subramanian, D.: Optimizing for reduced code space using genetic algo-rithms. In: Proceedings of the Conference on Languages, Compilers, and Tools for Embedded Systems(LCTES), pp. 19, (1999)

    18. Cooper, K., Subramanian, D., Torczon, L.: Adaptive optimizing compilers for the 21st century.J. Supercomput. 23(1) (2002)

    19. cTuning CC: cTuning Compiler Collection that can convert any traditional compiler into adaptive,machine learning enabled self-tuning infrastructure using Milepost GCC with ICI, CCC framework,cBench, COD public repository and cTuning.org web-services. http://cTuning.org/ctuning-cc

    20. cTuning.org: public collaborative optimization center with open source tools and repository to system-atize, simplify and automate design and optimization of computing systems while enabling reproduc-ibility of results

    21. Donadio, S., Brodman, J.C., Roeder, T., Yotov, K., Barthou, D., Cohen, A., Garzaran, M.J., Padua,D.A., Pingali, K.: A language for the compact representation of multiple program versions. In: Pro-ceedings of the International Workshop on Languages and Compilers for Parallel computing (LCPC)(2005)

    22. Dubach, C., Jones, T.M., Bonilla, E.V., Fursin, G., OBoyle, M.F.: Portable compiler optimizationacross embedded programs and microarchitectures using machine learning. In: Proceedings of theIEEE/ACM International Symposium on Microarchitecture (MICRO) December (2009)

    23. Duda, R., Hart, P., Stork, D.: Pattern Classification. Wiley, New-York (2001)24. El-Yaniv, R., Pechyony, D., Yom-Tov, E.: Better multiclass classification via a margin-optimized single

    binary problem. Pattern Recognit Lett 29(14), 19541959 (2008)25. ESTO: Expert System for Tuning Optimizations. http://www.haifa.ibm.com/projects/systems/cot/

    esto/index.html26. Franke, B., OBoyle, M., Thomson, J., Fursin, G.: Probabilistic source-level optimisation of embed-

    ded programs. In: Proceedings of the Conference on Languages, Compilers, and Tools for EmbeddedSystems (LCTES) (2005)

    27. Fursin, G.: Iterative Compilation and Performance Prediction for Numerical Applications. PhD thesis,University of Edinburgh, United Kingdom (2004)

    28. Fursin, G.: Collective tuning initiative: automating and accelerating development and optimization ofcomputing systems. In: Proceedings of the GCC Developers Summit, June (2009)

    29. Fursin, G., Cavazos, J., OBoyle, M., Temam, O.: MiDataSets: creating the conditions for a morerealistic evaluation of iterative optimization. In: Proceedings of the International Conference on HighPerformance Embedded Architectures & Compilers (HiPEAC 2007), January (2007)

    30. Fursin, G., Cohen, A., OBoyle, M., Temam, O.: A practical method for quickly evaluating programoptimizations. In: Proceedings of the International Conference on High Performance Embedded Archi-tectures & Compilers (HiPEAC 2005), pp. 2946, November (2005)

    31. Fursin, G., OBoyle, M., Temam, O., Watts, G.: Fast and accurate method for determining a lowerbound on execution time. Concurrency 16(23), 271292 (2004)

    32. Fursin, G., Temam, O.: Collective optimization. In: Proceedings of the International Conference onHigh Performance Embedded Architectures & Compilers (HiPEAC 2009), January (2009)

    33. Georges, A., Buytaert, D., Eeckhout, L.: Statistically rigorous java performance evaluation. In: Proceed-ings of the Twenty-Second ACM SIGPLAN Conference on Object-Oriented Programming, Systems,Languages & Applications (OOPSLA) (2007)

    34. GCC: the GNU Compiler Collection. http://gcc.gnu.org35. GRID5000: A nationwide infrastructure for large scale parallel and distributed computing research.

    http://www.grid5000.fr36. Guthaus, M.R., Ringenberg, M.R., Ernst, D., Austin, T.M., Mudge, T., Brown, R.B.: Mibench:

    A free, commercially representative embedded benchmark suite. In: Proceedings of the IEEE 4thAnnual Workshop on Workload Characterization, Austin, TX, December (2001)

    123

  • Int J Parallel Prog (2011) 39:296327 325

    37. Heydemann, K., Bodin, F.: Iterative compilation for two antagonistic criteria: Application to codesize and performance. In: Proceedings of the 4th Workshop on Optimizations for DSP and EmbeddedSystems, colocated with CGO (2006)

    38. Hoste, K., Eeckhout, L.: Cole: Compiler optimization level exploration. In: Proceedings of the Inter-national Symposium on Code Generation and Optimization (CGO) (2008)

    39. Hoste, K., Eeckhout, L.: Comparing benchmarks using key microarchitecture-independent character-istics. In: Proceedings of the IEEE International Symposium on Workload Characterization (IISWC),pp. 8392, California,USA, October (2006)

    40. Huang, Y., Peng, L., Wu, C., Kashnikov, Y., Renneke, J., Fursin, G.: Transforming GCC into a research-friendly environment: plugins for optimization tuning and reordering, function cloning and programinstrumentation. In: 2nd International Workshop on GCC Research Opportunities (GROW), Colocatedwith HiPEAC10 Conference, January (2010)

    41. ICI: Interactive Compilation Interface is a unified plugin system to convert black-box productioncompilers into interactive research toolsets for application and architecture characterization and opti-mization. http://cTuning.org/ici

    42. Ipek, E., McKee, S.A., de Supinski, B.R., Schulz, M., Caruana, R.: Efficiently exploring architec-tural design spaces via predictive modeling. In: Proceedings of the 12th International Conference onArchitectural Support for Programming Languages and Operating Systems (ASPLOS), pp. 195206(2006)

    43. Jimenez, V., Gelado, I., Vilanova, L., Gil, M., Fursin, G., Navarro, N.: Predictive runtime code sched-uling for heterogeneous architectures. In: Proceedings of the International Conference on High Per-formance Embedded Architectures & Compilers (HiPEAC 2009), January (2009)

    44. Kisuki, T., Knijnenburg, P., OBoyle, M.: Combined selection of tile sizes and unroll factors usingiterative compilation. In: Proceedings of the International Conference on Parallel Architectures andCompilation Techniques (PACT), pp. 237246, (2000)

    45. Kisuki T., Knijnenburg P., OBoyle M.: Combined selection of tile sizes and unroll factors using iter-ative compilation. In: Proceedings of IEEE International Conference on Parallel Architectures andCompilation Techniques (PACT), pp. 237246, (2000)

    46. Kulkarni, P., Zhao, W., Moon, H., Cho, K., Whalley, D., Davidson, J., Bailey, M., Paek, Y., Gallivan,K.: Finding effective optimization phase sequences. In: Proceedings of the Conference on Languages,Compilers, and Tools for Embedded Systems (LCTES), pp. 1223 (2003)

    47. Larra naga, P., Lozano, J.A.: Estimation of Distribution Algorithms: A New Tool for EvolutionaryComputation. Kluwer, Norwell (2001)

    48. Lattner, C., Adve, V.: LLVM: A compilation framework for lifelong program analysis & transfor-mation. In: Proceedings of the 2004 International Symposium on Code Generation and Optimization(CGO04), Palo Alto, California, March (2004)

    49. Lau, J., Arnold, M., Hind, M., Calder, B.: Online performance auditing: Using hot optimizations with-out getting burned. In: Proceedings of the ACM SIGPLAN Conference on Programming LanguagedDesign and Implementation (PLDI06) (2006)

    50. Li, X., Garzaran, M.J., Padua, D.A.: Optimizing sorting with machine learning algorithms. In: Pro-ceedings of the International Parallel and Distributed Processing Symposium (IPDPS) (2007)

    51. Long, S., Fursin, G.: A heuristic search algorithm based on unified transformation framework. In:Proceedings of the 7th International Workshop on High Performance Scientific and Engineering Com-puting (HPSEC-05), pp. 137144, (2005)

    52. Long, S., Fursin, G., Franke, B.: A cost-aware parallel workload allocation approach based on machinelearning techniques. In: Proceedings of the IFIP International Conference on Network and ParallelComputing (NPC 2007), number 4672 in LNCS, pp. 506515. Springer, September (2007)

    53. LLVM: the low level virtual machine compiler infrastructure. http://llvm.org54. Lu, J., Chen, H., Yew, P.-C., Hsu, W.-C.: Design and implementation of a lightweight dynamic opti-

    mization system. J. Instruction-Level Parallel. 6 (2004)55. Luo, L., Chen, Y., Wu, C., Long, S., Fursin, G.: Finding representative sets of optimizations for adap-

    tive multiversioning applications. In: 3rd Workshop on Statistical and Machine Learning ApproachesApplied to Architectures and Compilation (SMART09), Colocated with HiPEAC09 Conference,January (2009)

    56. Matteo, F., Johnson, S.: FFTW: An adaptive software architecture for the FFT. In: Proceedings of theIEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 3, pp. 13811384,Seattle, WA, May (1998)

    123

  • 326 Int J Parallel Prog (2011) 39:296327

    57. MILEPOST GCC: public collaborative R&D website. http://cTuning.org/milepost-gcc58. MILEPOST project archive (MachIne Learning for Embedded PrOgramS opTimization). http://

    cTuning.org/project-milepost59. McGovern, A., Moss, E.: Scheduling straight-line code using reinforcement learning and rollouts. In:

    Advances in Neural Information Processing Systems (NIPS). Morgan Kaufmann, San Mateo (1998)60. Monsifrot, A., Bodin, F., Quiniou, R.: A machine learning approach to automatic production of compiler

    heuristics. In: Proceedings of the International Conference on Artificial Intelligence: Methodology,Systems, Applications, LNCS 2443, pp. 4150 (2002)

    61. Monsifrot, A., Bodin, F., Quiniou, R.: A machine learning approach to automatic production of compilerheuristics. In: Proceedings of the Tenth International Conference on Artificial Intelligence: Methodol-ogy, Systems, Applications (AIMSA), LNCS 2443, pp. 4150, (2002)

    62. Moss, J., Utgoff, P., Cavazos, J., Precup, D., Stefanovic, D., Brodley, C., Scheeff, D.: Learning toschedule straight-line code. In: Advances in Neural Information Processing Systems (NIPS), pp. 929935. Morgan Kaufmann, (1997)

    63. Namolaru, M., Cohen, A., Fursin, G., Zaks, A., Freund, A.: Practical aggregation of semantical programproperties for machine learning based optimization. In: P