Predicting Good Compiler ? Predicting Good Compiler Transformations Using Machine Learning Edwin V.

Download Predicting Good Compiler ? Predicting Good Compiler Transformations Using Machine Learning Edwin V.

Post on 30-Jul-2018

213 views

Category:

Documents

1 download

TRANSCRIPT

  • Predicting Good Compiler Transformations

    Using Machine Learning

    Edwin V. Bonilla

    Master of Science

    Artificial Intelligence

    School of Informatics

    University of Edinburgh

    2004

  • I

    Abstract

    This dissertation presents a machine learning solution to the compiler optimisation problem

    focused on a particular program transformation: loop unrolling. Loop unrolling is a very

    straightforward but powerful code transformation mainly used to improve Instruction Level

    Parallelism and to reduce the overhead due to loop control. However, loop unrolling can also

    be detrimental, for example, when the instruction cache is degraded due to the size of the

    loop body. Additionally, the effect of the interactions between loop unrolling and other

    program transformations is unknown. Consequently, determining when and how unrolling

    should be applied remains a challenge for compiler writers and researchers. This project

    works under the assumption that the effect of loop unrolling on the execution times of

    programs can be learnt based on past examples. Therefore, a regression approach able to

    learn the improvement in performance of loops under unrolling is presented. This novel

    approach differs from previous work ([Monsifrot et al., 2002] and [Stephenson and

    Amarasinghe, 2004]) because it does not formulate the problem as a classification task but as

    a regression solution. Great effort has been invested in the generation of clean and reliable

    data in order to make it suitable for learning. Two different regression algorithms have been

    used: Multiple Linear Regression and Classification and Regression Trees (CART).

    Although the accuracy of the methods is questionable, the realisation of final speed-ups on

    seven out of twelve benchmarks indicates that something has been gained with this learning

    process. A maximum 18% of re-substitution improvement has been achieved. 2.5% of

    overall improvement in performance for Linear Regression and 2.3% for CART algorithm

    have been obtained. The present work is the beginning of an ambitious project that attempts

    to build a compiler that can learn to optimise programs and can undoubtedly be improved in

    the near future.

  • II

    Acknowledgements Special thanks to my supervisor Dr. Chris Williams for his invaluable advice and

    comprehensive revision of my progress throughout this project.

    Thanks to Dr. Michael O'Boyle and Dr. Grigori Fursin for the discussions held during our

    meetings that made possible the creation of the dataset that has been used in this project.

    Thanks to Catalina Voroneanu and Bonny Quick for their patience when revising some

    drafts of this dissertation.

    Supported by the Programme Alan, European Union Programme of High Level

    Scholarships for Latin America, identification number (E03M14650CO).

  • III

    Declaration I declare that this thesis was composed by myself, that the work contained herein is my own

    except where explicitly stated otherwise in the text, and that this work has not been

    submitted for any other degree or professional qualification except as specified.

    (Edwin V. Bonilla)

  • IV

    Table of Contents

    Introduction.............................................................................................................................. 1 Overview and Motivation .................................................................................................... 1 Project Objectives ................................................................................................................ 2 Organisation......................................................................................................................... 3

    Chapter One: Literature Review .............................................................................................. 6 1.1 Introduction ............................................................................................................ 6 1.2 Tuning heuristics and recommending program transformations ............................ 6 1.3 Learning in a particular program transformation: loop unrolling........................... 8 1.4 Summary............................................................................................................... 10

    Chapter 2: Background on Compiler Optimisation ............................................................... 11 2.1 Introduction .......................................................................................................... 11 2.2 Definition of compilation ..................................................................................... 11 2.3 Compiler organisation .......................................................................................... 12 2.4 The purpose of a compiler .................................................................................... 13 2.5 An Optimising Compiler ...................................................................................... 14

    2.5.1 Goals of Compiler Optimisation ...................................................................... 14 2.5.2 Considerations for program transformations.................................................... 15 2.5.3 The process of transforming a program for optimisation................................. 16 2.5.4 The problem of interaction............................................................................... 17 2.5.5 Types of program transformations ................................................................... 18 2.5.6 The scope of optimisation ................................................................................ 18 2.5.7 Some common transformations........................................................................ 20

    2.6 Loop Unrolling ..................................................................................................... 20 2.6.1 Definition ......................................................................................................... 20 2.6.2 Implementation ................................................................................................ 21 2.6.3 Advantages of loop unrolling........................................................................... 22 2.6.4 Disadvantages of loop unrolling ...................................................................... 23 2.6.5 Interactions, again ............................................................................................ 24 2.6.6 Candidates for unrolling................................................................................... 24

    2.7 Summary............................................................................................................... 24 Chapter 3: Data Collection..................................................................................................... 26

    3.1 Introduction .......................................................................................................... 26

  • V

    3.2 The Benchmarks................................................................................................... 28 3.3 Implementation of loop unrolling......................................................................... 31

    3.3.1 Which loops should be unrolled? ..................................................................... 31 3.3.2 Initial experiments............................................................................................ 31 3.3.3 Loop level profiling.......................................................................................... 32

    3.4 Generating the targets........................................................................................... 34 3.4.1 Preparing the benchmarks ................................................................................ 34 3.4.2 Selecting loops ................................................................................................. 34 3.4.3 Profiling............................................................................................................ 34 3.4.4 Filtering............................................................................................................ 34 3.4.5 Running the search strategy ............................................................................. 35

    3.5 Technical Details .................................................................................................. 35 3.5.1 The platform..................................................................................................... 35 3.5.2 The compiler .................................................................................................... 35 3.5.3 The timers precision........................................................................................ 35

    3.6 The results in summary......................................................................................... 36 3.7 Feature extraction ................................................................................................. 36 3.8 The representation of a loop ................................................................................. 39 3.9 Summary............................................................................................................... 40

    Chapter 4: Data Preparation and Exploratory Data Analysis................................................. 41 4.1 Introduction .......................................................................................................... 41 4.2 The general framework for data integration ......................................................... 42 4.3 Formal representation of the data ......................................................................... 43 4.4 Is this data valid? .................................................................................................. 44

    4.4.1 Statistical analysis ............................................................................................ 44 4.5 Pre-processing the targets..................................................................................... 46

    4.5.1 Filtering............................................................................................................ 48 4.5.2 Dealing with outliers ........................................................................................ 49 4.5.3 Target transformation....................................................................................... 50

    4.6 Pre-processing the features................................................................................... 52 4.6.1 Rescaling.......................................................................................................... 52 4.6.2 Feature selection and feature transformation ................................................... 53

    4.7 Summary............................................................................................................... 54 Chapter 5: Modelling and Results.......................................................................................... 57

    5.1 Introduction .......................................................................................................... 57

  • VI

    5.2 The regression approach....................................................................................... 58 5.3 Learning methods used......................................................................................... 60

    5.3.1 Multiple Linear Regression.............................................................................. 62 5.3.2 Classification and Regression Trees ................................................................ 64

    5.4 Parameters setting................................................................................................. 64 5.5 Measure of performance used............................................................................... 65 5.6 Experimental Design ............................................................................................ 65

    5.6.1 Complete dataset .............................................................................................. 65 5.6.2 K-fold cross-validation..................................................................................... 66 5.6.3 Leave One Benchmark Out cross-validation.................................................... 66 5.6.4 Realising speed-ups.......................................................................................... 67

    5.7 Results and Evaluation ......................................................................................... 67 5.7.1 Complete dataset .............................................................................................. 67 5.7.2 K-fold cross-validation..................................................................................... 70 5.7.3 Leave One Benchmark Out Cross-validation................................................... 71 5.7.4 Realising speed-ups.......................................................................................... 75 5.7.5 Feature construction ......................................................................................... 78 5.7.6 Comparison to related work ............................................................................. 78

    5.8 Summary and Discussion ..................................................................................... 79 Conclusions............................................................................................................................ 87 Bibliography .......................................................................................................................... 91

  • 1

    Introduction

    Overview and Motivation The continuously increasing demand of high performance computational resources has

    required great effort on behalf of hardware designers in order to develop advanced

    components making possible the construction of current applications. Thus,

    microprocessors architectures have become more complex and have come to provide

    processing rates that until a few years ago were affordable only by a selected group of users.

    However, the resources offered by current processors are significantly underexploited, which

    limits the possibility of running applications at maximum speed.

    This bottleneck is mainly a consequence of the limitations of a very complex application

    being used to transform the source code of programs written in a high-level programming

    language into machine dependent code, the compiler. Indeed, the main concern of compiler

    writers centres not upon very well studied tasks at the front of the compilation process (such

    as parsing and lexical analysis) but upon the discovery of ways in which to optimise the

    execution time of programs. Hence, optimisation is not considered an additional feature of

    the compilation process but has become a crucial component of existing compilers.

    However, achieving high performance on modern processors is an extremely difficult task

    and compiler writers have to deal with NP-complete problems.

    Numerous program transformations have been created in order to guarantee optimal use of

    the resources of existing architectures. Nonetheless, all these transformations fail in the sense

    that the compiler optimisation problem is far from being optimally solved. In fact, although

    these transformations should, in theory, lead to a significant improvement in the performance

    of the execution time of programs, they have instead become new problems to be solved.

    Certainly, due to the great variety of transformations that are currently available, the number

    of parameters each transformation involves and the unknown and almost enigmatic effect of

    the interactions among these transformations, the compiler optimisation problem has been

    transferred to the task of discovering when and how program transformations should be

    applied.

  • Introduction

    2

    Considering the huge search space in compiler optimisation, compiler writers commonly rely

    on heuristics that suggest when the application of a particular transformation can be

    profitable. The usual approach adopted is running a set of benchmarks and setting the

    parameters of the heuristics on these benchmarks. Although valid, this approach has

    demonstrated that it insufficiently exploits the power of the transformations, leading to

    suboptimal solutions, or sometimes to a diminishment in the performance of the execution

    time of programs. This is understandable given that the heuristics adopted may be specific to

    particular types of programs and target architectures.

    This project tackles the compiler optimisation problem from a different point of view. It does

    not attempt to tune heuristics previously constructed in order to optimise programs, but

    investigates the use of machine learning techniques with the aim of discovering when a

    particular transformation can be beneficial. Machine Learning techniques have been

    successfully applied to different areas, from business applications to star/galaxy

    classification. The type of learning adopted in this project is the one based on past examples

    (training data) from which a model can be constructed. This model must be provided with

    predicted power, i.e. although it is built on a set of training examples, it must be able to

    successfully perform on novel data.

    Project Objectives Thus, the principal goal of this project is to use supervised machine learning techniques to

    predict good compiler transformations for programs. Therefore, if a number of examples

    of good transformations for programs are given, the aim is to use machine learning methods

    to predict these transformations based on features that characterise the programs. Given the

    great number of program transformations that are available, this project will focus on a

    relatively simplified version of the problem by applying machine learning techniques to

    predict the effect of a particular transformation: loop unrolling.

    Loop unrolling is a very straightforward but powerful program transformation mainly used

    to improve Instruction Level Parallelism ILP (the execution of several instructions at the

    same time) and to reduce the overhead due to loop control. Studies on loop unrolling date

    from about 1979 when Dongarra and Hinds (1979) investigated the effect of this

    transformation on programs written in Fortran. However, understanding the interaction of the

    parameters that may affect it as well as its effects on the execution time of the ultimate code

  • Introduction

    3

    generated by the compiler is still a challenge. The main problem to be solved in loop

    unrolling is deciding whether this transformation is beneficial or not and, yet further, how

    much a loop should be unrolled in order to optimise the execution time of programs.

    Previous work has focused on building classifiers in order to predict the profitability of loop

    unrolling [Monsifrot et al., 2002] and how much this transformation should be applied

    [Stephenson and Amarasinghe, 2004]. Although well-founded, these approaches may

    encounter some difficulties when noisy measurements are present and when the data used for

    learning is limited. This project goes beyond the classification problem and proposes a

    regression approach to predict the improvement in performance that unrolling can

    produce on a particular loop. The regression approach is a more general formulation of the

    classification problem and previous solutions can be obtained with it. Furthermore, the

    machine learning solution to loop unrolling based on a regression method is smoother than

    the one obtained by classification methods given the degree of noisiness the measurements

    may have and the limitations in terms of the number of training examples that are available.

    Bearing in mind that the process of applying machine learning techniques requires great

    effort in the preparation and analysis of the data to make it suitable for learning and the

    correct use of appropriate techniques for the regression task, the following specific

    objectives have been stated for the project:

    To develop a systematic approach that enables the application of machine learning

    techniques to compiler optimisation focused on loop unrolling.

    To investigate the appropriateness of learning based on past examples in compiler

    optimisation.

    To propose a regression solution able to determine the most influential parameters

    involved in loop unrolling and able to predict the effect of this transformation on the

    execution time of programs.

    To assess the quality and the impact of the results obtained by two different

    regression methods used to predict the effect of loop unrolling.

    Organisation This dissertation assumes that the reader does not have a background on compilers and,

    therefore, an entire chapter will be dedicated to explaining the most relevant issues involved

  • Introduction

    4

    in compilation and compiler optimisation. Furthermore, the most important terminology

    related to compiler optimisation needed to understand a specific concept or procedure will be

    progressively explained. Similarly, although no separate section has been devoted to

    providing a background on machine learning techniques, those concepts that are believed to

    be crucial to understanding what is stated will be described as each chapter is developed. The

    organisation of this dissertation is presented below.

    Chapter One provides the Literature Review, describing previous work that involves

    machine learning and/or artificial intelligence techniques applied to compiler optimisation.

    Each approach is briefly explained, giving some details about the implementation and the

    learning technique used. A discussion concerning their advantages and drawbacks is

    presented and an explanation about how this previous work has influenced and motivated the

    present project is given.

    Chapter Two presents a Background on Compiler Optimisation. It aims to familiarise the

    reader with the terminology involved in compiler optimisation, which will be used

    throughout this dissertation, and give the background necessary to understand subsequent

    chapters. Initially, the definition of compilation is provided, followed by an explanation of

    the most important issues in compiler optimisation. The concept of program transformation

    is briefly described and some code transformations are mentioned. A special section is

    devoted to loop unrolling, which is the program transformation of interest in the present

    dissertation.

    Chapter Three describes the Data Collection process that has been carried out in this project.

    It presents the most relevant factors that were taken into account to execute the experiments

    and to generate the data available for learning. In this chapter, the benchmarks that were used

    to construct the dataset are described; the implementation used for loop unrolling is

    explained; some technical issues regarding the platform, the compiler and the optimisation

    level used for the experiments are provided; the results of the data collection are summarised

    and the features extracted from the loops are given.

    Chapter Four presents the Data Preparation and Exploratory Data Analysis that was

    carried out after the data collection process. It describes the process of cleaning, analysing

    and preparing the data in order to make it suitable for learning. This chapter provides an

  • Introduction

    5

    insight into the data that will be used for modelling, explaining how the data is processed

    throughout different stages and is made available for analysis and learning.

    Chapter Five presents the section of Modelling and Results. This chapter explains the

    machine learning solution to loop unrolling based on the regression approach and describes

    the results obtained with the techniques used. It details how the regression problem is

    formulated and why it is considered more general as compared to previous approaches that

    applied machine learning techniques to loop unrolling. It provides an overview of the

    regression methods that have been used in this project and defines the measure of

    performance that has been used to evaluate the methods. It presents and evaluates the results

    obtained and compares the solution found with previous work.

    Finally, the conclusions obtained after having culminated this project are presented with a

    summary of the work that has been done, the achievements obtained and the future work that

    is proposed for continuing research in the area of compiler optimisation with machine

    learning.

    An additional clarification needs to be provided before reading the following chapters. At

    some stages the term data mining is used to describe the work that has been developed in this

    project; sometimes this term seems to be equated with machine learning. However, the

    reader should bear in mind that although there is not a defined line that separates these fields,

    it is assumed that data mining is the complete process of discovering knowledge from data

    and that machine learning is only one important step in this process. Thus, even though the

    main goal of this project is to apply machine learning techniques to compiler optimisation,

    the work that has been developed is essentially a data mining application.

  • 6

    Chapter One

    Literature Review

    1.1 Introduction This section describes the most relevant work involving machine learning and/or artificial

    intelligence techniques applied to compiler optimisation. Although there are other

    approaches that have used machine learning for specific tasks in compiler optimisation, they

    are not mentioned in this section as they are distant from the central idea and the aims of this

    project. As described throughout this chapter, the attractive idea of applying machine

    learning to compiler optimisation is relatively new. Some authors have proposed solutions

    that are easy to implement and automate and others have preferred methods that are difficult

    to deal with and can take a long time to obtain results.

    The previous work that is presented in this chapter has been divided into two groups: the use

    of machine learning as a general approach to compiler optimisation and the application of

    machine learning methods to a specific program transformation. Thus, while section 1.2

    presents previous work that has attempted to tune several code transformations by using

    Evolutionary Computing and Case-Based Reasoning, Section 1.3 explains how Decision

    Trees and Nearest Neighbours have been applied to learning in a particular program

    transformation: loop unrolling. Each approach will be briefly described, giving some details

    of the implementation and the learning technique used; a discussion concerning their

    advantages and drawbacks will be presented; finally, a summary and an explanation about

    how this previous work has influenced and motivated the present project will be given in

    section 1.4.

    1.2 Tuning heuristics and recommending program transformations

    Stephenson et al. (2003) used evolutionary computing to build up priority functions. Priority

    functions are, in some way, ubiquitous in constructing heuristics for compiler optimisation.

  • Chapter 1. Literature Review

    7

    In other words, compiler writers commonly rely on the assumption that a specific

    optimisation technique is strongly tied to a certain function, called priority function. This

    function involves some of the parameters that possibly affect a particular heuristic. Under

    this assumption, they used Genetic Programming [Koza, 1992] to search the priority function

    solution space. Their work was focused on three different heuristics: hyperblock formation,

    register allocation and data prefetching, using Trimaran [Trimaran] and the Open Research

    Compiler [ORC] for their experiments.

    Although it may be appealing for many computer scientists to construct programs based on

    evolution and natural/sexual selection, this approach has more drawbacks than advantages.

    To avoid extending more than necessary this section, I will only mention four aspects that

    can stop someone using Genetic Programming and evolutionary computing for compiler

    optimisation. The first reason is related to a very popular term in machine learning:

    overfitting. Plainly explained, overfitting is the effect of fitting a model to the training data in

    such a way that is unable to generalise and to perform successfully in novel data. In fact, the

    generalisation of Genetic Programming for this type of problems has not been clearly

    demonstrated and the results of the work of Stephenson et al. (2003) reflect this effect.

    Secondly, given the fact that the so-called fitness function used to evaluate candidates to

    the solution is strongly connected to the execution time of the programs, carrying out the

    task of selecting and evolving the solution may take a very long time. Furthermore, and this

    is the third justification to avoid using Genetic Programming for this problem, different runs

    of the technique for the same input can lead to different results, which is referred as

    instability of the solution. Finally, but not less important, the adjustment of the parameters of

    the heuristic may be transferred to the tuning of the parameters of the technique itself.

    In an attempt to build an interactive tool to provide the user a guide to performance tuning,

    Monsifrot and Bodin (2001) developed a framework based on Cased-Based Reasoning

    (CBR). The purpose was to advise the user on possible code transformations in order to

    reduce the execution time of programs. They adapted the general idea of Case-Based

    Reasoning consisting in learning from past cases (See [Shen and Smaill, 2003] pages 72-78

    for an introduction or [Kolodner, 1993] for a thorough description) to the compiler

    optimisation problem by detecting fragments of code (i.e. loops) candidates to optimise,

    checking their similarities with other past cases and reusing the solution of these cases. The

    system was implemented using TSF [Bodin et al., 1998] and codes written in Fortran 77. The

    features used to characterise the loops were selected according to four different categories:

  • Chapter 1. Literature Review

    8

    loop structure, arithmetic expressions, array references, and data dependences. Loop

    transformations such as unrolling innermost and outer loops, unroll-and-jam and loop

    blocking were considered.

    As Nearest Neighbours, Case-Based Reasoning belongs to Instance-Based Learning

    methods, which classify new instances according to their similarities with a set of training

    examples ([Mitchell, 1997] pages 230-245). Although Case-Based Reasoning can be

    considered a more sophisticated version of Nearest Neighbour methods, the work done in

    [Monsifrot and Bodin, 2001] does not detail important stages in CBR such as the

    modification of the prior solution or the repair of the proposed solution when it is not

    successful.

    An aspect to highlight in this approach is that it presents a wide variety of sensible and

    important features to characterise programs and, more specifically, to describe loops, that

    can be crucial when working with machine learning techniques. These features, also called

    indices in the CBR terminology, make possible the identification of past cases that can be

    reused and modified to provide the solution for a given problem. However, there are several

    caveats to mention about this approach. In the first place, an insufficient number of loops

    were used in the experiments, and they cannot be considered representative for programs. In

    fact, the initial experiments were on only one benchmark containing 64 loops. It is clear that

    it limits the capacity of the method to generalise to new problems. Furthermore, only one

    specific program was used to test the performance of the system, which reinforces the idea of

    biased results. Nevertheless, this is understandable given the difficulty of collecting data for

    this kind of application. Finally, and possibly the greatest drawback of this solution, is that it

    poorly contributes to the main goal of reducing the effort of compiler writers on

    optimisation. Certainly, unlike traditional compilers, the system suggests modifications

    without checking them for legality. Therefore, there is still a lot of work left to the user who

    is responsible for this task.

    1.3 Learning in a particular program transformation: loop unrolling

    In a more pragmatic approach, which was indeed the motivation for the present dissertation,

    Monsifrot et al. (2002) concentrated their effort in working with a particular transformation

    for optimising the execution time of programs: loop unrolling. Based on a characterisation of

  • Chapter 1. Literature Review

    9

    loops from different programs written in Fortran 77 they wanted to investigate if it was

    possible to learn a rule that could predict when unrolling was beneficial or detrimental. In

    this case, unrolling was implemented at the source code level using TSF [Bodin et al., 1998]

    and the experiments were performed with the GNU Fortran Compiler [g77]. For Learning,

    they applied Decision Trees, which is an appropriate technique when the readability of the

    results is needed. Briefly, their aim was to build a binary classifier able to decide whether to

    perform loop unrolling or not.

    Even though their results were not so exciting, which is explainable because only one code

    transformation was taken into account, the methodology itself must be highlighted as well as

    the endeavour to obtain the loop abstraction. Nevertheless, it is necessary to mention several

    limitations about their work. Firstly, there is no reference to how much a loop should be

    unrolled and how this decision should be taken. In fact, this is an important issue in loop

    unrolling, because the technique can be advantageous for a specific unroll factor but

    detrimental for another one. Secondly, given the methods they used to carry out feature

    extraction, two loops with the same representation (loop abstraction) can belong to different

    classes: positive or negative, where positive refers to the case when unrolling causes an

    improvement in the execution time and negative to the opposite situation. This noisy

    training data seems to have severely affected their results. Finally, although it is explicitly

    recognised in the original paper, the fact of having a large number of negative examples and

    only a small number of positive examples also affected the performance of their technique.

    However, there was no effort to implement any solution to deal with this unbalanced data.

    Following the work of Monsifrot et al. (2002), Stephenson and Amarasinghe (2004) have

    recently published a technical report that describes how to go beyond the binary

    classification about the suitability of loop unrolling. In fact, besides predicting whether

    unrolling is beneficial or not, they also tried to predict the best unroll factor for loops. Unlike

    Monsifrot et al. (2002) who applied unrolling at the source code level, they implemented

    unrolling at the back-end of the compiler, and used the Open Research Compiler [ORC] for

    their experiments with codes written in C, Fortan and Fortran 90. As in [Monsifrot and

    Bodin, 2001], their technique was also based on Instance-Based Learning methods. In fact,

    they used Nearest Neighbours, a very simple machine learning method commonly used for

    classification, although also possible for regression. In summary, their work was focused on

    solving the multi-class problem of predicting the unroll factor for a loop that guaranteed its

    minimum execution time.

  • Chapter 1. Literature Review

    10

    Although not comparable, given the differences in the experimental methodology and the

    machine learning algorithm used, their results were relatively better than those obtained by

    Monsifrot et al. (2002). Aware of the variability in the execution times of the loops, they

    invested a lot of effort in the instrumentation of the loops and ran each experiment 30 times,

    taking the median as the representative for each training data point. Although this approach

    is very sensible and valid, a lot can be gained if information about the behaviour of the

    execution time of the loops under unrolling is maintained. Thus, it is possible to formulate a

    more general solution to the problem that directly models the improvement in performance

    of the loops when they are unrolled. This is the solution proposed in the present dissertation

    and it will be denoted as the regression approach.

    1.4 Summary This chapter has presented the relevant previous work that has applied machine learning to

    compiler optimisation. In general, machine learning has been used in order to provide a

    solution that can be applied to different optimisation problems but also has been focused on

    the application of the techniques to a specific code transformation: loop unrolling, which is

    the interest of the present project. Certainly, it is necessary to remark that this dissertation

    has been strongly motivated by the last two pieces of work described above and, many of

    their ideas were analysed to formulate a more general approach that can easily deal with

    variability and noisiness and can generalise more appropriately when encountering novel

    data.

  • 11

    Chapter 2

    Background on Compiler Optimisation

    2.1 Introduction This chapter presents an overview of compilation and describes the most important concepts

    related to compiler optimisation. It is not the aim of this chapter to provide an in-depth study

    about compilers. Contrarily, this section attempts to familiarise the reader with the

    terminology involved in compiler optimisation, which will be used throughout this

    dissertation, and give the background necessary to understand subsequent chapters. A

    general-to-specific methodology is used to associate the different topics. Thus, the definition

    of compilation is provided in section 2.2 and the organisation of a compiler is given in

    section 2.3. The aim of a compiler is described in section 2.4 and the most important issues

    in compiler optimisation are explained in section 2.5, where the concept of program

    transformation is briefly described and some code transformations are mentioned. Section

    2.6 is devoted to loop unrolling, which is the program transformation of interest in the

    present dissertation. Finally a summary of the chapter is presented in section 2.7.

    2.2 Definition of compilation The term compilation can be defined as the process of transforming the source code of a

    program (in a high-level language) into an object code (in machine language) executable on

    a target machine. Although this definition can be seen as more restrictive and less general

    than others (see [Cooper and Torczon, 2004] for an alternative definition of compilation) it is

    sufficient to understand the following concepts related to compiler optimisation. Hence, the

    most basic idea of a compiler can be understood as a black box whose input is the source

    code of a program written in high-level language and output is an executable code on a

    specific machine as shown in Figure 2.1.

  • Chapter 2. Background on Compiler Optimisation

    12

    This mysterious black box is actually a very complex program responsible for making the

    job easier for many programmers. Indeed, it hides the complexity of translating an easy-to-

    use high-level language into a less understandable machine-dependent code. Usually,

    commercial compilers provide additional features to the definition given above, such as

    debugging or even an Integrated Development Environment IDE, and these features are

    commonly included in the functionality of the compiler itself. However, the interest of this

    section is focused on understanding this principal function of translation, how it is

    performed, which issues may affect it and how the code generated can be improved. To

    achieve this goal, it is necessary to start by describing the internal structure of a compiler.

    2.3 Compiler organisation In a very simple form, a modern compiler can be thought as a three-layer structure where the

    output of one layer is the input of the following one. These three layers, namely the front-

    end, the optimiser and the back-end, perform different tasks that influence the ultimate code

    generated. A possible structure for a modern compiler is shown in Figure 2.2. The functions

    developed by each layer are:

    The Front-end is responsible for the lexical analysis and parsing. It checks if the source

    code satisfies the static constraints of the language in which it is implemented. Finally, it

    converts the code into a more convenient form called Intermediate Representation (IR).

    The Optimiser takes the intermediate representation of the program and applies a series of

    transformations that can possibly improve the object code.

    The Back-end receives the representation of the transformed code and converts it into a

    language that the specific machine can understand. This language explicitly deals with the

    management of physical resources available in the target architecture.

    Figure 2.1: Basic form of a compiler

    Compiler

    Source code

    (Program)

    Object Code

    (Executable)

  • Chapter 2. Background on Compiler Optimisation

    13

    Although Figure 2.2 presents compilation as a sequential process, sometimes this labour

    division between the layers is not clearly delimited. For example, after applying some

    transformations it could be necessary to perform additional analyses to the code that has

    been modified. However, it is instructive to bear in mind that there are three essential tasks

    developed by a compiler: analysis and transformation of the code into an intermediate

    representation, improvement of the intermediate representation with the help of several

    program transformations and translation of the code into a machine-understandable

    language.

    2.4 The purpose of a compiler Having explained the compiler organisation as a three-layer structure with well-defined

    functions, two questions can be raised about the final output of the compilation process.

    Firstly, what has been gained with the application of this process? Secondly, does the

    compilation of the program make sense if the meaning of the initial code is changed?

    Front-end

    Convert the code into

    IR

    Optimiser

    Apply transformations

    Object code

    Source code

    Back-end

    Convert into assembly

    code

    Figure 2.2: A possible structure of a modern compiler

  • Chapter 2. Background on Compiler Optimisation

    14

    The first question goes to the core of the compilation process. Even in the case when the

    optimisation phase is not applied, a lot has been gained because is the compilation process

    what makes the program executable on the target machine. Certainly, if compilers were not

    available programmers would have to write their applications directly in assembly code,

    which clearly would take much more time than using a high level programming language1.

    Hence, it can be concluded that a compiler improves the initial code.

    Now, it is possible to answer the second question with a simple statement: If the compilation

    process changes the meaning of the initial program all the effort invested by the compiler is

    useless. Unquestionably, a compiler must preserve the meaning of the program being

    compiled [Cooper and Torczon, 2004]. This preservation is usually referred in the literature

    as the correctness of the compiler.

    2.5 An Optimising Compiler It has been introduced so far the idea of a modern compiler as a structure composed by three

    layers: the front-end, the optimiser and the back-end. The functions of the front-end and the

    back-end have been clearly explained but the role of the optimiser has been purposely

    described in a general way by using the term improvement. The following subsections

    discuss what optimisation means and why it is important in the compilation process.

    2.5.1 Goals of Compiler Optimisation Compiler optimisation is mainly related to the ability of the compiler to optimise or improve

    the generated code rather than enhancing the compilation process itself. Therefore, although

    it might be very important to reduce the compilation time, and in fact, it is an issue to take

    into account in compiler optimisation, the first goal of an optimising compiler is to discover

    opportunities in the program being compiled in order to apply some transformations to

    improve the code. Improvement, in this case, can refer to several facts depending on what

    the user actually requires. For example, the user might be interested in reducing the

    execution time of a program. However, the goal could also be to generate a code that

    1 A person with a basic knowledge in computer science might argue that programmers could use interpreters. However, interpreters are also programs that transform the source code into machine understandable code. The difference is that they translate and execute line by line rather than the whole program at once, as it happens with compilers. Additionally, it is widely accepted that compiled code is much more efficient than interpreted code.

  • Chapter 2. Background on Compiler Optimisation

    15

    occupies the least possible space, or even a trade-off between speeding up the program and

    reducing the size of the code that has been created. Additionally, it also might be interesting

    to guarantee an efficient use of the resources of the target machine, for example memory,

    registers and cache. In general, there is not a unique objective when talking about compiler

    optimisation. Nevertheless, compiler optimisation is commonly associated only to the

    purpose of speeding up programs, and indeed, this is the goal the present dissertation tries to

    achieve. Therefore, the term optimisation will indicate henceforth the effect of reducing the

    execution time of programs.

    Compilers can improve the execution time of programs by carrying out subtasks such as

    minimisation of the number of operations executed by the program, efficient management of

    computational resources (cache, registers, and functional units) and minimisation of the

    number of accesses to the main memory. These tasks jointly executed by the compiler can

    greatly benefit not only the end-user of the program but also programmers and even

    hardware designers. The end-user logically profits from optimisation because the application

    executes faster. Programmers can neglect the details about making the appropriate code for a

    particular machine and concentrate on high-level structures and a good application design.

    Finally, hardware designers can be confident that compilers can appropriately exploit the

    capabilities of their products.

    2.5.2 Considerations for program transformations As expressed above and depicted in Figure 2.2, a compiler can optimise programs by

    applying code transformations. These transformations will hopefully improve the code

    generated by reducing its execution time. However, there are several issues to consider when

    applying a particular transformation: correctness, profitability and compilation time.

    2.5.2.1 Correctness: As for the general compilation process, the transformation applied must be correct. The principle is basically the same: the code produced by a transformation

    must preserve the meaning of the input code. In other words, if the meaning of the program

    changes, the transformation should not be applied. Correctness is also referred in the

    literature as the legality of the transformation. Bacon et al. (1994) provide a more formal

    definition of legality:

  • Chapter 2. Background on Compiler Optimisation

    16

    A transformation is legal if the original and the transformed programs produce

    exactly the same output for identical executions

    However, in this case, the legality of the transformation is not sufficient. Since we are

    considering the process of transforming the code in order to optimise the program, a

    transformation must also be profitable.

    2.5.2.2 Profitability: A transformation applied to a particular fragment of a program must

    improve the ultimate code generated. In other words, the process of applying a particular

    transformation is expected to produce an improvement in the execution time of the program.

    This is in fact the purpose of applying a transformation for optimisation, but in many cases,

    the effect of transforming a program is not noticeable or, even worse, it leads to a detriment

    in performance of the final code generated2.

    Finally, although sometimes neglected, the compilation time is also an important factor in

    determining whether a transformation is actually beneficial or not. If the transformation

    considerably increases the compilation time of the program, there should be serious doubts

    about its application. Ideally, it should be a trade-off between the improvement of the

    execution time and the detriment of the compilation time.

    2.5.3 The process of transforming a program for optimisation The transformation of a program for optimisation can be divided into three steps as follows:

    2.5.3.1 Identification: The compiler has to identify which part of the code can be optimised and which transformations may be applied to it.

    2.5.3.2 Verification: The legality of each transformation must be ensured, i.e. that the

    transformation does not change the meaning of the program.

    2.5.3.3 Conversion: Refers to the process of applying a particular transformation.

    2 Certainly, if all the transformations guaranteed an improvement in performance of the final code there would not be reasons for research in this area and for the present dissertation. Therefore, the questions of how and when to apply a particular transformation constitute the major problem in compiler optimisation.

  • Chapter 2. Background on Compiler Optimisation

    17

    The transformation process is depicted in Figure 2.3.

    From these three steps, the identification phase, i.e. the process of recognising fragments of

    the program susceptible to optimisation and the decision of finding transformations that can

    potentially improve the code, is the most complicated stage. Compiler writers commonly

    rely on heuristics to determine these transformations and the parameters that define them.

    However, given that, many parameters may be involved and one transformation may affect

    the applicability of the subsequent transformations, these heuristics commonly lead to

    suboptimal solutions.

    2.5.4 The problem of interaction As explained above, compiler optimisation is a very complex problem not only because

    many parameters may be involved in each transformation but also because one

    transformation can impede or enable the applicability of another transformation. The latter is

    known as the effect of interaction among the different transformations. Certainly, the actual

    performance of the compiled code depends on the resulting outcome of the interactions

    among the transformations that were applied during the whole compilation process. In

    general, while some transformations are demonstrated to be beneficial for a particular

    program when applied independently, they can degrade the execution time of the program if

    they are sequentially executed.

    Thus, considering that optimisation problems such as register allocation and instruction

    scheduling are NP-complete in themselves and that different transformations can interact

    with each other, the compiler optimisation problem is far from being optimally solved.

    Code (IR)

    Identification

    Which part of

    the code?

    Verification

    Is it legal?

    Conversion

    Apply

    Transformed code

    Figure 2.3: Program transformations for optimisation

  • Chapter 2. Background on Compiler Optimisation

    18

    2.5.5 Types of program transformations Program transformations can be split into two classes depending on which part of the

    compiler they are applied to. Thus, they can be classified as:

    2.5.5.1 Machine-independent transformations: convert an intermediate

    representation (IR) of the program into another intermediate representation. Consequently,

    the code generated does not depend on a specific machine or architecture. However, since

    the code produced is not the final code and may be susceptible to further changes, the

    profitability of these transformations cannot be ensured. High-level transformations that

    perform tasks such as eliminate redundancies, eliminate unreachable or useless code and

    enable other transformations, can be considered to belong to this group. Loop unrolling is, in

    general, an example of machine-independent transformation.

    2.5.5.2 Machine-dependent transformations: Machine-dependent transformations are

    also called machine-level transformations. They convert the intermediate representation of

    the program directly into assembly code. Thus, the code generated is tied to a specific

    architecture. Those transformations that consider particularities of the target architecture

    belong to this group, for example, instruction scheduling, instruction selection and register

    allocation.

    2.5.6 The scope of optimisation Program transformations can be applied at different levels of the code. For example, they can

    be applied to statements, basic blocks3, innermost loops, general loops, procedures (intra-

    procedures) and the whole program (inter-procedures). Depending on this level of

    granularity, the complexity of the analysis becomes bigger and applying a particular

    transformation turns out to be more costly because it increases the compilation time of the

    program. Loop level transformations are very important as loops are considered to be the

    places where programs spend most of their time.

    3 A basic block can be defined as straight-line code; In other words, a block of code with no branches.

  • Chapter 2. Background on Compiler Optimisation

    19

    Type Transformation

    Loop Interchange

    Loop Skewing

    Loop Reversal

    Loop Blocking (Tiling)

    Loop Pushing

    Loop Fusion

    Loop Peeling

    Loop Code motion

    Loop Normalisation

    Loop Transformations

    Loop Unrolling

    Memory Alignment

    Array Expansion

    Array Contraction

    Scalar Replacement

    Code Co-location

    Memory Access Transformations

    Array Padding

    Unreachable Code Elimination

    Useless Code Elimination

    Dead Variable Elimination

    Common Subexpression Elimination

    Redundancy Elimination

    Short-Circuiting

    Frame Collapsing

    Procedure Inlining Procedure Call Transformations Parameter Promotion

    Constant Propagation

    Constant Folding Partial Evaluation Algebraic Simplification

    Table 2.1: Some common transformations for compiler

    optimisation (taken from [Bacon et al., 1994])

  • Chapter 2. Background on Compiler Optimisation

    20

    2.5.7 Some common transformations To give an idea about the type and number of program transformations available for

    compiler optimisation some of them are shown in Table 2.1 (taken from [Bacon et al.,

    1994]). The transformations shown as shaded are very well known and studied in the

    literature. For a good review about the meaning and implementation of these

    transformations, see [Bacon et al., 1994].

    As it is the focus of the present dissertation, Loop unrolling is described in more detail in

    section 2.6.

    2.6 Loop Unrolling Loop unrolling is a very straightforward but powerful program transformation mainly used

    to improve Instruction Level Parallelism ILP (the execution of several instructions at the

    same time) and to reduce the overhead due to loop control. Although extensively studied in

    the literature ([Dongarra and Hinds, 1979] and [Davidson and Jinturkar, 2001]) it continues

    being of interest for the compiler community. Understanding the interaction of the

    parameters that may affect it as well as its effects on the execution time of the ultimate code

    is still a challenge.

    2.6.1 Definition Loop unrolling is the replication of the loop body certain number of times u, called unroll

    factor. As the loop body is replicated, the trip count, i.e. the loop termination code, must be

    adjusted in order to guarantee that the loop is executed exactly the same number of times

    than the rolled (original) version. To additionally control the leftover iterations, a prologue

    or epilogue is added before or after the unrolled loop. To clearly illustrate how loop

    unrolling works, the example shown in Figure 2.4, taken from [Bacon et al., 1994], depicts a

    loop that has been unrolled twice. The notation used does not belong to a specific language

    although it is very similar to Fortran 77, except for the way of accessing elements of arrays.

    The left side of Figure 2.4 shows the original loop composed by only one statement (the loop

    body) when the iteration step is 1 (considered by default). The right side shows the loop

  • Chapter 2. Background on Compiler Optimisation

    21

    unrolled using a factor u=2. Thus, the loop body is replicated twice4, the array accesses are

    modified accordingly and the iteration step is changed to 2. Since the value of the trip count

    (n-2) is unknown, an epilogue has been added to control that the unrolled loop is executed

    the same number of times as the original loop.

    2.6.2 Implementation Loop unrolling can be implemented by hand (manually) or by the compiler (automatically).

    It is manually implemented, as in Figure 2.4, either by the programmer or by a software

    transformation tool that works on top of the compiler. It can be automatically implemented

    by the compiler on the source code, on an intermediate representation of the program or on

    assembly code (back-end), i.e. on an optimised version of the program. Implementing loop

    unrolling at the source code level or at an intermediate representation of the program can be

    more profitable than at the back-end of the compiler. For example, it might make the code

    more susceptible to the application of other program transformations. However, it also can

    be unfavourable because it may impede the application of other transformations. Arguably,

    one of the reasons to implement loop unrolling at the back-end of the compiler is because of

    4 There is no general agreement in the literature about considering the rolled version of the loop as u=1 or u=0. The former notation, which is believed to be more understandable, will be used throughout this dissertation.

    Original Loop

    do i=2, n-1

    a[i] = a[i] + a[i-1] * a[i+1]

    end do

    Loop unrolled twice

    do i=2, n-2, 2

    a[i] = a[i] + a[i-1] * a[i+1]

    a[i+1] = a[i+1] + a[i] * a[i+2]

    end do

    if (mod(n-2,2) = 1) then

    a[n-1] = a[n-1] + a[n-2] * a[n]

    end if

    Epilogue

    Figure 2. 4: Original loop (left) and loop unrolled by a factor u=2 (right)

    (taken from [Bacon et al., 1994])

  • Chapter 2. Background on Compiler Optimisation

    22

    the fact that its profitability can be almost ensured when applied to one of the latest

    representations of the program. An important issue to remark is that, if judiciously applied,

    unrolling is an always-legal transformation.

    In general, loop unrolling can offer several advantages that may improve the execution time

    of programs. However, these benefits can be diminished by some side effects of the

    transformation.

    2.6.3 Advantages of loop unrolling As mentioned above, loop unrolling can be considered a beneficial transformation because it

    may:

    o Improve Instruction Level parallelism (ILP). Instruction Level Parallelism refers

    to the ability of the compiler to execute multiple instructions simultaneously. Hence,

    If the size of the loop is augmented, the number of instructions that can be scheduled

    in an out-of-order mode5 also increases, causing that more instructions can be

    executed in parallel [Monsifrot et al., 2002].

    o Reduce the overhead due to loop control. Loop overhead is caused by the

    increments of the loop variable, the tests applied to this variable and the branch

    operations. All these operations are decreased by compensating the reduction in the

    number of iterations with the replications of the loop body. For example, in Figure

    2.4 loop overhead is reduced by a half. Therefore, if the loop executes a considerable

    amount of iterations, the improvement in the execution time due to the reduction of

    the loop overhead is appreciable.

    Additionally, loop unrolling can also:

    o Enable other transformations. Loop unrolling applied at an early stage can give to

    the code the appropriate shape for other transformations, for example, for common

    subexpression elimination.

    o Eliminate loop copy operations. Sometimes copy operations are necessary when

    the loop calculates a value that is needed in a subsequent iteration, which is known

    5 This term refers to the situation when the instructions are not executed in the specific order given by the program.

  • Chapter 2. Background on Compiler Optimisation

    23

    in the jargon of compilers as loop-carried data dependency. These copy operations

    can be eliminated by unrolling.

    o Improve memory locality (register, data cache or TLB6). Memory locality is

    improved when the access to local memory resources is performed in an efficient

    way. For example, on the right side of Figure 2.4 the loads corresponding to a[i] and

    a[i+1] are done twice. Thus, the number of loads per iteration has been reduced from

    3 to 2.

    2.6.4 Disadvantages of loop unrolling By far the major drawback of loop unrolling is that it can degrade the performance of the

    instruction cache. Parameters such as the size of the loop body after unrolling (dependent

    on the unroll factor used), the size, the organisation and the replacement policy of the

    instruction cache may cause that some instructions cannot be kept in the instruction cache

    and must be accessed from the main memory, which can be thousands of times slower than

    the cache. When this happens, it is said that some cache misses have occurred. The number

    of cache misses can certainly affect the final execution time of the program and diminish any

    benefit from loop unrolling. Furthermore, it may be possible to completely deteriorate the

    execution time and slow down the program.

    Besides the degradation of the instruction cache, loop unrolling can have other negative

    effects. For example, as the size of the loop body becomes bigger, the number of instructions

    increases, which may augment the number of calculations of addresses and make the

    instruction scheduling problem more complex. Moreover, additional loads and stores may

    be needed causing a greater demand for registers. It is said that the register pressure has

    increased. The register pressure is a measure that indicates the ratio between the registers

    demanded by a program and the actual number of registers available in a particular machine.

    It is bad news if the register pressure becomes much greater than 1 because some registers

    values have to be saved (into main memory) and liberated for other purposes. In this case, it

    is said that the registers have been spilled [Bacon et al., 1994].

    6 TLB stands for Translation Lookaside Buffer. It is a table in the processor that maps virtual addresses into real addresses of memory pages that have been recently referenced.

  • Chapter 2. Background on Compiler Optimisation

    24

    Finally, another disadvantage of loop unrolling is that it may prevent other optimisation

    techniques. In fact, after loop unrolling some transformations may not be profitable or

    simply may not be applicable.

    2.6.5 Interactions, again It might seem rather weird that one of the advantages of loop unrolling is that it enables

    some transformations and one of its drawbacks is that it limits the applicability of other

    transformations. The justification for this apparent contradiction is related to the already

    known hurdle of interactions. The interactions between most of the compiler optimisation

    techniques are still poorly understood and the results can vary depending on the input

    program, the target architecture and the very high dimensional space of transformations and

    their parameters.

    2.6.6 Candidates for unrolling Having explained the most important features of loop unrolling, describing its positive and

    negative effects, one can have a general idea about which type of loops can be candidates to

    unroll in order to improve the execution time of programs. However, it is easier to suggest

    which loops are not good candidates for unrolling. In general, loops with a very low trip

    count, a large body, containing procedure calls or containing branches in them are not very

    suitable for unrolling [Nielsen, 2004]. Nevertheless, these features are rather ambiguous and

    it is difficult to specify the meaning of low trip count, large loop body or even the number

    and the size of procedure calls or the number of branches. Thus, as expressed above, loop

    unrolling still remains an area of great interest for the compiler community.

    2.7 Summary This chapter has presented the most important issues involved in compilation and compiler

    optimisation necessary to understand latter chapters and the purpose of this dissertation. The

    organisation of a compiler and the process of applying program transformations have been

    described emphasising the principles of correctness and profitability. As it is the focus of the

    present dissertation, loop unrolling has been studied in detail explaining why it is an

    important program transformation, what are its advantages and disadvantages and why the

    problem of determining when and how to apply this transformation is a challenge.

  • Chapter 2. Background on Compiler Optimisation

    25

    Additionally, important terminology about compilers has been introduced throughout this

    section. Expressions such as legality, intermediate representation, front-end, back-end, basic

    blocks, cache misses, spill registers were explicitly defined. Finally, the problem of

    interactions among the different optimisation techniques has been also mentioned and

    described as a hurdle for the compiler optimisation problem.

  • 26

    Chapter 3

    Data Collection

    3.1 Introduction One of the most difficult obstacles to applying machine learning to compiler optimisation is

    the process of generating clean, reliable and sufficient data. As explained in chapter two,

    compiler optimisation is related to the improvement of the ultimate code generated by the

    compiler in a specific way. In this case, improvement refers to the reduction in the execution

    time of programs. Therefore, for any approach that attempts to apply machine learning

    techniques in order to reduce the execution time of programs, the process of creating or

    evaluating the data is strongly dependent on this execution time.

    In fact, regardless of whether the solution proposed is based on supervised or unsupervised

    learning, the data that is utilised to build a specific model must arise in some way from the

    execution time of the programs or from the execution time of parts of them. For example, it

    was explained in Chapter One that Stephenson et al. (2003) used Evolutionary Computing to

    construct priority functions. For this approach, the process of evaluating the candidate

    solutions may require a very long time. Indeed, it is necessary to compute the whole

    population of functions on different programs in order to ascertain which are beneficial and

    actually lead to an improvement in performance. Similarly, it was also mentioned in Chapter

    One that Stephenson and Amarasinghe (2004) used Nearest Neighbours with the aim of

    predicting the best unroll factor for loops. The label for each loop was constructed by finding

    the unroll factor that guaranteed its minimal execution time. Each experiment was executed

    thirty times due to the variability of the measurements. Although not specified in the original

    paper, let us consider the ideal case where the interactions between loops are negligible and

    the programs are executed by using the same unroll factor for all the loops on each run.

    Using a maximum unroll factor of eight, each program should be executed at least 30 x 8 =

    240 times. As in the traditional approach used by compiler writers when building heuristics

    for optimisation, machine learning techniques for compiler optimisation should also work on

    a set of benchmarks that can be considered representative for specific tasks and challenging

    for current computers. Normally, these benchmarks have at least hundreds of lines and can

  • Chapter 3. Data Collection

    27

    take in general about 1 or 2 minutes for a normal execution. Hence, if a program must be run

    240 times and its execution time is one minute, obtaining the data for the loops within the

    program will take about 4 hours. Now, say for simplicity that on average a program contains

    about 20 loops that can be considered for unrolling. If one wanted to include 1000 loops in

    the training data, it would take about (1000 loops) x (1 program / 20 loops) x (4h) = 200

    hours. Obviously, this number can significantly increase with the preparation of the

    programs for a particular transformation, the compilation time, the effect of the

    instrumentation and the process of analysing and cleaning the data. Consequently, generating

    sufficient data for this type of application can be a time-consuming activity.

    However, there are still questions to be asked about the meaning of sufficient. Are 1000

    loops sufficient for learning; should there be more; could there be less? These are very

    difficult questions and, unfortunately, the present dissertation does not attempt to answer

    them as they are beyond the scope of this project. Nevertheless, other issues involved in the

    data collection process are also important and need to be mentioned. Particularly, the details

    about how the experiments were carried out in order to generate the data must be explained.

    This not only contributes to a better understanding of the present dissertation but also allows

    other researchers to replicate the results obtained here.

    As clearly recognised by the data mining community, at least 60% of the time in a data

    mining application is devoted to understanding, analysing, pre-processing and cleaning the

    data. This project is not an exception and great effort has been invested in generating clean

    and appropriate data as well as in performing its analysis and pre-processing. Whilst the

    following chapter deals with the data analysis phase, this chapter presents the most relevant

    factors that were taken into account in carrying out the experiments and generating the data

    available for learning. Initially, the benchmarks that were used to construct the dataset are

    described in section 3.2. The implementation used for loop unrolling is explained in section

    3.3, providing useful information about the granularity of the instrumentation and the

    assumptions involved in this data generation process. The general process that has been

    followed is explained in section 3.4. Subsequently, some technical issues regarding the

    platform, the compiler and the optimisation level used for the experiments are provided in

    section 3.5. The results of the data collection are summarised in section 3.6 and the features

    extracted from the loops are described in section 3.7. The final representation of a loop

    composed of its features and execution times is presented in section 3.8. Finally, a summary

    of the data collection process and the experiments performed is given in section 3.9.

  • Chapter 3. Data Collection

    28

    3.2 The Benchmarks Three important factors may influence the selection of the programs that are used to generate

    the data for compiler optimisation with machine learning: programming language, type of

    application and execution time.

    Programming language: Several programming languages can be considered for applying

    machine learning to compiler optimisation. In fact, other researchers have included programs

    written in Java [Long, 2004] and Fortran 77 [Monsifrot et al., 2002], or have used

    benchmarks from a mixture of sources such as Fortran 77, Fortran 90 and C [Stephenson and

    Amarasinghe, 2004]. Although it is possible to have different programming languages in a

    set of benchmarks and to include the language in which the programs are written as an

    additional feature of each loop, it is worth focusing only on Fortran 77 for at least two

    reasons. Firstly, there is a great deal to be said for optimising programs that are written in

    Fortran, as it is considered the scientific programming language and most high-performance

    computing applications have been developed in this language. Secondly, Fortran 77 lacks

    pointers that can potentially reduce the application of some program transformations. Other

    issues such as portability of the code to different platforms can also affect the choice the

    programming language.

    Type of application: Benchmarks are designed to investigate the capabilities of different

    platforms and the architectures for particular tasks. For example, some benchmarks are

    demanding on floating point operations, others are challenging on integer operations or

    target specific applications such as graphics or digital signal processing. In principle, it

    would be valuable if one could consider a wide variety of applications and include them in

    the set of programs to analyse. However, as explained above, time limitations generally

    place constraints upon the types of benchmarks that may be used. For this project, since we

    are not considering a specific target such as network applications or multimedia programs, it

    is important to focus on numerical applications that are demanding enough for current

    computers. Therefore, the benchmarks should be as realistic as possible, given the final goal

    of building a solution that can generalise over real programs.

    Execution time: Execution time performs an important role in choosing the benchmarks

    given the time constraints one can face when generating the data for learning. It is certainly a

    trade-off between how challenging a program is for a particular machine and how long it

    takes to execute on that machine. Programs for which the execution time is very low will

  • Chapter 3. Data Collection

    29

    probably not be rich in features necessary to build a model that can perform properly in

    novel data. On the other hand, programs rich in features and demanding for a particular

    machine will probably take a long time to execute and one could not afford to include them

    in the dataset. However, caution must be taken with programs that have a long execution

    time. The long execution time of some benchmarks may be attributable to the complexity of

    their input but not to the richness of any demanding features they may have. In other words,

    sometimes it is the input of a benchmark that compels the execution time of a program to be

    long, rather than the complexity of the program. Therefore, these programs might not be

    important to analyse either.

    An additional consideration for choosing the programs when building the dataset for

    compiler optimisation is the bias introduced by selecting specific types of applications. For

    example, one could include some very simple computational kernels such as matrix

    multiplication for which a transformation like loop unrolling is known to be beneficial to

    some extent. Therefore, it would be possible to create more examples by varying some of the

    parameters of the kernels such as the trip count and the array sizes. Since the final goal of the

    learning technique is finding a solution that provides an improvement in performance for the

    programs, the results will be biased towards these simple kernels. However, is it also true

    that numerical applications may include these computational kernels and a lot can be gained

    when having them in the training data. Therefore, it would be profitable for the learning

    technique to have these simple kernels along with other more complex programs.

    Having considered the issues that may affect the choice of the benchmarks, the programs

    used for the experiments belong to the suites SPEC CFP95 [SPEC95] and VECTORD

    [Levine et al., 1991]. The benchmarks taken from [SPEC95] are scientific applications

    written in Fortran. These numerical programs, intensive in floating point operations,

    represent a variety of real applications and are still challenging for current computers. One of

    these benchmarks, namely 110.applu was significantly affected by the instrumentation, and

    another benchmark called 145.fpppp did not have appropriate loops to be unrolled.

    Therefore, they were discarded; only eight of ten possible programs from this suite have

    been used. The suite VECTORD [Levine et al., 1991] contains a variety of subroutines

    written in Fortran intended to test the analysis capabilities of a vectorising compiler. These

    subroutines include different types of loops whose features may be encountered in other

    applications.

  • Chapter 3. Data Collection

    30

    The description of each benchmark specifying the name, the number of lines, the number of

    subroutines, the area of application and the specific task performed is shown in Table 3.1

    (for the case of SPEC CFP95 benchmarks this information was obtained from the official

    web site [SPEC95]).

    Benchmark # Lines /

    # subroutines

    Application

    Area Specific Task

    101.tomcatv 190/1

    Fluid Dynamics /

    Geometric

    Translation

    Generation of a two-

    dimensional boundary-fitted

    coordinate system around

    general geometric domains.

    102.swim 429/6 Weather Prediction

    Solves shallow water

    equations using finite

    difference approximations.

    103.su2cor 2332/35 Quantum Physics

    Masses of elementary

    particles are computed in the

    Quark-Gluon theory.

    104.hydro2d 4292/42 Astrophysics

    Hydrodynamical Navier

    Stokes equations are used to

    compute galactic jets.

    107.mgrid 484/12 Electromagnetism Calculation of a 3D potential

    field.

    125.turb3d 2101/23 Simulation Simulates turbulence in a

    cubic area.

    141.apsi 7361/96 Weather Predication

    Calculates statistics on

    temperature and pollutants in

    a grid.

    146.wave 7764/105 Electromagnetics Solves Maxwell's equations

    on a Cartesian mesh.

    vector 5302/135 Variety of vectorial

    routines

    Tests the analysis capabilities

    of a vectorising compiler

    Table 3.1: Description of the Benchmarks

  • Chapter 3. Data Collection

    31

    3.3 Implementation of loop unrolling As explained in section 2.6.2 loop unrolling can be implemented at the source code level, at

    an intermediate representation of the program or at the back-end of the compiler. For the

    experiments that generated the data in this project a software framework that works on top of

    the compiler has been used. In other words, loop unrolling has been implemented at the

    source code level. This framework developed in [Fursin, 2004] is mainly written in java and

    provides a platform independent tool to assist the user in compiler optimisation. The

    software, based on a feed-back directed program restructuring ([Fursin, 2004] page 63)

    searches for the best possible code transformations and their parameters in order to minimise

    the execution time of programs.

    The unrolling algorithm used is a generalised version of the algorithm described in section

    2.6.1 and it is shown in Figure 3.1 (taken from [Fursin, 2004]). In this generalised version,

    the loop body is replicated u times in the first loop and an additional loop is introduced to

    control the leftover operations.

    3.3.1 Which loops should be unrolled? As in other approaches ([Monsifrot et al., 2002] and [Stephenson and Amarasinghe, 2004]),

    only innermost loops were chosen to be unrolled. Although this may be the most common

    choice throughout the literature, one should bear in mind that there are some cases where

    unrolling outer loops can be beneficial as explained in [Nielsen, 2004]. However, in order to

    keep the complexity of the transformer low and to guarantee the legality of the

    transformations, outer loops were not considered. This fact is not very restrictive; many

    innermost loops can be significantly improved by unrolling.

    3.3.2 Initial experiments Due to the characteristics of the transformation framework used to perform unrolling, the

    initial experiments executed to generate the data were based on program-level timing.

    Firstly, innermost loops are chosen from a particular program and the maximum unroll factor

    is set to eight (U = 8). The framework runs a program U times corresponding to different

    unroll factors for a loop, recording the execution time of the whole program for each run.

    Subsequently, the best unroll factor found for that loop is fixed and the software executes the

    program for each unrolled version of the following loop. This process is repeated until all the

  • Chapter 3. Data Collection

    32

    loops are unrolled. In this case, it is said that the framework follows a systematic search

    strategy. This strategy works under the assumption that there is no interaction between loops.

    In other words, it means that fixing an unroll factor for a specific loop will not severely

    affect the performance of the execution of another loop. Furthermore, there is one advantage

    of having program-level profiling for measuring the execution times: the intrusion caused by

    the instrumentation is negligible. Indeed, since only the execution time for the whole

    program is measured, the performance of each loop is minimally affected by this

    instrumentation. However, there is a major drawback to carrying out the experiments in this

    way: the time needed to obtain the data for a specific program grows with the maximum

    unroll factor used and with the number of loops considered for unrolling. Thus, the number

    of times a program must be executed in order to be analysed is U x L, where U is the

    maximum unroll factor and L is the number of loops to be analysed. If, for example, a

    program contains forty loops and the maximum unroll factor used is eight, the program will

    need to be executed 40 x 8 = 320 times. If one whished to include a considerable number of

    programs the process of generating the data would be extremely time-consuming. Despite

    this fact, the initial experiments followed this approach. Unfortunately, an additional

    inconvenience was further discovered after obtaining the data for all the benchmarks: for

    most of the programs, the improvement found by the search strategy was only comparable to

    the variability of their execution times. Using signal-processing terminology, the signal of

    improvement was swamped by the noise. This fact utterly impeded the use of any criterion

    for selecting the loops and including them in the dataset. Therefore, it was necessary to

    follow a loop level granularity.

    3.3.3 Loop level profiling Unlike program-level profiling that measures the execution time of the whole program, loop-

    level profiling measures the execution time of each loop within the program. In the case of

    determining the execution time after unrolling, timers are inserted around each loop and the

    program is run U times (one for each unroll factor), setting up the same unroll factor for all

    the loops during each run. As in the case of program-level profiling, there is an assumption

    of independence in this process. There are no runs that involve different unroll factors for

    different loops. Indeed, analysing all the possible combinations of unroll factors for all the

    loops within a program ceases to be feasible. Additionally, there is a severe reduction in the

    number of times a program must be executed. In this case the number of runs is equal to the

    maximum unroll factor used. This fact dramatically reduces the time invested in obtaining

  • Chapter 3. Data Collection

    33

    the data, given that it is not dependent on the number of loops a program may have. There is,

    however, a great shortcoming when following this process: the effect of the

    instrumentation. Certainly, given that timing functions with their own variables and

    instructions were inserted around the loops in order to determine their execution time, the

    performance of a loop may be affected by this instrumentation. For example, some loop

    instructions cannot be kept in the cache because other instructions corresponding to the

    instrumentation code are already held in this place. The effect of the instrumentation is

    noticeable especially for those loops that are called upon many times in the program. This is

    compensated if the loops take a considerable amount of time to be executed. Therefore,

    caution must be taken when selecting the loops to be included in the dataset, avoiding those

    loops for which the execution time and/or trip count is very low.

    Original loop

    do i = 1, n

    S1[i]

    S2[i]

    end do

    Unrolled loop (unroll factor = u)

    do i = 1, n, u

    S1[i]

    S2[i]

    S1[i+1]

    S2[i+1]

    S1[i+u-1]

    S2[i+u-1]

    end do

    do j = i, n

    S1[j]

    S2[j]

    end do

    Loop body

    replicated u times

    Processing remaining

    elements

    Figure 3.1: Generalised version of loop unrolling (Taken from [Fursin, 2004])

  • Chapter 3. Data Collection

    34

    3.4 Generating the targets As explained above, the process of generating the targets, i.e. the execution times, followed a

    loop level granularity. The process starts by preparing the benchmarks for the transformation

    tool and ends after running each program, considering the different unroll factors. The steps

    involved in this process are briefly explained as follows.

    3.4.1 Preparing the benchmarks Some of the benchmarks may not be appropriately handled by the transformation tool.

    Specifically, there may be some loop constructions that are problematic for the transformer.

    They must be converted into a form that the framework can properly manage.

    3.4.2 Selecting loops Having prepared a specific program for the framework, the next step is selecting the loops

    that are believed to be appropriate for unrolling. In general, to avoid introducing some bias in

    the data, the only restriction is that they must be innermost loops. Additionally, loops with

    calls to subroutines were not considered given the difficulty of determining the actual effect

    of unrolling as other loops may be involved within the subroutine called.

    3.4.3 Profiling The loops must be profiled in order to calculate their execution time and in order to have an

    idea of which loops are insignificant. Those loops for which the execution time is very low

    are considered insignificant because no real improvement can be determined.

    3.4.4 Filtering After the profiling step, it is possible to define the loops that should not be included in the

    dataset. As mentioned above, the instrumentation causes an intrusive effect that may

    severely affect those loops with low trip count or low execution time. In general, if the

    execution time of a loop is less than a threshold T, the loop must be discarded.

  • Chapter 3. Data Collection

    35

    3.4.5 Running the search strategy This step implies running the benchmarks for each unroll factor. There must be a maximum

    unroll U common for all the loops and benchmarks.

    The process of generating the targets followed the steps explained above with a threshold for

    filtering T = 0.4 seconds and a maximum unroll factor U = 8. Additionally, each benchmark

    (for all unroll factors) was executed ten times, which henceforth will be referenced as the

    number of runs R = 10, in order to consider the variability of the execution times.

    In order to completely describe the data collection process it is necessary to mention some

    technical issues regarding the hardware and software resources that were utilised to execute

    the experiments.

    3.5 Technical Details

    3.5.1 The platform A dual Intel(R) XEON(TM) running at 2.00GHz with 512 KB in cache (level 2) and 4GB in

    RAM has been used for the experiments. The operating system installed in this machine is

    Red Hat Linux 2.40.20-24.9.

    3.5.2 The compiler The Fortran GNU compiler [G77] gcc-3.2.2-5 has been used. As remarked in section 2.6.5

    the interactions between loop unrolling and other transformations constitute a potential

    problem. Hence, the optimisation level chosen for the experiments was O2 with no

    additional flags. Clearly, unrolling at the compiler level was switched off by avoiding the use

    of the options funroll-loops or funroll-all-loops; otherwise the compiler could apply

    unrolling to the already unrolled code.

    3.5.3 The timers precision The transformation framework utilises the C function clock() in order to compute the

    execution time at a loop level. This function was found to have a precision of 0.01 seconds in

    the machine used for the experiments. This precision refers to the fact that the minimum

    elapsed time detected by this function is 0.01 seconds. In other words, it would not be

  • Chapter 3. Data Collection

    36

    possible to detect a difference between two runs less than this precision. Considering that the

    threshold used for filtering the loops is 40 times the precision on this particular machine, it

    does not represent a problem for this project.

    3.6 The results in summary The results of the data collection process are shown in Table 3.2. In order to facilitate its

    execution, the suite VECTORD [Levine et al., 1991] has been divided into four different

    programs given the independence among subroutines. The execution times for the original

    code and for the instrumented code (with timers) are given. Additionally, the contribution of

    each benchmark to the dataset in terms of the number of loops is provided. An important fact

    to notice from Table 3.2 is that for most of the benchmarks the execution time is not severely

    affected by the instrumentation of the code. In fact, only three benchmarks (shown as

    shaded), namely 107.mgrid, 125.turb3d and 141.apsi experienced a notable increase in their

    execution time. The benchmark that was most influenced by the instrumentation was

    125.turb3d, due to the great amount of times some of its loops were called upon within the

    program. However, unlike 110.applu that was discarded, 125.turb3d was maintained, as it

    could affordably be executed several times and most of its loops have an acceptable

    execution time.

    3.7 Feature extraction It has been explained so far how the execution times of selected loops have been generated

    for different unroll factors and multiple runs. Explicitly, the results of the process described

    above correspond to the execution time of each loop after unrolling using u = 1U. Each

    execution was repeated R times in order to consider its variability for a specific unroll factor.

    Hence, the data collection process generated LxUxR execution times. Where L is the number

    of loops, U is the maximum unroll factor considered and R is the number of repetitions of

    each run. Bearing in mind that the approach followed in this project is creating a regression

    model able to learn the improvement in performance of the execution time of loops, these

    magnitudes have been called the targets. Therefore, a regression model must be able to

    learn a function7 for which the output is a value based on the targets for a specific loop and

    7 Actually, as it will be explained in chapter 5, for each unroll factor a different model will be built, i.e. there will be U functions each one corresponding to a different unroll factor.

  • Chapter 3. Data Collection

    37

    for which the input is a characterisation of this loop. This section describes the features

    extracted from the programs that were used to characterise the loops.

    Execution Time (sec.)

    Benchmark Original Instrumented

    # Loops

    101.tomcatv 40.8 45.2 5

    102.swim 39.9 43.0 3

    103.su2cor 51.3 53.5 15

    104.hydro2d 61.4 61.6 35

    107.mgrid 44.5 102.5 15

    125.turb3d 73.3 594.7 12

    141.apsi 54.6 161.2 23

    146.wave5 42.5 59.4 25

    Vectord_1 146.3 146.4 51

    Vectord_2 148.3 148.7 49

    Vectord_3 163.9 167.9 13

    Vectord_4 160.0 160.2 2

    Total number of loops 248

    Table 3.2: Results of the data collection process

    One of the most important issues in a data mining application is selecting the right features

    for learning. In fact, the characterisation of a problem should be carried out by selecting

    those features that are believed to influence the targets to learn. As explained in Chapter

    Two, there are many factors that may influence loop unrolling, such as hardware components

    of the target architecture, other code transformations applied after unrolling and,

    characteristics of the program itself. It is not unrealistic to characterise a loop for unrolling

    based on its static features, i.e. those that can be determined at compilation time. However, it

    is necessary to emphasise that dynamic features, i.e. those that are determined at execution

    time, are also important and may not be involved in the static representation; for example,

    the number of cache misses. The characterisation of loops in this project is mainly based on

  • Chapter 3. Data Collection

    38

    static features and only two of them, namely the trip count and the total number of times the

    loop is called, were determined during the execution of the programs. This loop abstraction

    is mostly based on the description presented by [Monsifrot and Bodin, 2001] and [Monsifrot

    et al., 2002]. The loop characterisation presented by [Stephenson and Amarasinghe, 2004] is

    not applicable to the present approach given the differences in the implementation of loop

    unrolling.

    The features extracted to characterise the loops are shown in Table 3.3. Feature 1 (called) is

    the total number of times the loop is called within the program. Therefore, it represents the

    number of times the outer loops are executed and the subroutine containing the loop is

    called. The size (feature 2) of the loop refers to the number of statements within the loop. A

    statement may contain one or more lines. The trip count (feature 3) of the loop determines

    the number of times the body of the loop is executed. Given that for most of the loops this

    feature is unknown at compilation time, it was determined during the execution of the

    programs. For some loops it was found that the trip count was variable depending on some

    parameters of a subroutine. In these cases a weighted average was calculated. Feature 4

    considers the number of calls to proper functions of the language. Feature 5 (Branches)

    refers to the number of if statements within the loop. Feature 6 is the nested level of the loop.

    Features 7 and 8 represent the number of array accesses within the loop depending upon

    whether an array element is loaded or stored. A less straightforward feature is the number of

    array element reuses (feature 9). It attempts to measure dependency among different

    iterations. Although dependency analysis is a very complicated topic in compilers, a simple

    approach has been taken. The number of reuses of a particular array is computed as the total

    number of elements involved in its update when it is controlled by the iteration variable. An

    example is given in Figure 3.2 where the number of reuses is three. Finally, feature 10

    (Floating) considers the number of floating-point operations and feature 11 (indAcc)

    represents the number of indirect array accesses. An indirect array access occurs when an

    array is used as an index of another array.

    do i = 1, N a[i] = a[i] + a[i-1] * a[i+1] end do

    Figure 3.2: An example of a loop containing three array element reuses

  • Chapter 3. Data Collection

    39

    Index Name Description

    1 Called The number of times the loop is called

    2 Size The number of statements within the loop

    3 Trip The trip count of the loop

    4 Sys The number of calls to proper functions of the language

    5 Branches The number of if statements

    6 Nested The nested level of the loop

    7 Loads The number of loads

    8 Stores The number of stores

    9 Reuses The number of array element reuses

    10 Floating The number of floating point operations

    11 IndAcc The number of indirect array accesses

    Table 3.3: Features extracted to characterise loops

    3.8 The representation of a loop The features described above and the execution times throughout R repetitions for each

    unroll factor compose the dataset constructed by the data collection process. Figure 3.3

    shows the representation of a datapoint where the specific unroll factor used has been

    omitted for simplicity. In other words, for each unroll factor a loop has the representation of

    the one shown in Figure 3.3.

    x1 x2 xN t1 t2 tR

    Features Execution time throughout R repetitions

    Figure 3.3: The representation of a datapoint

  • Chapter 3. Data Collection

    40

    3.9 Summary This chapter has described the data collection process and has remarked upon the difficulties

    that appear during the generation of data when applying machine learning to compiler

    optimisation. In general, constructing a dataset for this purpose can be a time-demanding

    activity. Great effort has been invested in this project in order to generate clean and reliable

    data. Hence, competitive benchmarks have been chosen in order to keep the bias towards

    simple applications as minimal as possible. Loop unrolling has been applied to innermost

    loops within these benchmarks by using a framework developed in [Fursin, 2004]. Only

    loops with sufficient execution time have been taken into account in order to reduce the

    effect of the instrumentation. The general process utilised for generating the targets

    (execution times for loops) consists of preparing the benchmark, selecting loops, profiling

    and filtering the loops and running the programs using different unroll factors. Unroll factors

    from 1 to 8 were used and each run was repeated 10 times in order to consider variability.

    After having determined which loops should be included, a feature extraction process was

    performed over each loop using a set of 11 features that were selected mainly based on static

    characteristics of the loops. Finally, the representation of a datapoint (loop) composed of

    features and execution times resulting after the data collection process has been summarised.

  • 41

    Chapter 4

    Data Preparation and Exploratory Data Analysis

    4.1 Introduction The last chapter described the data collection process and provided technical and

    methodological details about how the experiments were carried out in this project. These

    experiments have generated the data that in principle will be available for learning. As has

    been highlighted, the construction of a dataset that can be used by learning techniques with

    the aim of optimising the execution time of programs is a time-demanding activity.

    Consequently, great effort has been invested in producing a considerable amount of data that

    can be reliably used by machine learning techniques. This effort has mainly been focused on

    determining the criteria used for choosing the benchmarks; the preparation of these

    benchmarks; the selection of appropriate loops to be unrolled; the generation of the

    execution times (also called the targets) and the selection of correct features that describe the

    loops and constitute the characterisation of the problem. Unfortunately, the raw data

    produced by the data collection process is not suitable to be directly used by any machine

    learning technique and it needs to be pre-processed and refined. There are several reasons for

    this. Firstly, it must be recalled that the ultimate goal of loop unrolling is optimising a

    program by determining whether a particular unroll factor is beneficial or not, i.e. if

    unrolling a loop a specific number of times represents a significant improvement with

    respect to the case of maintaining the original (rolled) version of the loop. Hence, it is

    necessary to apply a validation process to the data in order to establish if there is an actual

    improvement in performance to be learnt. In other words, if unrolling does not yield a

    potential reduction in the execution times of the loops that are included in the dataset, the

    process of modelling the data and trying to learn from it is useless. The second reason to

    avoid using the unprocessed data that has been collected is that it can be very difficult to

    learn from the pure execution times. Indeed, since the interest of this project is mainly

    focused on predicting the effectiveness of loop unrolling, the execution times do not directly

    reflect the improvement that needs to be learnt. Therefore, these execution times must be

  • Chapter 4. Data Preparation and Exploratory Data Analysis

    42

    transformed into another more convenient representation that explicitly indicates how good

    or bad the transformation is for a particular loop. Furthermore, some loops that were not

    removed after being profiled may still need to be discarded from the dataset, as their mean

    execution time is low. Similarly, some measurements deviate strangely from the distribution

    of the execution times for a particular loop under a specific unroll factor. These outliers

    should also be removed from the dataset as they may deteriorate the learning process. The

    final reason for which pre-processing the data is essential relates to the transformation of the

    features that constitute the representation of the loops. In fact, this representation must be

    rescaled in order to prevent some features being considered more important than others.

    This chapter tackles the problems mentioned above and provides an insight into the data that

    will be used for modelling. The rest of this chapter is organised as follows. Section 4.2

    explains how the data is processed throughout different stages (by using several software

    resources) and is made available for analysis and learning. Section 4.3 provides a formal

    representation of the data and introduces the notation that will be used in subsequent

    sections. Section 4.4 analyses the data and validates its suitability in order to be used by

    machine learning techniques. Section 4.5 presents the importance of pre-processing the

    execution times, giving details about the elimination of some loops from the dataset, the

    transformation of the targets and the detection and treatment of outliers. Section 4.6 explains

    how the features are also pre-processed in order to facilitate the application of learning

    techniques. Finally, section 4.7 summarises the most important aspects of this chapter.

    4.2 The general framework for data integration It was explained in chapter three how the benchmarks selected for this project were used to

    generate the execution times at a loop-level profile with the aid of a software tool that works

    on top of the compiler. From a more general perspective, the generation of the targets and

    the selection of the features for loops are only two steps during the whole process of making

    the data available for analysis and learning. In fact, the framework used for applying loop

    unrolling and obtaining the executions times, which will be referenced in this section as EOS

    [Fursin, 2004], produces a set of files that must be parsed and from which the targets must be

    extracted. In order to automate this task, a program written in java, referenced in this section

    as Java Parser, was developed. This program reads the files produced by EOS and transforms

    the results into easy-to-load text files. The files produced by the Java Parser are loadable by

    Matlab subroutines that integrate them with the features extracted from the programs.

  • Chapter 4. Data Preparation and Exploratory Data Analysis

    43

    Furthermore, these subroutines make possible the analysis, exploration and pre-processing of

    the data and communicate with modelling subroutines also written in Matlab. This process

    is shown in Figure 4.1.

    4.3 Formal representation of the data The data that has been generated by repeatedly executing the programs using the unrolled

    version of the loops can be denoted as tuj,k, representing the execution time of the jth run (j =

    1, 2, , R) of loop k (k = 1, 2, , L), which has been unrolled u times (u = 1, 2, , U).

    Here, the notation used in a great part of the literature about loop unrolling has been adopted,

    where an unroll factor of one (u = 1) corresponds to the original version of the loop, i.e. with

    no unrolling at all.

    Similarly, the features explained in section 3.7 describe a particular loop. Hence, the

    characterisation of loop k can be denoted as a vector xk for which the elements xi,k (i=1, 2,

    , N) represent the features that compose the loop.

    Benchmarks

    EOS

    Execution times

    Java Parser

    Features

    Raw data Pre-

    Processing

    Modeling

    Matlab

    Results

    Figure 4.1: The general framework for data integration

  • Chapter 4. Data Preparation and Exploratory Data Analysis

    44

    4.4 Is this data valid? The first question to be answered before starting to pre-process the data that has been

    collected is whether this data can actually be useful and suitable for applying learning

    techniques. Therefore, the aim of this section is to ascertain the validity and suitability of the

    data by finding out if the improvement in performance of the loops for unroll factors

    different from one (u=1) represents an actual reduction of the execution times and is not only

    a consequence of the variability of the data. In section 3.3.2 it was explained that the data

    resulting from the initial experiments using program-level profiling was discarded. In fact,

    the improvement obtained by loop unrolling in most of the benchmarks was only comparable

    to the variability of the execution times under no unrolling. Therefore, there were no criteria

    to select the loops that should be included in the dataset. The problem to be explained in this

    section is essentially the same but the aim now is to determine if the final data obtained

    represents a potential improvement that could be predicted by machine learning techniques.

    In other words, considering that the final goal in compiler optimisation is related to the

    realisation of speed-ups on a set of programs, the focus here is on the maximum possible

    improvement that may be reached by any learning technique. Although the detrimental effect

    of loop unrolling is also important to this research, the data will prove inappropriate should

    no loops be found to be improved by unrolling.

    4.4.1 Statistical analysis To enable the assessment of the suitability of the data, the measurements have been repeated

    ten times in this project. A sensible approach to establishing the appropriateness of the data

    aims to validate the significance of the improvement with the aid of statistics. The objective

    is to determine how many loops in the dataset are improved by unrolling and when this

    improvement is statistically significant. Using the notation introduced in section 4.3, a

    particular loop k has associated R execution times (due to repetitions) for each unroll factor

    u. Therefore, it is reasonable to compare the R measurements for each unroll factor with the

    execution times for the rolled (original) version of the loop. Although it may be possible to

    apply a t-test to validate the hypothesis of different means for u=1 and u>1, that is, Ho: iu

    kuk tt

    == 1 with i>1, it would be necessary to apply U-1 tests, which can considerably

    increase the probability of drawing at least one incorrect conclusion. Therefore, a one-way

    analysis of variance (one-way ANOVA) followed by a multiple comparison (multi-

  • Chapter 4. Data Preparation and Exploratory Data Analysis

    45

    comparison) procedure has been applied. Additionally, only improvements in performance

    but not detriments have been considered.

    In general, the purpose of ANOVA is to test if the means of several groups are significantly

    different. In our case, the groups are the unroll factors considered. Thus, ANOVA tests the

    null hypothesis that the means do not differ by using the p-values of the F-test. Essentially,

    the F-test measures the between-groups variance compared to the within-groups variance.

    Therefore, if the variance between groups is considerably greater than the variance within

    groups, there are serious doubts about the null hypothesis. Since we are interested in

    establishing if there is a significant difference between the mean execution times of the

    original loops and the unrolled version of the loops, a multi-comparison procedure is

    needed. Multi-comparison procedures also determine when this difference is positive (i.e. an

    improvement in performance) or negative (i.e. a detriment in performance). All the tests

    were executed at 5% level of significance and the Tukey's honestly significant difference

    criterion was used as the critical value for the multi-comparison procedure (for a complete

    description of ANOVA and multiple comparison procedures see [Neter et al., 1996] pages

    663-701 and pages 725-738).

    In order to provide a better understanding of whether unrolling is considered significantly

    beneficial or not regarding the procedure explained above, Figure 4.2 shows the box plots of

    the execution times for two different loops. The lower and upper lines of the boxes represent

    the 25th and the 75th percentile of the execution times for each unroll factor. The horizontal

    lines in the middle of the boxes are the medians. The dashed lines (also called whiskers)

    represent the variability of the execution times throughout R=10 repetitions for each unroll

    factor, where the top is the maximum execution time and the bottom is the minimum. Thus,

    the upper part of Figure 4.2 shows the case of a loop for which unrolling does not correspond

    to a significant improvement in performance. In fact, although the execution times for u=3,

    u=5 and u=6 are less than the minimal for u=1, they are not considered significant due to the

    variability of the measurements. The opposite case is shown in the lower part of Figure 4.2.

    For this loop, the variability of the execution times is low and the improvement in

    performance due to loop unrolling, e.g. for u=4 is significant. It is clear for this case that loop

    unrolling is beneficial as there is no overlapping between the execution times for u=1 and for

    u=4.

  • Chapter 4. Data Preparation and Exploratory Data Analysis

    46

    The procedure explained above was applied to all the loops in the dataset and the results are

    summarised in Table 4.1. The contribution to the dataset of each benchmark in terms of the

    total number of loops and in terms of the number of loops for which unrolling causes a

    statistically significant improvement of performance is shown. It can be seen in Table 4.1

    that the number of loops that can be improved by unrolling in the SPEC CFP95 [SPEC95]

    suite is considerably less than in the VECTORD [Levine et al., 1991] benchmarks.

    Furthermore, 42% of the loops may be significantly improved by loop unrolling in the whole

    set of benchmarks. It is necessary to emphasise that the loops included in Table 4.1 can also

    be negatively affected by unrolling because some unroll factors may be detrimental to the

    performance of a specific loop. However, since we are analysing the case of how many loops

    in the dataset can be improved by loop unrolling, it is possible to conclude that the reduction

    in the execution time of the loops included in the dataset is not only caused by the variability

    of the measurements but also by the effect of the transformation. Thus, the data that has been

    collected can be used by machine learning techniques in order to model its improvement in

    performance.

    Until now, it has been considered whether there is a positive impact of loop unrolling on the

    set of benchmarks. However, it may be of interest to formulate two additional questions

    regarding the behaviour of the execution times that have been collected for this project.

    Primarily, is there any negative impact of the transformation on these benchmarks?

    Secondly, how much can unrolling affect the execution time of these loops? To answer these

    questions, it is necessary to explain the pre-processing stage of the execution times, i.e. how

    the targets have been transformed in order to facilitate their analysis and to appropriately

    apply the modelling techniques.

    4.5 Pre-processing the targets As indicated by Han and Kamber (2001), pre-processing the data can significantly improve

    and ease the application of learning techniques. In this case, we are interested in eliminating

    some execution times that should not be considered because of their low magnitude;

    transforming these execution times into other values that may be more appropriate for

    learning; and detecting some strange data-points (outliers) that are somehow anomalous and

    do not follow the general behaviour of similar data-points.

  • Chapter 4. Data Preparation and Exploratory Data Analysis

    47

    Figure 4.2: Insignificant (top) and significant (bottom) effect of loop unrolling

  • Chapter 4. Data Preparation and Exploratory Data Analysis

    48

    Benchmark # Loops

    # Loops with

    improvement

    for u > 1

    101.tomcatv 5 2

    102.swim 3 1

    103.su2cor 15 5

    104.hydro2d 35 3

    107.mgrid 15 1

    125.turb3d 12 4

    141.apsi 23 6

    146.wave5 25 3

    Vectord1 51 38

    Vectord2 49 39

    Vectord3 13 1

    Vectord4 2 2

    Total 248 105

    Table 4.1: Number of loops that may be benefited

    from unrolling

    4.5.1 Filtering In section 3.4.4 program-profiling permitted the recognition of loops with a low execution

    time and their removal from the set of loops that were considered for unrolling. However,

    after the data collection process, loops with very low execution time remained, which went

    undetected during that phase. As now, at this stage, all the repetitions are available, it is

    possible to determine the behaviour of the execution time for the original (rolled) version of

    a specific loop. Therefore, a similar criterion may be established to remove loops with low

    execution time from the dataset.

    Let us consider the R repetitions of the execution time for the original version (with no

    unrolling) of loop k, t1j,k. A straightforward criterion to eliminate or maintain this loop can be

    ascertained by deciding whether its mean is greater than a threshold T or not. Thus, the

    mean can be computed by:

  • Chapter 4. Data Preparation and Exploratory Data Analysis

    49

    =

    =R

    jkjk tR

    t1

    1,

    1 1 (4.1)

    Therefore, the criterion to remove or maintain a loop is:

    1kt

    2). In the worst case, though, the performance of some loops is deteriorated by more than

    100%.

    In conclusion, although a great number of loops are not appreciably affected by unrolling, it

    is worth predicting when they are improved or deteriorated by the transformation and how

    much the execution time of these loops may be influenced. The predictive modelling used to

    tackle this problem will be explained in the following chapter. However, additional pre-

    processing needs to be applied in order to make the data suitable for the prediction

    techniques.

    4.6 Pre-processing the features

    4.6.1 Rescaling The values of the features selected to characterise the loops may significantly influence the

    learning techniques. For example, some features may outweigh others. A common approach

    to diminish this effect and make the variables have similar magnitudes is rescaling each

    feature to have zero mean and unit variance.

    Thus, the update of feature i for loop k is obtained as follows:

  • Chapter 4. Data Preparation and Exploratory Data Analysis

    53

    i

    ikiki s

    xxx

    ,, (4.3)

    Where i = 1, 2, N; ix is the mean and is is the standard deviation of the variable in the

    training data. Namely:

    =

    =K

    kkii xK

    x1

    ,1

    (4.4)

    =

    =K

    kikii xxK

    s1

    2, )(1

    1 (4.5)

    Where K is the number of loops considered in the training data.

    4.6.2 Feature selection and feature transformation Feature selection is the choice of a subset of features from all the variables available for

    learning. Working with a great number of features can demand more training data that may

    be difficult to obtain. Additionally, some features can be found to be irrelevant and the

    variables that mainly influence the learning can be a reduced subset of them. Domain

    knowledge or very well known techniques such as forward selection and backward

    elimination help to identify those features that can be discarded without worsening the

    performance of the learning algorithm. For example, from the set of 11 features presented in

    Table 3.3, it may be suggested that only five variables are determinant for loop unrolling,

    namely the number of floating point operations, the number of array element reuses, the

    number of loads by iteration, the number of stores and the number of if statements.

    Although the amount of features selected for this project is acceptable compared to the size

    of the dataset, it is important to find which features are found to be the most important for

    the learning techniques used. This will provide a better understanding of loop unrolling. The

    best features for each technique and the method used to obtain them will be explained in the

    following chapter.

    Feature transformation is the process of creating new features in order to considerably

    improve the performance of the learning techniques. Again, domain knowledge can help to

    construct the right ones. For example, an intelligent variable to be constructed for applying

    code transformations is the ratio memory references to floating point operations. This feature

    tell us how the memory is affected in relation to the number of operations performed by

  • Chapter 4. Data Preparation and Exploratory Data Analysis

    54

    iteration, and may not be easily found by a specific learning technique. How this feature is

    explicitly constructed and its effect on the learning process will also be explained in the

    following chapter.

    4.7 Summary This chapter has provided an insight into the execution times resulting from the data

    collection process and has described the need of the pre-processing stage in order to make

    the data suitable for learning. The general framework in which the data has been prepared

    and made available to the learning algorithms has been described. The execution times have

    been analysed in terms of the improvement obtained by applying loop unrolling and it has

    been found that about 42% of the loops that constitute the dataset can be, statistically,

    significantly improved. This fact is important because it validates the appropriateness of the

    benchmarks in order to build models that can potentially improve the performance of

    programs. The pre-processing stage has been divided into two parts: the pre-processing of

    the targets and the preparation of the features. The targets (the execution times) have been

    filtered in order to guarantee that only loops with a sufficient execution time are maintained.

    After discarding some loops due to their low execution times, the detection of outliers has

    taken part and has been focused on identifying those loops for which some execution times

    are considerably greater or smaller that the median for a specific unroll factor. These outliers

    have also been eliminated from the dataset. As the final phase in the target pre-processing

    stage, the execution times have been transformed into more suitable magnitudes that directly

    indicate the improvement or detriment in performance of unrolling with respect to the mean

    execution time of the original loop. These magnitudes are the actual targets for the learning

    techniques explained in the following chapter. The new values obtained made possible a

    deeper study of the effect of loop unrolling on the benchmarks used. Specifically, it was

    found that approximately from 40% to 50% of the loops included in the dataset were not

    considerably affected by unrolling and consequently, their improvement was nearly zero. It

    was also discovered that a great number of loops were positively affected by the

    transformation. Roughly, from 30% to 40% of the loops experienced a reduction in their

    execution time when they were unrolled. Nevertheless, some loops (20%) were found to be

    negatively affected by the transformation. These results provide a good understanding of the

    appearance of the data and demonstrate how difficult is to build a decision rule for unrolling

    loops.

  • Chapter 4. Data Preparation and Exploratory Data Analysis

    55

    Figure 4.3: Number of Loops vs. Mean Improvement in Performance (%)

  • Chapter 4. Data Preparation and Exploratory Data Analysis

    56

    Finally, in order to avoid some features being considered more important than others for the

    learning algorithms, each variable that characterises the loops was rescaled to have zero

    mean and unit variance.

    To conclude, with the results obtained by the analysis phase and the pre-processing stage

    explained in this chapter, the data is ready to be used by learning algorithms in order to build

    a model capable of predicting how beneficial or detrimental unrolling can be for a particular

    loop.

  • 57

    Chapter 5

    Modelling and Results

    5.1 Introduction Chapter Four has emphasised the importance of cleaning, analysing and preparing the data

    that has been collected in order to make it suitable for learning. This suitability relates to the

    maximum improvement in performance that may be achieved when applying machine

    learning techniques. However, the term learning has not been explicitly defined within the

    context of this project. Here, we are referring to learning from data. Thus, the aim is to build

    a model able to learn from past examples by discovering the underlying structure of the data

    in order to accurately predict on novel data. Therefore, in this project, learning is focused on

    predictive modelling, i.e. given a dataset called the training set, the goal is to build a model

    that can generalise and successfully perform on data that has not been seen before.

    In general, predictive modelling can be thought as the process of constructing a function that

    maps a vector of input variables (predictor variables) onto one or more output variables

    (response variables) [Hand et al., 2001]. This function is constructed based on a training

    dataset but is expected to successfully perform on new data. Predictive modelling is

    commonly divided into two different approaches: classification and regression. In a

    classification problem, the output variable is categorical. In other words, the targets, i.e. the

    values of the output variable are discrete and are commonly referred as classes or labels. The

    work done by [Monsifrot et al., 2002] and [Stephenson and Amarasinghe, 2004], described

    in Chapter One, are examples of classification problems. The former attempted to solve the

    binary classification problem of predicting whether loop unrolling is beneficial or not

    (targets as classes 1 or 0) for a particular loop given its static representation (values of the

    input variables). The latter was also focused on loop unrolling but stated the classification

    task as a multi-class problem where the possible labels for the output variable are the unroll

    factors that were considered from one to eight. Clearly, both cases aim to predict a

    categorical variable based on a set of features of the loops. Thus, they used machine learning

    techniques to achieve that goal. This project goes beyond the classification problem and

    proposes a regression approach to predict the improvement in performance that unrolling can

  • Chapter 5. Modelling and Results

    58

    produce on particular loop. Undoubtedly, the regression approach is a more general

    formulation of the problem and previous solutions represent particular cases. Furthermore,

    the machine-learning solution to loop unrolling based on a regression method is smoother

    than the one obtained by classification methods given the degree of nosiness the

    measurements may have. Indeed, the likely event of two loops with the same characterisation

    (the same values of the set of features that represent them) having different best unroll

    factors constitutes a great difficulty for the classification task. However, the case of two

    loops with a slightly different improvement in performance but the same representation is

    not a problem for the regression approach. Although it is not the main goal of predicting

    modelling, it is advantageous if the results of the techniques used for regression can be

    readable and understandable. This will provide a better insight into the problem. This chapter

    presents the machine learning solution to loop unrolling based on the regression approach

    and describes the results obtained with the techniques used. The organisation of this chapter

    is presented below.

    Section 5.2 explains in detail how the regression problem is formulated and why it is

    considered a more general view compared to previous approaches that apply machine

    learning techniques to loop unrolling. Section 5.3 presents an overview of the regression

    methods that have been used in this project and section 5.4 provides information about the

    parameters that are involved in each method. Section 5.5 describes the measure of

    performance that has been used to evaluate the methods. Section 5.6 provides the

    experimental methodology that was adopted to obtain the results. Section 5.7 presents and

    evaluates these results and compares the solution found with previous work. Finally, section

    5.8 summarises the most important theoretical concepts and experimental results presented

    in this chapter.

    5.2 The regression approach The regression approach to applying machine learning to loop unrolling proposed in this

    project attempts to build a function for a particular unroll factor u. This function aims to map

    a representation of a loop onto an expected improvement in performance. In other words, the

    input of this function is a vector of features for a specific loop and the output is the expected

    improvement in performance that the loop can achieve after being unrolled u times. Two

    important further clarifications need to be mentioned. Firstly, a different function is

    constructed for each unroll factor, i.e. there will be U functions that represent the behaviour

  • Chapter 5. Modelling and Results

    59

    of the performance of a loop when it is unrolled. Secondly, the output has been called

    improvement but it can also be a detriment in performance.

    In a more formal way, let us consider the magnitudes obtained by the transformation applied

    to the execution times of loop k (described by equation 4.2) u kjy , (j=1,2, R). These values

    represent the improvement in performance of loop k under a specific unroll factor u

    throughout R repetitions. This particular loop is characterised by the N-dimensional vector of

    features xk that have been scaled to have zero mean and unit variance (see Equation 4.3).

    Although it could be possible for learning algorithms to predict a vector of targets for a

    specific loop, let us convert the repetitions through index j into a sole magnitude by

    calculating its mean.

    =

    =R

    j

    ukj

    uk yR

    y1

    ,1

    (5.1)

    Where, for the sake of clarity, the bar over uky has been omitted. However, it should be

    always borne in mind that this magnitude represents the mean improvement in performance

    throughout R repetitions. Thus, the data available for learning can be represented

    as ),( ukk yx , k=1, L, where L is the number of loops that are considered in the training

    data. Hence, a regression approach for this data attempts to construct a function of the

    form ),( ux=uf , where u is the parameter vector of the model8. Thus, given a loop k

    described by its feature vector xk, it is possible to obtain ukf that predicts its mean

    improvement in performance when it is unrolled using the factor u, i.e. uky .

    Two advantages of this approach over previous work need to be highlighted: robustness and

    generality, and they are described below.

    Robustness: The classification solution may be severely affected by the noisiness of the

    measurements and the limited number of examples. Indeed, as explained in Chapter One, the

    binary classification approach described in [Monsifrot et al., 2002] is likely to have two

    8 This notation assumes that a set of parameters are involved in the model. However, for non-parametric models, these parameters may not exist. For simplicity, we consider that the notation used can refer to both parametric and non-parametric models.

  • Chapter 5. Modelling and Results

    60

    loops with the same feature vector but assigned to different classes. This represents a

    potential problem for any classifier if it is not appropriately treated. Additionally, even in the

    case of the multi-class approach presented in [Stephenson and Amarasinghe, 2004], the

    learning process may become very difficult. Certainly, there may not be sufficient examples

    that indicate the correct unroll factor for a type of loop and, therefore, the best factor to be

    used by unrolling will be erroneously predicted. These complications are not present in the

    regression approach where the aim is to build a function that smoothly predicts the

    improvement in performance. Consequently, the solution provided is more robust to the

    intrinsic variability of the data.

    Generality: The multi-class problem of establishing the best unroll factor for loops is a

    particular case of the regression approach. Clearly, the most appropriate unroll factor to be

    used for a specific loop can be easily determined by calculating the greatest improvement in

    performance obtained by the different models constructed (u=1, U). The binary decision

    of whether unrolling is beneficial or not is also straightforward.

    5.3 Learning methods used A wide variety of modelling techniques is available to solve regression problems. In general,

    these techniques can be divided into two groups: parametric and non-parametric models.

    Whilst the former makes assumptions about the underlying distribution that generates the

    data, the latter is more data-driven and does not place constraints in the form of the function

    that constitutes a good representation of the data. It is necessary to remark that in our context

    a good representation does not necessarily implies the best fit for the training data but

    relates to the predictive power the algorithm can have when novel examples are presented.

    Parametric models may assume, for example, that the underlying function that generates the

    data is linear on the parameters of the model. This function can be a linear combination of

    the input variables or a combination of other functions such as polynomials, exponentials or

    gaussians. Among non-parametric models, other methods are important such us kernel

    machines (Support Vector Regression), Decision Trees or Artificial Neural Networks. Other

    techniques such as Instance-Based Learning (e.g. Nearest Neighbours methods) can also be

    used for regression problems, although they are more popular for classification.

    Although there is not a perfect method that can be used for regression, some techniques may

    be more appropriate for the problem this project deals with. Considering that the number of

  • Chapter 5. Modelling and Results

    61

    examples in the training data is limited, very complicated methods such as Neural Networks

    should be avoided. In fact, there is latent danger of overfitting the data and, although this can

    be diminished in Neural Networks by regularisation, this technique may demand more

    training examples than those currently available. Another important issue for the present

    research is to give an interpretation to the results obtained by a specific method. Thus,

    methods for which interpretability of the results is very difficult or even impossible such us

    Neural Networks, are not convenient for this problem. Hence, it seems that Occams razor is

    applicable to this problem: try the simplest hypothesis first. Certainly, the complexity of

    the problem is unknown and a lot of effort can be saved if a simple model is found to be

    sufficient to solve it. In other words, if it is possible to find a simple method with a good

    performance on data that is not included in the training set, it should be preferable over other

    more complex models with comparable performance.

    Therefore, the first method used to carry out the predictive modelling in this project is Linear

    Regression. Since the input for the regression problem in this project is multivariate, i.e. a

    vector instead of a simple scalar magnitude, this method will be referenced as Multiple

    Linear Regression. Multiple Linear Regression is a simple but extensively used technique to

    predict a variable based on a linear combination of the input variables. Thus, this method

    works under the assumption that there is no interaction among the predictor variables, i.e.

    one variable does not have an effect on the others. Furthermore, it is a linear additive model,

    i.e. the effects of the variables are independent and they are linearly added to produce the

    final prediction. Although these assumptions seem very restrictive, Linear Regression has

    demonstrated to perform well even in some cases where the relation between the predictor

    variables and response variable is known to be non-linear. Another reason for which Linear

    Regression represents a good approach to be attempted is that the parameter vector obtained

    by the model may indicate the relevance of each variable. Thus, with this linear model, it is

    possible to ascertain the variables that most influence loop unrolling.

    However, in many situations the relation between the input variables and the output variable

    is very complex and cannot be properly modelled by linear techniques. In these cases, it is

    necessary to apply a non-linear approach that may have better predictive power. To consider

    the general case in which no assumptions about the underlying process that generates the

    data are made, a non-parametric model has been adopted, Namely Decision Trees Learning

    has been used as the second choice of the regression approach in this project. Being more

    specific, the Classification and Regression Trees (CART) algorithm described in [Breiman et

  • Chapter 5. Modelling and Results

    62

    al., 1993] has been applied. Besides modelling non-linear relations between the predictor

    variable and the response variable, Decision Trees can provide good insight into the features

    that are more important for loop unrolling given that its results are produced in a tree-

    structured fashion. However, it is also recognised that if the number of training examples is

    not sufficient to provide a representative sample of the true population Decision Trees can

    also overfit the training data (see [Mitchell, 1997] pages 66-69). In general, Decision Trees

    may suffer from fragmentation, repetition and replication [Han and Kamber, 2001], which

    can deteriorate the accuracy and comprehensibility of the tree. Fragmentation occurs when

    the rules created by the tree (the branches of the tree) are only based on a small subset of the

    training examples tending to be statistically insignificant. Repetition takes place when one

    attribute is evaluated several times along the same branch of the tree. Replication occurs

    when some portions of the tree (subtrees) are duplicated. These drawbacks appear to be

    serious when applying Decision Trees Learning to solve the present regression approach.

    Nevertheless, several improvements have been developed in the recent years in order to

    overcome these problems. Essentially, if a good pruning algorithm is applied to the primary

    tree created by the method and the performance is roughly the same, these problems can be

    diminished. Therefore, a lot can be gained with a non-linear and non-parametric technique

    to solve the regression problem, being especially cautious to avoid overfitting the data.

    The following sections will present a brief overview of each method in order to provide a

    better understanding of how they work and to describe the most important issues to take into

    consideration for each of them.

    5.3.1 Multiple Linear Regression Multiple Linear Regression models the response variable based on a linear combination of

    the predictor variables. In an N-dimensional input space, a linear regression model tries to fit

    a hyperplane to the training data. Hence, a linear model for the regression approach can be

    stated as:

    uo

    uf += x .u (5.2)

    Where the upper index u has been used to emphasise that one model must be created for each

    unroll factor. The parameters of the model, also called regression coefficients, are u

  • Chapter 5. Modelling and Results

    63

    and uo . u is an N-dimensional vector corresponding to the coefficients of each variable

    and uo is the free parameter or bias.

    Therefore, given a loop described by its N-dimensional vector of features xk, it is possible to

    find the predicted mean improvement in performance under the unroll factor u by

    calculating:

    uu

    kf 0ku . += x (5.3)

    In order to fit this model to the data given by ),( ukk yx , where k=1, K, and K is the

    number of loops considered in the training set, it is possible to find the parameters of the

    model that minimise the mean square error (MSE) between the predictions ukf and the

    actual values uky . These parameters can be found by:

    uTu1uTuu )( yXXX = (5.4)

    Where now, u is a (N+1) dimensional vector (its last component is u0 ); uX is a matrix of

    dimensions Kx(N+1) for which each row is a different data-point and the last column is

    composed of ones; and uy is a vector containing the actual predictions for each data-point.

    In general, in order to find the solution given by equation (5.4), the calculation of the inverse

    is not necessary and numerical techniques of linear algebra are used instead.

    5.3.1.1 Understanding the regression coefficients: It is clear that the components of

    the parameter vector u are the coefficients of the input variables. Recalling that the features

    have been scaled to have zero mean and unit variance, the value of ui indicates the effect of

    the variable ix when all the others variables are constant. Thus, these coefficients measure

    the importance of the variables in order to predict the mean improvement in performance. If,

    for example, one coefficient of a variable is nearly zero, that variable can be considered an

    irrelevant predictor.

  • Chapter 5. Modelling and Results

    64

    5.3.2 Classification and Regression Trees Decision Trees constitute a Learning technique commonly used for classification problems

    although, as in the case of CART algorithm, they are also applicable to regression problems.

    CART algorithm, comprehensively described in [Breiman et al., 1993], does not make any

    assumptions about the distributions of the independent or dependent variables. The aim of

    the algorithm is to produce a set of rules able to accurately predict the dependent variable

    based on the values of the independent variables. The resulting rules can be seen in a tree-

    structured fashion. It works by recursively partitioning the data into smaller subsets, testing

    the value of one variable at each node. The criterion used to split the data each time is based

    on an impurity measure and CART exhaustively tests all the variables at each node. At each

    leaf of the tree (terminal node) the predicted value is calculated as the mean of the dependent

    variable in the subset that meets the conditions of the respective branch. Given that the size

    of the tree can considerably increase at the risk of overfitting, CART computes the final tree

    by determining the subtree with minimal cost by using cross-validation. It is worth noting

    that although computing the mean at the leaf nodes as the possible predicted values may

    sound unsophisticated, the subsets corresponding to these terminal nodes are believed to be

    as homogeneous as possible. Furthermore, after pruning, the final tree may provide

    understandable rules that potentially give a better insight into the interactions among the

    variables.

    5.4 Parameters setting Unlike other machine learning methods, Multiple Linear Regression and CART require

    minimal tweaking in order to determine the best parameters of the algorithms for which the

    greatest performance may be obtained. Certainly, there are no parameters to tune in Multiple

    Linear Regression. Since the implementation of CART provided in the MATLABs

    Statistics Toolbox was used, the splitting criterion or measure of impurity was the Least

    Square function. Additionally, pruning was performed by means of cross-validation.

  • Chapter 5. Modelling and Results

    65

    5.5 Measure of performance used In order to evaluate the performance of the algorithms used and to carry out comparisons

    between them, the Standardised Mean Square Error (SMSE) has been used. It can be

    computed by:

    =

    =

    = K

    k

    uk

    u

    K

    k

    uk

    uk

    u

    yy

    yfSMSE

    1

    2

    1

    2

    )(

    )( (5.5)

    Where uky are the actual improvements in performance given by (5.1), u

    kf are the values

    predicted by the model and uy is the mean improvement in performance throughout all the

    K loops considered within the validation or test set. The SMSE describes how good the

    predictions of the model are compared to the case of always predicting the mean, i.e. without

    knowledge of the problem. Therefore, small values of SMSE indicate a good performance of

    the algorithm and they are expected to be less than 1.

    5.6 Experimental Design This section describes the design of the experiments that were carried out in order to evaluate

    the performance of the learning algorithms used. There are three objectives with the

    realisation of these experiments:

    1. To provide an interpretation of the parameters and the results obtained by the

    methods used regarding the features that were considered for loop unrolling.

    2. To compare the accuracy of the methods and determine the technique that obtained

    the best performance.

    3. To ascertain the quality of the predictions obtained and the possible impact of these

    predictions on the set of benchmarks used.

    5.6.1 Complete dataset The first experiments were performed in the whole set of loops that were collected, i.e. the

    total number of loops were used for training and validation. Although the results in terms of

  • Chapter 5. Modelling and Results

    66

    the performance of each method do not represent an expected performance on novel data,

    these experiments were carried out in order to analyse the results obtained by each method

    and to interpret the importance of the variables involved in loop unrolling.

    5.6.2 K-fold cross-validation Given that testing a model on the same dataset that was used for training does not represent a

    real measure of performance of the machine learning methods used, the second set of

    experiments were carried out using K-fold cross-validation. This procedure has demonstrated

    to be a good methodology to evaluate the performance of a learning algorithm and does not

    require a dataset with a great number of examples. The procedure is explained as follows:

    1. Divide the training data into K randomly chosen independent (not overlapped)

    subset (folds) S1, S2, , SK of approximately the same size.

    2. Repeat for i varying from 1 to K:

    a. Take the Si subset as the test set and the rest of sets (S1, , Si-1, Si+1, SK)

    as the training set.

    b. Train the methods (Linear Regression and CART) with the training set

    previously built.

    c. Evaluate the models constructed in the test set, i.e. determine the predicted

    values for each model.

    d. Assess the performance of each algorithm by using the SMSE.

    The procedure explained above was executed for each unroll factor u (u=1, 2, 8) and the

    number of folds used was K=10.

    5.6.3 Leave One Benchmark Out cross-validation Although the K-fold cross-validation procedure is widely accepted by the machine learning

    community to evaluate the performance of learning algorithms, the real goal in compiler

    optimisation with machine learning is to optimise complete programs that have not been seen

    before. Therefore, an alternative procedure called Leave One Benchmark Out (LOBO) cross-

    validation has been followed. This time, given a set of B benchmarks, the models must be

    trained with B-1 benchmarks and the performance of the algorithms must be evaluated in the

    benchmark that was not used for training. Therefore, it is the same cross-validation

  • Chapter 5. Modelling and Results

    67

    procedure explained above when the subsets are not randomly chosen but they are selected

    to be each benchmark S1, SB. Hence, there are B folds instead of K, where B is the number

    of benchmarks used (B=12).

    5.6.4 Realising speed-ups The accuracy of the predictions calculated for each learning technique in terms of the SMSE

    provides a measure that indicates the performance of the learning algorithms used for this

    specific problem. However, it does not present the actual improvement obtained with the

    values predicted on each loop and consequently, on each benchmark. The valid procedure to

    be followed is to determine the best unroll factor for each loop based on the predictions

    obtained and setting up the programs with the unrolled version of the loops in order to realise

    the final speed-ups. However, time-limitations have constrained the present project to carry

    out this task. Alternatively, with the initial data already collected and the results of LOBO

    cross-validation, it is possible to obtain the expected improvement in performance that each

    loop could experience under the assumptions of no interactions between loops and a

    negligible effect of the instrumentation. This procedure has been adopted and its results are

    described in section 5.7.4.

    5.7 Results and Evaluation

    5.7.1 Complete dataset Multiple Linear Regression and CART algorithm have been applied separately to the set of

    benchmarks. The Standardised Mean Square Error (SMSE) for each model that has been

    constructed is shown in Table 5.1. In the case of u=1 the SMSE by definition is 1 given that

    this factor is considered the point of reference and there is no improvement over itself.

    Recalling that the SMSE measures how good the predictions obtained by the model are

    compared to the situation of always predicting the mean, Multiple Linear Regression shows

    very little improvement over always predicting the mean. CART algorithm considerably

    outperforms the linear regression model. Since the same dataset for training is used to

    validate the model, this good performance is rather unrealistic and it may indicate that the

    algorithm is overfitting the data. However, the application of the models to the same data

    that is used for their construction attempts to be explanatory rather than to provide a measure

  • Chapter 5. Modelling and Results

    68

    of accuracy. In both cases, we are trying to determine those variables that are considered

    more important for each model.

    5.7.1.1 Analysing the regression coefficients: As explained above, in the linear regression model the parameter vector may provide an indication of the effect of each

    predictor variable on the output variable. The parameter vector for the Linear Model

    constructed using the whole dataset is shown in Table 5.2. For each model, the coefficients

    of the five most important features are shown shaded, i.e. those features with the greatest

    conditional effect (negative or positive).

    The absolute conditional effect of each feature throughout all the models has been calculated

    by:

    =

    =U

    u

    uii

    2 (5.6)

    This has been used to determine the best five features and they are presented below.

    SPEC CFP95

    Method u = 1 u = 2 u = 3 u = 4 u = 5 u = 6 u = 7 u = 8

    MLR 1.000 0.898 0.929 0.911 0.890 0.905 0.898 0.909

    CART 1.000 0.287 0.318 0.385 0.384 0.346 0.427 0.431

    VECTORD

    Method u = 1 u = 2 u = 3 u = 4 u = 5 u = 6 u = 7 u = 8

    MLR 1.000 0.898 0.971 0.932 0.970 0.965 0.968 0.943

    CART 1.000 0.306 0.649 0.463 0.595 0.565 0.586 0.517

    SPEC CFP95 + VECTORD

    Method u = 1 u = 2 u = 3 u = 4 u = 5 u = 6 u = 7 u = 8

    MLR 1.000 0.959 0.979 0.971 0.986 0.987 0.986 0.980

    CART 1.000 0.310 0.649 0.419 0.606 0.585 0.629 0.460

    Table 5.1: SMSE when applying Multiple Linear Regression (MLR) and CART to

    the complete set of benchmarks

  • Chapter 5. Modelling and Results

    69

    1. Stores: The number of array stores

    2. Size: The number of statements within the loop

    3. Reuses: The number of array element reuses

    4. Nested: The nested level of the loop

    5. Floating: The Number of floating point operations

    By far, the number of array stores and the size of the loop body are the features that the

    linear model has encountered to most influence the improvement in performance of the loops

    under unrolling. This is understandable as the former provides an indication of the memory

    references and the demand on registers the code may have and the latter is crucial to

    determine whether the instructions in the loop body may be kept in the cache. However,

    other features such as the number of if statements, the number of array loads or even the trip

    count, are found not to be so important for the linear models. Indeed, the trip count does not

    seem to provide useful information and is ranked among the best five features only when

    u=2. Other variables such as the number of array element reuses that attempts to represent

    data dependency among iterations and the number of floating point operations are also found

    to be relevant for these models. Surprisingly, the nested level of the loop is ranked forth

    among all the features. However, a cautionary comment should be mentioned given that the

    linear models that have been constructed do not seem to accurately represent the data, as the

    values of SMSE are approximately one. Certainly, the predictions are only slightly better

    than always predicting the mean and the findings explained above may only be a

    consequence of the poor performance of the linear model.

    5.7.1.2 Analysing the trees: With the aim of obtaining an explanatory model of the data,

    CART was applied to the whole dataset with no pruning, i.e. allowing overfitting. The

    results shown in Table 5.1 reflect this fact. The most informative feature, i.e. the one placed

    at the root of the tree, was the trip count for u=2, 4, 8 and the number of floating operations

    for u=3, 5, 6, 7. This is in contrast with the results obtained with linear regression where the

    trip count was found not to be relevant for the predictions.

    Some rules that show how the algorithm fits the data may be of interest. For example, for

    u=2 it is found that IF the loop is called between 349800 and 650000 times within the

    program AND the loop body ranges between 4 and 8 statements AND the trip count is

    greater than 755 AND the loop has maximum 1 branch AND the number of loads is greater

    than 2 AND the loop has only 1 store AND maximum 1 floating-point operation THEN the

  • Chapter 5. Modelling and Results

    70

    expected improvement in performance is approximately 38%. On the opposite case, IF the

    number of times the loops is called, the trip count and the number of branches are the same

    than before, but the number of loads per iteration is 2 or 3, AND the loop has 2 or more

    stores AND 2 or more floating-point operations THEN the expected detriment in

    performance is roughly 23%.

    Variable u = 2 u = 3 u = 4 u = 5 u = 6 u = 7 u = 8

    Called -0.958 -1.205 -1.023 -1.042 -0.849 -1.206 -1.049

    Size 3.318 5.387 3.967 4.779 2.850 4.742 3.827

    Trip -1.000 -0.434 -0.492 -0.505 -0.427 -0.534 -0.828

    Sys -0.838 -0.993 -0.782 -0.695 -0.433 -1.516 -1.038

    Branches 0.911 0.539 0.288 0.178 0.532 1.147 0.380

    Nested -0.716 -0.559 -1.504 -1.918 -1.783 -1.521 -2.008

    Loads 0.925 1.249 0.236 1.205 1.588 1.448 1.630

    Stores -3.911 -4.616 -5.179 -5.044 -4.315 -4.976 -4.987

    Reuses 1.724 1.514 2.566 2.419 2.647 1.602 2.418

    Floating -1.113 -2.137 -1.044 -1.237 -1.233 -1.208 -1.615

    IndAcc -0.173 0.268 -0.009 -0.214 -0.201 -0.177 -0.443

    Bo (bias) 1.804 1.668 4.587 2.370 2.872 3.772 5.143

    Table 5.2: Parameter vectors of the linear models using the whole dataset

    Although these prediction rules may seem unsophisticated, they represent how the loops in

    the set of benchmarks selected are affected by unrolling. If it is recalled that the subsets in

    which the data is partitioned by CART are as homogeneous as possible, they may provide a

    better understanding of how unrolling works on this set of benchmarks.

    5.7.2 K-fold cross-validation Given that training and testing the models in the same dataset does not represent the actual

    predictive power of the algorithms, K-fold cross-validation has been used to test the

    accuracy of the predictions. The results for the set of benchmarks used are shown in Table

    5.3, Table 5.4 and Table 5.5. Each table shows the average (in bold) of the SMSE throughout

    all the folds for the algorithms applied to each set of benchmarks. Although there are several

  • Chapter 5. Modelling and Results

    71

    cases when the SMSE is found to be less than one (highlighted cells), in general, the

    accuracy of the predictions for linear regression and CART algorithm is not good. The

    predictions obtained do not outperform the simple mean predictor used in the calculation of

    the SMSE. However, it is worth noting that some SMSE calculations indicate a good

    performance of the algorithm for a specific fold (e.g. in Table 5.3 for CART algorithm when

    using k=9 and u=6) but they are compensated by other folds in which the algorithm performs

    poorly. Comparatively, linear regression and CART algorithm have roughly the same

    performance, although as shown in Table 5.3 and Table 5.5, Linear Regression consistently

    outperforms CART.

    5.7.3 Leave One Benchmark Out Cross-validation As explained in section 5.6.3 a more realistic approach to test the algorithms is to use Leave

    One Benchmark Out (LOBO) cross-validation. The SMSE values for Linear Regression and

    CART algorithm are shown in Table 5.6 and Table 5.7. Unfortunately, the overall results are

    again, an indication of bad performance.

    For the case of Multiple Linear Regression (Table 5.6) the loops within the benchmarks

    102.swim, 141.apsi and vectord3 are predicted better than baseline, but these low values of

    SMSE are compensated by the poor performance of the algorithm in other benchmarks such

    as 104.hydro2d and 125.turb3d. For benchmarks with a considerable amount of loops that

    can be improved by unrolling such as vectord1 and vectord2 (see table 4.1 in chapter 4) the

    SMSE values are only slightly different from 1 and it could be expected these benchmarks to

    be correctly predicted when determining the best unroll factor to be used.

    As with Multiple Linear Regression, CART algorithm (Table 5.7) shows quite good

    performance when predicting on the benchmark 102.swim (u = 3, 4, 7 and 8). Additionally,

    unlike Linear Regression, CART shows good SMSE values on the benchmark 125.turb3d

    that has 5 loops for which unrolling is significantly beneficial (see table 4.1 in chapter 4).

    However, CART also has a bad performance in 101.tomcatv, vectord3 and vectord4. For this

    last benchmark composed of two loops with very little improvement, the algorithm makes

    predictions greater than 50%, causing the SMSE to reach values greater than 300.

  • Chapter 5. Modelling and Results

    72

    Linear Regression

    Fold u = 2 u = 3 u = 4 u = 5 u = 6 u = 7 u = 8

    K = 1 0.960 0.912 0.887 0.899 0.839 0.876 0.864

    K = 2 1.105 1.058 1.021 0.917 1.094 1.096 0.992

    K = 3 0.942 0.886 1.050 1.433 1.349 0.781 0.682

    K = 4 0.975 0.967 1.150 1.090 1.025 1.116 1.079

    K = 5 0.961 0.986 0.925 1.024 1.072 0.918 0.941

    K = 6 1.916 2.473 1.473 1.257 1.494 1.986 1.365

    K = 7 1.315 1.488 1.334 1.350 1.750 1.001 1.197

    K = 8 1.315 1.452 1.284 1.023 0.904 1.296 1.283

    K = 9 1.072 1.361 1.524 1.330 1.021 1.329 1.273

    K = 10 0.948 0.962 0.888 0.813 0.835 1.069 0.868

    AVG. 1.151 1.254 1.154 1.114 1.138 1.147 1.054

    CART

    FOLD u = 2 u = 3 u = 4 u = 5 u = 6 u = 7 u = 8

    K=1 1.400 1.164 1.319 1.078 0.905 1.228 0.825

    K=2 3.336 1.707 4.385 1.616 6.883 3.440 3.507

    K=3 1.097 2.658 1.452 2.220 2.380 2.079 2.387

    K=4 1.542 1.018 1.165 1.571 1.208 2.015 1.085

    K=5 1.550 1.268 0.909 0.807 1.106 0.565 0.756

    K=6 2.596 1.206 1.191 1.189 1.218 1.579 0.969

    K=7 1.060 2.247 1.240 1.331 0.757 0.718 0.763

    K=8 2.744 4.179 3.193 9.976 1.151 2.861 2.867

    K=9 1.976 1.769 0.940 1.359 0.244 1.299 1.155

    K=10 0.650 0.650 0.747 1.006 1.184 1.027 0.994

    AVG. 1.795 1.787 1.654 2.215 1.704 1.681 1.531

    Table 5.3: SMSE K-fold cross-validation for SPEC CFP95

  • Chapter 5. Modelling and Results

    73

    Linear Regression

    Fold u = 2 u = 3 u = 4 u = 5 u = 6 u = 7 u = 8

    K = 1 0.963 0.740 1.297 1.086 1.043 0.897 1.084

    K = 2 1.016 1.429 1.153 1.033 1.075 1.129 1.158

    K = 3 0.934 0.996 0.826 0.994 0.994 0.996 0.940

    K = 4 17.42 13.75 3.618 4.624 5.308 4.107 3.242

    K = 5 1.053 1.312 0.900 1.163 1.010 1.122 0.959

    K = 6 1.731 1.656 1.445 1.625 1.554 1.239 1.258

    K = 7 1.093 1.192 1.057 1.201 1.201 1.268 1.109

    K = 8 0.797 0.850 1.039 1.088 1.421 1.182 1.048

    K = 9 1.078 1.569 1.016 1.529 1.534 1.640 1.039

    K = 10 1.090 1.800 2.011 2.723 2.809 2.477 1.676

    AVG. 2.717 2.529 1.436 1.707 1.795 1.606 1.351

    CART

    FOLD u = 2 u = 3 u = 4 u = 5 u = 6 u = 7 u = 8

    K=1 0.899 2.007 1.810 1.863 2.353 2.529 1.483

    K=2 1.844 6.094 1.570 6.769 7.186 7.719 1.081

    K=3 0.869 0.827 0.728 0.784 0.743 0.831 2.922

    K=4 1.345 1.070 0.771 0.742 0.713 0.889 0.963

    K=5 1.595 4.826 1.677 3.197 3.073 4.198 2.964

    K=6 1.599 1.610 1.050 0.793 0.746 0.682 0.575

    K=7 1.794 1.581 1.874 3.552 3.475 3.432 1.876

    K=8 0.938 1.348 1.173 1.278 0.964 1.343 0.995

    K=9 0.990 0.625 1.046 1.512 1.340 1.374 1.713

    K=10 1.269 1.379 1.307 1.162 1.412 1.329 1.366

    AVG. 1.314 2.137 1.301 2.165 2.200 2.433 1.594

    Table 5.4: SMSE K-fold cross-validation for VECTORD

  • Chapter 5. Modelling and Results

    74

    Linear Regression

    Fold u = 2 u = 3 u = 4 u = 5 u = 6 u = 7 u = 8

    K = 1 1.076 1.034 0.991 1.037 1.020 1.052 0.978

    K = 2 1.022 1.000 1.044 1.006 1.022 1.004 1.055

    K = 3 0.972 0.960 0.942 0.960 0.986 0.903 0.946

    K = 4 0.919 1.016 1.052 1.077 1.051 1.090 1.051

    K = 5 1.000 1.047 0.960 0.945 0.971 1.014 1.037

    K = 6 1.192 1.149 1.052 1.062 1.034 1.038 1.030

    K = 7 1.032 0.981 1.057 1.028 1.030 1.034 1.032

    K = 8 0.931 0.852 0.914 0.927 0.993 0.871 0.954

    K = 9 0.935 0.943 0.975 0.999 0.988 0.963 0.970

    K = 10 1.303 1.353 2.247 1.332 1.621 1.876 1.691

    AVG. 1.038 1.034 1.123 1.037 1.072 1.084 1.074

    CART

    FOLD u = 2 u = 3 u = 4 u = 5 u = 6 u = 7 u = 8

    K=1 1.054 0.678 0.887 1.818 1.094 0.566 1.185

    K=2 1.180 0.835 0.529 0.766 0.752 0.769 0.717

    K=3 0.638 1.898 1.544 2.420 2.459 1.724 2.371

    K=4 1.119 1.678 0.978 1.561 1.267 1.717 0.996

    K=5 1.661 1.531 1.825 1.833 1.551 1.139 1.117

    K=6 1.116 1.171 1.721 1.677 1.541 1.694 1.830

    K=7 0.907 1.332 0.781 1.105 1.070 1.291 0.645

    K=8 1.270 1.172 0.884 1.593 1.219 1.359 0.720

    K=9 1.511 8.758 1.068 3.444 1.009 2.384 1.089

    K=10 1.513 1.533 1.255 2.091 2.764 2.027 1.138

    AVG. 1.197 2.058 1.147 1.831 1.473 1.467 1.181

    Table 5.5: SMSE K-fold cross-validation for SPEC CFP95 + VECTORD

  • Chapter 5. Modelling and Results

    75

    Benchmark u = 2 u = 3 u = 4 u = 5 u = 6 u = 7 u = 8

    101.tomcat 5.917 0.869 0.985 0.573 2.691 0.612 1.838

    102.swim 0.279 0.625 0.636 0.739 0.717 0.756 0.705

    103.su2cor 1.832 1.079 1.793 1.352 1.573 1.479 1.602

    104.hydro2d 8.086 38.60 19.28 42.64 29.03 23.60 26.76

    107.mgrid 0.889 1.681 2.096 2.997 4.354 2.505 2.775

    125.turb3d 2.085 9.685 20.75 17.63 6.313 10.96 9.547

    141.apsi 1.031 1.025 0.913 0.911 0.900 0.898 0.987

    146.wave5 1.343 1.616 1.856 2.715 2.365 2.464 2.381

    Vectord1 0.976 0.996 0.987 1.010 1.018 1.003 1.003

    Vectord2 1.064 1.102 1.042 1.057 1.048 1.068 1.037

    Vectord3 0.958 0.976 0.943 0.727 0.802 0.838 0.895

    Vectord4 1.487 1.485 4.748 0.527 1.869 2.274 0.030

    AVG. 2.162 4.978 4.670 6.073 4.390 4.038 4.130

    Table 5.6: SMSE LOBO cross-validation for Multiple Linear Regression

    5.7.4 Realising speed-ups Considering that the ultimate task in loop unrolling is to determine the unroll factor to be

    used for each loop, an additional procedure has been followed in order to measure the impact

    of the predictions described above. Given that the actual improvement in performance of the

    loops is available for each unroll factor, it is possible to establish the best improvement in

    performance that a loop can achieve. Similarly, the predictions provided by the algorithms

    (obtained with LOBO cross-validation) can be used to ascertain the best predicted unroll

    factor for each loop. Therefore, the expected improvement in performance a loop may

    experience under this unroll factor can be found. Let us call this magnitude the predicted

  • Chapter 5. Modelling and Results

    76

    improvement, although it should be clear that it is the actual improvement of each loop based

    on the unroll factor suggested by the predictions of the algorithms.

    Benchmark u = 2 u = 3 u = 4 u = 5 u = 6 u = 7 u = 8

    101.tomcat 4.861 6.862 6.244 4.185 13.68 8.461 3.907

    102.swim 22.76 0.739 0.575 1.499 1.284 0.646 0.795

    103.su2cor 6.600 4.158 1.753 1.754 1.560 1.317 0.949

    104.hydro2d 1.953 2.274 1.218 3.075 3.181 2.042 1.896

    107.mgrid 1.656 2.401 0.674 2.903 2.187 2.069 2.731

    125.turb3d 1.468 2.425 0.658 1.392 0.787 1.252 0.514

    141.apsi 1.253 2.020 1.455 1.635 3.770 2.328 2.539

    146.wave5 0.672 1.579 0.611 5.272 3.133 1.262 0.789

    Vectord1 1.169 1.122 1.239 1.002 1.242 1.145 1.296

    Vectord2 0.882 4.690 1.009 2.469 2.678 3.126 1.309

    Vectord3 5.459 1.891 6.883 4.882 4.799 5.201 3.859

    Vectord4 386.5 328.3 389.4 210.8 95.33 344.6 481.3

    AVG. 36.27 29.87 34.31 20.07 11.13 31.12 41.82

    Table 5.7: SMSE LOBO cross-validation for CART algorithm

    Figure 5.1 shows the plots of the predicted improvement vs. the best improvement for

    Multiple Linear Regression and CART algorithm. It is clear that the ideal place for each

    point in the plots is on the diagonal of the first quadrant, i.e. when the predicted improvement

    is equal to the Best improvement. However, it may also be acceptable to be under the

    diagonal (it should be noticed that being over the diagonal is impossible) but over the x-axis.

    This is an indication that, although the exact improvement is not predicted, a lot can be

    gained if the unroll factor suggested by the algorithms is used. Both Linear Regression and

    CART suggest unroll factors for which a notable improvement in performance is achieved.

  • Chapter 5. Modelling and Results

    77

    However, they also suggest predictions that cause a detriment in performance when the loops

    could potentially be improved, as it can be perceived from the points under the x-axis.

    Negative values lying at the y-axis indicate that the best improvement is zero, but that the

    predicted unrolling hurts performance.

    As some overprinting that makes difficult the analysis of the actual effect of the predictions

    may be present in Figure 5.1, the best improvement and the predicted improvement have

    been divided into ten classes and a histogram that computes the number of loops that belong

    to each pair has been calculated. The results are shown in Figure 5.2. Given that there are

    only few differences for Linear Regression and CART, the subsequent analysis will refer to

    the results of both algorithms (unless explicitly stated). As before, it is desired to be on the

    main diagonal of the plot because it indicates that the effect of the predictions is the best

    possible improvement. Hence, approximately 57% of the loops are correctly affected with

    the predictions. Furthermore, it can be observed that a large number of loops remain

    unaffected with the predictions as their maximum improvement is nearly zero. More

    specifically, the bar that corresponds to the fifth class of both, the predicted improvement

    and the best improvement, represents roughly 28% of the loops. It is also interesting to

    calculate the percentage of loops for which unrolling cannot produce an improvement in

    performance but are negatively affected with the predictions. These loops are represented by

    the bars for which the predicted improvement class is less than five and the best

    improvement class is five. Roughly, 17% of the loops belong to this group. The bars for

    which the best improvement class is greater than five and the predicted improvement class is

    five correspond to those loops that could potentially be improved by unrolling but the

    predictions do not produce a significant effect. 13% (for Linear Regression) and 17% (for

    CART) of the loops correspond to this group.

    Besides analysing the effect of the predictions on the complete set of loops, the unroll factors

    suggested by the algorithms have been used to ascertain the reduction (or increase) in the

    execution time of each loop by using the initial data that has been collected. These

    reductions in execution time have been used to determine the impact of the predictions on

    the total execution time of the benchmarks. Let us denote this improvement of performance

    as hypothetical or re-substitution improvement because it has not been obtained by

    additional executions of the programs. It relies on the assumptions that there is no interaction

    between loops and that the effect of the instrumentation is negligible. The results for all the

    benchmarks used are shown in Figure 5.3. The results obtained by the Linear Regression

  • Chapter 5. Modelling and Results

    78

    model leaded to an improvement in performance of seven benchmarks, whilst one

    benchmark, namely 125.turb3d remained nearly unaffected, and four programs experienced

    an increase in their execution times. The improvements obtained by using the results of

    CART algorithm are similar, except that in this case, 125.turb3d was negatively affected and

    its performance was worsened. It can be seen that the SPEC CFP95 benchmarks are more

    difficult to optimise than the VECTORD benchmarks. The maximum improvement reached

    for the former was roughly 6% while the latter was improved at a maximum of

    approximately 18%. The mean improvement in performance throughout all the benchmarks

    achieved by the Linear Regression model was about 2.5%, while CART algorithm gave

    2.3% improvement on average. Finally, the benchmark most negatively affected by

    erroneous predictions of the algorithms was vectord3 that experienced a detriment of

    approximately 7%.

    If Figure 5.3 is compared with the maximum possible improvement achieved by unrolling

    shown in Figure 5.4 for all the benchmarks, the negative effect of the predictions on some

    benchmarks can be understood given the insignificant maximum possible improvement that

    can be reached on these benchmarks. Furthermore, the good effect obtained for vectord1 and

    vectord2 can also be explained based on this maximum improvement.

    5.7.5 Feature construction As explained in section 4.6.2 some features may be irrelevant for the learning techniques;

    others can be transformed into a more suitable form, for example binary features; and new

    features can be constructed in order to improve the accuracy of the predictions, for example,

    the ratio of memory references to floating point operations. Binary features were created for

    sparse variables such us the number of proper functions of the language or the number of

    branches within the loop but no improvement was found. Similarly, the feature ratio of

    memory references to floating point operations was created, but unfortunately this new

    variable did not affect the performance of the algorithms.

    5.7.6 Comparison to related work In principle, there is no point of comparison with the work done in [Monsifrot et al., 2002] or

    the one in [Stephenson and Amarasinghe, 2004] due to the differences in the benchmarks

  • Chapter 5. Modelling and Results

    79

    used and the formulation of the problem. Certainly, their goal was to build a classifier while

    the approach that has been adopted in this project is the construction of a regression model.

    Although in [Monsifrot et al., 2002] unrolling was also implemented at the source code level,

    a classification task was developed and, consequently, the results were presented in a

    different way. Additionally, the decision of the specific unroll factor to be used was

    transferred to the compiler and it was not determined by the classifier. This considerably

    simplifies the problem given that a loop may be improved by one unroll factor but negatively

    affected by another one.

    For illustrative purposes only, if a classifier was constructed with the results obtained with

    Linear Regression or CART algorithm, it would have approximately 75% of accuracy, which

    is comparable with the total accuracy obtained in [Monsifrot et al., 2002] without boosting

    (79.4%). Their speed-ups were on average 6.2% and 4% in two different machines.

    However, there is no reference of the maximum speed-up reachable in the data.

    In the case of the work developed by [Stephenson and Amarasinghe, 2004] it is even more

    difficult to establish comparisons because they implemented unrolling at the back-end of the

    compiler. They also adopted a classification approach and obtained speed-ups of 6% on the

    SPEC benchmarks. However, they included programs from the SPEC INT95 that were not

    considered in this project. Furthermore, programs for which an improvement of 30% to 50%

    was possible were included in their dataset, which is greater than the maximum obtainable

    speed-ups for the programs used in this project.

    5.8 Summary and Discussion This chapter has addressed the regression approach for predicting the expected improvement

    in performance of loops under unrolling. This approach has been explained to be more robust

    and general than the classification approach. It is more robust because it deals more easily

    with noisy measurements. It is considered more general because the binary classification of

    deciding whether unrolling should be applied or not, and the multi-classification approach of

    predicting the best unroll factors can be easily determined after obtaining the results with the

    regression approach.

  • Chapter 5. Modelling and Results

    80

    Two different modelling techniques have been used to tackle this regression problem:

    Multiple Linear Regression and Classification and Regression Trees (CART). Multiple

    Linear Regression is a straightforward technique that assumes that the output variable is a

    linear combination of the input variables. Despite its simplicity, it has demonstrated to be

    successful for many types of problems. CART algorithm is a method based on decision trees

    learning used for classification and regression. It attempts to model non-linear dependencies

    between the output variable and the predictor variables without making assumptions on the

    distribution of the data. To measure the predictive power of the techniques, the Standardised

    Mean Square Error (SMSE) has been used. It comparatively indicates the accuracy of the

    predictions with the very simple case of always predicting the mean. Three different types of

    experiments have been designed in order to test the models constructed for the regression

    approach: using the complete dataset, using K-fold cross-validation and using Leave One

    Benchmark Out (LOBO) Cross-validation. When working with the complete dataset, the

    algorithms were trained and tested on the same data in order to obtain an explanatory model

    that enabled us to have a better insight into the problem and to determine those variables that

    are found to be relevant for loop unrolling. The regression coefficients indicated that the

    most influential variables were the number of array stores, the number of statements within

    the loop, the number of array element reuses, the nested level of the loop and the number of

    floating point operations. Apart from the nested level of the loop that was unexpectedly

    ranked fourth among the most important variables all the other features encountered have a

    demonstrated impact on loop unrolling. The analysis of the trees that were built by CART

    algorithm determined that the trip count and the number of floating-point operations were

    found to be the most relevant features. Additionally, interesting rules that involve the size of

    the loop body, the number of memory references, the number of floating-point operations

    and the trip count emerged from the trees to provide an explanation of how the regression

    task was developed in the data.

    Considering that the predictive power of the techniques is what describes the effectiveness of

    the regression approach, the results obtained for the cross-validation methodology and the

    Leave One Benchmark Out strategy are, in general, disappointing. Certainly, although for

    some cases the SMSE values were an indication of good performance, they were strongly

    overshadowed by other cases where the performance of the algorithms was poor.

    Without being discouraged by the low accuracy of the predictions, the models that were

    constructed by the Linear Regression approach and CART algorithm were used to suggest an

  • Chapter 5. Modelling and Results

    81

    unroll factor for each loop. Thus, these unroll factors were considered in order to ascertain a

    possible improvement in performance that the loops could obtain. It was found that in most

    of the cases, the unroll factors suggested by the algorithms produced an improvement in

    performance that, to some extent, agreed with the best obtainable improvements. In other

    words, many of the unroll factors suggested by the predictions of the algorithms leaded to an

    improvement in performance of the loops. However, some loops were negatively affected by

    incorrect predictions.

    Additionally, given the time-constraints placed on the present project, the benchmarks were

    not executed with the predicted unrolled version of the loops. An alternative procedure based

    on the data that was collected was performed in order to realise speed-ups on each

    benchmark. This procedure reused the execution times of the data and worked under the

    assumptions of no interaction between loops and a negligible effect of the instrumentation.

    Hence, the new execution times for the benchmarks were calculated based on the predictions

    suggested by the algorithms with the LOBO cross-validation results. It was encountered that

    seven of twelve benchmarks might experience an improvement in performance, where the

    maximum speed-up found was approximately 18%. However, other benchmarks might

    experience an increase in their execution times if the predictions were adopted. The worst

    detriment in performance was found to be roughly 7%. The mean improvement of

    performance throughout all the benchmarks was about 2.5% for the linear model and 2.3%

    for the results obtained by CART.

    It may seem rather contradictory that the accuracy of the regression techniques was found to

    be poor, but the unroll factor suggested by the results leaded to an improvement in

    performance that is notable, considering the maximum improvement obtainable by unrolling

    on the set of benchmarks. There are two reasons that may explain this fact. Firstly, some

    loops may always be improved by unrolling regardless of the unroll factor used. That is, the

    execution time of some loops may be effectively reduced by any unroll factor greater than

    one. However, there is still an indication of a good effect of the predictions given that, as

    shown in Chapter Four, most of the loops do not experience an improvement in performance

    greater than 5%. Indeed, and this is the second reason for the speed-ups obtained despite the

    poor performance of the algorithms, the models constructed may somehow have learnt when

    the loops experience a positive effect under unrolling and, although the accuracy of their

    predictions is low, the best unroll factor suggested by their results still represents a good

    improvement for the loops.

  • Chapter 5. Modelling and Results

    82

    The explanation given above does not attempt to soften the poor predictive power in terms of

    the SMSE achieved by the techniques used, but rather raise a question regarding the

    appropriateness of the regression approach for the problem tackled in this project. I still

    believe it is a more general and suitable approach than the construction of a classifier to

    directly predict the best unroll factor. The classification approach is more susceptible to

    erroneous decisions given that it does not smoothly model the behaviour of the execution

    times of the loops under unrolling. However, it may also be true that the regression approach

    may need more training data, different features to capture the variations in the execution

    times of the loops under unrolling and even different types of techniques to solve the

    problem. These considerations can be adopted by future work in order to improve the results

    obtained in this project.

  • Chapter 5. Modelling and Results

    83

    -160

    -120

    -80

    -40

    0

    40

    80

    0 10 20 30 40 50 60 70

    Best Improvement (%)

    Pred

    icte

    d im

    prov

    emen

    t (%

    )

    -160

    -120

    -80

    -40

    0

    40

    80

    0 10 20 30 40 50 60 70

    Best Improvement (%)

    Pred

    icte

    d im

    prov

    emen

    t (%

    )

    Figure 5.1: Predicted Improvement (%) vs. Best Improvement in performance

    for Linear Regression model (top) and CART (bottom)

  • Chapter 5. Modelling and Results

    84

    Class Improvement (%) 1 (-150,-100]

    2 (-100,-50]

    3 (-50, -20]

    4 (-20,-5]

    5 ( -5,5]

    6 (5,10]

    7 (10,20]

    8 (20,40]

    9 (40,60]

    10 (60,80]

    Figure 5.2: Histogram Best Improvement Predicted Improvement for Multiple

    Linear Regression (top) and CART (bottom)

  • Chapter 5. Modelling and Results

    85

    Figure 5.3: Re-substitution Improvement found by Multiple Linear Regression

    (top) and CART (bottom)

  • Chapter 5. Modelling and Results

    86

    Figure 5.4: Maximum possible improvement obtainable by unrolling

  • 87

    Conclusions

    This dissertation has addressed the problem of compiler optimisation with machine learning.

    It has been shown that even in the case of only one transformation such as loop unrolling

    being considered, the problem of determining when and how this transformation should be

    applied remains a challenge for compiler writers and current researchers. This is evident

    because loop unrolling may be beneficial for some loops but detrimental for other loops.

    Furthermore, this transformation may cause a negative and a positive effect on the same loop

    when using a different unroll factor. One of the most important reasons to apply loop

    unrolling is that it exposes Instruction Level Parallelism. Unrolling can also eliminate loop

    copy operations and improve memory locality. However, the transformation may degrade the

    instruction cache depending on factors such as the size of the loop body, the size of the cache

    and the unroll factor used. Additionally, it may augment the number of calculations of

    addresses and make the instruction scheduling problem more complex. Finally, due to the

    effect of interactions, loop unrolling may enable or prevent other transformations. This is

    probably the most complicated issue when studying loop unrolling, given that analysing the

    interaction with another transformation in isolation may be unrealistic but evaluating its

    effect when using a great number of program transformations will become unfeasible.

    These factors are essential for determining the use of machine learning techniques in order to

    predict when unrolling can be detrimental or beneficial for a particular loop. This approach

    assumes that program transformations can be learnt from past examples; more specifically,

    that the parameters involved in loop unrolling can be learnt based on the behaviour of the

    execution times of loops that are described by a set of features.

    Despite previous work having focused on stating the optimisation problem with loop

    unrolling as a classification task, throughout this project the belief has been held that a

    regression approach could more appropriately model the behaviour of a loop under unrolling.

    This may be understood from the fact that, as concerns the problem this project has dealt

    with, classifiers could be severely affected by noisy measurements and a limited number of

    examples in the training set. Furthermore, the results of a classification approach regarding

    the decisions of whether applying unrolling to a specific loop and determining the best unroll

    factor for loops are easily obtained when applying the regression approach.

  • Conclusions

    88

    The belief has also been held that, as in most data mining applications, the stage of pre-

    processing the data to make it suitable for learning is the phase to which most time must be

    dedicated, given that without reliable data the learning process proves non-sensical. Thus,

    much effort was focused on the process of collecting, cleaning, analysing and preparing the

    data to be used by machine learning techniques. Judiciously following this process, the

    features and the execution times of the loops were extracted and transformed in order to be

    used by the regression approach.

    Two different algorithms were used for regression in order to predict the improvement in

    performance a loop can experience when it is unrolled a specific number of times: Multiple

    Linear Regression and Classification and Regression Trees (CART). The former was not

    only selected on the basis of its simplicity, but also on the basis of its successful performance

    across different types of applications. The latter was chosen as it considers non-linear

    relationships and does not make assumptions about the underlying distribution of the data.

    The results in terms of the accuracy measured by the Standardised Mean Square Error

    (SMSE) were disappointing as most of the cases the predictions obtained with the algorithms

    could not outperform the very simple mean predictor used by the SMSE. Nonetheless, these

    predictions were used in order to ascertain the best unroll factor for loops and determine the

    expected improvement in performance the benchmarks that were considered could

    experience. Due to time-constraints placed on this project, the programs could not be

    executed with the predictions suggested by the algorithms. The initial data that was collected

    was used instead, in order to compute the final results. Thus, the realisations of speed-ups

    were carried out under the assumption that there is no interaction between loops and that the

    effect of the instrumentation is negligible.

    It was found that most of the loops could experience an improvement in performance by

    using the unroll factor suggested by the predictions of the algorithms. Certainly, it was found

    that seven out of twelve Benchmarks could be improved with the predictions, where the

    maximum speed-up found was approximately 18%. Four programs in the case of the linear

    model, and five programs in the case of CART algorithm, were negatively affected by

    unrolling when using the factors predicted. On average, an overall improvement in

    performance of 2.5% (for Linear Regression) and 2.3% (for CART) could be achieved.

  • Conclusions

    89

    Two possible explanations were considered after analysing the results obtained. Firstly, the

    fact that some loops are positively influenced by unrolling, regardless of the unroll factor

    used. Secondly, although the accuracy of the predictions was poor, the models enabled the

    discovery of circumstances in which unrolling could effectively improve the execution time

    of loops. In conclusion, both factors have influenced the improvements obtained. However,

    as a consequence of improvements in performance having been achieved, it would be nave

    to sate that all the results in the project are exciting. Other approaches, albeit using different

    data, have obtained superior speed-ups. It would also be extreme to assume that the

    improvement in performance of loops under unrolling is unpredictable. It is difficult to

    determine a unique reason responsible for the poor performance of the algorithms used. It is

    necessary, however, to remark that the number of loops utilised in this project is

    considerably less than in similar approaches. Certainly, whilst only 248 loops were included

    in the dataset in this project, previous work accumulated data for more than 1000 loops.

    Additionally, the set of mostly static features utilised for loops may not be sufficient to

    accurately predict the improvement in performance of loops under unrolling. Finally, other

    regression methods different from Linear Regression and CART algorithm could perform

    better using the dataset that has been created.

    Thus, immediate future work carried out to follow on from the present project must be

    focused on the addition of further examples to the erstwhile dataset, remaining cautious that

    any new data added is suitable for use by machine learning techniques. The dataset may be

    complemented not only with more examples but also with a greater number of features.

    Dynamic features can be determined by frameworks specialised in program analysis and

    program transformations. Furthermore, it is worth trying other regression methods different

    from Multiple Linear Regression and CART algorithm.

    Going beyond learning in loop unrolling there is a great number of applications in compiler

    optimisation that could be addressed. For example, discerning better heuristics for other

    program transformations or working with more complex problems such as instruction

    scheduling or register allocation. However, there is still much work to be done in connection

    with the problem of interactions among different transformations. Given that evaluating all

    the possible optimisation options with which a program should be executed is unfeasible,

    one could have a set of training examples for which these optimisations has been recorded,

    and it would be worthwhile predicting the levels of optimisation for which a program may

    achieve maximum speed. Nonetheless, it should be recalled that the creation of a good

  • Conclusions

    90

    dataset usable for machine learning techniques is a time-consuming activity. Therefore, the

    use or the creation of software tools that could automate this process can provide a great

    benefit for the research in this area.

  • 91

    Bibliography

    [Bacon et al., 1994] Bacon, D., Graham, S. and Sharp, O. (1994). Compiler Transformations

    for High-Performance Computing. In ACM Computing Survey, Vol. 26, Issue 4, Pages 345

    420.

    [Bodin et al., 1998] Bodin, F., Mvel Y. and Quiniou, R. (1998). A User Level Program

    Transformation tool. In Proceedings of the international Conference on Supercomputing.

    [Breiman et al., 1993] Breiman L., Friedman J., Olshen R. and Stone C. (1998).

    Classification and Regression Trees. Chapman and Hall, Boca Raton.

    [Calder et al., 1997] Calder, B., Grunwald, D., Jones, M., Lindsay, D., Martin, J., Mozer, M.

    and Zorn, B. (1997). Evidence-Based Static Branch Prediction Using machine Learning. In

    ACM Transactions on Programming Languages and Systems, Vol. 19, No. 1, Pages 188-

    222.

    [Cooper and Torczon, 2004] Cooper, K. and Torczon L. Engineering a Compiler. (2004).

    Morgann Kaufmann.

    [Davidson and Jinturkar, 2001] Davidson, J. and Jinturkar, S. (2001). An Aggressive

    Approach to Loop Unrolling. Technical Report: CS-95-26. University of Virginia, USA.

    [Dongarra and Hinds, 1979] Dongarra, J. and Hinds, A. (1979). Unrolling Loops in

    FORTRAN. Software-Practice and Experience, Vol. 9, Pages 219-226.

    [Fursin, 2004] Fursin, G. (2004). Iterative Compilation and Performance Prediction for

    Numerical Applications. PhD thesis, University of Edinburgh, Edinburgh, UK.

    [G77] GNU Fortran Compiler. http://gcc.gnu.org/.

    [Han and Kamber, 2001] Han, J. and Kamber, M. (2001). Data Mining: Concepts and

    Techniques. Morgan Kaufmann.

    http://gcc.gnu.org/

  • Bibliography

    92

    [Hand et al., 2001] Hand, D., Mannila, H. and Smyth, P. (2001). Principles of Data Mining.

    MIT Press.

    [Hays and Winkler, 1971] Hays, W. and Winkler, R. (1971). Statistics: Probability, Inference

    and Decision. Volume I. Holt, Rinehart and Winston, Inc.

    [Kolodner, 1993] Kolodner, J. (1993). Case-Based Reasoning. Morgann Kaufmann.

    [Koza, 1992] Koza J. (1992). Genetic Programming: On the Programming of Computers by

    Means of Natural Selection. MIT Press.

    [Levine et al., 1991] Levine, D., Callahan, D. and Dongarra, J. (1991). A test suite for

    vectorizing Fortran compilers. Double precision.

    [Long, 2004] Long, S. (2004). Adaptive Java Optimisation Using Machine Learning

    Techniques. PhD thesis, University of Edinburgh, Edinburgh, UK.

    [Mitchell, 1997] Mitchell, Tom. (1997). Machine Learning. McGraw Hill.

    [Monsifrot and Bodin, 2001] Monsifrot, A. and Bodin, F. (2001). Computer Aided Tuning

    (CAHT): Applying Case-Based Reasoning to Performance Tuning. In proceedings of the

    15th ACM International Conference on Supercomputing (ICS-01), pages 196-203. ACM

    Press, Sorrento, Italy.

    [Monsifrot et al., 2002] Monsifrot, A., Bodin, F. and Quiniou R. (2002). A Machine

    Learning Approach to Automatic Production of Compiler Heuristics. In Artificial

    Intelligence: Methodology, Systems, Applications, pages 41-50.

    [Neter et al., 1996] Neter, J., Kutner, M., Nachtsheim, C. and Wasserman, W. (1996).

    Applied Linear Statistical Models, 4th Edition. Irwin.

    [Nielsen, 2004] Nielsen, P. (2004). Lecture notes on Advanced Computer Systems. The

    University of Auckland, New Zealand. Downloadable from:

    http://www.esc.auckland.ac.nz/teaching/Engsci453SC/.

    http://www.esc.auckland.ac.nz/teaching/Engsci453SC/

  • Bibliography

    93

    [ORC] Open Research Compiler for ItaniumTM Processor Family.

    http://ipf-orc.sourceforge.net/.

    [Shen and Smaill, 2003] Shen, Q. and Smaill, A. (2003). Lecture Notes on Knowledge

    Representation. The University of Edinburgh, UK. Downloadable from:

    http://www.inf.ed.ac.uk/teaching/modules/kr/notes/index.html.

    [SPEC95] The Standard Performance Evaluation Corporation.

    http://www.specbench.org/.

    [Stephenson and Amarasinghe, 2004] Stephenson, M. and Amarasinghe, S. (2004).

    Predicting Unroll Factors Using Nearest Neighbors. MIT-TM-938.

    [Stephenson et al., 2003] Stephenson M., Martin M., OReilly U.-M. and Amarasinghe, S.

    (2003). Meta Optimization: Improving Compiler Heuristics with Machine Learning. In

    Proceedings of the SIGPLAN 03 Conference on Programming Language Design and

    Implementation, San Diego, CA.

    [Trimaran] Trimaran, an Infrastructure for Research in Instruction-Level Parallelism.

    http://www.trimaran.org.

    http://ipf-orc.sourceforge.net/http://www.inf.ed.ac.uk/teaching/modules/kr/notes/index.htmlhttp://www.specbench.org/http://www.trimaran.org/

    Predicting Good Compiler TransformationsUsing Machine LearningAbstractAcknowledgementsDeclarationTable of ContentsIIntroductionOverview and MotivationProject ObjectivesOrganisation

    CChapter OneLiterature ReviewIntroductionTuning heuristics and recommending program transformationsLearning in a particular program transformation: loop unrollingSummary

    CChapter 2Background on Compiler OptimisationIntroductionDefinition of compilationCompiler organisationThe purpose of a compilerAn Optimising Compiler

    2.5.2.1 Correctness: As for the general compilation process, the transformation applied must be correct. The principle is basically the same: the code produced by a transformation must preserve the meaning of the input code. In other words, if the meanin2.5.2.2 Profitability: A transformation applied to a particular fragment of a program must improve the ultimate code generated. In other words, the process of applying a particular transformation is expected to produce an improvement in the execution tim2.5.3.1 Identification: The compiler has to identify which part of the code can be optimised and which transformations may be applied to it.2.5.3.2 Verification: The legality of each transformation must be ensured, i.e. that the transformation does not change the meaning of the program.2.5.3.3 Conversion: Refers to the process of applying a particular transformation.2.5.5.1 Machine-independent transformations: convert an intermediate representation (IR) of the program into another intermediate representation. Consequently, the code generated does not depend on a specific machine or architecture. However, since the c2.5.5.2 Machine-dependent transformations: Machine-dependent transformations are also called machine-level transformations. They convert the intermediate representation of the program directly into assembly code. Thus, the code generated is tied to a speLoop UnrollingSummary

    CChapter 3Data CollectionIntroductionThe BenchmarksImplementation of loop unrollingGenerating the targetsTechnical DetailsThe results in summaryFeature extractionThe representation of a loopSummary

    CChapter 4Data Preparation and Exploratory Data AnalysisIntroductionThe general framework for data integrationFormal representation of the dataIs this data valid?Pre-processing the targetsPre-processing the featuresSummary

    CChapter 5Modelling and ResultsIntroductionThe regression approachLearning methods used

    5.3.1.1 Understanding the regression coefficients: It is clear that the components of the parameter vector are the coefficients of the input variables. Recalling that the features have been scaled to have zero mean and unit variance, the value of indParameters settingMeasure of performance usedExperimental DesignResults and Evaluation

    5.7.1.1 Analysing the regression coefficients: As explained above, in the linear regression model the parameter vector may provide an indication of the effect of each predictor variable on the output variable. The parameter vector for the Linear Model co5.7.1.2 Analysing the trees: With the aim of obtaining an explanatory model of the data, CART was applied to the whole dataset with no pruning, i.e. allowing overfitting. The results shown in Table 5.1 reflect this fact. The most informative feature, i.eSummary and Discussion

    CConclusionsBBibliography

Recommended

View more >