efficient execution paradigms for parallel heterogeneous …952645/... · 2016. 9. 2. · efficient...

ACTAUNIVERSITATIS

UPSALIENSISUPPSALA

2016

Digital Comprehensive Summaries of Uppsala Dissertationsfrom the Faculty of Science and Technology 1405

Efficient Execution Paradigmsfor Parallel HeterogeneousArchitectures

KONSTANTINOS KOUKOS

ISSN 1651-6214ISBN 978-91-554-9654-8urn:nbn:se:uu:diva-300831

Dissertation presented at Uppsala University to be publicly examined in ITC/1111,Lägerhyddsvägen 2, Uppsala, Friday, 30 September 2016 at 13:00 for the degree of Doctor ofPhilosophy. The examination will be conducted in English. Faculty examiner: Prof. MargaretMartonosi (Dept. of Computer Science, Princeton University).

AbstractKoukos, K. 2016. Efficient Execution Paradigms for Parallel Heterogeneous Architectures.Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science andTechnology 1405. 54 pp. Uppsala: Acta Universitatis Upsaliensis. ISBN 978-91-554-9654-8.

This thesis proposes novel, efficient execution-paradigms for parallel heterogeneousarchitectures. The end of Dennard scaling is threatening the effectiveness of DVFS in futurenodes; therefore, new execution paradigms are required to exploit the non-linear relationshipbetween performance and energy efficiency of memory-bound application-regions. To attackthis problem, we propose the decoupled access-execute (DAE) paradigm. DAE transformsregions of interest (at program-level) in two coarse-grain phases: the access-phase and theexecute-phase, which we can independently DVFS. The access-phase is intended to prefetchthe data in the cache, and is therefore expected to be predominantly memory-bound, while theexecute-phase runs immediately after the access-phase (that has warmed-up the cache) and istherefore expected to be compute-bound.

DAE, achieves good energy savings (on average 25% lower EDP) without performancedegradation, as opposed to other DVFS techniques. Furthermore, DAE increases the memorylevel parallelism (MLP) of memory-bound regions, which results in performance improvementsof memory-bound applications. To automatically transform application-regions to DAE, wepropose compiler techniques to automatically generate and incorporate the access-phase(s) inthe application. Our work targets affine, non-affine, and even complex, general-purpose codes.Furthermore, we explore the benefits of software multi-versioning to optimize DAE in dynamicenvironments, and handle codes with statically unknown access-phase overheads. In general,applications automatically-transformed to DAE by our compiler, maintain (or even exceed insome cases) the good performance and energy efficiency of manually-optimized DAE codes.

Finally, to ease the programming environment of heterogeneous systems (with integratedGPUs), we propose a novel system-architecture that provides unified virtual memory with lowoverhead. The underlying insight behind our work is that existing data-parallel programmingmodels are a good fit for relaxed memory consistency models (e.g., the heterogeneous race-freemodel). This allows us to simplify the coherency protocol between the CPU – GPU, as well asthe GPU memory management unit. On average, we achieve 45% speedup and 45% lower EDPover the corresponding SC implementation.

Keywords: Decoupled Execution, Performance, Energy, DVFS, Compiler Optimizations,Heterogeneous Coherence

Konstantinos Koukos, Department of Information Technology, Division of Computer Systems,Box 337, Uppsala University, SE-75105 Uppsala, Sweden.

© Konstantinos Koukos 2016

ISSN 1651-6214ISBN 978-91-554-9654-8urn:nbn:se:uu:diva-300831 (http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-300831)

Dedicated to my very supportive wife,for tolerating me those five —full of deadlines— years.

Also, dedicated to my son,for all these sleepless —sometimes productive— nights.

List of papers

This thesis is based on the following papers, which are referred to in the text

by their Roman numerals.

I Konstantinos Koukos, David Black-Schaffer, Vasileios Spiliopoulos,

and Stefanos Kaxiras. Towards more efficient execution: a decoupledaccess-execute approach. In Proceedings of the 27th international

ACM conference on International conference on supercomputing

(ICS ’13). 2013. ACM, New York, NY, USA, 253-262.

DOI=http://dx.doi.org/10.1145/2464996.2465012

II Alexandra Jimborean, Konstantinos Koukos, Vasileios Spiliopoulos,

David Black-Schaffer, and Stefanos Kaxiras. Fix the code. Don’t tweakthe hardware: A new compiler approach to Voltage-Frequency scaling.In Proceedings of Annual IEEE/ACM International Symposium on

Code Generation and Optimization (CGO ’14). 2014. ACM, New

York, NY, USA, Pages 262 , 11 pages.


III Konstantinos Koukos, Per Ekemark, Georgios Zacharopoulos,

Vasileios Spiliopoulos, Stefanos Kaxiras, and Alexandra Jimborean.

Multiversioned decoupled access-execute: the key to energy-efficientcompilation of general-purpose programs. In Proceedings of the 25th

International Conference on Compiler Construction (CC ’16). 2016.

ACM, New York, NY, USA, 121-131.


IV Konstantinos Koukos, Alberto Ros, Erik Hagersten, and Stefanos

Kaxiras. Building Heterogeneous Unified Virtual Memories (UVMs)without the Overhead. ACM Transactions on Architecture and Code

Optimization 13, 1, Article 1. March 2016, 22 pages.

DOI=http://dx.doi.org/10.1145/2889488

Reprints were made with permission from the publishers.

Other publications not accounted in the thesis:

• Konstantinos Koukos, David Black-Schaffer, Vasileios Spiliopoulos, and

Stefanos Kaxiras. Towards power efficiency on task based, decoupledaccess execute models. In 4th Workshop on Parallel Programming and

Run-Time Management Techniques for Many-core Architectures

(PARMA ’13), 2013.

• Jonatan Waern, Per Ekemark, Konstantinos Koukos, Stefanos Kaxiras

and Alexandra Jimborean. Profiling-Assisted Decoupled Access-Execute.In High Performance Energy Efficient Embedded Systems, 4th Edition,

(HIP3ES ’16). 2016. http://arxiv.org/abs/1601.01722

Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.1 The end of Dennard scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.2 Compiler challenges to automate DAE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.3 Extend, adapt, and apply to general-purpose applications . . . . . . . . . . 11

1.4 Improve the programmability of heterogeneous systems . . . . . . . . . . . . 12

1.5 Thesis organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2 Understanding the decoupled access-execute paradigm . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.1 The key idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.2 Task-based DAE; motivation and methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.3 DAE case-study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3 Compiler support for the decoupled access-execute paradigm . . . . . . . . . . . . . . 20

3.1 Affine code transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2 Non-Affine code transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.3 General purpose code transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4 Multi-version decoupled access-execute (SMVDAE) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.1 SMVDAE definitions and methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.2 SMVDAE code generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

5 Building heterogeneous UVMs without the overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

5.1 Synchronization and memory consistency model . . . . . . . . . . . . . . . . . . . . . . . . 34

5.2 VIPS-G, heterogeneous system coherence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5.3 Optimizations for GPGPU computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

6 Related-Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

8 Svensk sammanfattning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

9 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

1. Introduction

Computers nowadays, are inextricably linked to science and to our every-day

life. Although the digital revolution of the last century has affected almost ev-

ery other science, some scientific fields own their exponential development in

those marvelous machines. High performance computing have become a ne-

cessity today for fields like biology and pharmaceutical research, physics and

cosmology, economics, and forecast. Even in our every day life, we are sur-

rounded by computers not only to browse the world-wide-web and our e-mail,

but also for social media, entertainment and communication. Furthermore,

numerous other devices (e.g., digital cameras, microwaves, cars, etc) contain

at least one embedded processor. Even mobile phones and tablets today, are

build with such sophisticated processors that easily outperform the capabil-

ities of desktop computers of the previous decade. This widespread-use of

computers comes with a surprisingly large energy footprint; it is estimated

that computers today are responsible for 10% of the total world’s electricity

consumption [52].

Since we rely so much on computers, significant effort from computer ar-

chitects and engineers is dedicated to deliver higher performance and improve

the energy efficiency of every new computer generation. This effort is two-

fold: (i) improve the memory system, and (ii) exploit the opportunities to save

energy based on the program demands.

The memory system [29] as evolved in the last years, refers to a memory hi-

erarchy that consists of several levels of caches between the processor and the

main memory (DRAM) —while it extends to the disk and non-volatile storage

memories that are outside the focus of this study. Each level of the hierarchy

has different characteristics such as: size, access-latency (smaller caches are

faster), data-transfer energy, and meet different technology trends and costs

(with registers and smaller caches being more expensive to produce per byte).

Ideally, to exploit the full potential of state-of-the-art micro-architectures we

would like to keep the data as close as possible to the execution units. Un-

fortunately, real life applications have a significantly bigger memory footprint

than the processor caches, which makes it impossible to keep all data close to

the execution units and therefore, they stress the performance gap between the

processor and the main memory.

In many cases the memory system limitations, throttle the performance of

applications with relatively big memory footprints, especially when they per-

form irregular memory accesses; for brevity we refer to them as memory-

bound applications in the rest of this thesis. For these applications, dynamic

9

voltage-frequency scaling DVFS [24, 8] has been the staple of energy saving

over the past years. DVFS provides quadratic energy savings for at most linear

performance degradation1, and can be applied at different levels —for exam-

ple, at system level with the help of the OS, at program level with assistance

from the compiler, or entirely at the hardware level [34]. The software decou-pled access-execute (DAE) technique proposed in this thesis (Paper I) [39],

operates at program level with assistance from the compiler (Papers II & III)

[31, 40]. This work is detailed in Chapters 2–4. The rest of the current chapter

discusses the challenges that this thesis is addressing.

1.1 The end of Dennard scalingIn early 2000s the increasing frequency trend of the previous decades started

to faint, as a result of materials properties that make frequency scaling beyond

4GHz at a reasonable power/thermal budget really difficult. This led to the

introduction of multi-cores, because there was still room to scale the transis-

tors (according to Dennard’s scaling). The underlying insight behind Dennard

scaling [14] was that transistors can get smaller with constant power density,

maintained due to a linear voltage and current decrease. Dennard scaling,

however, comes to end due to the exponential increase of leakage power at

low voltages, which restricts voltage scaling and undermines the effectiveness

of DVFS. With voltage scaling range expected to shrink even more in future

processor generations [28], optimizing the program execution for energy effi-

ciency becomes a great challenge.

Frequency scaling (decrease) alone, provides linear power decease at the

cost of linear performance loss, which restricts the potential for energy savings

in predominantly compute-bound applications. The performance of memory-

bound applications, however, is not linearly affected by frequency scaling,

because they spent most of their execution time waiting for data to come from

the main memory. Therefore, their performance is limited by the memory,

which allow us to decrease the CPU frequency with a small performance

degradation while achieving near-linear energy savings. Unfortunately, real

life applications are neither entirely compute- nor memory-bound. Instead,

they consist of various compute- to memory-bound phases, where each has

its own optimal energy efficiency at a different frequency. Additionally, the

DVFS transition-latency for the hardware incurs overheads that prohibits its

use at very fine-granularity (instruction-granularity) even in future processors

with on chip voltage regulators [37, 38].

To attack these problems, we propose a software decoupled access-execute(DAE) paradigm (Paper I) [39]. They key idea of DAE is to transform the exe-

cution in two distinct, coarse-grain phases: one predominantly memory-bound

1Power ∝ AC fV 2 where C is the load capacitance, V is the supply voltage, A is the activity

factor and f is the operating frequency. AC is referred as effective capacitance Ce f f .

10

phase intended to prefetch the data to a private cache, and another compute-

bound that will use the already prefetched data in its execution. With those

phases designed to operate at a coarse granularity (task, or function), a low la-

tency (nano-scale) DVFS can optimize the energy efficiency of each one sepa-

rately with negligible overhead. Furthermore, our technique increases memory

level parallelism (MLP), and therefore led to performance improvements (up

to 15%) on irregular, memory-bound applications (as shown in Paper I).

1.2 Compiler challenges to automate DAE

The DAE technique discussed in Paper I [39], offers a great potential to im-

prove the energy efficiency (on average 25% energy-delay product [17] im-

provements), while at the same time maintains the performance of applica-

tions, as if they were running at the highest frequency. There was an obstacle,

however, for this technique to be generally applicable to all types of appli-

cations. That was the lack of a fully-automated way to transform them into

DAE.

To overcome this limitation and allow automatic DAE transformations (ini-

tially of task-parallel applications) at task granularity we develop an LLVM

[47, 48] compiler pass as proposed in Paper II [31]. The main goal of Paper

II [31] is to relief the programmer from the responsibility to manually craft

the access-phase for the DAE execution; instead the compiler was automati-

cally generating an access-phase for each task, and properly integrate it into

the program’s execution. This poses new challenges for the compiler; how

to handle different codes, such as affine of non-affine codes? How to auto-

matically generate an access phase that introduces minimal overhead due to

prefetch instructions, and yet maintains most of the effectiveness of DAE for

different classes of applications?

1.3 Extend, adapt, and apply to general-purposeapplications

The decoupled access-execute model has a great potential to improve energy

efficiency in the multi-core era, yet as we move forward to heterogeneous sys-

tems, the programming models evolve to take advantage of the heterogeneity

of those systems, which integrate only a few sophisticated (and power-hungry)

CPU cores, with many simpler (and energy efficient) GPU cores. The uprising

use of general-purpose GPU (GPGPU) computing in many fields and applica-

tions creates new challenges. Since most of the regular code can be easily

parallelized and run in a GPU or an accelerator, what is left for the sophis-

ticated CPU cores is highly irregular regions of code [3]. The situation is

aggravated by the fact that in heterogeneous systems the CPU cores might be

11

sharing resources (such as the memory bandwidth) with the GPU or the accel-

erators. This might further reduce their energy efficiency since the memory

hierarchy becomes slower as it gets more congested.

Heterogeneity is not only a great challenge for the computer architecture,

but also for the programing models and the compilers. To address this up-

coming challenge, in Paper III [40] we extend DAE methodology (and the

compiler infrastructure) for general-purpose applications. This introduces new

challenges for the compiler that now has to generate efficient access phases

for loops or basic-blocks with complex control flow, indirections and pointer

chasing. Furthermore, we consider that the optimal energy efficiency may

vary depending on dynamic information (such as execution-conditions and

system utilization); therefore, a static-only approach can exploit only partial

benefit. To address these issues we propose multi-version DAE (Paper III)

[40]. Multi-version DAE uses a combination of runtime/compiler support,

where the compiler generates different access phases —with different indirec-

tion depths each— and the runtime system instruments and selects the best

performing access phase for each application, for the given execution2 (tak-

ing into consideration the system utilization —for example, sharing last level

cache and memory bandwidth).

1.4 Improve the programmability of heterogeneoussystems

The decoupled access-execute methodology discussed in this thesis, is in-

tended to improve the energy efficiency (and in some cases the performance)

of homogeneous many-core systems, as well as one end of heterogeneous sys-

tems —that is, the latency aware CPU cores. The energy inefficiency in these

systems is a result of the performance gap between the memory system and

the processing unit, which creates stalls during the program execution on ir-

regular or memory bound applications. On the other end of heterogeneous

systems we typically have a GPU with limited general-purpose computing ca-

pabilities. The GPU is highly optimized and specialized for graphics, while its

general purpose computation capabilities are limited to a data-parallel execu-

tion model —for example, CUDA [53] or OpenCL [73]. GPUs are throughput

oriented devices with higher bandwidth compared to CPUs, and their archi-

tecture and execution model are capable of hiding most of the stalls of the

memory system. This is achieved with scheduling mechanisms (provided by

the hardware) that allow very fast switching between idle and ready to execute

threads. Since there is always work to be executed in the GPU’s execution

2This study is focused on the compiler side; therefore, it is limited in automatically generating

all possible different access phases, and omits an extended study of the runtime integration and

overheads.

12

units due to these mechanisms, the potential of running DAE on the GPU is

very limited. Therefore, for this part of the heterogeneous system, we shift our

attention on how to improve the programmability rather than the execution ef-

ficiency, which is already highly utilized.

One of the main limitations of heterogeneous systems is the ease of use,

with explicit data copies between host and device memories and synchroniza-

tion, being a heavy burden for the programmer. Both industry (e.g., HSA

foundation members [42, 26]) and academia [58, 60, 70, 21, 20] envision fu-

ture heterogeneous systems to employ a unified virtual memory (UVM) across

devices. The main design-challenge in a heterogeneous system with UVM is

to optimize it to maintain the latency sensitive nature of CPU cores even when

shared resources are under heavy stress due to high GPU utilization, especially

when cache coherence is employed system-wide. Another challenge, is to pro-

vide an energy efficient address translation scheme for the GPU with respect to

the data parallel execution model, and the massively multi-threaded architec-

ture of this device. Therefore, having a private TLB per compute unit3 might

not be the most energy efficient approach. To address these issues, in Paper

IV [41] we propose a heterogeneous system architecture capable of providing

UVM system-wide, with respect to the execution nature of the application on

each device (latency sensitive for the CPU, throughput oriented for the GPU).

In our approach we: (i) adopt the heterogeneous race free (HRF) [25] memory

model while keeping synchronization semantics backwards compatible, (ii)

propose an energy efficient GPU memory management unit (MMU), and (iii)

provide hardware coherence for the entire system.

1.5 Thesis organization

The rest of this thesis is structured as follows. Chapter 2, explains the de-

coupled access-execute paradigm, discusses its applicability, its limitations,

and evaluates a proof of concept on real systems using task-parallel applica-

tions with hand-tuned access phases. Chapter 3, details our compiler approach

to transform different types of code to DAE and evaluates the compiler effi-

ciency in this task. Chapter 4, explores the potential of multi-version DAE.

With multi-version DAE capable to adapt in different system loads and han-

dle general purpose code on the CPU, we shift our attention to heterogeneous

architectures. For that Chapter 5, explores the design space to provide unified

virtual memory in future heterogeneous systems, with minimal overheads for

the coherence and the address translations for the GPU. Chapter 7 summarizes

Chapters 2–5, while Chapter 6 discusses the related-work. Finally, Chapter 8

provides a summary of this thesis in Swedish (Svensk sammanfattning).

3Compute unit is a GPU core, using OpenCL terminology. NVIDIA’s terminology use the term

Stream Multiproccesor (SM).

13

2. Understanding the decoupledaccess-execute paradigm

One of the most widely used energy saving techniques of the last decades:

DVFS, is loosing most of its effectiveness due to the end of Dennard scaling, as

already discussed in Chapter 1. To address this challenge, and considering that

future processor generations will feature low latency, per-core DVFS [37, 38],

we propose a novel execution paradigm: the decoupled access-execute (DAE)

paradigm. Our first approach on DAE (Paper I [39]) targets task based appli-

cations (as explained in this chapter). This approach was later on extended to

automatically handle affine and non-affine codes with the help of a compiler

(Paper II [31]), and target general purpose applications (Paper III [40]) —as

explained in Chapters 3 and 4.

2.1 The key idea

To provide a deeper understanding of our technique, Figure 2.1 demonstrates

an execution example of a coupled (CAE) versus a decoupled (DAE) exe-

cution. The coupled execution is the original program execution that contains

memory accesses mixed with arithmetic operations that use the data when they

arrive from the memory hierarchy. Diagrams (a)–(c) in Figure 2.1 show the

processing-unit utilization under different frequencies. When a data-access

(load instruction) misses in the cache the processing-unit may stall, if it does

not have enough outstanding, independent instructions to execute.

Figure 2.1. Example of coupled execution under different core frequencies (a,b and c)

and decoupled execution (d)

14

In cases (a) and (b) a long latency load (LLL) causes the execution unit to

stall. Although it does not produce any useful work while stalled, the execu-

tion unit keeps consuming power for the corresponding frequency. Reducing

the frequency, however, reduces the power consumption, but it also makes

the arithmetic instructions execute slower. Therefore, it can be predominantly

beneficial only for program phases that have a lot of stalls, while it harms the

performance (and in some cases even the energy efficiency) of computation-

ally intensive programs or program-regions.

Figure 2.1.c shows an example of inefficiently scaling-down the frequency.

In this case some stall remains, while for the computationally intensive regions

the performance is degraded. An ideal DVFS implementation, would try to

set a low frequency for the memory-bound regions and instantly switch to a

high frequency for the compute-bound regions of the application. The DVFS

transition latency, however, prohibits this technique at very fine granularity (a

few instructions); instead state-of-the-art DVFS techniques select an interme-

diate frequency, which offers a compromise between performance losses and

energy savings. To overcome this limitation and maintain both good perfor-

mance (as of the highest frequency) and good energy savings (as if we could

apply ideal DVFS) we propose DAE, which transforms the code in two coarse

grain phases that we can independently set optimal frequencies1.

The first one; the access-phase is a skeleton of the original code-region that

maintains only the load instructions and converts them into software prefetch

instructions (at the deepest indirection level). All stores and arithmetic in-

structions (with the exception of address calculations) are removed to create

a side-effect free phase. This phase is intended to prefetch all the data that

the next phase (the execute-phase) will use. Therefore, all of the stalls that a

normal execution would have, are eliminated within the access-phase, which

makes it a good candidate for execution at a low frequency.

The second one; the execute-phase is the original (unmodified) region that

runs “hopefully”without stalls, since the vast majority of long-latency misses

are resolved in the previous phase. Therefore, this phase runs at a high fre-

quency and maintains an overall high performance to the application, while the

energy savings comes from the access-phase that runs on a lower frequency

(with a minimal impact on performance). Overall, DAE achieves both high

performance (as if the application was running entirely at a high frequency)

and good energy savings (as if the application was running at an optimal fre-

quency).

1Our approach requires low transition-latency for the frequency and per-core DVFS. Research

proposals envision future systems to feature such characteristics. In Paper I [39] we find that

when DAE is applied in a system with a DVFS transition latency below 500ns, provides good

energy savings for memory bound applications.

15

2.2 Task-based DAE; motivation and methodology

The wide-spread use of multi-core and many-core systems of the last decade,

necessitates a new approach to efficiently program those inherently parallel

machines. Parallel programming, has practically been proved a challenging

task for the programmer, especially for high performance computing. This

challenge led computer scientists to develop task-based programming mod-

els that reduce the complexity of typical parallel programming environments

(e.g., POSIX threads and message passing MPI). Several frameworks have

been developed over the last years such as: StarSs [56, 4], SMPSS [51], In-

tel TBB [55], Cilk5 [16], TPC [76], etc. The main advantages of task-based

programming environments are:

• Their ability to simplify synchronization.

• Their load balancing (e.g., work-stealing) capabilities.

• They provide abstractions to simplify heterogeneous programming.

Therefore, in Papers I [39] and II [31], we adopt the task based program-

ming model for DAE. In addition to the advantages described above, a task-

based model allow us to adjust the data granularity of each task, and also pro-

vides a well defined scope (task-scope) to apply DAE. Therefore, by carefully

adjusting task granularity to make its dataset fit in a private cache we are able

to set the DVFS region limits to a coarser granularity, which incurs negligible

DVFS transition overhead.

In our approach we are using a custom runtime system that implements each

task as a C/C++ function. The role of the runtime is to orchestrate the parallel

execution; that is, to allow for task synchronization, and take care of the task

scheduling and load balancing. Our runtime is kept minimal and omits sup-

port for sophisticated point-to-point synchronization between tasks (we also

omit dependency analysis and rely on simple barriers, which leads to a data

race free execution for the tasks within each epoch). This decision helps us

minimize the runtime overhead and allow the execution of fine-grained tasks.

Furthermore, the runtime system is responsible for profiling the tasks and

adjusting the frequency of the access- and execute-phases of each task. To

determine the optimal frequency for each phase, we employ a linear to IPC

power model in combination with online time profiling of each phase of the

task (as described in Paper I [39] Section 3.3). The role of the power model

is two-fold: (i) provides an energy estimate per phase, even when different

tasks are scheduled in different cores and their execution overlaps, and (ii)

allow us to predict the optimal frequency for each phase of the task (given a

hardware specific power model), and therefore to optimize either for energy,

or for energy-delay product (EDP). To validate the accuracy of our model

we compare the total power estimate against the measured value using the

methodology described in [39], and [72], with an average error of < 5%.

16

2.3 DAE case-study

0

100

200

300

400

fmin ------> fmax

Tim

e (

se

c)

PrefetchOverheadTask Time

4-DAE2-DAE1-DAE4-CAE2-CAE1-CAE

0

1000

2000

3000

4000

5000

6000

fmin ------> fmax

En

erg

y (

Jo

ule

)

Access EnergyExecute Energy

4-DAE2-DAE1-DAE4-CAE2-CAE1-CAE

Figure 2.2. The impact of core frequency on execution time and energy for CG with

1,2 and 4 threads. For CAE we use frequencies from fmin (1.6 GHz) to fmax (3.4 GHz)

with a step of 400 MHz while for DAE we use fmin for the access-phase and fmin to

fmax with a step of 400 MHz for the execute phase.

Figure 2.2 shows the execution time and energy of a decoupled (DAE)

as opposed to a contemporary (CAE) execution for different frequencies and

number of threads. In this case study we use the minimum frequency (1.6 GHz)

for the access-phase of DAE execution and vary the frequency of the execute-

phase, similar to the corresponding CAE execution. We observe that DAE

maintains the performance of the corresponding CAE execution on the same

frequency despite running the access-phase at the lowest frequency. An one-

to-one comparison between DAE and CAE shows no performance degradation

in any combination of frequency and number of threads by DAE. The key ob-

servation of Figure 2.2 is the energy efficiency as the frequency scales. Since

a significant amount of the execution-time is spent within the access-phase,

as we scale-up the frequency for the execute-phase (and therefore the perfor-

mance) the potential for energy savings in DAE increases. In case of 4 threads

at the highest frequency we observe around 25% energy savings over the cor-

responding CAE execution (with exactly the same performance).

17

2.4 Evaluation

0.4

0.6

0.8

1

1.2

1.4

1.6

Cholesky LU FFT LBM CG LibQ Cigar G.Mean

Tim

e (

No

rma

lize

d t

o M

ax F

req

ue

ncy)

Coupled A-E (Optimal frequency)Decoupled A-E (Min/Max frequency)Decoupled A-E (Optimal frequency)

0

0.2

0.4

0.6

0.8

1

1.2

1.4

Cholesky LU FFT LBM CG LibQ Cigar G.Mean

ED

P (

No

rma

lize

d t

o M

ax F

req

ue

ncy)

Coupled A-E (Optimal frequency)Decoupled A-E (Min/Max frequency)Decoupled A-E (Optimal frequency)

Figure 2.3. Normalized execution Time and EDP of CAE versus DAE for Intel Sandy-

bridge using 4 threads (baseline is CAE at maximum frequency).

The decoupled access-execute paradigm provides a unique potential to im-

prove energy efficiency without compromises in performance. Figure 2.3

shows the execution time and EDP for different applications2. For CAE we

evaluate the performance and EDP for two different frequencies: maximum

and optimal-EDP (where maximum-frequency CAE is used as baseline), while

for DAE we evaluate a min/max and an optimal-EDP policy. The min/max is a

naive policy that runs the access-phase in the minimum and the execute-phase

at the maximum frequency. However, this might not be the optimal neither for

the access-phase because sometimes it might contain complex address calcu-

lations that benefit from a slightly higher (than the min) frequency, nor for the

execute-phase because there might be some remaining stalls, and therefore the

optimal-EDP is at a slightly lower frequency (than the maximum). For that, we

2 EDP [17] is the product of energy by time. It emphasizes energy saving with respect to

performance (EDP ∝ Ce f f fV 2T 2) where: Ce f f is the effective capacitance, f is the frequency,

V is the voltage and T is the execution time.

18

integrate the power model in the runtime-system to predict the optimal-EDP

frequency (based on each phase’s IPC).

-60

-40

-20

0

20

40

60

0 200 400 600 800 1000

% E

DP

Sa

vin

g

DVFS Transition Latency (ns)

Memory-BoundCompute-Bound

Best-caseWorst-case

Figure 2.4. Simulation of future DVFS transition latency using 4 threads for the aver-

age of memory bound versus compute bound applications

Performance-wise not only we do not introduce any degradation, but on

the contrary; for memory bound applications we achieve improvements of up

to 15% due to the MLP3 improvement of the prefetching nature of DAE. In

terms of EDP the benefit is two-fold; DAE improves the EDP both due to

the overall performance improvement, and due to the energy savings of the

access-phase. For the memory bound applications we measure EDP gains of

up to 50%, while on average we have an ≈ 25% improvement. This study

however, assumes a very-fast (almost instant) DVFS transition latency4.

Figure 2.4 models the impact of DVFS transition latency on DAE for dif-

ferent classes of applications. Memory-bound applications have a great po-

tential for energy savings, but they are also more prone to DVFS transition

latency because the duration of each of their phases is rather fine-grain. In

contrast, compute-bound applications have a potential for slight benefit, but

they are also less affected by the transition overhead because the duration of

their execute-phase is rather coarse-grain (such applications typically involve

floating point arithmetic). For the average of all applications we observe that

a DVFS transition-latency below 500ns allows for energy and EDP improve-

ments to memory-bound applications, without harming the energy or EDP of

compute-bound applications. For contemporary systems that do not feature

low-overhead, per-core DVFS, DAE can be used with disabled DVFS func-

tionality. This maintains a fraction of the benefit for memory-bound appli-

cations; that is, performance improvements (and the EDP saving that derives

from them) due to increased MLP.

3Memory-level parallelism (MLP) is the architecture ability to support multiple outstanding

memory operations.4The evaluation methodology is detailed in Section 3.4 of Paper I [39]

19

3. Compiler support for the decoupledaccess-execute paradigm

The decoupled access-execute paradigm relies its efficiency on the access-

phase to prefetch the right data while keeping the instruction overhead min-

imal. A straight-forward approach to generate the access-phase is to create

a lightweight skeleton of the execute-phase that omits all unnecessary arith-

metic calculations and stores. To maintain correctness the access-phase must

be side-effect-free. Therefore, stores to variables outside the task-scope, or

function-calls that directly or indirectly modify values outside the task-scope

are forbidden in the access-phase. In Paper I [39], we used manually crafted

and optimized access-phases based on theses guidelines.

The challenge of Paper II [31] is to fully automate this procedure with sup-

port from the compiler (e.g., as an LLVM pass [47, 43]). Conceptually, the use

of a prefetching access-phase before the actual task increases the number of

instructions and adds overhead to the total execution. This overhead is com-

pensated by the speedup of the execute-phase. However, there is no guarantee

that the total execution time (both access and execute phases) in DAE will be

smaller or similar to the original (CAE) execution time. To ensure that DAE

will not penalize the total execution time due to excessive instruction overhead

in the access-phase, the latter must be kept instruction-minimal with respect

to the following:

• No duplicate address prefetching (or calculation). The data used in

the execute-phase must be prefetched exactly once in the access-phase

even in deep loop nests with high temporal locality. Similarly, address

calculations should also be kept minimal, which suggests loops transfor-

mations to reduce the original loop nest in some cases (e.g., affine-codes)

• No store prefetching. Empirically we found that stores are less critical

to create stalls in the pipeline; therefore, we decided not to prefetch them

to keep the access-phase more lightweight.

Although Paper I [39] targets task parallel applications, DAE is applicable

to various types of codes. Our effort to build a compiler infrastructure for DAE

starts with Paper II [31], in which we target affine and non-affine codes similar

to Paper I and extends to Paper III [40] in which we add support for general-

purpose codes (e.g., SPEC-CPU (2006) [22]). The current chapter (Sections

3.1–3.2) details the transformations presented in Paper II [31], and introduces

(Section 3.3) the approach of Paper III [40], which is detailed in Chapter 4.

20

3.1 Affine code transformationAffine code transformations apply to loop nests with static control structures,

where data accesses can be represented as affine functions of the enclosing

loop indices. Typical examples of such codes are linear algebra codes (e.g.,

matrix multiplication, LU, Cholesky, etc). For these codes, we employ the

polyhedral model [6, 67] (available in LLVM through PolyLib [57]) to gener-

ate the access-phase. The polyhedral model is a powerful mathematical model

that provides a geometric representation of software loops. This allows the

compiler to reason about complex loop transformations such as loop unrolling,

fusion, fission, and parallelization. In our compiler pass for DAE, the polyhe-

dral model is used to determine the set of unique memory addresses accessed

by each task. Therefore, it allows us to generate an access-phase with the

minimal loop nest to prefetch all the addresses accessed in the task.

Listing 1 Loop nest accessing different arrays

( a ) /∗ Execu te v e r s i o n ∗ /

f o r ( i = 0 ; i < Block ; i ++)

f o r ( j = i +1 ; j < Block ; j ++)

f o r ( k =0; k < Block ; k ++)

A[ j ] [ k ] −= D[ j ] [ i ] ∗ A[ i ] [ k ] ;

( b ) /∗ Access v e r s i o n ∗ /

f o r ( i = 0 ; i < Block ; i ++)

f o r ( j = 0 ; j < Block ; j ++)

p r e f e t c h : A[ i ] [ j ]

p r e f e t c h : D[ i ] [ j ]

Listings 1 and 2 and show an example of loop nests accessing different

arrays or different tiles on the same array and their corresponding access-

phases1. In Listing 1 the original task is accessing tiles on different arrays

(A and D). Therefore, we first compute the convex hull for each array and

then merge them into the same loop nest (only if they have the same number

of iterations) as shown in the second code-region of Listing 1. Otherwise, we

generate two separate prefetch loop nests. 2.

Listing 2 illustrates another case where the original task is accessing dif-

ferent tiles on the same array. In this case a convex hull would include (in

1Paper II [31] discusses the details of each case illustrated in Listings 2 and 1. This thesis gives

a high-level overview and omits technical details.2One may observe that for array D only the elements below the diagonal are accessed, while the

loop bounds are the same as with array A. This creates a dilemma whether we should generate

prefetches for the upper part of array as used in Listing 1 or use separate prefetch loop nests

for A and D. Through experimentation we find that a single loop nest has less overhead, and

therefore use it in the rest of Paper II and this thesis

21

Listing 2 Loop nest accessing blocks of the same array

( a ) /∗ Execu te v e r s i o n ∗ /

f o r ( i = 0 ; i < Block ; i ++)

f o r ( j = i +1 ; j < Block ; j ++)

f o r ( k = i +1; k < Block ; k ++)

A[ Ax+ j ] [ Ay+k ] −= A[ Dx+ j ] [ Dy+ i ] ∗ A[ Ax+ i ] [ Ay+k ] ;

( b ) /∗ Access v e r s i o n ∗ /

f o r ( i =0 ; i <Block ; i ++)

f o r ( j =0 ; j <Block ; j ++)

p r e f e t c h : A[ Ax+ i ] [ Ay+ j ]

p r e f e t c h : A[ Dx+ i ] [ Dy+ j ]

addition to the blocks) the entire area between the accessed-blocks. To over-

come this excessive prefetching we separate the memory accesses into classes

with the same parameters (e.g., Ax and Ay for class A, and Dx and Dy for classD). Similar with the previous case we create two separate convex hulls and

merge them only if the loop bounds are the same, as shown in the last code-

region of Listing 2. This approach minimizes the instruction overhead, while

it allows for prefetching of the exact address ranges.

3.2 Non-Affine code transformation

For non-affine codes, which are non amenable to polyhedral optimizations we

use simple heuristics to generate the access-phase. We start this approach by

generating a simplified skeleton of the original code that is further optimized

to produce acceptable results in a wide variety of access patterns following the

guidelines discussed in the beginning of this chapter.

One of the major roadblocks to generating efficient access-phases is that

maintaining the entire control flow graph (CFG) may result in an access-phase

very heavyweight, which replicates a lot of the original computation. This is

very common when there is a control flow data-dependency. In this case we

either have to preserve the entire CFG part required to calculate the control

flow statements, or exclude the entire block from the access-phase. Unfortu-

nately, there is no rule of thumb for this dilemma. In Paper II [31], we find

through experimentation that keeping the access-phase minimal yielded better

results for the set of evaluated applications.

22

3.3 General purpose code transformation

Although a naive, lean access-phase approach for non-affine codes as the one

used in Paper II [31] yields sufficient results for scientific applications, it is

not suitable for all types of general purpose code (e.g., SPEC-CPU (2006)

[22]). In fact, there is no single remedy for the wide variety of access patterns

and control flow graphs met on general purpose applications. Therefore, the

compiler must consider the trade-off between address-calculation overhead

and prefetching effectiveness.

Furthermore, various architecture-specific parameters (such as cache size

and memory bandwidth) and non (such as system utilization) affect the ben-

efits of prefetching. To address these issues, Paper III [40] proposes a soft-

ware multi-versioning scheme, which uses indirection-measure as the prefetch

throttle. By controlling the number of indirections for the generated prefetch

instructions we are able to selectively exclude heavy-weight prefetches (with

deep indirections), whose overhead outweigh the benefit. However, transfor-

mation for these codes is closely intertwined with multi-versioning; therefore,

they are detailed in Chapter 4.

3.4 Evaluation

Our primary concern when automatically generating access-phases for the

DAE paradigm is the instruction overhead. A small instruction-overhead can

be easily tolerated in contemporary CPUs, especially when stalls exist —as

in the cases that DAE targets. Excessive overhead however, may result in a

severe performance degradation and reduce the potential for energy savings.

Therefore, our first concern is to ideally maintain the performance of a CAE

execution at highest frequency, or at least introduce a minimal degradation as

a trade-off for the energy/EDP savings. Figure 3.1.a shows the normalized

execution time for different execution modes such as: CAE at its optimal EDP

frequency, hand-crafted (manual) DAE both with min./max. and OptEDP poli-

cies, and compiler generated DAE (both min./max. and OptEDP).

Overall, DAE applications with auto-generated by the compiler access-

phases perform similar or even better than manually-crafted ones. On average

a naive DVFS policy (min./max.) introduces a minimal performance degrada-

tion ≈ 1−2% (worst-case up to 10%). Figure 3.1.b shows the energy savings

from DVFS between DAE (hand-crafted versus compiler-generated) and CAE.

Our first observation is that compiler-generated DAE slightly outperforms the

hand-crafted versions, while CAE achieves around 7% better energy savings

at the cost of 15% performance degradation. In terms of EDP as shown in

Figure 3.1.c, we achieve the same EDP with CAE at its optimal frequency

(≈ 25% lower EDP) but with significantly higher performance (on average

15% faster).

23

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

LU Chol. FFT LBM LibQ Cigar CG G.Mean

(a) Time (Normalized to Max Frequency)

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6


(b) Energy (Normalized to Max Frequency)

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6


CAE (Optimal f.)Manual DAE (Min/Max f.)Manual DAE (Optimal f.)

Compiler DAE (Min/Max f.)Compiler DAE (Optimal f.)

(c) EDP (Normalized to Max Frequency)

Figure 3.1. Performance, Energy, and EDP results on a quadcore Intel Sandybridge,

assuming state-of-the-art 500ns frequency scaling transition latency.

24

4. Multi-version decoupled access-execute(SMVDAE)

General purpose codes (e.g., SPEC-CPU (2006) [22]) is the most challenging

class of applications to optimize for DAE. Their control flow graph is more

complex than typical scientific or linear algebra codes, with data dependen-

cies in conditional statements, while their memory access-pattern may involve

pointer chasing or deep indirections. Such access patterns are amenable to

prefetching, since they create a lot of stalls; however, we cannot statically

determine how much address calculation overhead can be tolerated on each

application.

To attack this problem we propose a software multi-version decoupled access-execute (SMVDAE) scheme (Paper III [40]). In this approach we statically

generate different access-phases, with different number of indirections —for

brevity we call them access-versions— and we select the best-performing ver-

sion at runtime. This technique not only allow us to optimize codes with

statically unknown prefetching overhead, but also allow us to optimize the ap-

plication under different system utilization. For example, in a multicore sys-

tem where a single-threaded application runs in isolation (not sharing memory

bandwidth or last level cache) a more lightweight prefetching DAE scheme

may yield the best EDP, while if the same application runs in parallel with

other processes (different applications, or instances of the same application),

a deeper level of indirection in the access-phase could yield better results.

This adaptation is particularly useful in heterogeneous systems, where the

CPU may share the same memory with the GPU, which generates tremendous

pressure to the memory system. Furthermore, with the increasing offloading

of easy to parallelize code to GPUs and accelerators in the heterogeneous era

[3], CPU is expected to run only the complex and memory-bound parts of the

applications. This necessitates an adaptive, dynamic approach as the one we

propose in Paper III [40]. Our approach is not only adaptive (selects the best

access-version when DAE is able to provide benefits), but is also self-healing

—that is, it reverts back to CAE when it detects that DAE does not provide the

requested benefits, so it always runs the application in its optimal mode.

25

4.1 SMVDAE definitions and methodology

x = a [* b +c - > d [* e ]]; t1 = load bt2 = gep c, 0, index_of_dt3 = load t2t4 = load et5 = gep t3, 0, t4t6 = load t5t7 = add t1, t6t8 = gep a, 0, t7x = load t8

A load statement in C and its SSA representation, resembling the LLVM IR; “gep” is the GetElementPointerInst indexing instruction from LLVM.

Figure 4.1. The number of indirections of an address A is given by the number of

loads required to compute A

SMVDAE provides us with a knob: the number of indirections, to control

how light/heavy-weight an access-phase can be to provide benefits. We de-

fine number of indirections as a metric for the number of intermediate loads

required to calculate the effective address for a memory location. Figure 4.1

shows an example of a load operation and its intermediate representation in

LLVM (LLVM-IR). In this example a total of four intermediate loads (t1, t3,

t4, and t6) are required to calculate the effective address of x. Algorithm 3

summarizes the procedure to measure indirection-count.

Algorithm 3 Finding the number of indirections for an address A

1. If address A is not produced by an instruction (e.g. a global variable),

stop; the level of indirection is 0. Otherwise, continue.

2. Find the set D of dependencies for A (by following steps 2-4 in Algo-

rithm 4, using D in place of K).

3. The level of indirection is the number of load instructions present in D.

Figure 4.2 shows the overall system design. At user-level (source code)

no modification of the application is required. The compiler is responsible to

generate all different access versions up to a user defined maximum number

of indirections. The programmer only needs to provide hints to the compiler

for the region of interest (ROI) to optimize using DAE. This is a list of func-

tion names, which can be acquired through profiling. This list of functions

is provided as a compilation parameter through an external text file. Further-

more, the compiler can perform more complex transformations for DAE such

as loop chunking and apply DAE to each chunk separately. This ensures that

each chunk’s data-footprint is small enough to fit in a private cache. Option-

ally, the programmer is allowed to select a fixed number of indirections for the

entire execution or set the maximum indirection number.

After the compiler has generated all the available access-phase versions a

runtime system is responsible to profile and orchestrate the execution. The

26

Runtime

libVER libDVFS

Versionselection

Timeprofiling

IPCprofiling

Frequencyselection

LLVM

Version Generation

Platform Independent Platform SpecificUser Application

Figure 4.2. The overall system design demonstrating all system components to support

multi-versioning

runtime consists of two parts; one part is system independent and is responsi-

ble for time profiling and version selection, while the other is platform specific

and profiles the IPC through hardware counters, embeds a power-model to es-

timate execution energy, and optionally DVFS (when applicable) each DAE

phase to its optimal frequency.1

loopbody

for(;;) {

}

SMVDAE

Multiple Access phasesEach Access phase is constructed to prefetch di erent amount of data. Early Access phases prefetch less data, but are lightweight. Later Access phases prefetch more data, are more e ective but also more heavyweight.

ExecuteThe execute phase is unique as it is merely the original code, which became compute bound asmore data is brought to L1 by Access.

A0 E E EA1 EA1 EA1A1 A2 ECAE ...

for(granularity) { // select Access

}

E

A0 A2A1

for(;;) {

}

Com

pile

-tim

eRu

n-ti

me Loop execution in slices of granularity=10

Iterations0 10 20 30 40 50 60

The loop is executed in slices, each slice evaluating rst the original loop version, next each pair Access-Execute. The best performing Access-Execute pair is selected to complete the loop.slice 0 slice 1 slice 2 slice 3 slice 4 slice 5 ...

Figure 4.3. SMVDAE compilation and execution model: The compiler generates mul-

tiple Access versions which are evaluated at runtime and the best performing Access-

Execute pair is selected to complete the loop execution.

For the DVFS runtime, we extend the linear to IPC power model used in

our previous work of Papers I [39] and II [31] to support newest architectures

1In the implementation of Paper III [40], we focus our study on the compiler techniques to

generate multi-version DAE. Therefore, we omit the details and the evaluation of a selection-

runtime for future work. Instead, we perform a brute-force evaluation by generating all available

versions and evaluate each one for the entire program execution.

27

(e.g., Intel Haswell [18]). Since the DVFS transition latency is very costly, we

use a methodology similar to Paper I to emulate low-latency per core DVFS,

while performance measurements are conducted in real hardware at the opti-

mal frequency suggested by the power model. For the version selection we use

a simple heuristic —that is, to select the best-performing access version with

respect to the total access-execute runtime. For this we evaluate all access-

execute pairs as shown in Figure 4.3 including a CAE run (only execute), once

at the beginning of each loop and use that for the rest of the program execution.

This approach trades off adaptivity to minimize selection overhead.2

4.2 SMVDAE code generation

Algorithm 4 Set K collects the instructions preserving the CFG

1. Add all instructions that directly define the CFG to set K.

2. For each instruction I in K, add all instructions that produce a value

required by I.

3. If I is a load instruction, add all store instructions that may store the

value that I reads.

4. Repeat steps 2 and 3 until no choice of I will cause further instructions

to be added to K.

Since multi-versioning allow us to select how light/heavy-weight the access-

phase can be, the code generation for SMVDAE is entirely different from our

prior work for non-affine codes of Paper II [31]. In contrast to our previous

work, we do not employ heuristics to omit parts of the CFG in the access-

phase with respect to their overhead (as in Paper II). Instead, we build a CFG

skeleton with a minimal set of instructions that preserve the entire controlflow graph of the original function, and adjust the overhead by controlling the

number of indirections.

Algorithm 4 summarizes the methodology to collect the instructions to gen-

erate a CFG skeleton. The algorithm first adds all instructions that directly de-

fine the CFG (conditional/unconditional branches and jumps) into a set. Then

it processes each instruction within the set by adding its dependent instruc-

tions (steps 2 – 3). Finally, the algorithm terminates when no new instructions

can be added.

A major concern with this approach is to preserve a side-effect-free access-

phase. For that, we perform an alias analysis using the built-in LLVM Alias-Analysis pass. If we detect that a store instruction aliases with a global variable

or an address outside the scope of the access-phase, then the access-phase is

considered non-safe and is discarded. Furthermore, the access-phase is con-

2An extensive study of the selection-runtime overhead and adaptivity is left for future work

28

sidered non-safe, when there is a CFG dependence to a non-read-only func-

tion call. The methodology to detect store aliasing is detailed in Paper III [40]

(Section 3.1.1). Section 3.3 of the latter proposes different solutions to handle

statically unknown memory dependencies.

4.3 Evaluation

0

0.2

0.4

0.6

0.8

1

0 2 3

Nor

m. E

xec.

Tim

e

Time(A) Time(X)

0

0.2

0.4

0.6

0.8

1

0 2 3

Nor

m. E

nerg

y

Energy(A) Energy(X)

(a) Mutli-version performance/energy for mcf

0

0.2

0.4

0.6

0.8

1

0 1 2 9 10

Nor

m. E

xec.

Tim

e

Time(A) Time(X)

0

0.2

0.4

0.6

0.8

1

0 1 2 9 10

Nor

m. E

nerg

y

Energy(A) Energy(X)

(b) Mutli-version performance/energy for mri-g

Figure 4.4. Normalized execution time and energy for all generated versions for mcf,and mri-g. The x axis displays the maximum number of indirections per version.

Experiments were conducted on an Intel i7-4790K CPU.

We begin the evaluation of multi-version DAE with a case-study of two

applications: mcf and mri-g, as shown in Figure 4.4. In this study we show

the benefit of multi-versioning for codes in which we cannot statically predict

the best version (indirection count) for the access-phase. For each of the ap-

plications we observe that the number of indirections has a significant impact

both on performance and energy efficiency. For mcf the best performance and

energy efficiency is achieved with the maximum number of indirections in the

access-phase (up to 3 indirections) —that is, 30% in performance and 45%

energy. In the case of mcf all different access versions have an impact on per-

formance and energy efficiency, while only one yielded the optimal results,

exceeding by far all other available versions.

29

A different situation is observed for mri-g, in which the first three versions

yield no benefit (neither in performance nor energy), while as we increase the

number of indirections we improve both performance and energy. The optimal

access version for this application uses 9 indirections3, and achieves ≈ 17%

performance improvement and ≈ 20% energy savings.

0

0.2

0.4

0.6

0.8

1

1.2

Nor

mal

ized

Exec

utio

n Ti

me

1P (A) 1P (X) 4P (A) 4P (X)

0

0.2

0.4

0.6

0.8

1

1.2

g.mean

1P 4P

(a) Execution time for different system loads

0

0.2

0.4

0.6

0.8

1

1.2

Nor

mal

ized

Ener

gy

1P (A) 1P (X) 4P (A) 4P (X)

0

0.2

0.4

0.6

0.8

1

1.2

g.mean

1P 4P

(b) Energy for different system loads

Figure 4.5. Normalized execution time and energy for a selection of SPEC-CPU

(2006) benchmarks, under different system loads on an Intel i7-2600K. 1P is 1 in-

stance of the application in isolation, versus 4 instances (4P) of the same application

(one per core).

Beyond optimizing the access-phase for codes that their optimal access-

version cannot be statically known, SMVDAE, also allow us to optimize the

3The optimal access version is not always the one with the more indirections. In mri-g it is the

second before maximum indirections with small difference from the max-indirections version.

In other applications it might be an intermediate version, although for brevity we exclude those

in our case study of Figure 4.4.

30

access-phase with respect to the system load. Figure 4.5 shows the benefit

provided by SMVDAE when an application runs in isolation, versus running

multiple instances of the same application (that share resources). In Figure

4.5 we use a selection of memory bound applications from SPEC-CPU (2006)

benchmark suite, where DAE is predominantly beneficial. The first observa-

tion in this figure is that DAE has a great potential for improvements (both in

energy and performance) when the system is under heavy load (e.g., running

4 instances). Optimizing for different system loads separates the applications

in two categories: (i) highly apt to DAE with respect to both performance and

energy, and (ii) apt to DAE under heavy load with respect to energy savings.

In the first category we classify mcf and milc. These applications provide

good performance improvements and energy savings when run in SMVDAE

in both cases (1 instance and 4 instances). In the second category we clas-

sify libQ, soplex, sphinx and lbm. These applications do not provide perfor-

mance improvements (with exception of lbm that yields a small speedup in 4

instances). However, they have a potential for energy savings when running

under heavy system load (4 instances) because a significant amount of their

execution time is spent within the access-phase, for which we scale down the

frequency. On average for this selection of memory bound SPEC applications

we have a ≈ 10% performance improvement and up to ≈ 35% energy savings.

We conclude this section by evaluating the majority of SPEC CPUTM2006

in two different Intel architectures (i7-2600K and I7-4790K), as shown in Fig-

ure 4.6. The newer Intel processor (i7-4790K) has a slightly increased fre-

quency, which augments the gap between the processing-unit and the mem-

ory system. Therefore, this architecture yields even more promising results

for DAE. More specifically, we achieve over 20% energy savings and ≈ 25%

lower EDP, while in the best case we achieve over 60% lower EDP and energy

savings. For the heavily compute-bound applications multi-versioning either

reverts the execution to the original CAE to avoid hurting the performance due

to excessive prefetching overhead, or select a very lightweight access-version,

which yields minimal benefits.

The SMVDAE study (Paper III [40]) concludes our effort to optimize en-

ergy efficiency of many-core and multi-core CPUs. To summarize; so far in

this thesis, we proposed decoupled access-execute paradigm and evaluated its

performance and efficiency in real systems (e.g., Intel CPUs). We proposed

a compiler methodology to automatically handle various classes of applica-

tions (affine, non-affine and complex general purpose). Finally, we propose

a software multi-version DAE (SMVDAE), which allow us: (i) to optimize

codes with statically unknown prefetching overheads and deep indirections,

and (ii) to adapt under different system loads. The DAE paradigm yields very

good energy and EDP savings due to its ability to optimize energy without

compromises in performance. The next chapter proposes a novel, complete

heterogeneous system architecture with unified virtual memory, to ease pro-

grammability and improve the overall performance and energy efficiency.

31

0

0.2

0.4

0.6

0.81

1.2

Normalized Time / Energy / EDP

Tim

e (A

)Ti

me

(X)

Ener

gy (A

)En

ergy

(X)

EDP

(Tot

al)

0

0.2

0.4

0.6

0.81

g.m

ean

Tim

eEn

ergy

EDP

0

0.2

0.4

0.6

0.81

1.2

Normalized Time / Energy / EDP

Tim

e (A

)Ti

me

(X)

Ener

gy (A

)En

ergy

(X)

EDP

(Tot

al)

0

0.2

0.4

0.6

0.81

g.m

ean

Tim

eEn

ergy

EDP

Figu

re4.

6.O

ver

all

per

form

ance

,en

erg

y,an

dE

DP

imp

rovem

ents

on

Inte

li7

-26

00

Kan

di7

-47

90

K

32

5. Building heterogeneous UVMs without theoverhead

As already prefaced, the decoupled access-execute paradigm is not a good fit

for GPUs, because their hardware is already designed to tolerate long latency

stalls. Therefore, we do not provide a DAE version for GPUs; instead we try

to improve the programmability of fused (CPU-GPU) heterogeneous systems.

The main obstacle on these systems is the difficulty to maintain data coher-

ent across devices. In a heterogeneous system without unified virtual mem-ory (UVM), the programmer has to copy data back and forth to the device to

maintain consistency. Furthermore, for this to be efficiently implemented the

programmer should consider the benefits of overlapping data transfers with

computation in the GPU, which makes this approach even more complicated.

Various research proposals [59, 58, 20, 70] recognize the need for UVM.

However, there are two different tendencies based on the memory consistency

model. The first, employs sequential consistency to provide UVM (e.g., Power

et al. [59] and [58]), while the second adopts a more relaxed memory consis-

tency, as in Hechtman et al. [20] and Singh et al. [70]. Our experimentation

with the state-of-the-art simulator for heterogeneous systems [59], shows that

using sequential consistency across a heterogeneous system introduces long

access-latencies to the CPU when accessing data owned by the GPU. Further-

more, a naive directory-approach such as the one used in [59] suffers great

performance losses due to directory and GPU’s last-level-cache contention (as

quantitatively evaluated in Paper IV [41]). Therefore, we adopt a release con-

sistency memory model, and more specifically the heterogeneous-race-free

HRF [25] model. Our effort is many-fold and is outlined below:

1. Ease of use. A pointer is after all a pointer; therefore, our proposal

employs unified virtual memory (UVM) between devices.

2. Optimize each device with respect to its needs, and in harmony withthe entire system. The CPU is after all latency-sensitive; therefore, in

our approach we try to protect the CPU from slow interactions with the

GPU.

3. Keep the architecture simple. Employing sequential-consistency across

an entire heterogeneous system, introduces a huge bottleneck: the direc-

tory and the last level GPU cache. Therefore, in our approach we use a

directory-less coherence protocol, which simplifies the architecture.

4. Optimize the GPU for general-purpose GPU (GPGPU) computing.

Finally, we optimize the write-policy of the GPU caches and the GPU

memory management unit (MMU) for GPGPU applications.

33

5.1 Synchronization and memory consistency model

Figure 5.1. Different synchronization primitives of a heterogeneous x86/CUDA sys-

tem.

GPUs today, use a weakly-ordered memory model to make writes visible

in different levels of the memory hierarchy. This requires some effort from

the programmer to define the synchronization scopes, but provides a potential

to reduce NoC traffic since writes do not need to become visible to the entire

system immediately. Therefore, explicit synchronization is standard in het-

erogeneous GPGPU computing today. Furthermore, state-of-the-art proposals

such as the heterogeneous-race-free HRF [25] model, recognize the benefits

of explicit synchronization.

In our system proposal we take into consideration both backwards compat-

ibility (for legacy GPGPU codes) and future-system proposals (e.g., HRF).

Therefore, in Paper IV [41] we adopt a relaxed-consistency memory model

similar to HRF. Figure 5.1 shows a high-level overview of the synchronization

scheme employed in our proposal. For the CPU, we use pthread_barrier_wait()or pthread_mutex_lock() /unlock() for synchronization within CPU cluster.

These synchronization-primitives selectively-flush the first level cache of the

cores involved (write-back the dirty-shared blocks, and invalidate all the shared

blocks after writing-back). As long as these data are not required by the GPU,

no further action is required in the system. To make these changes visible in

the GPU, however, we facilitate the programmer with the cpu_flush_llc() prim-

itive, which flushes of the last level CPU cache, making all updates within the

CPU cluster visible in the entire system.

Similarly, GPU synchronization is explicitly declared by the programmer

for each scope. For the GPU synchronization, we adopt the CUDA [53, 54]

synchronization semantics (unmodified) in our system, as shown on the right-

side of Figure 5.1. For example, _threadfence_block() is used to make writes

performed by the calling thread visible to all threads in the thread-block. Sim-

34

ilarly, threadfence() ensures that writes become visible by all threads in the

device, while _threadfence_system() makes them visible system-wide. Al-

though these synchronization semantics are seemingly similar to the program-

mer, they incur different overheads, and their implementation faces different

challenges (based on the system-components that participate), as explained in

the next section.

5.2 VIPS-G, heterogeneous system coherence

Our starting point to efficiently support the HRF memory model, is by devel-

oping a coherence protocol that can truly address the needs of a heterogeneous

system, and preserve the characteristics of each device (e.g., CPU is latency

sensitive, while GPU is throughput-oriented). For that we develop a heteroge-

neous variation of the VIPS family of protocols [64, 35, 63]. The key idea of

the VIPS protocol (and most of its variations) is to provide a simple, scalableprotocol, that rely on explicit synchronization primitives to create race-free

execution regions of the applications (epochs). The basic version of VIPS:

VIPS-M [64], which we employ for the CPU coherence, is flat, directory-lessand uses page-classification stored in the TLB and the page-table. Each page

can be either private, shared, or invalid, which are the stable states of the

protocol. For the private pages a write-back policy is selected to reduce un-

necessary traffic to the lower-level, while for the shared pages a write-through

policy is employed. To reduce traffic for shared blocks we use coalesced (de-

layed) writes.

Based on the VIPS protocol specifications blocks belonging to private pages

can cross synchronization points without flushing. In contrast to them, shared

blocks have to be written-back the next level of the memory hierarchy and

self-invalidate. This guarantees that the data will become visible to other cores

after synchronization and that the calling thread will not read stale data, since

all shared data are invalidated and the new values are read from the next level

of the hierarchy. In VIPS-M version [64] which we employ in our approach

this classification is held at the TLB and the page-table. Such an approach is

flat —that is, when a page becomes shared it is considered shared for all cores

even between different clusters, and non-adaptive —that is, when a page be-

comes shared it remains shared until it is evicted from the page-table. Further

variations of the protocol such as VIPS-H (hierarchical) [63] use more elabo-

rate mechanisms to maintain classification hierarchically across clusters.

Although VIPS-M offers the advantage of simplicity and is employed un-

modified within the CPU cluster, its non-adaptive nature prevents us from us-

ing it to maintain coherence across devices. VIPS-H is also non-adaptive;

furthermore, it requires multi-level classification, and involves multiple inter-

rupts in the transition of a locally-shared page to globally-shared. This pre-

vents us from using it to maintain coherence between devices. Therefore, for

35

Figure 5.2. Logical diagram of a homogeneous - hierarchical (on the left), versus an

asymmetrical (heterogeneous) - adaptive (on the right) memory organization for data

classification.

inter-device coherence between the CPU and the GPU we develop VIPS-G, a

heterogeneous variation of the VIPS-M protocol that enables adaptive classifi-

cation between devices, while allowing VIPS-M to operate unmodified within

the CPU cluster. Figure 5.2 shows a logical diagram of VIPS-G (right) as

opposed to a homogeneous - hierarchical variation of the protocol (left).

The key idea to provide adaptability between devices is to initially lookup if

the GPU contains the page before accessing the page-table. This information

is volatile, and is available to the CPU through a mirror-TLB (which acts as

a shadow-directory). If an entry exists in the mirror-TLB, then the page is

shared and therefore, a write-through policy is selected; otherwise the page is

considered private and a write-back policy can be safely employed for all the

blocks of the page. This allows a page that is shared only between CPU cores

to be treated as private by the CPU’s last level cache, for as long as it is not in

use by the GPU, become shared when the GPU access it and for as long as it

exists in its TLB, and finally revert back to private when it is evicted from the

GPU TLB1.

5.3 Optimizations for GPGPU computingGPUs are graphics accelerators with limited general-purpose computing ca-

pabilities (GPGPU). Their programming model is more restrictive compared

to CPUs, and the hardware has also been designed for graphics, geometri-

cal transformations and fast texture processing. Although these devices have

caches and they become increasingly important, some design decisions of the

past were not taken with GPGPU application in mind. Therefore, in our work

beyond providing UVM, we also try to optimize them for GPGPU computing.

State-of-the-art approaches such as [59] and [58], envision GPUs to employ a

non-allocate write-invalidate policy for the GPU L1s. Although such a policy

1The adaptability between devices, involves two transitions: private to shared and shared toprivate. The mechanisms to handle these transitions are detailed in Paper IV [41]

36

might be sufficient for graphics, and further simplify the coherence within the

device, it has a huge impact on the performance of GPGPU applications as we

show on our work of Paper IV [41]. Therefore, in our approach we employ a

write-allocate policy for all GPU caches (L1s and L2).

To maintain coherence within the GPU cluster we use the standard approach

of a valid/invalid VI protocol that provides relaxed memory consistency (em-

ployed both in commercial GPUs and in academic studies [59, 58, 20, 70]).

Therefore, no data classification is used within the GPU. Instead, all valid

data are treated as shared and the L1s are flushed at synchronization points.

Writing-back all dirty blocks at once during synchronization, however, gener-

ates burst-traffic that could enormously increase the synchronization overhead.

Therefore, for shared data we use a write-through policy. This has the draw-

back of generating continuous traffic on writes, which is potentially unneces-

sary when many writes to the same block may coalesce due to spatial/temporal

locality. Therefore, to reduce this traffic we employ small coalescing write-

buffers near the caches for the shared blocks.

Finally, for the GPU we propose an energy-efficient memory management

unit (MMU) that is built upon virtual coherence and a shared TLB. Driven by

the coherence-protocol that does not create any upwards traffic (from L2 to

the L1s) we employ virtual coherence for the GPU L1s similar to [35]. This

eliminates the need for private TLBs at each GPU core (as opposed to other

academic proposals [60]); instead, we use a shared TLB which is accessed

before accessing the L2. For as long as we hit in the GPU L1, no translation

is required. The translation is required only for accessing the last level (L2)

cache of the GPU. Our MMU approach, combines the high-effectiveness of a

big shared-TLB with only a fraction of the energy requirements of an MMU

with private TLBs. Our final optimization in the system, is to maintain the

state of the cache-block (V/I bit) externally in the TLB (similar to [68, 69])

for the CPU-L1s and the GPU-L2. This allows for fast bulk reset of the shared

blocks of each page selectively, or for all shared pages.

Figure 5.3 shows the overall design of our system including the TLBs,

caches, and address spaces. The CPU is a typical multi-core with few OoO

cores having private, discrete instruction- and data-caches, and TLBs; the

CPU cores share a big L2 cache. The CPU-L1 caches are virtually-indexed,

physically-tagged (VIPT), while the L2 is physically-indexed, physically-tagged

(PIPT). Near the CPU-L2 is the mirror-TLB, which is used in inter-device co-

herence. The GPU is modeled using a custom many-core design. In this design

we have a unified instruction-data L1 cache per core and, a shared L2 and TLB

across all cores. The GPU L1s are virtually-indexed virtually-tagged (VIVT),

while the L2 —similar to the CPU-L2— is PIPT. The page-classification is

maintained at the page-table and the TLBs; therefore, those mechanisms also

have a role of maintaining coherence in the system (with the help of the OS).

37

Figure 5.3. The memory hierarchy proposed for a fused CPU/GPU system showing

the address spaces, the caches, and the TLBs.

5.4 Evaluation

We evaluate our work using gem5-gpu [59] simulator, which unites gem5 [7]

and GPGPU-Sim simulators [5]. Both support detailed simulation mode. The

coherency is modeled using the RUBY system as available in gem5. For the

power estimates a combination of GPU-Wattch [44] and McPAT [45] is used.

As baseline we use the reference implementation of gem5-gpu. This imple-

mentation, uses VI-Hammer protocol; a directory-protocol based on MOESI

for the CPU and VI for the GPU. Our experimentation shows that the direc-

tory is the bottleneck of the system, and severely penalizes the accesses that

go through it or modify the coherence state. Related work (Power et al. [58])

also recognizes the problem and proposes a regional coherence approach (us-

ing a regional-directory). In our work this bottleneck is inherently removed by

removing the directory, while page classification is kept at the TLBs, which

is a similar approach to regional directories, except that it has minimal extra

overhead.

1

1.5

2

2.5

3

backprop bfs

gaussianhotspot

lavaMD ludsrad

streamcluster

particlefilte

rstencil

octreepart

wavegmeanV

IPS

-G S

pe

ed

up

(o

ve

r V

I-H

am

me

r)

VIPS-G

Figure 5.4. Normalized speedup.

38

We perform our evaluation using a subset of the Rodinia [9] benchmark

suite, which has been modified by the gem5-gpu community to work with uni-

fied memory address space. Additionally, we include a 7-point stencil kernel

from parboil [74] benchmark suite and two applications: wave and octreepart,

from the workloads used in [70]. Figure 5.4 show the speedup of our approach,

compared to a naive directory approach. On average we get ≈ 45% speedup

and ≈ 45% EDP reduction (≈ 20% energy reduction) as shown in Figure 5.5.

In some cases —for example, lavaMD and streamcluster the speedup is up

to 3x. There are two reasons for the performance improvements (detailed in

Paper IV): (i) the GPU-L1 miss-ratio is significantly lower in our approach

and, (ii) the access latencies are also significantly lower because our approach

eliminates a significant bottleneck: the directory, from the system. Finally,

Figure 5.6 shows the detailed energy benefits for the most important system

components. For the GPU-core we observe 20% energy reduction, mostly due

to the achieved speedup. For the MMU we observe that our approach requires

only a fraction of the energy required by the reference (≈ 30%), while both the

NoC and cache energy is reduced by on average 20% and 25% respectively.

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

backprop bfs

gaussianhotspot

lavaMD ludsrad

streamcluster

particlefilte

rstencil

octreepart

wavegmean

Normalized EDP Normalized Energy

Figure 5.5. Normalized Total Energy/EDP for VIPS-G (baseline is VI_Hammer).

0

0.2

0.4

0.6

0.8

1

1.2

backprop bfs

gaussianhotspot

lavaMD ludsrad

streamcluster

particlefilte

rstencil

octreepart

wavegmean

No

rma

lize

d E

ne

rgy

Core MMU Caches NoC

Figure 5.6. Normalized per component energy for VIPS-G (baseline is VI_Hammer).

39

6. Related-Work

The decoupled access-execute paradigm was initially proposed by Smith [71]

in 1982. Unlike contemporary processors that are widely used today, the exe-

cution units of Smith’s DAE processor were unaware of address calculations;

instead they were only capable of performing arithmetic operations to the next

available operands, provided to them through FIFO queues. In such an ap-

proach stalls could occur when the queues became empty. For contemporary

processors various approaches have been proposed in the last decade to tackle

with memory stalls, such as: helper-threads (e.g., [32, 61, 81]), runahead(e.g., [33]), and slipstream execution (e.g., [27]). The key idea remains the

same across all these different approaches: to prefetch data ahead of time, be-

fore use by utilizing more resources (either more cores or hardware threads).

Employing helper-threads, however, to prefetch data and eliminate stalls in

single-threaded applications, provides a potential for performance improve-

ments, with the trade-off of increased energy consumption due to utilizing

more cores for the helper threads. DAE substantially differs from these ap-

proaches, because the access- and execute-phases run in the same thread.

DAE relies its efficiency on the access-phase, which is based on our ability

to generate a lightweight skeleton for the data prefetching. Similar code gen-

eration techniques have been used in various inspector-executor approaches

[1, 80, 83, 10, 62, 65]. The underlying insight behind those techniques, is to

generate an inspector-phase that runs ahead of time and instruments the code,

which later on runs optimized at the executor-phase based on the analysis of

the inspector. This technique has been widely used to overcome the limita-

tions of static analysis. Other compiler techniques have been proposed to op-

timize applications’ energy [79, 77, 78, 11, 66, 46] with the use of DVFS. Our

approach of Papers II [31] and III [40] combines best of both worlds by em-

ploying code transformation to optimize energy within the limits of applicable

DVFS of future processors (with low-latency per-core DVFS [38, 37]).

Alternative approaches use the operating system (OS) to profile and select

the optimal frequency either for an entire process, or for different phases of

a process[36, 72]. Such approaches do not involve code transformation and

they typically operate at a coarse granularity defined by the OS time-slice;

therefore, they provide a compromise between performance losses and energy

savings in codes with high diversity of memory-stall regions.

Software multi-versioning has widely been employed to reduce the over-

head of code instrumentation [2, 13, 15, 23], for checking program correct-

ness [19], for loop parallelization, for automatic, speculative optimizations

40

[12, 30, 49], to optimize the execution for different inputs [75, 82], or even de-

tect races [50]. Our work of Paper III [40] employs software multi-versioning

to find the right balance between the prefetching overheads and benefits for the

access-phase of DAE programs. Furthermore, it provides a knob to optimize

CPU energy efficiency when the CPU runs under different loads.

DAE, also has the potential to improve the energy efficiency of the CPU

cores on a heterogeneous architecture, especially when running highly-complex

codes as those expected to remain in the CPU even in the heterogeneous era

[3], but is not a good fit for the GPU programming model that is capable

of tolerating memory stalls. To simplify the programming model, state-of-

the-art approaches suggest the use of a unified virtual memory (UVM). For

example, the latest versions of NVIDIA CUDA [54] offer UVM. Academic

proposals such as HSC [58] and [70] recognize the need for an efficient UVM

implementation in hardware. Some approaches like [58] and [59] employ se-

quential memory consistency models, while other like [70, 20] adopt a release

consistency model. Hechtman et al. [21] shows that the consistency model has

a minimal impact on performance; however, it has a huge impact on energy

consumption and implementation complexity. Therefore, in Paper IV [41] we

adopt a release consistency model. Unlike state-of-the-art approaches for SC

[60] we minimize the GPU memory management unit overheads by employ-

ing virtual coherence similar to [35].

41

7. Conclusions

This thesis studies efficient execution paradigms for parallel heterogeneous

architectures. Our starting point is to improve the energy efficiency of con-

temporary cores such as those met in many-core and multi-core architectures

today, for which we propose the decoupled access-execute (DAE) paradigm.

DAE transforms the code in two coarse grain regions: an entirely memory-

bound region (called the access-phase) intended to prefetch data into the cache,

and another entirely compute-bound (called the execute-phase) that performs

the original computation without stalls. The coarse granularity of these phases

allows us to set an optimal frequency for each one separately; typically a low

frequency is used for the access-phase, while a high frequency is used for the

execute-phase. Therefore, DAE allows to optimally DVFS applications for

future systems with low-latency, per-core DVFS capabilities.

The decoupled access-execute paradigm has been extensively studied in

real hardware, yielding excellent results in different classes of applications.

Our experimentation shows great energy savings for predominantly memory-

bound applications with irregular control-flow, where the hardware capabili-

ties to hide memory slack are limited. In these applications, DAE significantly

improves memory level parallelism (MLP), and therefore, improves both per-

formance and energy efficiency. In our study we evaluate DAE in different ap-

plication categories such as task parallel and sequential, affine and non-affine,

and complex general purpose. Our evaluation shows that DAE has a potential

to increase the MLP and the energy efficiency in a wide range of applications.

To facilitate the programmer with an automated way to apply DAE trans-

formations we provide compiler passes to support both affine and non-affine

codes. Furthermore, we extend our infrastructure to allow for indirection-

measure control of prefetches. We integrate this approach with multi-version

DAE (SMVDAE) to allow for dynamic version selection. SMVDAE provides

us with a knob to throttle the statically-unknown overhead of the access-phase;

therefore, it employs the best access-phase only when it provides benefit over

the original (self-adapting and self-healing). Furthermore, SMVDAE has the

ability to adapt under different system loads. An SMVDAE system relies on

the compiler to statically generate different access versions, and a runtime

system to profile and dynamically select the best performing access version.

On average DAE provides excellent results for memory bound applica-

tions; not only does it improve energy and EDP but in most of the cases

it provides good speedup to memory-bound applications. Without employ-

ing multi-versioning, automatic transformations with the help of the compiler

42

achieved ≈ 20% energy reduction (for the average of the applications evalu-

ated in Papers I and II) and ≈ 20% lower EDP. Employing multi-versioning

(as in Paper III) exceeded our expectations by providing ≈ 10% performance

improvements and over 30% lower energy for the memory-bound applica-

tions, while for the average of the evaluated applications (SPEC-CPU 2006)

we achieved over 20% energy savings (25% EDP).

Finally, in this thesis we explore the possibility to improve the programmer

experience on heterogeneous systems. We propose a novel heterogeneous sys-

tem architecture, which facilitates the programmer with unified virtual mem-

ory. Our approach involves minimal modifications (compared to contempo-

rary systems), and introduces minimal overhead to maintain coherence. To de-

fine the synchronization scopes we employ the heterogeneous race-free (HRF)

memory consistency model. Our goal is to provide regional coherence (at page

granularity) with minimal hardware overhead. For that, we use a finely-tuned

version of the VIPS coherency protocol (VIPS-G variation), which is asym-

metrical and adaptive allowing us to exploit the full potential of heterogeneous

system, while at the same time preserving the latency sensitive nature of the

CPU cores. Furthermore, we optimize the GPU memory management unit

(MMU) by using virtual coherence, and the GPU caches with a more effi-

cient write-allocate policy. Our approach improves performance by ≈ 45% on

average and EDP by also ≈ 45%. This completes our perspective of a hetero-

geneous system where DAE is employed to improve the execution paradigm

of the sequential part of the application, while the proposed UVM architecture

allows for a more efficient execution across the entire heterogeneous system.

43

8. Svensk sammanfattning

Denna avhandling studerar effektiva exekceringsparadigmer för parallella he-

terogena arkitekturer. De främsta heterogena systemen idag består av en fle-

kärnig CPU och en mångkärnig GPU sammanslaget i samma paket. För CPU

är den mest etablerade tekniken för att minska energiförbrukningen dynamisk

spänning-frekvensskalning (DVFS). Det mesta av effektiviteten av DVFS för-

litar sig på den kvadratiska förhållandet mellan spänning, skalning, och ef-

fektförbrukning. I och med slutet av Dennard-skalningen i det senaste årtion-

det på grund av den exponentiella ökningen av läckeffekt vid låga spänningar,

behövs nya metoder för energihantering. Den underliggande insikten bakom

vårt arbete är observationen att prestanda inte alltid är linjärt relaterad till fre-

kvensen (t.ex. i applikationer begränsade av minnes-prestanda) och därför kan

vi exploatera denna potential i en programmeringsmodell. Dessutom, för att

utnyttja GPU behövs en betydande insats från programmeraren för att kopie-

ra data fram och tillbaka till enheten, samt att synkronisera, och att iscensätta

utförandet på ett sådant sätt att dataöverföringar ska överlappas med själva be-

räkningen. Detta kräver en implementation av ett effektivt enhetligt virtuellt

minne (UVM) för hela heterogena system.

Vår utgångspunkt är att förbättra energieffektiviteten i CPU-kärnor som

finns idag. För att åstadkomma detta föreslår vi en frikopplad access - exekve-ringsparadigm (DAE). DAE förvandlar koden till två grovkorniga regioner: en

helt minnesbegränsad region (kallad accessfas) som är avsedd till att förhämta

data i cache, och en annan helt beräkningsbegränsad (kallad exekveringsfas)

som utför den ursprungliga beräkningen utan stopp. Den grova granulariteten

av dessa faser tillåter oss att ställa in en optimal frekvens för varje fas för sig;

vanlightvis används en låg frekvens för accessfasen, medan en hög frekvens

används för exekveringsfasen. Därför tillåter DAE att utföra DVFS optimalt i

framtida system som har möjlighet till DVFS-operationer med låg latens, för

varje individuell kärna.

I denna avhandling har vi omfattande studerat den frikopplade access - ex-

ekveringsparadgimen med riktig hårdvara. Våra experiment visar att DAE ger

utmärkta resultat i olika klasser av applikationer. Vi uppnår stora energibespa-

ringar för övervägande minnesbundna applikationer med oregelbundna kon-

trollflöden, där hårdvarans förmåga att dölja fördröjningar orsakade av data-

hämtningar är begränsad. I dessa fall förbättrar DAE minnes-parallellismen

(MLP) avsevärt, och förbättrar därför både prestanda och energieffektivitet. I

vår studie utvärderar vi DAE i olika applikationer: uppgifts-parallella eller se-

kventiella, affina eller icke-affina och komplexa allmänt ändamål; DAE har en

potential att öka MLP och energieffektivitet över hela området.

44

För att underlätta för programmeraren med ett automatiserat sätt att applice-

ra DAE transformationer tillhandahåller vi kompilator-passe för att stödja både

affina och icke-affina koder. Dessutom utökar vi vår infrastruktur för att möj-

liggöra indirekthet måttkontroll av prefetches. Vi kallar detta tillvägagångs-

sätt multiversion DAE (SMVDAE). SMVDAE ger oss ett vred för att strypa

statiskt okända kostnader för överdriven förhämtning i accessfasn; Därför an-

vänds den bästa tillgänliga accessfasen endast när det ger fördel över origina-

let (självanpassande och självläkande). Dessutom har SMVDAE förmågan att

anpassa sig till olika systembelastningar. Ett SMVDAE system bygger på att

kompilatorn statiskt genererar olika access-versioner, och en runtime system

för att profilera och dynamiskt välja den bästa tillgängliga accessversionen.

I medel ger DAE utmärkta resultat för minnesbundna applikationer; inte

bara förbättras energiförbrukningen och energifördröjningsprodukten (EDP),

men i de flesta fall ges dessutom en god uppsnabbning till minnesbundna

tillämpningar. Utan att utnyttja multi-versionshantering uppnår automatiska

omvandlingar med hjälp av kompilatorn ≈ 20% minskning av energiförbruk-

ning (genomsnittet av applikationerna som utvärderades i artikel I och II) och

≈ 20% lägre EDP. Utnyttjande av multi-versionshantering (som i artikel III)

överträffade våra förväntningar genom att tillhandahålla ≈ 10% prestandaför-

bättringar och över 30% lägre energi för minnesbundna applikationer, medan

genomsnittet för de utvärderade applikationer (SPEC-CPU 2006) uppnådde

över 20% energibesparingar (25% EDP).

Slutligen, i denna avhandling utforskar vi möjligheten att förbättra pro-

grammerarens upplevelse av heterogena system. Vi föreslår en ny heterogen

systemarkitektur, vilket förser programmerare med enhetligt virtuellt minne,

med minimala modifieringar (jämfört med dagens system) och minimal over-

head för att upprätthålla minneskoherens. Vårt tillvägagångssätt utnyttjar den

heterogena race-fria (HRF) minnes-konsistes-modellen. Vårt mål är att leve-

rera regional minneskoherens (med en minnses-sida som region) med minimal

hårdvaruoverhead. För detta använder vi en finkornig version av VIPS kohe-

rensprotokollet (VIPS-G variation), som är asymmetrisk och adaptiv vilket gör

det möjligt att utnyttja den fulla potentialen av heterogent system och samti-

digt bevara den latenskänsliga naturen av CPU-kärnor. Dessutom, vi optimerar

minneshanteringsenheten (MMU) för GPU, och de privata GPU cachar med

virtuell koherens (vi använder ett VI protokoll inom GPU) och en mer effektiv

skriv-allokera politik för GPGPU datoranvändning. Vår ansats förbättrar pre-

standan med ≈ 45% i genomsnitt och även EDP med ≈ 45%. Detta fullbordar

vårt perspektiv av ett heterogent system där DAE används för att förbättra ex-

ekveringsparadigmen för den sekventiella delen av en applikation, medan den

föreslagna UVM arkitekturen möjliggör ett mer effektivt utförande över det

hela heterogena systemet.

45

9. Acknowledgments

Firstly, I would like to express my sincere gratitude to my advisor, Prof. Ste-

fanos Kaxiras for his continuous support on this PhD study; for his patience,

motivation, and immense knowledge. His guidance was crucial to help me

shape into an independent researcher; not only did he taught me how to find

interesting research problems, and how to approach them with a novel perspec-

tive, but he also taught me how to structure these ideas and convert them into

high quality scientific documents. I would also like to express deep gratitude

to Prof. Alexandra Jimborean for her overall contribution to the decoupled

access-execute paradigm, for bringing her compiler expertise into the project

and for all the knowledge transfer regarding LLVM and polyhedral optimiza-

tions. Of utter significance to Paper IV and to the work presented in this thesis

to support UVM in heterogeneous systems, was the contribution of Prof. Al-

berto Ros. Therefore, i would like to express my gratitude for all his help in the

way to understand the tradeoffs in coherence and the endless hours he spent,

helping me develop and debug this work. Finally, equal gratitude is expressed

to my co-advisors Professors Erik Hagersten and David Black-Schaffer, for the

long discussions, for continuously reviewing my work and providing valuable

feedback. Working in the UART group has provided me a unique opportunity

for me to improve my work, and the way of thinking and presenting it.

But without financial aid, none of this work would be possible. Therefore,

i would like to thank all the funders of this work: (i) the EU commission, for

offering a 3 year grant under the LPGPU (low-power GPU) project (FP7-ICT-

288653), (iii) the Swedish Research Council for co-funding this PhD, and (iii)

UPMARC for offering computing resources and support. Finally, I would like

to thank and congratulate Uppsala University for offering an employer friendly

and motivating environment, and for supporting this work. Great gratitude in

this respect is paid to Prof. Ivan Christoff, who as head of the Division of

Computer Systems put enormous effort to made the division a better place for

the employees, and one of the most highly competitive. Apart from that, he

made an example of kindness, politeness, strength and generosity; an example

that could shape us into better persons.

Finally, I would like to thank and express my gratitude to all the friends and

colleagues for their help and support during those years. More specifically, to

Vasileios Spiliopoulos for the long discussions about energy saving trends, and

for his effort to build the hardware infrastructure we used to measure power

on real machines. I would also like to thank my previous office-mates, Nikos

Nikoleris for the long philosophical discussions, and for encouraging me into

46

the hardest periods; and Muneeb Khan for sharing his experience and expertise

on software prefetching techniques, and for all the discussions we had about

life in general. As the list becomes longer I would like to thank and express my

gratitude to all the master-students that joined the decoupled access-execute

project for their contribution. And last but not least, I express endless gratitude

to my beloved wife Eirini, for being on my side and withstanding me all those

full of deadlines and difficulties years, and for supporting my efforts.

47

References

[1] Manuel Arenaz, Juan Touriño, and Ramón Doallo. An inspector-executor

algorithm for irregular assignment parallelization. In Parallel and DistributedProcessing and Applications, volume 3358 of Lecture Notes in ComputerScience, pages 4–15. 2005.

[2] Matthew Arnold and Barbara G Ryder. A framework for reducing the cost of

instrumented code. In Proceedings of the 22th ACM SIGPLAN Conference onProgramming Language Design and Implementation, PLDI ’01, pages

168–179, 2001.

[3] Manish Arora, Siddhartha Nath, Subhra Mazumdar, Scott B. Baden, and

Dean M. Tullsen. Redefining the Role of the CPU in the Era of CPU-GPU

Integration. IEEE Micro, 32(6):4–16, Nov 2012.

[4] Eduard Ayguadé, Rosa M Badia, Francisco D Igual, Jesús Labarta, Rafael

Mayo, and Enrique S Quintana-Ortí. An extension of the starss programming

model for platforms with multiple gpus. In Euro-Par 2009 Parallel Processing,

pages 851–862. Springer, 2009.

[5] Ali Bakhoda, George L. Yuan, Wilson W.L. Fung, Henry Wong, and Tor M.

Aamodt. Analyzing CUDA workloads using a detailed GPU simulator. In

Proceedings of the 2009 IEEE International Symposium on PerformanceAnalysis of Systems and Software, ISPASS ’09, pages 163–174, April 2009.

[6] Utpal K. Banerjee. Loop Transformations for Restructuring Compilers: TheFoundations. Kluwer Academic Publishers, 1993.

[7] Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali

Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna,

Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay

Vaish, Mark D. Hill, and David A. Wood. The Gem5 Simulator. ACMSIGARCH Computer Architecture News, 39(2):1–7, August 2011.

[8] Anantha P Chandrakasan, Samuel Sheng, and Robert W Brodersen. Low-power

cmos digital design. IEEE Journal of Solid-State Circuits, 27(4):473–484, 1992.

[9] Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer,

Sang-Ha Lee, and Kevin Skadron. Rodinia: A Benchmark Suite for

Heterogeneous Computing. In Proceedings of the 2009 InternationalSymposium on Workload Characterization, pages 44–54, October 2009.

[10] Ding Kai Chen, Josep Torrellas, and Pen Chung Yew. An efficient algorithm for

the run-time parallelization of doacross loops. In Proceedings of the 1994ACM/IEEE Conference on Supercomputing, SC ’94, pages 518–527, 1994.

[11] Juan Chen, Huizhan Yi, Xuejun Yang, and Liang Qian. Compile-Time Energy

Optimization for Parallel Applications in On-Chip Multiprocessors. In

Proceedings of the 6th International Conference on Computational Science,

ICCS ’06, pages 904–911, 2006.

48

[12] Xuan Chen and Shun Long. Adaptive multi-versioning for openmp

parallelization via machine learning. In Proceedings of the 15th InternationalConference on Parallel and Distributed Systems, ICPADS ’09, pages 907–912,

2009.

[13] T Chilimbi and M Hirzel. Dynamic Hot Data Stream Prefetching for

General-Purpose Programs. In Proceedings of the 23rd ACM SIGPLANConference on Programming Language Design and Implementation, PLDI ’02,

pages 199–209, 2002.

[14] R.H. Dennard, F.H. Gaensslen, V.L. Rideout, E. Bassous, and A.R. LeBlanc.

Design of ion-implanted mosfet’s with very small physical dimensions. IEEEJournal of Solid-State Circuits, 9(5):256–268, 1974.

[15] Bruno Dufour, Barbara G Ryder, and Gary Sevitsky. Blended analysis for

performance understanding of framework-based applications. In Proceedings ofthe 2007 International Symposium on Software Testing and Analysis,

ISSTA ’07, 2007.

[16] Matteo Frigo, Charles E. Leiserson, and Keith H. Randall. The implementation

of the Cilk-5 multithreaded language. ACM SIGPLAN Notices, 33(5):212–223,

1998.

[17] R. Gonzalez and M. Horowitz. Energy dissipation in general purpose

microprocessors. IEEE Journal of Solid-State Circuits, 31(9):1277–1284, 1996.

[18] Per Hammarlund, Alberto J. Martinez, Atiq A. Bajwa, David L. Hill, Erik

Hallnor, Hong Jiang, Martin Dixon, Michael Derr, Mikal Hunsaker, Rajesh

Kumar, Randy B. Osborne, Ravi Rajwar, Ronak Singhal, Reynold D’Sa, Robert

Chappell, Shiv Kaushik, Srinivas Chennupaty, Stephan Jourdan, Steve Gunther,

Tom Piazza, and Ted Burton. Haswell: The fourth-generation intel core

processor. IEEE Micro, 34(2):6–20, 2014.

[19] Matthias Hauswirth and Trishul M. Chilimbi. Low-overhead memory leak

detection using adaptive statistical profiling. In Proceedings of the 11thInternational Conference on Architectural Support for Programming Languageand Operating Systems, ASPLOS ’04, pages 156–164, 2004.

[20] Blake A. Hechtman, Shuai Che, Derek R. Hower, Yingying Tian, Bradford M.

Beckmann, Mark D. Hill, Steven K. Reinhardt, and David A. Wood.

QuickRelease: A Throughput-Oriented Approach to Release Consistency on

GPUs. In Proceedings of the 20th International Symposium onHigh-Performance Computer Architecture, HPCA ’14, pages 189–200, Feb

2014.

[21] Blake A. Hechtman and Daniel J. Sorin. Exploring Memory Consistency for

Massively-Threaded Throughput-Oriented Processors. In Proceedings of the40th International Symposium on Computer Architecture, ISCA ’13, pages

201–212, June 2013.

[22] John L. Henning. SPEC CPU2006 benchmark descriptions. ACM SIGARCHComputer Architecture News, 34(4):1–17, September 2006.

[23] Martin Hirzel and Trishul Chilimbi. Bursty tracing: A framework for

low-overhead temporal profiling. In 4th ACM Workshop on FeedbackDirectedand Dynamic Optimization, pages 117–126, 2001.

[24] M. Horowitz, T. Indermaur, and R. Gonzalez. Low-power digital design. In

IEEE Symposium on Low Power Electronics, pages 8–11, 1994.

49

[25] Derek R. Hower, Blake A. Hechtman, Bradford M. Beckmann, Benedict R.

Gaster, Mark D. Hill, Steven K. Reinhardt, and David A. Wood.

Heterogeneous-Race-Free Memory Models. In Proceedings of the 19thInternational Conference on Architectural Support for Programming Languageand Operating Systems, ASPLOS ’14, pages 427–440, March 2014.

[26] Heterogeneous system architecture (hsa) foundation.

�� .

[27] K.Z. Ibrahim, G.T. Byrd, and E. Rotenberg. Slipstream execution mode for

CMP-based multiprocessors. In Proceedings of the 9th InternationalSymposium on High-Performance Computer Architecture, HPCA ’03, pages

179–190, 2003.

[28] ITRS. DESIGN 2011 EDITION, 2011.

[29] Bruce Jacob. The memory system: you can’t avoid it, you can’t ignore it, you

can’t fake it. Synthesis Lectures on Computer Architecture, 4(1):1–77, 2009.

[30] Alexandra Jimborean, Philippe Clauss, Jean-François Dollinger, Vincent

Loechner, and Juan Manuel Martinez Caamaño. Dynamic and Speculative

Polyhedral Parallelization Using Compiler-Generated Skeletons. InternationalJournal of Parallel Programming, 42(4):529–545, August 2014.

[31] Alexandra Jimborean, Konstantinos Koukos, Vasileios Spiliopoulos, David

Black-Schaffer, and Stefanos Kaxiras. Fix the Code. Don’t Tweak the

Hardware: A New Compiler Approach to Voltage-Frequency Scaling. In

Proceedings of the 2014 IEEE / ACM International Symposium on CodeGeneration and Optimization, CGO ’14, pages 262–272, 2014.

[32] Md Kamruzzaman, Steven Swanson, and Dean M. Tullsen. Inter-core

prefetching for multicore processors using migrating helper threads. In

Proceedings of the 16th International Conference on Architectural Support forProgramming Language and Operating Systems, ASPLOS ’11, pages 393–404,

2011.

[33] M. Karlsson and E. Hagersten. Conserving Memory Bandwidth in Chip

Multiprocessors with Runahead Execution. In Proceedings of the 2007International Parallel and Distributed Processing Symposium, IPDPS ’07,

pages 1–10, march 2007.

[34] Stefanos Kaxiras and Margaret Martonosi. Computer architecture techniques

for power-efficiency. Synthesis Lectures on Computer Architecture, 3(1):1–207,

2008.

[35] Stefanos Kaxiras and Alberto Ros. A New Perspective for Efficient

Virtual-Cache Coherence. In Proceedings of the 40th International Symposiumon Computer Architecture, pages 535–546, June 2013.

[36] Georgios Keramidas, Vasileios Spiliopoulos, and Stefanos Kaxiras.

Interval-based models for run-time DVFS orchestration in superscalar

processors. In Proceedings of the 7th ACM International Conference onComputing Frontiers, CF ’10, pages 287–296, 2010.

[37] Wonyoung Kim, D. Brooks, and Gu-Yeon Wei. A Fully-Integrated 3-Level

DC-DC Converter for Nanosecond-Scale DVFS. IEEE Journal of Solid-StateCircuits, 47(1):206–219, jan. 2012.

[38] Wonyoung Kim, M.S. Gupta, Gu-Yeon Wei, and D. Brooks. System level

analysis of fast, per-core DVFS using on-chip switching regulators. In

50

Proceedings of the 14th International Symposium on High-PerformanceComputer Architecture, HPCA ’08, pages 123–134, feb. 2008.

[39] Konstantinos Koukos, David Black-Schaffer, Vasileios Spiliopoulos, and

Stefanos Kaxiras. Towards More Efficient Execution: A Decoupled

Access-execute Approach. In Proceedings of the 27th International Conferenceon Supercomputing, ICS ’13, pages 253–262, 2013.

[40] Konstantinos Koukos, Per Ekemark, Georgios Zacharopoulos, Vasileios

Spiliopoulos, Stefanos Kaxiras, and Alexandra Jimborean. Multiversioned

decoupled access-execute: The key to energy-efficient compilation of

general-purpose programs. In Proceedings of the 25th International Conferenceon Compiler Construction, CC ’16, pages 121–131, 2016.

[41] Konstantinos Koukos, Alberto Ros, Erik Hagersten, and Stefanos Kaxiras.

Building heterogeneous unified virtual memories (uvms) without the overhead.

ACM Transactions on Architecture and Code Optimization, 13(1):1:1–1:22,

March 2016.

[42] George Kyriazis. Heterogeneous system architecture: A technical review. AMDFusion Developer Summit, 2012.

[43] Chris Lattner and Vikram Adve. Llvm: A compilation framework for lifelong

program analysis & transformation. In Proceedings of the 2004 IEEE / ACMInternational Symposium on Code Generation and Optimization, CGO ’04,

2004.

[44] Jingwen Leng, Tayler Hetherington, Ahmed ElTantawy, Syed Gilani, Nam Sung

Kim, Tor M. Aamodt, and Vijay Janapa Reddi. GPUWattch: Enabling Energy

Optimizations in GPGPUs. In Proceedings of the 40th International Symposiumon Computer Architecture, ISCA ’13, pages 487–498, June 2013.

[45] Sheng Li, Jung Ho Ahn, Richard D. Strong, Jay B. Brockman, Dean M. Tullsen,

and Norman P. Jouppi. McPAT: An Integrated Power, Area, and Timing

Modeling Framework for Multicore and Manycore Architectures. In

Proceedings of the 42nd IEEE/ACM International Symposium onMicroarchitecture, MICRO ’09, pages 469–480, Dec 2009.

[46] Xiang LingXiang, Huang JiangWei, Sheng Weihua, and Chen TianZhou. The

design and implementation of the dvs based dynamic compiler for power

reduction. In Proceedings of the 7th International Conference on AdvancedParallel Processing Technologies, volume 4847 of Lecture Notes in ComputerScience, pages 233–240. 2007.

[47] The llvm compiler infrastructure. ��.

[48] Vincent Loechner. Polylib: A library for manipulating parameterized polyhedra,

1999.

[49] Lianjie Luo, Yang Chen, Chengyong Wu, Shun Long, and Grigori Fursin.

Finding representative sets of optimizations for adaptive multiversioning

applications. In Proceedings of the 3rd International Workshop on Statisticaland machine learning approaches to architectures and compilation,

IWSMART ’09, 2009.

[50] Daniel Marino, Madanlal Musuvathi, and Satish Narayanasamy. LiteRace:

effective sampling for lightweight data-race detection. In Proceedings of the30th ACM SIGPLAN Conference on Programming Language Design andImplementation, PLDI ’09, pages 134–143, 2009.

51

[51] Vladimir Marjanovic, Jesús Labarta, Eduard Ayguadé, and Mateo Valero.

Overlapping communication and computation by using a hybrid MPI/SMPSs

approach. In Proceedings of the 24th International Conference onSupercomputing, ICS ’10, pages 5–16, 2010.

[52] Mark P Mills. The cloud begins with coal: Big data, big networks, big

infrastructure, and big power. Digital Power Group, 2013.

[53] John Nickolls, Ian Buck, Michael Garland, and Kevin Skadron. Scalable

parallel programming with cuda. Queue, 6(2):40–53, March 2008.

[54] Nvidia. CUDA C Programming Guide, 2015.

[55] Chuck Pheatt. Intel® threading building blocks. Journal of ComputingSciences in Colleges, 23(4), April 2008.

[56] Judit Planas, Rosa M. Badia, Eduard Ayguadé, and Jesus Labarta. Hierarchical

Task-Based Programming With StarSs. International Journal of HighPerformance Computing Applications, 23:284–299, 2009.

[57] A library of polyhedral functions. �� .

[58] Jason Power, Arkaprava Basu, Junli Gu, Sooraj Puthoor, Bradford M.

Beckmann, Mark D. Hill, Steven K. Reinhardt, and David A. Wood.

Heterogeneous System Coherence for Integrated CPU-GPU Systems. In

Proceedings of the 46th IEEE/ACM International Symposium onMicroarchitecture, MICRO ’13, pages 457–467, December 2013.

[59] Jason Power, Joel Hestness, Marc S. Orr, Mark D. Hill, and David A. Wood.

gem5-gpu: A Heterogeneous CPU-GPU Simulator. 14(1):34–36, Jan 2015.

[60] Jason Power, Mark D. Hill, and David A. Wood. Supporting x86-64 Address

Translation for 100s of GPU Lanes. In Proceedings of the 20th InternationalSymposium on High-Performance Computer Architecture, HPCA ’14, pages

568–578, Feb 2014.

[61] Carlos García Quiñones, Carlos Madriles, Jesús Sánchez, Pedro Marcuello,

Antonio González, and Dean M. Tullsen. Mitosis compiler: An infrastructure

for speculative threading based on pre-computation slices. In Proceedings of the26th ACM SIGPLAN Conference on Programming Language Design andImplementation, PLDI ’05, pages 269–279, 2005.

[62] Lawrence Rauchwerger, Nancy M. Amato, and David A. Padua. Run-time

methods for parallelizing partially parallel loops. In Proceedings of the 9thInternational Conference on Supercomputing, ICS ’95, pages 137–146, 1995.

[63] Alberto Ros, Mahdad Davari, and Stefanos Kaxiras. Hierarchical

Private/Shared Classification: The Key to Simple and Efficient Coherence for

Clustered Cache Hierarchies. In Proceedings of the 21st InternationalSymposium on High-Performance Computer Architecture, HPCA ’15, pages

186–197, Feb 2015.

[64] Alberto Ros and Stefanos Kaxiras. Complexity-Effective Multicore Coherence.

In Proceedings of the 21st International Conference on Parallel Architecturesand Compilation Techniques, pages 241–252, September 2012.

[65] J.H. Saltz, R. Mirchandaney, and K. Crowley. Run-time parallelization and

scheduling of loops. IEEE Transactions on Computers, 40(5):603–612, May

1991.

[66] H. Saputra, M. Kandemir, N. Vijaykrishnan, M. J. Irwin, J. S. Hu, C-H. Hsu,

and U. Kremer. Energy-conscious compilation based on voltage scaling. In

52

Proceedings of the Joint Conference on Languages, Compilers and Tools forEmbedded Systems: Software and Compilers for Embedded Systems,

LCTES/SCOPES ’02, pages 2–11, 2002.

[67] Alexander Schrijver. Theory of linear and integer programming. John Wiley &

Sons, Inc., NY, USA, 1986.

[68] Andreas Sembrant, Erik Hagersten, and David Black-Shaffer. TLC: A Tag-less

Cache for Reducing Dynamic First Level Cache Energy. In Proceedings of the46th IEEE/ACM International Symposium on Microarchitecture, pages 49–61,

December 2013.

[69] Vivek Seshadri, Abhishek Bhowmick, Onur Mutlu, Phillip B. Gibbons,

Michael A. Kozuch, and Todd C. Mowry. The Dirty-Block Index. In

Proceedings of the 41st International Symposium on Computer Architecture,

pages 157–168, June 2014.

[70] Inderpreet Singh, Arrvindh Shriraman, Wilson W. L. Fung, Mike O’Connor,

and Tor M. Aamodt. Cache Coherence for GPU Architectures. In Proceedingsof the 19th International Symposium on High-Performance ComputerArchitecture, HPCA ’13, pages 578–590, Feb 2013.

[71] James E. Smith. Decoupled access/execute computer architectures. ACMSIGARCH Computer Architecture News, 10(3):112–119, April 1982.

[72] V. Spiliopoulos, S. Kaxiras, and G. Keramidas. Green governors: A framework

for Continuously Adaptive DVFS. In Proceedings of the 2011 InternationalGreen Computing Conference and Workshops, IGCC ’11, pages 1–8, 2011.

[73] John E. Stone, David Gohara, and Guochun Shi. Opencl: A parallel

programming standard for heterogeneous computing systems. IEEE Design andTest, 12(3):66–73, May 2010.

[74] John A Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid, Li-Wen

Chang, Nasser Anssari, Geng Daniel Liu, and Wen-Mei W Hwu. Parboil: A

revised benchmark suite for scientific and commercial throughput computing.

Center for Reliable and High-Performance Computing, 2012.

[75] Kai Tian, Yunlian Jiang, Eddy Z. Zhang, and Xipeng Shen. An input-centric

paradigm for program dynamic optimizations. In Proceedings of the 2010 ACMConference on Object-Oriented Programming, Systems, Languages andApplications, OOPSLA ’10, pages 125–139, 2010.

[76] George Tzenakis, Konstantinos Kapelonis, Michail Alvanos, Konstantinos

Koukos, Dimitrios S. Nikolopoulos, and Angelos Bilas. Tagged Procedure Calls

(TPC): Efficient runtime support for task-based parallelism on the Cell

Processor. In Proceedings of the 2010 International Conference onHigh-Performance Embedded Architectures and Compilers, HiPEAC ’10, 2010.

[77] Qiang Wu, Margaret Martonosi, Douglas W. Clark, V. J. Reddi, Dan Connors,

Youfeng Wu, Jin Lee, and David Brooks. A dynamic compilation framework

for controlling microprocessor energy and performance. In Proceedings of the38th IEEE/ACM International Symposium on Microarchitecture, MICRO ’05,

pages 271–282, 2005.

[78] Qiang Wu, Margaret Martonosi, Douglas W. Clark, Vijay Janapa Reddi, Dan

Connors, Youfeng Wu, Jin Lee, and David Brooks. Dynamic-compiler-driven

control for microprocessor energy and performance. IEEE Micro,

26(1):119–129, 2006.

53

[79] Fen Xie, Margaret Martonosi, and Sharad Malik. Compile-time Dynamic

Voltage Scaling Settings: Opportunities and Limits. In Proceedings of the 24thACM SIGPLAN Conference on Programming Language Design andImplementation, PLDI ’03, pages 49–62, 2003.

[80] Daisuke Yokota, Shigeru Chiba, and Kozo Itano. A new optimization technique

for the inspector-executor method. In IASTED PDCS, pages 706–711, 2002.

[81] Weifeng Zhang, D.M. Tullsen, and B. Calder. Accelerating and adapting

precomputation threads for effcient prefetching. In Proceedings of the 13thInternational Symposium on High-Performance Computer Architecture,

HPCA ’07, pages 85–95, 2007.

[82] Mingzhou Zhou, Xipeng Shen, Yaoqing Gao, and Graham Yiu. Space-efficient

multi-versioning for input-adaptive feedback-driven program optimizations. In

Proceedings of the 2014 ACM Conference on Object-Oriented Programming,Systems, Languages and Applications, OOPSLA ’14, pages 763–776, 2014.

[83] Chuan-Qi Zhu and Pen-Chung Yew. A scheme to enforce data dependence on

large multiprocessor systems. IEEE Transactions on Software Engineering, 13,

1987.

54

Acta Universitatis UpsaliensisDigital Comprehensive Summaries of Uppsala Dissertationsfrom the Faculty of Science and Technology 1405

Editor: The Dean of the Faculty of Science and Technology

A doctoral dissertation from the Faculty of Science andTechnology, Uppsala University, is usually a summary of anumber of papers. A few copies of the complete dissertationare kept at major Swedish research libraries, while thesummary alone is distributed internationally throughthe series Digital Comprehensive Summaries of UppsalaDissertations from the Faculty of Science and Technology.(Prior to January, 2005, the series was published under thetitle “Comprehensive Summaries of Uppsala Dissertationsfrom the Faculty of Science and Technology”.)

Distribution: publications.uu.seurn:nbn:se:uu:diva-300831

ACTAUNIVERSITATIS

UPSALIENSISUPPSALA

2016

efficient execution paradigms for parallel heterogeneous …952645/... · 2016. 9. 2. · efficient...

Documents