developing scientiﬁc software for low-power system-on-chip ...€¦ · processors: optimising for...

Developing Scientific Software forLow-power System-on-Chip

Processors: Optimising for Energy

Anish Varghese

A thesis submitted for the degree ofDoctor of Philosophy (Computer Science)

The Australian National University

November 2019

c© Anish Varghese 2019

Except where otherwise indicated, this thesis is my own original work.

Anish Varghese12 November 2019

To Neetha, David, our future generations and my parents

Acknowledgments

I thank God Almighty for giving me the strength and grace to complete this thesis, especiallyduring all the challenging moments.

There are many people who I would like to thank for their support during thecourse of my PhD.

First I would like to thank my primary supervisor, Prof. Alistair Rendell for hisimmense support, guidance and patience throughout this journey. This would nothave been possible without his constant encouragement. I would like to express mysincere gratitude to my associate supervisor, Dr. Josh Milthorpe for his substantialsupport and guidance over the course of the last two years. I also thank Dr. EricMcCreath for his inputs during this journey.

I would like to take this opportunity to express my gratitude to the AustralianNational University and the Australian Government for supporting me financiallywith scholarships. Part of this work was supported by the Australian ResearchCouncil Discovery Project (DP0987773), for which I express my gratitude.

Parts of this thesis was made possible through collaboration with others. Specialthanks to my collaborators, Dr. Gaurav Mitra, Robert Edwards, Andrew Haigh andLuke Angove. I am thankful for the opportunity to work with them and I havelearned a lot from each of them.

I would like to thank my colleagues in the Research School of Computer Science- Beau Johnston, Swapnil Mishra, Jeff Fisher, Kunshan Wang, Sara Hamouda, BrianLee for interesting and helpful discussions during this time. Thanks also to the otheracademics for their guidance from time to time.

I am grateful to the National Computational Infrastructure (NCI) for the oppor-tunity to work part-time during the last year of my PhD. I would like to particularlythank my managers Dr. Roger Edberg, Dr. Muhammad Atif and Andrew Wellingtonfor giving me the flexibility I needed to complete my thesis. I would like to thank mycolleagues at NCI as well for their encouragement.

Last, but not the least, I would like to express my gratitude to my parents andin-laws, and my family members for their support and prayers. Special thanks tomy wife, Neetha for her unconditional love, support, sacrifices and for lifting me upwhenever I needed it. I am also grateful for the opportunity to see my son, Davidgrow up from a tiny tot into a bubbly toddler during the last year of this journey.He has been the source of our happiness, especially during challenging times. I lookforward to spending more time with my family.

vii

Abstract

Energy consumption has been identified as the major bottleneck in the push to in-crease the scale of current High Performance Computing (HPC) systems. Conse-quently there has been an increased effort to investigate the suitability of low-powerhardware for HPC. Low-power system-on-chips (LPSoCs), which are widely used in amobile and embedded context, typically integrate multicore Central Processing Units(CPUs) and accelerators on a single chip, offering high floating point capabilitieswhile consuming low power.

While there are merits to using such low-power systems for scientific computing,there are a number of challenges in using them efficiently. This thesis considers threeissues. i) development of applications which are able to use all the LPSoC processingelements effectively, ii) measurement, understanding and modelling of the energyusage of an application executing on such platforms, iii) strategies for deciding theoptimal partitioning of an application’s workload between the different processingelements in order to minimise energy-to-solution. Each of these issues are investigatedin the context of three applications - two core computational science kernels, namelymatrix multiplication as an exemplar of dense linear algebra and stencil computationas an exemplar of grid based numerical methods, and the complex block tridiagonalbenchmark from the multizone NAS parallel benchmark suite.

To study the challenges associated with the development of scientific softwarefor LPSoCs, two fundamentally different systems are considered, the Epiphany-IVNetwork-on-chip (NoC) and the Tegra systems. The former was a kickstarter projectwhich aimed to design a LPSoC that could scale to over 4096 cores with a peakperformance in excess of 5 trillion single-precision floating point operations per sec-ond (TFLOP/s) while operating at an energy efficiency of 70 GFLOP/s per Watt. Bycontrast, the latter is a product range from multinational company NVIDIA thatcombines their popular Graphics Processing Unit (GPU) technology with a generalpurpose ARM processor in a mass market LPSoC. This thesis reports the implementa-tion of both the matrix multiplication and stencil kernels on both systems comparingtheir performance, energy usage and the programming challenges associated withdeveloping code for these systems to those on conventional systems.

In order to analyse the energy efficiency of applications running on an LPSoC,the ability to measure its energy usage is crucial. However, very few platformshave internal sensors which provide details of energy usage, and when they domeasurements obtained using such sensors are usually low-resolution and intrusive.This thesis presents a high-resolution, non-intrusive, energy measurement frameworkalong with an Application Programming Interface (API) which enables an applicationto obtain real-time measurement of its energy usage at the function level. Based on

ix

x

these measurements a simple energy usage model is proposed to describe the energyusage as a function of how the workload is partitioned between the different comput-ing devices 1. This model predicts the conditions under which energy minimisationoccurs when using all available computing devices. This prediction is tested anddemonstrated for the matrix multiplication and stencil kernels.

Given access to high resolution, real-time energy measurements and a modeldescribing energy usage as a function of how an application is partitioned betweenthe available computing devices, this thesis explores various strategies for runtimeenergy tuning. Different scenarios are considered; offline pre-tuning, tuning based onestimates gained from solving a small fraction of the complete problem, and tuningbased on iteratively solving fractions of the entire problem a small number of timeswith the expectation that the final solution involves many repetitions of this. Theapplicability of these for the model kernels is discussed and tested.

1The development of the energy measurement framework and the energy usage model was under-taken in collaboration with others

Contents

Acknowledgments vii

Abstract ix

1 Introduction 11.1 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Background and Related Work 72.1 Low-power System-on-Chip Architecture . . . . . . . . . . . . . . . . . . 7

2.1.1 Adapteva’s Epiphany-IV . . . . . . . . . . . . . . . . . . . . . . . 92.1.2 NVIDIA Tegra K1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.1.3 NVIDIA Tegra X1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.1.4 NVIDIA Xavier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.2 Programming Models for LPSoCs . . . . . . . . . . . . . . . . . . . . . . 172.2.1 Programming the Adapteva’s Epiphany-IV coprocessor . . . . . 18

2.2.1.1 Programming Considerations . . . . . . . . . . . . . . . 192.2.2 Programming the NVIDIA Tegra SoCs . . . . . . . . . . . . . . . 20

2.3 Scientific applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.3.1 Stencil computation . . . . . . . . . . . . . . . . . . . . . . . . . . 232.3.2 Dense matrix multiplication (GEMM) . . . . . . . . . . . . . . . . 242.3.3 The Block Tridiagonal (BT) solver . . . . . . . . . . . . . . . . . . 25

2.4 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262.4.1 Using LPSoCs for Scientific Computing . . . . . . . . . . . . . . . 262.4.2 Use of CPU-Accelerator Work Distribution . . . . . . . . . . . . . 262.4.3 Measuring Energy Consumption . . . . . . . . . . . . . . . . . . . 272.4.4 Modelling and Auto Tuning for Energy . . . . . . . . . . . . . . . 28

2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3 Developing Applications for Adapteva Epiphany and NVIDIA Tegra LP-SoCs 333.1 Developing Applications for Adapteva’s Epiphany coprocessor . . . . . 34

3.1.1 Evaluating the Epiphany chip’s Memory Subsystem . . . . . . . 343.1.1.1 Measuring On-chip Memory Bandwidth and Latency . 343.1.1.2 Accessing External (Off-chip) Shared Memory . . . . . 37

3.1.2 Developing Stencil kernel for the Epiphany chip . . . . . . . . . . 383.1.2.1 Floating-Point Performance of Optimised Stencil . . . . 43

xi

xii Contents

3.1.3 Developing GEMM kernel for Epiphany chip . . . . . . . . . . . 453.1.3.1 Tuned Single-core GEMM kernel . . . . . . . . . . . . . 463.1.3.2 On-chip Multi-core GEMM kernel . . . . . . . . . . . . 473.1.3.3 Off-chip GEMM kernel . . . . . . . . . . . . . . . . . . . 513.1.3.4 Floating-Point Performance of GEMM kernel . . . . . . 51

3.1.4 Feasibility of Work Partitioning between Host CPU and Epiphany 533.2 Developing Applications for NVIDIA Tegra SoCs . . . . . . . . . . . . . 54

3.2.1 Developing Partitioned Stencil kernel for Tegra SoCs . . . . . . . 543.2.1.1 Performance Results and Analysis . . . . . . . . . . . . 573.2.1.2 Critique . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

3.2.2 Developing Partitioned GEMM kernel for Tegra SoCs . . . . . . 623.2.2.1 Performance Results and Analysis . . . . . . . . . . . . 64

3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4 Measuring and Modelling Energy Usage 714.1 Developing an Energy Measurement Environment for LPSoC systems . 72

4.1.1 Hardware Design of Energy Measurement System . . . . . . . . 734.1.2 Software Environment for Energy Measurement System . . . . . 74

4.2 Design of Energy Usage Model . . . . . . . . . . . . . . . . . . . . . . . . 764.3 Experimental Evaluation of the Energy Usage Model . . . . . . . . . . . 79

4.3.1 Hardware Setup for Experiments . . . . . . . . . . . . . . . . . . 804.3.2 Evaluation of Predicted Energy vs Modelled Energy . . . . . . . 80

4.3.2.1 Stencil . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 814.3.2.2 GEMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 824.3.2.3 Multizone Block Tridiagonal Solver . . . . . . . . . . . . 86

4.3.3 Evaluation of Predicted Optimal Split Ratio . . . . . . . . . . . . 884.3.3.1 Stencil . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 884.3.3.2 GEMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

4.3.4 Variation of Optimal Split with Problem Size . . . . . . . . . . . . 894.3.4.1 Stencil . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 944.3.4.2 DGEMM . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

4.4 Critique and Limitations of the Energy Usage Model . . . . . . . . . . . 954.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

5 Developing a Runtime Tuning Framework 1015.1 Offline Pre-tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

5.1.1 Implementation of Pre-tuning Approach for Stencil . . . . . . . . 1035.1.2 Implementation of Pre-tuning Approach for GEMM . . . . . . . 1045.1.3 Evaluation of Offline Pre-tuning Framework . . . . . . . . . . . . 104

5.1.3.1 Evaluation of Pre-tuning Approach for Stencil . . . . . 1055.1.3.2 Evaluation of Pre-tuning Approach for GEMM . . . . . 108

5.2 Dynamic Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1115.2.1 Implementation of Dynamic Tuning Approach for Stencil . . . . 111

Contents xiii

5.2.1.1 Method 1 - Subset of Problem . . . . . . . . . . . . . . . 1125.2.1.2 Method 2 - Subset of Iterations . . . . . . . . . . . . . . 1125.2.1.3 Method 3 - Progressive Refinement . . . . . . . . . . . . 112

5.2.2 Implementation of Dynamic Tuning Approach for GEMM . . . . 1135.2.2.1 Method 1 - Subset of Problem . . . . . . . . . . . . . . . 1145.2.2.2 Method 3 - Progressive Refinement . . . . . . . . . . . . 114

5.2.3 Implementation of Dynamic Tuning Approach for NPB (MZ)BT Solver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1145.2.3.1 Method 2 - Subset of Iterations . . . . . . . . . . . . . . 115

5.2.4 Evaluation of Dynamic Tuning Framework . . . . . . . . . . . . . 1155.2.4.1 Evaluation of Dynamic Tuning Approach for Stencil . . 1155.2.4.2 Evaluation of Dynamic Tuning Approach for GEMM . 1215.2.4.3 Evaluation of Dynamic Tuning Approach for NPB (MZ)

BT Solver . . . . . . . . . . . . . . . . . . . . . . . . . . . 1285.3 Critique of Runtime Tuning Framework . . . . . . . . . . . . . . . . . . . 1295.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

6 Conclusions and Future Work 1316.1 Performance-energy Trade-offs and Energy Efficiency . . . . . . . . . . . 1336.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

A Experimental Hardware Platforms 137

B Developing Optimised Applications for the Epiphany-IV NoC 139B.1 Stencil . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139B.2 Matrix Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

C Energy Measurement Setup 145

D Frequency Scaling and Energy Usage 147D.1 CPU-bound Workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147D.2 Memory-bound Workload . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

Bibliography 151

xiv Contents

List of Figures

2.1 Adapteva Epiphany System . . . . . . . . . . . . . . . . . . . . . . . . . . 102.2 Tegra K1 SoC [NVIDIA, 2014c] . . . . . . . . . . . . . . . . . . . . . . . . 112.3 Jetson TK1 Dev kit [NVIDIA, 2014a] . . . . . . . . . . . . . . . . . . . . . 112.4 Tegra K1 “GK20a” GPU [NVIDIA, 2014c] . . . . . . . . . . . . . . . . . . 122.5 Tegra X1 SoC [NVIDIA, 2015c] . . . . . . . . . . . . . . . . . . . . . . . . 132.6 Jetson TX1 Dev kit [NVIDIA, 2015a] . . . . . . . . . . . . . . . . . . . . . 142.7 Jetson TX1 Maxwell GPU architecture [NVIDIA, 2015c] . . . . . . . . . . 142.8 Xavier SoC [NVIDIA, 2018c] . . . . . . . . . . . . . . . . . . . . . . . . . . 152.9 Jetson AGX Xavier Development kit [NVIDIA, 2018b] . . . . . . . . . . . 162.10 Xavier CPU cluster [NVIDIA, 2018c] . . . . . . . . . . . . . . . . . . . . . 162.11 Xavier Volta GPU [NVIDIA, 2018c] . . . . . . . . . . . . . . . . . . . . . . 17

3.1 Adapteva Epiphany coprocessor eMesh bandwidth: DMA transfervs Direct on-chip writes - For large messages DMA transfer achievesaround 2 GB/s. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.2 Adapteva Epiphany coprocessor eMesh transfer time: DMA transfer vsDirect on-chip writes - For small transfers point-to-point direct writesare faster than DMA. DMA is preferable for large transfers. . . . . . . . 36

3.3 Adapteva Epiphany: Core-wise utilisation of external memory linkunder contention - eCores located furthest from the eLink (row 0, col7) are starved of memory access. . . . . . . . . . . . . . . . . . . . . . . . 38

3.4 Stencil on Adapteva Epiphany: Communication of boundary data be-tween neighbouring eCores - Each eCore synchronises with each of itsfour neighbouring eCores. . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.5 Stencil kernel on Adapteva Epiphany: Single core floating-point per-formance - Performance ranges from 0.97-1.14 GFLOP/s (81-95%) ofpeak . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.6 Stencil kernel on Adapteva Epiphany: 64-core floating-point perfor-mance - Darker colours show performance of the stencil kernel includ-ing communication of boundary region. Lighter colours at the top ofeach bar show performance without communication of data. . . . . . . 45

3.7 Multicore matrix multiplication on Adapteva Epiphany: Assignmentof blocks of A and B and data flow between eCores . . . . . . . . . . . . 48

3.8 Multicore matrix multiplication on Adapteva Epiphany: Transfer ofMatrix A - 1st iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

xv

xvi LIST OF FIGURES

3.9 Multicore matrix multiplication on Adapteva Epiphany: Transfer ofMatrix B - 1st iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.10 Multicore matrix multiplication on Adapteva Epiphany: Transfer ofMatrix A - 2nd iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.11 Multicore matrix multiplication on Adapteva Epiphany: Transfer ofMatrix B - 2nd iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.12 Implementing stencil kernel on Tegra SoCs: Partitioning of stencil gridbetween CPU and GPU - The hollow dots represent boundary regionof the full grid. The dots highlighted in red represent the boundaryregion of the two partitions which are shared between the CPU andGPU. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

3.13 Stencil kernel on TK1: Performance of CPU-only and GPU-only kernelswith varying problem size - Number of columns nx = number of rowsny. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

3.14 Stencil kernel on TX1: Performance of CPU-only and GPU-only kernelswith varying problem size - Number of columns nx = number of rowsny. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

3.15 Stencil kernel on Xavier: Performance of CPU-only and GPU-onlykernels with varying problem size - Number of columns nx = numberof rows ny. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

3.16 Stencil kernel on TK1: Performance of partitioned kernel as fraction ofwork given to CPU is varied - Five problem sizes (nx = ny). . . . . . . . 60

3.17 Stencil kernel on TX1: Performance of partitioned kernel as fraction ofwork given to CPU is varied - Five problem sizes (nx = ny). . . . . . . . 61

3.18 Stencil kernel on Xavier: Performance of partitioned kernel as fractionof work given to CPU is varied - Five problem sizes (nx = ny). . . . . . . 61

3.19 GEMM on TK1: Performance of CPU-only and GPU-only kernels withvarying problem size - m=k=n. . . . . . . . . . . . . . . . . . . . . . . . . 65

3.20 GEMM on TX1: Performance of CPU-only and GPU-only kernels withvarying problem size - m=k=n. . . . . . . . . . . . . . . . . . . . . . . . . 66

3.21 GEMM on Xavier: Performance of CPU-only and GPU-only kernelswith varying problem size - m=k=n. . . . . . . . . . . . . . . . . . . . . . 66

3.22 GEMM on TK1: Performance of partitioned kernel as fraction of workgiven to CPU is varied - Six problem sizes (m=k=n). . . . . . . . . . . . . 68

3.23 GEMM on TX1: Performance of partitioned kernel as fraction of workgiven to CPU is varied - Six problem sizes (m=k=n). . . . . . . . . . . . . 68

3.24 GEMM on Xavier: Performance of partitioned kernel as fraction ofwork given to CPU is varied - Six problem sizes (m=k=n). . . . . . . . . 69

4.1 Custom energy measurement framework using µCurrent Gold . . . . . 734.2 Visualizing real-time power consumption of Jetson Xavier board. The

spikes in power corresponds to the time when the system is under load. 76

LIST OF FIGURES xvii

4.3 TK1: Partitioned Stencil - nx=ny=8192. Top half shows measuredand modelled energy-to-solution for double precision stencil on theTK1 while the bottom half shows measured and modelled energy-to-solution for single precision stencil. Energy is minimised when allwork is given to GPU for both cases . . . . . . . . . . . . . . . . . . . . . 81

4.4 TX1: Partitioned Stencil - nx=ny=8192. Top half shows measuredand modelled energy-to-solution for double precision stencil on theTX1 while the bottom half shows measured and modelled energy-to-solution for single precision stencil. Energy is minimised when allwork is given to GPU for both cases . . . . . . . . . . . . . . . . . . . . . 82

4.5 Xavier: Partitioned Stencil - nx=ny=8192. Top half shows measuredand modelled energy-to-solution for double precision stencil on theXavier while the bottom half shows measured and modelled energy-to-solution for single precision stencil. Energy is minimised when allwork is given to GPU for both cases . . . . . . . . . . . . . . . . . . . . . 83

4.6 TK1: Partitioned GEMM - m=k=n=4096. Top half shows measured andmodelled energy-to-solution for DGEMM on the TK1 while the bottomhalf shows measured and modelled energy-to-solution for SGEMM.Energy is minimised when all work is given to GPU for both cases . . . 83

4.7 TX1: Partitioned GEMM - m=k=n=4096. Top half shows measured andmodelled energy-to-solution for DGEMM on the TX1 while the bottomhalf shows measured and modelled energy-to-solution for SGEMM.Energy is minimised when all work is given to GPU for SGEMM whilesplit ratio of 60% minimises energy-to-solution for DGEMM. . . . . . . 84

4.8 Xavier: Partitioned GEMM - m=k=n=4096. Top half shows measuredand modelled energy-to-solution for DGEMM on the Xavier whilethe bottom half shows measured and modelled energy-to-solutionfor SGEMM. Energy is minimised when all work is given to GPUfor SGEMM while split ratio of 65% minimises energy-to-solution forDGEMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

4.9 Intel + K20: Partitioned GEMM - m=k=n=4096. Top half shows mea-sured and modelled energy-to-solution for DGEMM on the Sandy-bridge + K20 system while the bottom half shows measured and mod-elled energy-to-solution for SGEMM. Energy is minimised when allwork is given to GPU for both cases. . . . . . . . . . . . . . . . . . . . . . 85

4.10 Intel + K80: Partitioned GEMM - m=k=n=4096. Top half shows mea-sured and modelled energy-to-solution for DGEMM on the Haswell+ K80 system while the bottom half shows measured and modelledenergy-to-solution for SGEMM. Energy is minimised when all work isgiven to GPU for both cases. . . . . . . . . . . . . . . . . . . . . . . . . . 86

xviii LIST OF FIGURES

4.11 TK1: Partitioned NPB (MZ) BT - Class B. Figure shows measuredand modelled energy-to-solution for class B problem size on the TK1system. Model and measured values indicate that energy is minimisedwhen all work is given to CPU. . . . . . . . . . . . . . . . . . . . . . . . . 87

4.12 TX1: Partitioned NPB (MZ) BT - Class B. Figure shows measuredand modelled energy-to-solution for class B problem size on the TX1system. Model and measured values indicate that energy is minimisedwhen all work is given to CPU. . . . . . . . . . . . . . . . . . . . . . . . . 87

4.13 Xavier: Partitioned NPB (MZ) BT - Class B. Figure shows measuredand modelled energy-to-solution for class B problem size on the Xaviersystem. Model and measured values indicate that energy is minimisedwhen all work is given to CPU. . . . . . . . . . . . . . . . . . . . . . . . . 88

4.14 Double precision stencil on TX1 - Model Prediction error for differentproblem sizes. The colormap shows the error in energy optimality EE(%) 91

4.15 DGEMM on TX1 - Model Prediction Error for square and non-squarematrices. The work is partitioned by allocating a portion of the columnsof matrix B (dimension n) to the CPU and the rest to the GPU . . . . . . 93

4.16 Stencil on TX1 - how Fo varies with problem size. The colormap showsthe variance of optimal split ratio Fo (%) when problem size is changed.Fo is observed to vary between 45% and 70% for most problem sizes. . . 94

4.17 DGEMM on TX1 - how Fo varies with problem size. The colormapshows the variance of optimal split ratio Fo (%) when problem sizeis changed. Fo is observed to vary between 35% and 70% for mostproblem sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

4.18 CPU Frequency Scaling: SGEMM and Stencil on TK1. Top half showshow performance and energy vary with CPU frequency and bottomhalf shows how power varies with CPU frequency. Energy for bothSGEMM and Stencil is minimised at CPU frequency=1.12 GHz. . . . . . 97

4.19 GPU Frequency Scaling: SGEMM and Stencil on TK1. Top half showshow performance and energy vary with GPU frequency and bottomhalf shows how power varies with GPU frequency. Energy for bothSGEMM and Stencil is minimised at GPU frequency=612 MHz. . . . . . 98

5.1 Double precision stencil pre-tuning results on TX1 - all test cases. Thecolormap shows the error in energy optimality EE(%). For larger prob-lem sizes error in energy optimality is seen to be less than 5% . . . . . . 106

5.2 DGEMM pre-tuning results on TX1 - all test cases. The colormap showsthe error in energy optimality EE(%). For larger problem sizes error inenergy optimality is seen to be much less than 5% . . . . . . . . . . . . . 110

5.3 Stencil (DP) dynamic tuning method 1 on TX1 - nx=1024. Subset of theproblem used for tuning. Tuning size varied from 25% to 50%. . . . . . 116

5.4 Stencil (DP) dynamic tuning method 1 on TX1 - nx=2048. Subset of theproblem used for tuning. Tuning size varied from 25% to 50%. . . . . . 117

LIST OF FIGURES xix

5.5 Stencil (DP) dynamic tuning method 1 on TX1 - nx=8192, 16384. Subsetof the problem used for tuning. Tuning size varied from 25% to 50%. . 117

5.6 Stencil (DP) dynamic tuning method 2 on TX1 - nx=1024. Subset ofiterations used for tuning. Tuning iterations varied from 1 to 5. . . . . . 118

5.7 Stencil (DP) dynamic tuning method 2 on TX1 - nx=2048. Subset ofiterations used for tuning. Tuning iterations varied from 1 to 5. . . . . . 119

5.8 Stencil (DP) dynamic tuning method 2 on TX1 - nx=8192,16384. Subsetof iterations used for tuning. Tuning iterations varied from 1 to 5. . . . 119

5.9 Stencil progressive refinement tuning - illustration on TX1. Note thatthis shows only the refinement process. 2 iterations are used initiallyto estimate Fpi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

5.10 DGEMM dynamic tuning method 1 on TX1: m = k = 2048. Subset ofproblem used for tuning. Tuning size varied from 5% to 15%. . . . . . . 123



5.13 DGEMM dynamic tuning energy optimality error on TX1 for tuningsize = 5%. The colormap shows the error in energy optimality EE(%) . 125

5.14 DGEMM dynamic tuning energy optimality error on TX1 for tuningsize = 10%. The colormap shows the error in energy optimality EE(%) . 126

C.1 Schematic of energy measurement framework using µCurrent Gold . . 145

D.1 CPU-bound workload on TK1 - Variation of active power draw as CPUfrequency is varied. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

D.2 CPU-bound workload on TK1 - Variation of performance and energyusage when CPU frequency is varied. . . . . . . . . . . . . . . . . . . . . 148

D.3 Memory-bound workload on TK1 - Variation of active power drawwhen CPU frequency is varied. . . . . . . . . . . . . . . . . . . . . . . . . 149

D.4 Memory-bound workload on TK1 - Variation of performance and en-ergy usage when CPU frequency is varied. . . . . . . . . . . . . . . . . . 149

xx LIST OF FIGURES

List of Tables

2.1 Summary of chosen LPSoCs . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.1 Adapteva Epiphany coprocessor eMesh: Effect of node distance ontransfer latency - Results show very little effect of distance. Latency ismeasured to be ≈ 6 clock cycles. . . . . . . . . . . . . . . . . . . . . . . . 37

3.2 Matrix multiplication on Adapteva Epiphany: Single core floating-point performance - Peak of ≈ 96% achieved. . . . . . . . . . . . . . . . . 51

3.3 Matrix multiplication on Adapteva Epiphany: Multi-core on-chip floating-point performance - Peak of 85% achieved. . . . . . . . . . . . . . . . . . 52

3.4 Matrix multiplication on Adapteva Epiphany: Off-chip floating-pointperformance for larger matrices - Peak of ≈ 11% achieved. . . . . . . . . 52

4.1 Double precision stencil on TX1 - Evaluation of energy model’s predic-tion of optimal split . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

4.2 DGEMM on TX1 - Evaluation of energy model’s prediction of optimalsplit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

5.1 Double precision stencil pre-tuning results on TX1: Few problem sizes . 1075.2 DGEMM pre-tuning results on TX1: Few problem sizes . . . . . . . . . . 1095.3 Stencil progressive refinement (method 3) tuning results on TX1 . . . . 1225.4 DGEMM progressive refinement (method 3) tuning results on TX1 . . . 1275.5 NPB (MZ) BT solver dynamic tuning method 2 results on TX1 . . . . . . 128

A.1 Hardware Platforms for Experiments . . . . . . . . . . . . . . . . . . . . 137

xxi

xxii LIST OF TABLES

Chapter 1

Introduction

Scientific computing is the collection of tools, techniques, and theories required tosolve on a computer mathematical models of problems in science and engineering. Itinvolves the development and use of numerical methods and mathematical modelsto solve real-world problems efficiently on computers. Simulation and modeling ishighly important, especially in cases where experiments are extremely costly or evenimpossible. Scientific computing generally requires knowledge about the underlying(physical) problem, the ability to formulate a mathematical model, stable & accuratenumerical schemes, and efficient implementation on high performance computers.Some examples of scientific computing include climate simulation, financial modelingand aerodynamic simulation. These problem domains require high floating-pointcomputation capabilities typically provided by massively parallel systems, referred toas High Performance Computing (HPC) systems. Efficient implementation is crucialon these systems and involves choosing and making the best use of the underlyinghardware.

In recent years the number of systems in the “Top500” list of fastest supercom-puters [Top500] containing heterogeneous processing elements has increased. Theseheterogeneous elements include traditional CPUs, Graphics Processing Units (GPUs)and other custom cores. Leading HPC systems today consist of many thousandsof nodes containing multiple such heterogeneous elements. The push to furtherincrease the scale of these systems presents considerable challenges such as reli-ability, programmability and energy efficiency. The DARPA Exascale Technologystudy [Bergman et al., 2008] has outlined power as the major bottleneck to achievingexascale computing. A threshold of 20 megawatts has been suggested (by DARPA)as a working boundary for exascale systems. However, the best estimates with cur-rent technologies is a factor of between three and five times this amount [Dongarraet al., 2011; Lim et al., 2015]. Meeting such high energy demands is not economicallyfeasible.

Consequently, in the last decade or so there has been an increased effort to inves-tigate the suitability of low-power hardware, which are traditionally used in a mobilecontext, for HPC. In particular, low-power system-on-chips (LPSoC), which are widelyused in mobile devices such as smartphones and tablets, are drawing much atten-tion from the HPC community [Rajovic et al., 2014, 2013; Shalf et al., 2011]. LPSoCs

1

2 Introduction

generally have heterogeneous compute units such as a multi-core CPU and on-chipaccelerators such as GPUs, Digital Signal Processors (DSP), Field Programmable GateArrays (FPGAs) or other custom cores. Such heterogeneous system-on-chips (SoC)promise high floating point capabilities and low power consumption. These char-acteristics suggest that they are ready for HPC [Rajovic et al., 2013]. It is thereforeimportant to consider low-power SoCs and other novel architectures as alternatebuilding blocks for future HPC systems.

It is interesting to note that the majority of the systems in the Top500 and theGreen500 [Green500] lists use either GPUs or many-core accelerators. Summit [Hines,2018] and Sierra [LLNL], the top two fastest supercomputers as of November 2018, arealso third and sixth in the Green500 list. The Sunway Taihulight [Fu et al., 2016], thethird fastest supercomputer is currently twenty-seventh in the Green500 list, havingbeen in the top three when it was first released three years ago. While Summit andSierra use NVIDIA Volta GPUs which contain hundreds of energy efficient computecores [NVIDIA, e], the Sunway Taihulight [Fu et al., 2016] uses a custom homogeneousmany-core processor, the Sunway SW26010, which consists of 260 low-power RISCcores.

Keeping this trend in mind, this thesis considers the use of Adapteva’s Epiphany-IV Network-on-chip (NoC) coprocessor, a low-power custom many-core platformwhich was developed as part of a kickstarter project [Olofsson et al., 2014], and threegenerations of NVIDIA’s GPU-based SoCs - the Tegra K1, Tegra X1 and Xavier, forenergy-efficient scientific computing. The Epiphany-IV NoC containing 64 low-powerRISC cores, is capable of energy efficiency of up to 50 GFLOP/s per Watt [Olofssonet al., 2014] (where 1 GFLOP/s is 1 billion floating point operations per second) andcame with the promise of scaling to over 4096 cores in future versions. The TegraSoCs, starting with the Tegra K1, heralded a new era in mobile computing combiningsupercomputer-class GPUs and ARM CPUs in a chip [NVIDIA, 2014c], with eachnew generation offering higher performance and better energy efficiency.

All these systems offer high floating point capabilities and consume less than 30 Wof power. While there are a number of merits of using such low-power SoCs for HPC,they present a few challenges including:

Developing efficient applications: Careful attention must be given to the archi-tecture of these platforms in order to develop efficient applications for them.The resource-constrained cores in many-core systems such as the Epiphany-IVcoprocessor necessitates careful management of memory and the use of algo-rithms which map well to all the cores in the chip in order to extract highperformance. The presence of different compute units such as CPU and GPU inthe heterogeneous Tegra SoCs presents a complex programming environment todevelopers. Traditional programming models can’t be solely relied upon as theytend to focus on the use of one or other of the processing elements rather thanutilising the LPSoC as a whole. As a result, scientific applications have beentraditionally developed to run on a single processing element. Using all the

3

processing units simultaneously to speed up computation requires considerableeffort from the programmer and knowledge of the complex memory hierarchythat exists between the different processing elements. This makes developingapplications which can fully exploit all the processing capabilities of a chip anon-trivial challenge. Consequently, there are not many application softwaresavailable which are currently capable of utilising all the compute devices in achip effectively.

Measurement of energy: The ability to measure energy-to-solution for an appli-cation at function-level is crucial to analysing its energy efficiency. However,most LPSoC platforms available in the market lack the internal capability tomeasure its energy consumption. Some systems have internal current sensors,but obtaining energy measurements using them is usually intrusive and affectsthe workload being measured. A high-resolution non-intrusive environmentwhich allows an application to obtain real-time measurement of energy at afine-grained level would be of great benefit.

Energy minimisation: While maximum theoretical performance is achieved by us-ing all the processing elements on the chip, minimising energy to solution is notso straightforward. The power usage of different devices vary when they are inactive and in idle conditions. Consequently the energy consumed by a runningapplication varies according to how the application is partitioned to executebetween the different devices. An energy optimal application may run eitherentirely on one of the devices or be partitioned across the different devices. Thefraction of work allocated to each device must be carefully chosen based on thecharacteristics of the devices in order to minimise the energy consumed by therunning application. Dynamically modifying an application that is executing onan LPSoC to minimise its power consumption is a non-trivial challenge. Energymeasurements need to be collected and analysed at runtime to determine howto partition the application between the devices in order to minimise its energyto solution.

This thesis addresses the above challenges. In particular, the primary contributionsof this work are:

Development of optimised applications for the Epiphany coprocessor: This workdemonstrates how to develop highly optimised scientific applications for theresource-constrained many-core Epiphany architecture. In particular, optimisedversions of two core computational science kernels, namely matrix multipli-cation as an exemplar of dense linear algebra and stencil computation as anexemplar of grid based numerical methods, are developed. A novel bufferingscheme for overlapping computation and communication for such low-memoryplatforms is also demonstrated.

4 Introduction

Development of partitioned kernels for the heterogeneous Tegra LPSoCs: Thiswork demonstrates how to develop applications which are capable of utilisingthe different compute devices on a heterogeneous LPSoC by seamlessly parti-tioning a computation across them. Partitioned versions of matrix multiplicationand stencil kernels are developed for the NVIDIA Tegra SoCs.

Development of a software API for a custom energy measurement framework:This work describes the development of a software component for a customenergy measurement framework which can be used to collect high resolutionenergy measurements for an application executing on an LPSoC platform. Us-ing this software API the framework allows a running application to obtainreal-time energy measurements at the function level.

Evaluation of a simple energy usage model for heterogeneous LPSoCs: This workdescribes a model which can be used to describe the energy usage of an ap-plication which is executing across the different devices on an LPSoC. Thismodel allows prediction of the optimal partitioning of an application in orderto minimise its energy to solution. Analysis of the effect of partitioning of an ap-plication on its energy consumption and how it changes with different problemsizes is presented. Evaluation is provided for a partitioned matrix multiplica-tion kernel, partitioned stencil kernel and a partitioned block tridiagonal solverfrom the multizone NAS parallel benchmark suite.

Dynamically optimising a running application for energy: This work describesthe design and development of a proof-of-concept runtime tuning frameworkwhich is able to use real-time measurements and our energy usage model todecide how to partition an application between different compute devices inorder to achieve energy optimality. Different runtime scenarios are identifiedand explored. Evaluation is provided for the three applications mentionedabove, namely matrix multiplication kernel, stencil kernel and block tridiagonalsolver. Applicability of the framework for other applications is discussed.

1.1 Thesis Outline

The rest of the thesis is structured as follows:

• Chapter 2 provides a detailed literature review covering all aspects of this thesis.The description of architectures of the different LPSoCs used in this thesis isalso presented here.

• Chapter 3 presents the work done to develop scientific application kernels effi-ciently for the different LPSoC systems. A discussion on how these applicationscan be mapped onto the different LPSoC platforms is provided.

§1.1 Thesis Outline 5

• Chapter 4 presents the design and development of our energy measurementframework and presents the custom software API which enables an applicationto obtain energy measurements at run time. It also presents the work done todevelop and evaluate an energy usage model which analytically predicts theoptimal strategy to partition an application on a heterogeneous platform inorder to minimise energy to solution.

• Chapter 5 presents the design and implementation of a proof-of-concept run-time tuning framework which a running program can use to dynamically al-locate work to different available processing elements present in a platform inorder to execute it in an energy-optimal manner.

• Chapter 6 presents the conclusions derived from this work while summarizingthe contributions and discussing future work arising from this.

6 Introduction

Chapter 2

Background and Related Work

This chapter provides background information about the different LPSoC platformsused in this work, their architectural features and the programming models supportedby them. Details of the different scientific applications used in this thesis are alsoprovided. Section 2.1 describes the architecture of the hardware platforms chosenfor this work. Section 2.2 describes the programming models supported by theseplatforms. Section 2.3 gives an overview of the application used in this work forevaluating the tuning framework. Section 2.4 describes other previous work relatedto all areas of this thesis.

2.1 Low-power System-on-Chip Architecture

Generally a system-on-chip (SoC) includes multiple processing elements with differ-ent Instruction Set Architectures (ISA), each of which is designed for specialized tasks.It also contains the memory subsystem, the network subsystem and other peripherals.Typically, an SoC contains a CPU which runs the operating system and an acceleratoror special hardware which are designed for specific purposes like image processing,rendering graphics etc. A low-power system-on-chip (LPSoC) is typically used in amobile or embedded context where energy requirements are constrained.

We can classify LPSoCs broadly into different categories based on their archi-tectural features, in particular, based on the type of accelerator they contain. Mostcommonly these include:

• GPUs: eg: NVIDIA’s Tegra family of SoCs, Samsung Exynos 5250

• DSPs: eg: TI’s Keystone II

• FPGAs: eg: Altera Cyclone

• Custom many-cores: eg: Tilera’s Tile64, Adapteva’s Epiphany

This work considers low-power platforms from two different categories identi-fied above. From many-core based LPSoCs, we choose Adapteva’s Epiphany-IV

7

8 Background and Related Work

Co-processor [Adapteva, 2013]. From GPU-based LPSoCs, we choose three genera-tions of NVIDIA’s Tegra family of SoCs, namely Tegra K1 [NVIDIA, 2014b], TegraX1 [NVIDIA, 2015b] and the newer Xavier [NVIDIA, 2018b]. Development kits areavailable for each of these SoCs. The platforms are summarised in Table 2.1.

These platforms were chosen based on the following factors:

• Market dominance: With over 100 billion processors produced as of 2017 [ARMCommunity], ARM is the most commonly used processor in low-power mobiledevices which are commercially available. Hence we limit our problem spaceto systems with ARM CPUs.

• Relevance of Architecture: As mentioned in Chapter 1, all the chosen systemshave architectural similarities to processors which are currently used in HPCsystems.

• Availability and cost: All the four systems chosen are commercially availableand come with evaluation modules which are relatively inexpensive. The Xavierdevelopment kit currently costs USD 2500. The rest of the platforms costs lessthan USD 500.

• Power usage: All the systems have low power usage. While the Xavier’s canconsume up to 30 W, all the other systems typically consume up to a maximumof 20 W under full load.

• Performance: The chosen LPSoC systems are capable of high floating pointperformance, ranging from 100 Single-precision GFLOP/s for Epiphany-IV chipto 1.4 Single-precision TFLOP/s for Xavier.

• Programmability: The systems are programmable using widely used program-ming models such as OpenMP and CUDA.

• Shared memory: The physical memory is shared between the devices and isaccessible from all the components.

System CPU Accelerator Memory model ProgrammingModel

ApproxCost(USD)

Theoretical Per-formance (SP)

TheoreticalBandwidth

Power(W)

Epiphany ARMv7 Custom chip (64-core)

Shared PhysicalMemory

Custom SDK /OpenCL

99 100 GFLOP/s 2.4 GB/s 2

Tegra K1 ARMv7 Kepler GPU (192cores)


CUDA 192 365 GFLOP/s 15 GB/s 15

Tegra X1 ARMv8 Maxwell GPU (256cores)


CUDA 499 512 GFLOP/s 26 GB/s 15

Xavier ARMv8 Volta GPU (512cores)


CUDA 2500 1400 GFLOP/s 137 GB/s 30

Table 2.1: Summary of chosen LPSoCs

The following subsections detail the architectural features of each of these plat-forms.

§2.1 Low-power System-on-Chip Architecture 9

2.1.1 Adapteva’s Epiphany-IV

The Epiphany architecture comprises a low-power, multi-core, distributed sharedmemory embedded system created by Adapteva [Adapteva, 2013]. The Epiphany IV64-core Network-on-chip (NoC) coprocessor contains 64 cores (referred to as eCores)organized in a 2D mesh with future versions expected to house more than 4096eCores [Olofsson et al., 2011]. The 64-core Epiphany-IV coprocessor was developedin 2011 while an earlier version, the Epiphany-III coprocessor containing 16 coreswas released earlier in the same year. The Epiphany-III was available commerciallyin the Parallella System-on-module (SoM) board [Adapteva, 2012b] which combinedthe Epiphany chip with a host ARM processor housed in a Zynq system-on-chip.Though the Epiphany-IV was planned for future versions of the Parallella, it was notproduced in volume due to lack of funding. Instead an earlier development prototypewas made available which uses an FPGA mezzanine “daughter” card (FMC) housingthe Epiphany-IV, attached to a ZedBoard [ZedBoard]. For this work we used theZedBoard evaluation module and the daughter card housing the Epiphany-IV 64-core (E64G401) chip.

Both the Parallella and the prototype ZedBoard consist of a Zynq SoC, sharedmemory and the Epiphany NoC coprocessor as shown in Figure 2.1.

The Xilinx Zynq 7000 series SoC contains a dual-core ARM Cortex-A9 CPU run-ning at 800 MHz on the Parallella and 667 MHz on the ZedBoard, with standardon-chip peripherals such as USB 2.0, Ethernet, UART, MIO, AXI BUS, GPIO, HDMI,JTAG etc. It also contains a Field Programmable Gate Array (FPGA) which is usedto implement the “Glue-Logic” and eLink protocol required to interface with theEpiphany coprocessor. In addition, the FPGA implements the AXI master interface,AXI slave interface and a HDMI controller.

The Epiphany NoC has a 2D array of eCores connected to each other by a meshnetwork-on-chip. Each eCore consists of a RISC CPU, 32 KB of local scratchpadmemory organized as four banks of 8 KB, a Direct Memory Access (DMA) engine,and a network interface to an eMesh router. No cache is present. Each eMeshrouter provides three network communication channels; an on-chip write network(in blue), an off-chip write network (in green) and a read request network (in red).The eCore CPU is super-scalar and can execute two floating-point operations and a64-bit memory load/store operation in every clock cycle. Scratchpad memory cantheoretically provide up to 32 Bytes per clock cycle of bandwidth.

The eCore CPU has a total of 64 accessible 32-bit registers which can be usedas single-precision floating-point values, 32-bit signed or unsigned integers or asmemory pointers, with various addressing modes.

The Parallella SoM has 1 GB GB of DDR3 RAM while the ZedBoard has 512 MB.The DRAM is partitioned such that Linux running on the ARM Cortex-A9 CPU hasits own private O/S memory and a 32 MB chunk of memory is accessible by both theARM and Epiphany and is termed as “Shared Memory”. Shared memory access forthe Epiphany is handled by an eLink interface via the AXI bus and memory controller


Figure 2.1: Adapteva Epiphany System


on the Zynq SoC. The Epiphany has a flat and unprotected memory map. Each eCorecan address its local SRAM, other eCores’ SRAMs and shared off-chip DRAM.

2.1.2 NVIDIA Tegra K1

The NVIDIA Jetson TK1 development board [NVIDIA, 2014a] features the Tegra K1SoC (referred to as TK1 in the rest of the text) [NVIDIA, 2014c] which contains anARM CPU and a GPU on chip. The SoC has 4 usable cores along with a battery-saving shadow-core. The GPU is a NVIDIA Kepler GPU with 192 CUDA cores. Thearchitecture of the TK1 SoC is shown in Figure 2.2 while the Jetson TK1 developmentboard is shown in Figure 2.3.

Figure 2.2: Tegra K1 SoC [NVIDIA, 2014c]

Figure 2.3: Jetson TK1 Dev kit [NVIDIA, 2014a]

While there are two versions of Tegra K1, 32-bit and 64-bit versions, the 32-bitversion is used here. The 32-bit version uses a quad-core ARM v7 Cortex-A15 CPUwhich is 3-way superscalar and runs at clock speeds up to 2.3 GHz. The CPU coreshave 32 kB L1 Instruction Cache and 32 kB L1 Data Cache. The four Cortex-A15 cores


share a 2 MB 16-way set associative L2 Cache. It also contains an extra “battery saver”A15 CPU core.

The architecture of the Kepler GPU used in the Tegra K1, named “GK20a”, isvirtually identical to the Kepler GPU architecture used in high-end GPUs such asNVIDIA Tesla K20, K40 and K80 series GPUs. It also includes a number of optimi-sations for mobile system usage to conserve power and deliver high performance.It has 1 Streaming Multiprocessor (SM) containing 192 CUDA compared to 15 SMscontaining 2880 CUDA cores on a Tesla K40. It consumes significantly lower powercompared with high-end Kepler GPUs. While desktop and server grade Kepler GPUscan be expected to consume a few hundred watts of power, the Tegra K1 Kepler GPUconsumes a tiny fraction (less than 5 W [NVIDIA, 2014c]). The GPU has an L2 cachesize of 128 kB. The architecture of the Tegra K1 GPU is shown in Figure 2.4

Figure 2.4: Tegra K1 “GK20a” GPU [NVIDIA, 2014c]

Being designed for mobile graphics, the TK1 SoC also has dual Image StreamProcessing (ISP) Cores supporting camera sensors and an Advanced Display enginefor driving display on external monitors as shown in Figure 2.2.

The Jetson TK1 development board comes with 2 GB of LPDDR3 memory, 16 GBof on-board storage and numerous peripherals and IO ports such as HDMI, USB 3.0,Ethernet, UART etc. The 2 GB RAM is shared between the CPU and the GPU.

The Kepler GPU is rated at a theoretical peak performance of 365 single-precisionGFLOP/s.


2.1.3 NVIDIA Tegra X1

The Tegra X1 System-on-Module contains the Tegra X1 SoC [NVIDIA, 2015c] (referredto as TX1 in the rest of the text) and was released as the successor to Tegra K1,promising twice the performance and power efficiency of Tegra K1. The NVIDIAJetson TX1 Development kit [NVIDIA, 2015a] is built around the Tegra X1 SoM. TheSoC features a 64-bit CPU along with an on-chip GPU. The architecture of the TX1SoC is shown in Figure 2.5 while the Jetson TX1 development board is shown inFigure 2.6.

The CPU contains 4 64-bit ARMv8 Cortex-A57 cores and has a maximum clock fre-quency of up to 1.9 GHz. Four lower-performance Cortex-A53 cores are also includedfor lower-power operation in mobile devices, but they are not directly accessible tosoftware and are only activated in low-power modes. The on-chip GPU is a Maxwell“GM20B” GPU containing 256 CUDA cores.

Figure 2.5: Tegra X1 SoC [NVIDIA, 2015c]

The four Cortex-A57 CPU cores on Tegra X1 share a common 2 MB L2 cache, andeach of the four CPU cores has a 48 kB L1 instruction cache and a 32 kB L1 data cache.

The architecture of the Maxwell GPU is shown in Figure 2.7. The Maxwell GPUin TX1 contains two SMs; each SM consists of fundamental compute cores calledCUDA Cores, texture units, and a Polymorph engine. Each SM in the Kepler GPUarchitecture consists of 192 CUDA cores, while each Maxwell SM includes 128 CUDAcores. However, a Maxwell CUDA core is a significant upgrade over a Kepler CUDAcore, and each Maxwell core delivers almost forty percent higher performance than aKepler core. The Maxwell GPU has 256 kB of L2 cache.

The fundamental architecture of the Maxwell GPU used in Tegra X1 is virtuallyidentical to that found in the high end Maxwell based GPU (GM204) used in GTX


Figure 2.6: Jetson TX1 Dev kit [NVIDIA, 2015a]

Figure 2.7: Jetson TX1 Maxwell GPU architecture [NVIDIA, 2015c]


980 desktop graphics cards, but differs primarily in scale and memory architecture. Ahigh-end Maxwell based GTX980 graphics card with a GM204 GPU includes a totalof 2048 CUDA cores and 4 GB of frame buffer memory, consuming approximately165 W of power while the Maxwell GPU in Tegra X1 consists of 256 CUDA cores,shares DRAM with the CPU, and consumes only a few watts.

The board has 4 GB of LPDDR4 memory which is shared between the CPU andthe GPU. Similar to the Jetson TK1 development kit, this board also contains 16 GBof on-board storage along with peripherals and IO ports such as HDMI, USB 3.0,Ethernet, UART.

The Maxwell GPU is rated at a maximum theoretical peak performance of 1 TFLOP/s(half precision) and 512 TFLOP/s (single-precision).

2.1.4 NVIDIA Xavier

The NVIDIA AGX Xavier Module contains the NVIDIA Xavier SoC [NVIDIA, 2018b]which is the latest addition to NVIDIA’s Tegra SoCs. It is mainly targeted for ArtificialIntelligence (AI) and Deep Learning (DL). The Jetson AGX Xavier Developer kit is anevaluation board which features this module. The Xavier SoC features an 8-core CPUcluster and a Volta GPU on the same chip. The architecture of the Xavier SoC is shownin Figure 2.8 and the Jetson AGX Xavier development kit is shown in Figure 2.9.

Figure 2.8: Xavier SoC [NVIDIA, 2018c]

The CPU contains 8 64-bit ARMv8.2 custom “Carmel” cores, all of which areavailable to the software, running at up to a maximum frequency of 2.27 GHz. Thecluster consists of 4 duplexes, each sharing 2 MB cache, making up a total of 8 MB ofL2 cache as shown in Figure 2.10. It also has 4 MB L3 cache shared between all thecores.

The Xavier’s integrated Volta GPU (called “GV10B”) shown in Figure 2.11 pro-vides 512 CUDA cores and 64 Tensor Cores. The GPU includes eight Volta StreamingMultiprocessors (SMs) with 64 CUDA cores and 8 Tensor Cores per Volta SM. EachVolta SM includes a 128 kB L1 cache, which is 8 times that of previous generations.The SMs share a 512 kB L2 cache and offers 4 times faster access than previous gener-ations.


Figure 2.9: Jetson AGX Xavier Development kit [NVIDIA, 2018b]

Figure 2.10: Xavier CPU cluster [NVIDIA, 2018c]

§2.2 Programming Models for LPSoCs 17

Each tensor core can perform 16 FP16 Fused Multiply-Add operations (MAC) or32 INT8 MACs per cycle. All of this yields a maximum of up to 11 TFLOP/s (FP16)or 22 TOP/s (INT8), with a maximum clock frequency of 1.37 GHz [NVIDIA, 2018c].The GPU is rated at 1.4 TFLOP/s (single-precision).

Figure 2.11: Xavier Volta GPU [NVIDIA, 2018c]

In addition, the Xavier SoC also contains 2 NVIDIA Deep Learning Accelerators(DLAs) for processing convolutional neural networks used for object detection andrecognition and 2 Programmable Vision Accelerators for computer vision algorithms,with dual vector processors.

The Jetson AGX Xavier development kit has 16 GB 256-bit LPDDR4x memory,32 GB on-board storage and also provides the various peripherals and IO ports suchas HDMI, USB 3.0, Ethernet, UART.

2.2 Programming Models for LPSoCs

Programming models such as OpenCL and CUDA are widely used to offload workto compliant GPUs. While these models require some effort from the programmer toorchestrate data movement and schedule kernels for execution, accelerator modelssuch as OpenACC [Lebacki et al., 2012] and OpenMP accelerator models [Liao et al.,2013] offer support for offloading work to the attached accelerators/coprocessorsusing compiler directives/pragmas.

The programming models supported by the LPSoC hardware chosen for this workare detailed in the following subsections.


2.2.1 Programming the Adapteva’s Epiphany-IV coprocessor

The Epiphany chip is intended to operate as a coprocessor of a more general-purposehost CPU. The Epiphany chip can be programmed using C, and has a bare-metalSDK [Adapteva, 2012b] that provides some basic programming primitives to facilitatewriting parallelised C code for this architecture. Although recent developmentsenable the use of OpenMP [Agathos and Papadogiannakis, 2015], MPI [Richie et al.,2015] and OpenCL [Richie and B, 2016] for programming the Epiphany chip, theoverheads introduced means that performance is impacted. Therefore to extract highperformance out of the Epiphany chip, using the bare-metal SDK is desirable. Someof the key features of the SDK are:

• Workgroup model: To program the eCores, workgroups are created by specifyingthe number of rows and columns of nodes and the location of the starting nodeof the group. The SDK provides functions to determine the ID and location ofneighbouring eCores.

• Memory addressing: All eCores share the same address space and it is possibleto read and write directly to the local memory of another eCore. The SDKprovides functions to obtain the global address of a memory location in anothereCore’s local memory facilitating data transfer between the nodes.

• Communication between eCores: The SDK provides APIs to transfer blocks of databetween nodes and to the shared memory. These can be achieved by usingeither the CPU or the DMA engine. Two DMA channels are available in eachnode supporting both non-blocking and blocking DMA transfers.

• Barriers: The SDK provides functions for setting synchronization points andbarriers in the program.

• Hardware Mutex: Mutexes are available to ensure mutual exclusion while access-ing shared resources. The workgroup defines a memory location on the chipas the mutex object. The SDK provides functions to enable the eCores to utilisethe mutex object.

• Event Timers: Each eCore has two event timers that can be configured to inde-pendently monitor key events within the node such as clock counter, watchdogtiming etc. This can be used to count the number of clock cycles which haveelapsed during execution of a block of code.

The development environment requires (at least) two C programs to be written:one for the host CPU and one or more “kernels” for running on the eCore nodes. Anapplication code would typically perform all of its initialization and outer loops onthe host CPU with the innermost numerically intensive loops running as kernels onthe Epiphany eCore nodes. The steps required to execute a program are:


1. Host program creates a workgroup by specifying the number of rows andcolumns required and the position of the start node in the group.

2. Host resets all the nodes and loads the device-side executable image into eacheCore.

3. Host signals all the eCores in the workgroup to start execution

4. Host communicates with each eCore either by accessing the core’s local memoryor using the shared memory.

5. Once the execution is complete, the host is signalled. The host reads the resulteither directly from each eCore’s local memory or from the shared memory.

Memory model

The Epiphany follows a strong memory-order model for all read and write transac-tions to local memory[Adapteva, 2012a]. This means that all the transactions completein program order. However it follows a weak memory-order model for all transactionsto non-local memory. While this implies that the memory operations may not appearin program order, some guarantees are given which ensure that a) load operationscomplete before the returned data is used by a subsequent transaction, b) load opera-tions using data previously written use updated values, c) store operations eventuallypropagate to their ultimate destination. Due to the weak memory-order model for ac-cess to the shared memory, the programmer may need to use runtime synchronizationcalls with code that require strong ordering of load and store operations.

2.2.1.1 Programming Considerations

The Epiphany eCore architecture presents some interesting challenges to implement-ing high performance numerical codes. The main limitation is the relatively small32 kB of local RAM per eCore which must be divided between program code, dataand stack. Although each eCore has access to the entire 32-bit address space, perfor-mance drops off when accessing non-local memory. Within each eCore the suppliedlinker scripts allow the programmer to control which parts of the code and data areto reside in which specific bank of local memory and which parts are to be located inslower off-chip shared memory.

In its current form the Epiphany eCore does not include hardware support forinteger multiply, floating-point divide or any double-precision floating-point opera-tions. This design decision frees up silicon for other uses, e.g. for additional cores.The implication of this varies from application to application.

Maximum floating-point performance is achieved when each eCore is performinga stream of Fused Multiply-Add (FMADD) instructions with simultaneous 64-bitload or store operations in each clock cycle. At 600 MHz on a 64-core Epiphany this


corresponds to a peak of 76.8 single-precision GFLOP/s. The ability of the compilerto optimise code and achieve this is another matter.

Overcoming Memory Limitations

As indicated previously, the local eCore memory is implemented as four banks of8 kB each. Maximum performance can be obtained only when code is fetched fromone bank while load/store and DMA operations are occurring to other banks.

This further restricts code size to 8 kB or 16 kB, or between 2k and 8k instructions(depending on mix of 16-bit and 32-bit instruction words). The programmer needsto carefully allocate the use of these four local memory banks in order to achieve thebest performance.

For example, the programmer could allocate one bank of local memory for code,two for data (“data1” and “data2”) and one for the stack and local variables. Withsuch an arrangement the code can process data to/from “data 1” while using DMAto move data in/out of “data 2”. When the processing and DMA are complete, thecode can then go on to process “data 2” while using DMA to move result data outand new input data into “data 1”.

Adding further pressure on limited memory, branching (eg. in loops) costs 3cycles and should be avoided where possible by “unrolling” inner loops. However,unrolling loops comes at a cost to code size. With such small amounts of memoryavailable for code, it is necessary to finely tune the degree to which loops are unrolled.Directives to the C compiler can be used to determine the degree of loop unrolling.

Instructions can, however be fetched from the local memory of other eCores. Thusa novel approach may be to locate smaller fragments of non-innermost loop code inunused portions of banks of local memory of eCores within a row. This code couldthen be executed, when required, with contention only between the eCores in thatrow. This would result in less contention for eMesh bandwidth than if all the eCoreswere executing code out of external shared memory.

Hardware/Software Operation

Codes for array processing often make use of product terms in array indices, for exam-ple, to calculate row offsets. Without hardware support for integer multiplication itis desirable to iterate through array elements in a regular manner using incrementedoffsets. Similarly floating-point divide operations should be removed from innerloops or minimised wherever possible. In both cases these are optimisations that canusually be carried out by a compiler.

2.2.2 Programming the NVIDIA Tegra SoCs

The ARM host CPUs in the NVIDIA Tegra SoCs can be easily programmed using pop-ular programming models including OpenMP. The GPU present in the Tegra SoCs can


be programmed using NVIDIA’s CUDA [Sanders and Kandrot, 2010] programmingframework.

Typically a GPU is considered an accelerator that performs operations requestedby CPU programs. CUDA programs use a set of C or C++ library routines to requestGPU operations that are implemented by a combination of hardware and device-driver software.

The structure of a CUDA program is usually as follows:

1. Allocate GPU-local (device) memory for data

2. Copy data from host memory to GPU device memory

3. Launch the GPU “kernel” program

4. Copy output data from device memory back to host memory

5. Free device memory

When invoking a CUDA kernel, the number of GPU threads to use for executionis specified along with how the threads are organized into groups (thread blocks).Parallelism is achieved through having multiple threads executing the kernel. Kernellaunches are always asynchronous, requiring the invoking CPU process to explicitlywait for them to complete by calling the function call cudaDeviceSynchronize().

CUDA operations pertaining to a given GPU are ordered by associating themwith a stream. By default, there is a single stream for all programs that share a GPU,but multiple streams can be optionally created.

Memory access mechanisms

In traditional CUDA programs, the host program typically stores the data in mainmemory and data must be explicitly copied from CPU to GPU memory. Typically,malloc() is used to allocate memory on the host memory and cudaMalloc()

is used to allocate memory on the GPU. cudaMemcpy() is used to copy the databetween the CPU and GPU. After the GPU computation is over, the result is copiedback from GPU to CPU memory. Such explicit copies of large amounts of databetween CPU and GPU memory can be costly. This is more so for systems withdiscrete GPUs.

CUDA version 2.2 added a feature called “zero-copy” [NVIDIA, a]. This enablesGPU threads to directly access host memory locations. The data is first allocatedon host memory using the cudaHostAlloc() function call. Now a pointer to thisobject, which is accessible from the GPU can be obtained by using the functioncudaHostGetDevicePointer(). This means that programs can simply pass apointer to the host memory location and avoid the need for explicit cudaMemcpy()calls, making it easier to write and maintain code. This requires mapped pinned(non-pageable) memory on the host. On systems with discrete GPUs, data move-ment still takes place whenever the GPU accesses an element from this region and


is orchestrated automatically by the CUDA driver. On integrated CPUs, it avoidssuperfluous copies as the integrated GPU and CPU memory are physically the same.However, pinned memory is not cached on CPU or GPU and could cause performancedegradation.

CUDA version 6.0 introduced another mechanism called Unified Memory Access(UMA) [NVIDIA, b]. Unified Memory offers a “single-pointer-to-data” model that isconceptually similar to zero-copy memory. One key difference between the two is thatwith zero-copy allocations the physical location of memory is pinned in CPU systemmemory such that a program may have fast or slow access to it depending on whereit is being accessed from. With Unified memory, data is moved between the CPUand GPU memory on demand and is handled automatically by the CUDA driver.The function call cudaMallocManaged() returns a pointer which is accessible fromboth CPU and GPU. Unified memory is cached on the GPU and the CPU.

When there are multiple streams, kernels and copy operations from differentstreams can operate concurrently depending on the GPU hardware.

Concurrent access of shared memory by CPU and GPU kernel

One would think that with UMA and shared physical memory, implementing ker-nels for concurrent execution using the same shared memory is trivial. However,NVIDIA’s documentation [NVIDIA, c] states that for Integrated GPUs, concurrent ac-cess of the same unified memory region from both GPU and CPU is not yet supporteddue to cache coherency issues.

On all the Tegra SoCs this is the case. It is found that a segmentation fault isthrown when the CPU tries to access shared memory (allocated using cudaMallocManaged())after a GPU kernel, which accesses this memory, starts executing. However, welearned that on the TK1 and TX1 systems, it is possible to circumvent this issueby immediately unprotecting this shared region of memory using the mprotect()function call each time after a GPU kernel starts executing.

However, on the Xavier system, this method of unprotecting memory still yieldsa segmentation fault. Therefore, double buffering strategies using duplicated copiesof data would be required in order to get around this limitation.

2.3 Scientific applications

Scientific computing seeks to use advanced computing capabilities to understand andsolve complex problems. It is typically the application of computer simulation andother forms of computation from numerical analysis and theoretical computer scienceto solve problems in various scientific disciplines. Scientists and engineers developapplication code that model systems being studied and run these programs withvarious sets of input parameters. Such programs often model real-world changingconditions, such as weather, air flow around a plane, the motion of stars in a galaxy

§2.3 Scientific applications 23

etc. In most cases, these models require massive amounts of calculations, usuallyfloating-point.

Though the applications of computational science are very diverse, there are anumber of algorithmic patterns that generally recur throughout them. Seven suchfloating-point algorithmic patterns called the “Seven Dwarfs” [Colella, 2004] wereidentified, by Phillip Colella in 2004, as being important for science and engineeringfor at least the next decade. A dwarf is an algorithmic method that captures apattern of computation and communication. The Seven Dwarfs constitute equivalenceclasses where membership in a class is defined by similarity in computation and datamovement. The dwarfs are specified at a high level of abstraction to allow reasoningabout their behavior across a broad range of applications.

Researchers from Berkeley have since expanded the list of dwarfs to includethirteen classes of applications [Asanovic et al., 2006]. They are: i) Dense LinearAlgebra, ii) Sparse Linear Algebra, iii) Spectral Methods, iv) N-Body Methods v)Structured Grids, vi) Unstructured Grids, vii) MapReduce, viii) Combinational Logicix) Graph Traversal, x) Dynamic Programming, xi) Backtrack and Branch-and-Boundxii) Graphical Models, xiii) Finite State Machines.

For this work, we focus on three applications. They are stencil computation, matrixmultiplication (GEMM) and block tridiagonal (BT) solver from the NAS parallelbenchmark suite. The applications chosen are representative of a broader range ofwidely used scientific applications. GEMM and stencil are important computationalscience kernels and have been used in a number of previous work for evaluatingauto tuning techniques while BT solver (and NAS parallel suite in general) has beenwidely used to benchmark all-round performance of systems.

2.3.1 Stencil computation

Stencil computation is an important computation pattern that is widely used inscientific computing [Wittmann et al., 2010; Datta et al., 2009; Kamil et al., 2009]and can be considered to fall under the structured grid category of the Berkeleydwarfs. Stencil kernels apply regular operations on a grid, and are common to awide range of applications including image processing, partial differential equation(PDE) solvers etc. Stencil computations are characterized by high memory traffic dueto non-contiguous memory access patterns and low operational intensity. Althoughmany stencil codes can be optimized to achieve high computational intensity viatemporal blocking [Datta et al., 2009; Wittmann et al., 2010; Datta et al., 2008; Kamilet al., 2006], naive stencil computations are generally memory bandwidth bound.

The stencil chosen here is the five-point star-shaped stencil used to solve thehomogeneous, anisotropic heat equation on a structured 2 dimensional rectangulargrid. In this computation the value of the temperature at a particular grid pointis updated based on the current values of the temperature at the four surroundinggrid points in an iterative fashion with the temperatures at the grid boundary keptconstant. Using i and j to reference points on a 2D Cartesian grid in the x and y


directions and T as the temperature, an update proceeds as follows:

Tnewi,j = w1 ∗ Tprevi,j+1 + w2 ∗ Tprevi,j

+ w3 ∗ Tprevi,j−1 + w4 ∗ Tprevi+1,j

+ w5 ∗ Tprevi−1,j

Parallelisation is usually via domain decomposition with data communicationprimarily between adjacent domains. The grid of values are stored and distributedequally among all the processing elements. The computation of each grid pointdepends on the values of its neighbouring grid points from the previous iteration.

Stencil computation may be partitioned across devices in different ways. Therectangular grid may be divided along one dimension, either row-wise or column-wise where one processing element is responsible for one sub-domain while the otherprocessing element is responsible for the other sub-domain. Each of the devices cansimultaneously compute a portion of the final result. The grid may also be dividedinto 2-dimensional blocks. For simplicity, one-dimensional partitioning is consideredhere.

2.3.2 Dense matrix multiplication (GEMM)

Matrix multiplication is one of the most important and computationally intensiveoperations in scientific computing and falls under the dense linear algebra categoryof the Berkeley dwarfs. A majority of HPC applications depend on the Basic LinearAlgebra Subprograms (BLAS). Achieving good performance scalability for fundamen-tal BLAS operations is central to these applications. General Matrix multiplication(GEMM) is a level 3 BLAS operation that is characterized by high computationaldensity (when optimised using loop blocking and unrolling techniques) and is oftenused to benchmark and test the performance of HPC platforms since it provides agood basis for comparing their performance.

Since matrix multiplication (matmul) can be easily subdivided into parts that canbe computed in parallel, there are several algorithms for parallel matrix multiplica-tion [Van De Geijn and Watts, 1997; Cannon, 1969; Choi et al., 1994]. The overallproblem can be divided into multiple independent sub matrix products which canbe solved simultaneously. Considering two operand matrices A and B of dimensionsm× k and k× n respectively and the product matrix C of dimension m× n, the matrixproduct C = A× B can be partitioned by columns in the manner[

C1 C2 . . . Cp

]= A×

[B1 B2 . . . Bp

].

The problem is thus subdivided into p independent matrix products Ci = A× Bi

each of which can be computed in parallel. These sub products can be allocated tothe different compute devices present.

§2.3 Scientific applications 25

2.3.3 The Block Tridiagonal (BT) solver

The Numerical Aerodynamic Simulation (NAS) program, located at NASA AmesResearch Center, focuses on computational fluid dynamics and related aerosciencedisciplines. To measure objectively the performance of highly parallel computersand to compare their performance with that of conventional supercomputers, NASdeveloped the NAS Parallel Benchmark (NPB) suite [Bailey et al., 1991], comprising aset of benchmark applications which are derived from computational fluid dynamicscodes. These benchmarks have been widely used to evaluate the performance of HPCplatforms. Although other benchmark suites such as Rodinia [Che et al., 2009] andMantevo [Heroux and Barrett, 2011] have recently gained popularity for studyingthe performance of heterogeneous systems, the NPB benchmark suite still remainsextremely relevant in the HPC community [Okada et al., 2016; Marcondes et al., 2016;Ibrahim et al., 2018; Sundriyal and Sosonkina, 2018].

The Block Tridiagonal (BT) Solver is one of three simulated application bench-marks from the NAS Parallel Benchmark (NPB) suite. The BT solver represents theheart of the computationally-intensive building blocks of CFD programs in most com-mon use today for the numerical solution of three-dimensional Euler/Navier-Stokesequations using finite-volume, finite-difference discretization on structured grids. Itcomputes the numerical solution for a synthetic system of five nonlinear PDEs and isa simplified version of the solvers used in many computational fluid dynamics (CFD)programs. It can be considered to encompass a number of computational patterns.

Since NPB does not feature several levels of parallelism exhibited by many scien-tific problems, the NPB Multi-Zone (NPB-MZ) version was created to remedy thisdeficiency [der Wijngaart and Jin, 2003]. The BT solver from the multizone suite isconsidered here. Here, the solver operates on a structured discretization mesh thatis a logical cube. In realistic applications, however, a single such mesh is often notsufficient to describe a complex domain, and multiple meshes or zones are used tocover it. The flow equations are solved independently in each zone, and after eachiteration the zones exchange boundary values with their immediate neighbors withwhich they overlap.

We use the hybrid implementation developed by Dümmler and Rünger [2013] forexecution on our heterogeneous platforms. This hybrid CPU + GPU implementationparallelises by allocating the different zones to the different available processingelements (CPU and GPU) in each timestep. The computation for these zones can nowproceed as independent tasks on the different compute devices. Once the CPU andthe GPU complete their kernel computations for a particular timestep, the boundarydata is exchanged between them in order to start the computations for the next timestep. This necessitates a synchronization between the processing elements in eachtimestep.


2.4 Related work

This section presents a brief description of other related work with respect to thedifferent areas explored in this thesis. Section 2.4.1 describes other work whichexplore the use of LPSoCs for scientific computing. Section 2.4.2 presents otherwork which use distributed CPU-Accelerator computations to improve performanceand energy efficiency. Section 2.4.3 explores other energy measurement techniques.Section 2.4.4 surveys previous work related to modelling of energy and auto tuningtechniques for optimising energy usage.

2.4.1 Using LPSoCs for Scientific Computing

Rajovic et al. [2013] describe their experiences in designing and deploying HPCcluster prototypes using mobile SoCs and details the problems encountered. Theirperformance and energy efficiency results suggested that such low-powered SoCs areHPC-ready. However they identified a number of limitations such as lack of highbandwidth I/O interfaces, lack of ECC in DRAM etc. These were however designdecisions by manufacturer since they were not required for mobile devices such assmartphones. Since these features are critical for HPC systems, the authors concludethat adding these features would likely make such SoCs suitable for HPC.

Mont-Blanc [Rajovic et al., 2014] is a European project that explores the use ofenergy efficient computing technologies for exascale supercomputers. Beginning in2011, the project has focused on using ARM and GPU processors to help achieve thehigh levels of compute efficiency that will be required for these future systems. Theprototype cluster built with ARM multicore chips achieves 120 MFLOP/s per Wattand is competitive with AMD Opteron 6128 and Intel Xeon X5660-based systems.

Previous work undertaken to evaluate the performance of the Tegra SoCs include[Calore et al., 2015; Fu et al.; Fatica and Phillips] which run simulations on the TegraK1 SoC. Otterness et al. [2017] evaluates Tegra X1 SoC and concluded that underCUDA 8, using zero-copy or unified memory does not provide much performancebenefit but this changes significantly with the CUDA version. Cavicchioli et al. [2018]analyzes the conflicts due to parallel accesses to main memory by both CPU coresand integrated GPU on the Tegra K1 and X1 SoCs and showed how CPU activitiesmay increase integrated GPU task execution times.

Mitra [2017] evaluates the use of TI Keystone II SoC for HPC workloads andconcluded that it is unlikely to be part of future HPC systems since it is not assuitable for double precision computations as conventional systems.

2.4.2 Use of CPU-Accelerator Work Distribution

The use of different heterogeneous components simultaneously to extract maximumperformance is a widely researched topic. Ohshima et al. [2007] proposed a techniquefor the parallel processing of matrix multiplication by using both CPUs and GPUs

§2.4 Related work 27

in a heterogeneous environment. Their results showed that the execution time oflarge matrix multiplications can be decreased to 40% when compared with the fastestexecution times while using only CPU or only GPU.

One of the main problems to be dealt with in a heterogeneous processing environ-ment is the difference in performance between the different processing components,say CPU and GPU. Thus balancing workloads as well as data transfers across CPUsand GPUs is critical in order to achieve good performance. Static partitioning tech-niques are used by Beaumont et al. [2001] for matrix multiplication on a CPU/GPUenvironment using a-priori information about the application and the platform inorder to decide how the workload should be partitioned.

Papadrakakis et al. [2011] describes a dynamic balancing approach to optimiseperformance where the work is divided up using domain decomposition methodsand each small unit of work is added to a queue of tasks. Both the CPU and GPU arefed with tasks from this queue in an asynchronous manner. Thus both the CPU andGPU are constantly busy with computations until the queue is emptied.

Lang and Rünger [2014] implemented a hybrid version of the preconditionedconjugate gradient method (CGM) while Dümmler and Rünger [2013] developeda hybrid CPU-GPU implementation of the multi-zone version of the NAS parallelbenchmark suite.

The MAGMA library [Agullo et al., 2009] automatically allocates different BLAScalls in an application to either the CPU or GPU depending on their suitability.However, individual BLAS kernels are not split between the CPU and GPU.

Donfack et al. [2014] propose a heuristic driven approach to divide LU factoriza-tion between CPU and GPU to optimise for energy and performance. This is doneat the algorithm level by adjusting tile sizes, effectively varying the relative cost ofdifferent subroutines, based on the CPU and GPU’s theoretical peak performances.

OmpSs (OpenMP SuperScalar) [Duran et al., 2011] is a task-based programmingmodel which exploits parallelism based on annotations using pragmas and supportsexecution on heterogeneous devices. StarPU [Augonnet et al., 2011] is a taskingAPI that allows developers to design applications in heterogeneous environments. Itprovides an automatic data transfer mechanism to communicate data between theCPU and the GPU and also provides a framework to schedule tasks for heterogeneousexecution. However, both of these need special compiler support which limit theirusability for embedded, heterogeneous systems with limited resources.

2.4.3 Measuring Energy Consumption

Few LPSoC systems offer features which enable measurement and monitoring oftheir energy consumption. One such system is the Odroid-XU3 [Gensh et al., 2015]which has inbuilt current sensors and makes current and voltage measurementsavailable via the /sys filesystem. The Jetson Xavier system provides this capabilityin a similar manner. However, in both cases, it is recommended to sample the valuesat a frequency of 1 second or more. Sampling at a higher rate causes the power


consumption to increase.In the absence of hardware support for measuring energy consumption, external

devices are generally used. The Yokogawa WT230 power meter is used widely formeasuring the power drawn from the AC line [Rajovic et al., 2014]. However, thiscosts over 3000 USD. The WattsUp Pro meter is a relatively inexpensive alternativewhich costs around 150 USD and is also used to measure the AC power drawn [Tiwariet al., 2012a].

Techniques used by Bedard et al. [2010] and Cao [2014] employ current sensorsthat sits between the system power supply and internal components. Measurementsare send through a USB interface to an external device. These can be reliably usedto measure power consumption of traditional computer systems. Measurements arecollected and correlated with the application benchmarks by matching timestampsand are not available at real-time.

The measurement framework which is presented in our work was developed asa collaborative effort and uses the basic technique used by Cao [2014]. Calore et al.[2015] describes a framework which is similar to ours. However a key differencebetween the other measurement frameworks and ours is that our software API makesreal-time energy measurements available to the running application. This can also beused with other unified power collection APIs such as EML [Cabrera et al., 2015] andEnergymon [Imes et al., 2016] which were developed to collect measurements fromdifferent sources.

For conventional HPC systems, energy usage of Intel CPUs, starting from SandyBridge, can be obtained using the hardware feature called Running Average PowerLimit (RAPL) [Intel, 2011] which uses activity counters and predefined weights torecord accumulated energy in Machine State Registers (MSR). NVIDIA ManagementLibrary (NVML) provides APIs to measure energy consumed by Tesla and QuadroGPUs [NVIDIA, 2012]. These are not available for embedded GPUs.

2.4.4 Modelling and Auto Tuning for Energy

Previous research that provide detailed energy models for heterogeneous systemsinclude work by Tiwari et al. [2015, 2012c], which introduce models for predictingcomponent-level power characteristics of large scale HPC systems.

The roofline energy model [Choi et al., 2013] provides high-level guidance to bal-ance energy consumption and performance optimally for a given application on atarget platform. This model helps algorithm designers to analyse the relationshipsbetween time, energy, and power using the notion of energy-balance which measuresthe ratio of flops and bytes per unit-energy. The Execution-Cache-Memory (ECM)performance model [Hager et al., 2016] provides insights into the relevant contribu-tions of the hardware model to the performance of a given loop code, allowing adetailed identification of optimisation opportunities.

These detailed models were designed for modelling the performance and energyusage of an algorithm on a processor. However, they do not address execution in a

§2.4 Related work 29

heterogeneous environment.Komoda et al. [2013] developed a power capping technique through coordinating

DVFS and task mapping in a CPU-GPU heterogeneous system using empirical modelsto predict performance and maximum power consumption given the frequencies ofthe CPU and GPU, and the distribution of workload.

PEACH [Ge et al., 2014] provides an analytical performance and energy modelwhich captures the performance and energy impact of computation distribution andenergy-saving scheduling to identify the optimal strategy for best performance or low-est energy consumption. However, there is less discussion on how to build a practicalframework which is usable for real-time applications executing on a platform.

Our tuning framework uses a simple energy model for modelling execution ina heterogeneous environment, which is similar to [Ge et al., 2014], and has beenextended to use the measurable power states of LPSoC systems.

Auto-tuning is an established method for adapting the execution of an applicationto the underlying hardware for attaining performance and energy-related goals [Langet al., 2015]. Tuning is done either at compilation time or at runtime. Offline tuningapproaches typically follows an empirical search strategy which identifies tunableparameters and evaluates the performance and/or energy for different values of theparameters. Dynamic auto-tuning approaches execute an instance of the applicationand use run-time information to dynamically refine and choose the best parametersetting for the application. Alternatively auto-tuning can be realized as an onlineprocess that measure and adapts parameters during the execution of a long-runningapplication [Karcher et al., 2009].

The well-known ATLAS library employs a paradigm known as Automated Em-pirical Optimisation of Software (AEOS) [Whaley et al., 2000]. This approach is usedwhen there exists many different methods of performing the required operations, anduses empirical timings in order to choose the best method for a given architecture.ATLAS investigates a large number of code variants and parameter settings for differ-ent operations and selects those showing the best performance. The optimal valuesfor several parameters, such as block sizes, pre-fetch distances are empirically foundand chosen to obtain best performance depending on the input problem size.

Similar to ATLAS, PHiPAC [Bilmes et al., 1997] uses an empirical search strat-egy which evaluates several levels of explicit loop unrolling and depths of softwarepipelining, cache blocking etc to find the best performing code variant. It uses heuris-tics to limit the search space for each parameter.

Datta et al. [2008] developed an auto-tuning environment for stencil codes similarto that exemplified by libraries like ATLAS and OSKI. Here, a Perl code generatorproduces multithreaded C code variants encompassing various stencil optimisations.The second component of an auto-tuner is the auto-tuning benchmark that searchesthe parameter space through a combination of explicit search for global maxima withheuristics for constraining the search space. At completion, the auto-tuner reportsboth peak performance and the optimal parameters. Kamil et al. [2009] employs a


similar technique for auto tuning stencil codes.Similarly, several tools exist which optimise the energy efficiency of parallel appli-

cations. Gschwandtner et al. investigates the auto-tuning of parallel codes using theInsieme compiler to optimise execution time, resource usage and energy consump-tion. Here, an input code is loaded by the compiler, analyzed and prepared to betuned prior to execution. The optimiser conducts auto-tuning by iteratively evaluat-ing sets of configurations for each code on the target system. The runtime systemthen dynamically selects the preferred code version to be executed. Its search engineevaluates points either on an equidistant grid or a random set of values defined overeach tunable parameter.

Active Harmony [Tiwari et al., 2012b] takes a search-based offline auto-tuningapproach. A set of tunable parameters are identified for different potential perfor-mance bottlenecks in an application. Parameter configurations (admissible values fortunable parameters) serve as points in the search space. The feedback driven empiri-cal auto-tuner monitors the application’s performance and power consumption andadjusts the values of the tunable parameters in response to them. The feedback metricvalues associated with different parameter configurations are measured by runningthe target application on the target platform. The methodology is thus offline becausethe tuning adjustments are made between successive full application runs based onthe observed power consumption for code-variants.

Shen [2015] describes an offline tuning approach for performance by considering asearch space of different problem sizes of an application executing on a heterogeneousenvironment. Energy usage is not considered. Our static tuning approach followsa similar approach to optimise for energy by evaluating how the performance andenergy usage of an application varies with problem size.

The recent study by Endrei et al. [2018] describes an offline analytical tuning policywhich can guide identification of bottlenecks for CPU-based execution by consideringparameters such as DVFS and thread scaling. Evaluation is done for varying problemsizes for a set of applications including Stencil computation and matrix multiplication.Heterogeneous execution is not considered.

Pack & Cap [Cochran et al., 2011] presents a control technique designed to makeoptimal DVFS and thread packing control decisions in order to maximise performancewithin a power budget. A multinomial logistic regression (MLR) classifier is trained tocharacterize optimal thread packing and Voltage and Frequency settings as a functionof user defined peak power constraints. This is used to make online decisions.

POET library [Imes et al., 2015] and Jouleguard [Hoffmann and Henry, 2015]enables applications to tune their own resource usage by choosing predefined systemconfigurations of frequency and thread counts at runtime based on feedback.

Archon [Siehl and Zhao, 2017] follows an approach which is similar to our dy-namic runtime tuning framework for deciding the optimal energy split for a hybridmatrix multiplication kernel. A simple energy model is then used to predict the opti-mal split to minimise energy. However it only measures time and uses an estimation

§2.5 Summary 31

model to estimate the energy consumption. Evaluation is done for one application -matrix multiplication.

2.5 Summary

This chapter presented background information related to the different hardwareplatforms, their programming models and the different applications used in the restof the thesis. Previous work related to the different issues addressed in this thesis werealso presented. In the following chapters, each issue is explored in detail, startingwith the strategies for developing efficient applications for the different hardwareplatforms.

Chapter 3

Developing Applications forAdapteva Epiphany and NVIDIATegra LPSoCs

This chapter presents the strategies to develop efficient application software forthe chosen hardware platforms, namely Adapteva’s Epiphany co-processor and theNVIDIA Tegra SoCs. Appropriate strategies need to be chosen for each platform inorder to map the problem to the underlying architecture.

The Epiphany-IV chip is a coprocessor which contains 64 RISC cores called eCoresarranged in a mesh network with a shared address space. The strategy for developingapplications for such a platform is to implement optimised kernels for a single meshnode and use appropriate algorithms to parallelise over all the nodes in the mesh.With no support for any widely-used programming models at the time of release, theonly way to program this chip was using an SDK provided by Adapteva. The workdetailed in this chapter presents one of the first attempts to write efficient code for thisarchitecture. Benchmarking was also done to evaluate the performance characteristicsof the chip.

With very limited memory per eCore for storing both data and code, programmingthe Epiphany system presents significant challenges. Two highly optimised kernels,namely stencil and matrix multiplication, were developed for this architecture. Thesetwo kernels were published in the Parallella examples repository [Adapteva]. A novelefficient double-buffering scheme was also implemented to overlap communicationand computation and can be used for such memory-constrained platforms.

The NVIDIA Tegra SoCs contain an ARM CPU and a GPU in the same chip withaccess to the same physical shared memory. The presence of multiple heterogeneousdevices leads to a complex programming environment. Most applications and popu-lar benchmark suites such as Rodinia [Che et al., 2009] are usually written to executeon either a CPU or GPU rather than on both simultaneously. Consequently thereare not many applications which are capable of effectively utilising all the computedevices in such heterogeneous environments. This chapter presents the work done todevelop versions of the widely used stencil and matrix multiplication kernels that can

33

34 Developing Applications for Adapteva Epiphany and NVIDIA Tegra LPSoCs

be seamlessly partitioned across multiple devices present in a heterogeneous SoC.Portions of the work presented in this chapter have been published in the follow-

ing papers:

• Anish Varghese, Robert Edwards, Gaurav Mitra, Alistair P. Rendell, Program-ming the Adapteva Epiphany 64-core Network-on-chip Coprocessor, The InternationalJournal of High Performance Computing Applications (IJHPCA) [Varghese et al.,2015].

• Anish Varghese, Robert Edwards, Gaurav Mitra, Alistair P. Rendell, Program-ming the Adapteva Epiphany 64-core Network-on-chip Coprocessor, The 4th Interna-tional Workshop on Accelerators and Hybrid Exascale Systems, InternationalParallel & Distributed Processing Symposium (AsHES Workshop, IPDPS) [Vargh-ese et al., 2014].

• Gaurav Mitra, Andrew Haigh, Anish Varghese, Luke Angove, Alistair P. Ren-dell. Split Wisely: When work partitioning is energy-optimal on heterogeneous hard-ware, The 18th IEEE International Conference on High Performance Computingand Communications (HPCC) [Mitra et al., 2016].

3.1 Developing Applications for Adapteva’s Epiphany copro-cessor

In order to develop applications for the Epiphany coprocessor in an efficient man-ner, the characteristics of the relatively unknown chip need to be understood first.Hence, some microbenchmarks were written to assess the capabilities of the Epiphanychip in areas such as inter-core memory bandwidth, off-chip memory bandwidth andnetwork utilisation. A prototype ZedBoard [ZedBoard] evaluation module which con-tains an FPGA “daughter” card (FMC) housing the Epiphany-IV 64-core (E64G401)chip is used for all experiments as described in Appendix A.

3.1.1 Evaluating the Epiphany chip’s Memory Subsystem

This section describes the experiments which were performed to assess the Epiphanychip’s capabilities with regard to on-chip and off-chip memory transfers.

3.1.1.1 Measuring On-chip Memory Bandwidth and Latency

The eMesh Network-On-Chip has a 2D mesh topology with only nearest-neighbourconnections as described in Section 2.1.1. To evaluate the cost of routing messagesfrom one eCore to another a micro-benchmark was written. In this benchmark oneeCore in the mesh writes data as a sequence of 32-bit transfers into the memory ofanother eCore. Once the transfers are complete, the source eCore writes to a specificlocation in the receiving eCore. The receiving eCore monitors this location, observes

§3.1 Developing Applications for Adapteva’s Epiphany coprocessor 35

the change, and begins to write the data into the memory of the next eCore in the row.This process continues for all the mesh nodes with the boundary nodes transferringthe message to the next row. After all nodes have received and forwarded the data,the whole process is repeated a number of times. The total data transferred and totalmean time are recorded. Two methods are used to transfer the data between thetwo eCores - DMA and point-to-point writes. Pseudo code for the benchmark withpoint-to-point write transfers is given in Listing 3.1.

Listing 3.1: Code for point-to-point transfer between cores of Adapteva Epiphany1 int *flag = (int *)0x7000;static float val[SIZE];

3 *flag = 0;val_next_core_start=(float *)e_get_global_address(neighbour_row,neighbour_col,val);

5 flag_next_core=(int *)e_get_global_address(neighbour_row,neighbour_col,flag);float *val_next0 = &val_next_core_start[0];

7 ...float *val_next19 = &val_next_core_start[19];

9

e_ctimer_set(E_CTIMER_0, E_CTIMER_MAX);11 e_ctimer_start(E_CTIMER_0, E_CTIMER_CLK);

time_e = e_ctimer_get(E_CTIMER_0);13 for (loopcount=1;loopcount<=LOOP;loopcount++)

//Waiting for previous core to finish15 while (*flag<loopcount);

//writing17 *val_next0 = val[0];

......19 *val_next19 = val[19];

//Setting flag of next core21 *flag_next_core = (coreid!=end_core)?loopcount:loopcount+1;

23 time_s = e_ctimer_get(E_CTIMER_0);

e_ctimer_stop(E_CTIMER_0);25 clocks = time_e - time_s;

The bandwidths observed using the DMA and direct write methods as a functionof message length for transfers between adjacent eCores are shown in Figure 3.1. It isclear that for all but very small messages it is better to use DMA rather than issuingindividual write instructions. For large messages DMA is able to achieve around2 GB/s transfer rates. Theoretically, with a 32-bit single word transfer per clock cycle,the DMA engine can provide a sustained data transfer rate of 2.4 GB/s at a clockspeed of 600 MHz. With doubleword transfers it can provide a transfer rate of up to4.8 GB/s.

Latency is important for small data transfers. Figure 3.2 shows the transfer timefor small message transfers. For transfers of less than about 500 bytes it is faster towrite directly into the memory of an adjacent eCore rather than to use DMA transfers.Beyond 500 bytes, DMA is preferable.

In Table 3.1 we report the latency for an 80 byte message transferred from eCore0,0 to one of the other cores in the 8× 8 grid. The Manhattan distance of each transferis given. This shows surprisingly little effect of distance, although all transfers arerelatively slow in terms of clock cycles (≈ 6 clock cycles at 600 MHz).


0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5·104

0

500

1,000

1,500

2,000

Total Message Size (Bytes)

Band

wid

th(M

B/s)

DMA transfer Direct on-chip writes

Figure 3.1: Adapteva Epiphany coprocessor eMesh bandwidth: DMA transfer vs Directon-chip writes - For large messages DMA transfer achieves around 2 GB/s.

0 200 400 600 800 1,000 1,200 1,400 1,600 1,800 2,000 2,2000

2

4

6

8

Total Message Size (Bytes)

Tim

e(µ

sec)

DMA transfer Direct on-chip writes

Figure 3.2: Adapteva Epiphany coprocessor eMesh transfer time: DMA transfer vs Directon-chip writes - For small transfers point-to-point direct writes are faster than DMA. DMA ispreferable for large transfers.


Node 1 Node 2 Manhattan Distance Time per transfer (n sec)0,0 0,1 1 11.12

0,0 1,0 1 11.12

0,0 0,2 2 11.14

0,0 1,1 2 11.14

0,0 1,2 3 11.19

0,0 3,0 3 11.19

0,0 0,4 4 11.38

0,0 1,3 4 11.38

0,0 3,3 5 11.62

0,0 4,4 6 11.86

0,0 7,7 14 12.57

Table 3.1: Adapteva Epiphany coprocessor eMesh: Effect of node distance on transfer latency- Results show very little effect of distance. Latency is measured to be ≈ 6 clock cycles.

3.1.1.2 Accessing External (Off-chip) Shared Memory

As mentioned in Section 2.1.1, the only way to get data in and out of the Epiphanychip is via the shared memory (unless external hardware is connected to the othereLink interfaces).

Some example code exists to showcase the performance of the memory system,but none to show the performance when multiple eCores attempt to write to theexternal shared memory over the single 8-bit wide 600MHz (600MB/s each direction)eLink at the same time, and how these accesses may be impacted by normal ARMCPU memory accesses to the shared DRAM.

To evaluate the relative share of the external memory interface that is allocated toeach eCore for off-chip data transfers, a micro-benchmark was written. In this bench-mark, all the 64 eCores in the grid continuously write blocks of 2 kB as sequencesof 4-byte stores to the external memory. This is done for a specific period of time(two seconds) and the utilisation of the eLink by each mesh node is measured. Theresult is shown in Figure 3.3. The effects of starvation are clearly evident. The resultsshow that location does matter when an eCore is attempting to write to the externalshared memory, causing a strong load imbalance between the eCores. Nodes closerto column 7 and row 0 get the best write access to external DRAM. Nodes closer tocolumn 7 always do better than eCores closer to row 0. With sufficient contention,many (all) eCores in rows 5 - 7 simply miss out on write slots. The maximum writethroughput to external shared memory achieved was 150 MB/s.


0 1 2 3 4 5 6 7

0

1

2

3

4

5

6

7

0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

0.16

0.18

Figure 3.3: Adapteva Epiphany: Core-wise utilisation of external memory link under con-tention - eCores located furthest from the eLink (row 0, col 7) are starved of memory access.

3.1.2 Developing Stencil kernel for the Epiphany chip

As described in Section 2.3.1, a 5-point star (“+”) shaped stencil is developed for a 2Drectangular grid. We reference the five points as Top, Left, Centre, Right and Bottom(T,L,C,R,B). Using i and j to reference points on a 2D Cartesian grid in the x and ydirections and T as the temperature, an update proceeds as follows:

Tnewi,j = w1 ∗ Tprevi,j+1 + w2 ∗ Tprevi,j

+ w3 ∗ Tprevi,j−1 + w4 ∗ Tprevi+1,j

+ w5 ∗ Tprevi−1,j

The stencil kernel is mapped to the Epiphany architecture using a 2-dimensionaldomain decomposition. The grid of temperatures is stored in a 1-dimensional array inrow-major order and is distributed equally among all the eCores. The host transfersthe corresponding grid portion to the local memory of each eCore directly usingthe available API functions for data transfer. Once the grid is copied to the localmemory, each eCore computes the values for the current iteration for all the gridpoints assigned to that eCore. This is followed by a communication phase.

Computation

Maximum floating-point performance on the Epiphany architecture can only beachieved when using the FMADD (Fused Multiply-Add) instructions which effec-tively yields two Flops/cycle. This instruction multiplies two inputs from registersand accumulates the result into a third register, all in one instruction. It can be ex-ecuted concurrently with certain other integer unit instructions, such as loads andstores, in a super-scalar manner.


Communication

The computation is followed by a communication phase where the “edge” regionsof each local grid are transferred to the “boundary” regions of each of the fourneighbouring eCores. Thus each eCore receives data from each of its neighboursas shown in Figure 3.4. As mentioned before, we maximise the use of registers bybuffering rows of input data into registers and accumulating the results in registersbefore writing out the final result. We also enforce a design goal that each griddata point is loaded into a register just once. Using this strategy, data movementis optimised enabling the result of the current iteration to be stored back in thesame array without altering the algorithm’s convergence properties. The transfersof boundary regions are hence started after the neighbours have completed theircomputation phase. In each iteration, an eCore is synchronized with each of its fourneighbouring eCores. Since previous experiments in Section 3.1.1.1 showed that DMAis faster than direct writes, these transfers are achieved using the DMA engine, whichtransfers data to each neighbour in a chain. 64-bit double word transfers are usedfor the top and bottom boundary rows as they are stored in contiguous memorylocations, while 32-bit single word transfers are used for transferring the left andright boundary columns. Listing 3.2 shows snippets of code illustrating how thecommunication and synchronization are performed.

Figure 3.4: Stencil on Adapteva Epiphany: Communication of boundary data between neigh-bouring eCores - Each eCore synchronises with each of its four neighbouring eCores.

Listing 3.2: Adapteva Epiphany: Code for communication and synchronization be-tween an eCore and its four neighbours

1 //Defining DMA Descriptors


start_descr0=start_descr1=0x0000;3 dma_config = E_DMA_ENABLE | E_DMA_MASTER;config_row = dma_config | E_DMA_DWORD;

5 config_col = dma_config | E_DWA_WORD;//BOTTOM

7 if (core_row != group_rows - 1) dst_offset = 0;

9 src_offset = (CORE_GRID_Y - 2) * CORE_GRID_X;e_dma_set_desc(E_DMA_0, config_row, start_descr0,

11 0x0008, 0x0008,CORE_GRID_X>>1, 0x0001,

13 0x0000 , 0x0000,(void *)(dptr+src_offset),(void *)(t_neighbour[BOTTOM]+dst_offset),&dma_desc

[3]);15 start_descr0=&dma_desc[3];

17 //TOP

if (core_row != 0) 19 dst_offset = (CORE_GRID_Y - 1) * CORE_GRID_X;

src_offset = CORE_GRID_X;21 if (start_descr0!=0x0000) config_row|=E_DMA_CHAIN;

e_dma_set_desc(E_DMA_0, config_row, start_descr0,23 0x0008, 0x0008,

CORE_GRID_X>>1, 0x0001,25 0x0000 , 0x0000,

(void *)(dptr+src_offset),(void *)(t_neighbour[TOP]+dst_offset),&dma_desc[2]);27 start_descr0=&dma_desc[2];

29 //RIGHT

if (core_col != (group_cols - 1)) 31 dst_offset = 0;

src_offset = CORE_GRID_X - 2;33 e_dma_set_desc(E_DMA_1, config_col, start_descr1,

0x0000, 0x0000,35 0x0001, CORE_GRID_X,

(CORE_GRID_X*sizeof(float)),(CORE_GRID_X*sizeof(float)),37 (void *)(dptr+src_offset),(void *)(t_neighbour[RIGHT]+dst_offset),&dma_desc

[1]);start_descr1=&dma_desc[1];

39 //LEFT

41 if (core_col != 0) dst_offset = CORE_GRID_X - 1;

43 src_offset = 1;if (start_descr1!=0x0000) config_col|=E_DMA_CHAIN;

45 e_dma_set_desc(E_DMA_1, config_col, start_descr1,0x0000, 0x0000,

47 0x0001, CORE_GRID_X,(CORE_GRID_X*sizeof(float)),(CORE_GRID_X*sizeof(float)),

49 (void *)(dptr+src_offset),(void *)(t_neighbour[LEFT]+dst_offset),&dma_desc[0]);

start_descr1=&dma_desc[0];51

53 //Synchronize and Transfer data per iterationiter++;

55 *(iter_neigh[LEFT]) = iter;

*(iter_neigh[RIGHT]) = iter;57 *(iter_neigh[TOP]) = iter;

*(iter_neigh[BOTTOM]) = iter;


59 while (iter_array[TOP]<iter||iter_array[BOTTOM]<iter||iter_array[LEFT]<iter||iter_array[RIGHT]<iter);

61

//Start dma63 e_dma_start(start_descr0,E_DMA_0);

e_dma_start(start_descr1,E_DMA_1);65 e_dma_wait(E_DMA_0);

e_dma_wait(E_DMA_1);67 //End of dma

69 *(t_iter_neigh[LEFT]) = iter;

*(t_iter_neigh[RIGHT]) = iter;71 *(t_iter_neigh[TOP]) = iter;

*(t_iter_neigh[BOTTOM]) = iter;73 while (t_iter_array[TOP]<iter||t_iter_array[BOTTOM]<iter

||t_iter_array[LEFT]<iter||t_iter_array[RIGHT]<iter);

These steps are repeated for a number of iterations. After all the iterations arecompleted, the host reads the corresponding portion of the computed grid from eacheCore and writes out the final result.

Discussion

Our first implementation was written in C. However, the relatively immature compilerwas only able to achieve a small fraction of peak performance (around 10%). Thiscode was replaced with a hand-tuned assembly code using the Epiphany instructionset. Grid sizes of 20 columns ×X rows were used where 20 was chosen based on thenumber of available registers and the latency of operations. Rows containing morethan 20 elements are processed 20 at a time. The maximum number of rows, X, thatcan be processed on one eCore is driven by the number of elements per row and theavailable memory.

Experimentation showed that the register used for accumulating the result of theFMADD instruction cannot be used again as a Floating-point unit (FPU) source orresult register, or as the source of a store instruction for at least 5 cycles to avoidstalling the execution pipeline.

Attaining peak performance

To attain peak FPU performance for the 5-point stencil, it is desirable that FMADDinstructions should comprise as high a proportion of issued instructions as possible.Branching costs 3 cycles with a further cycle or two for decrementing a counterregister. Therefore, inner loops should be unrolled as much as possible within codememory size constraints.

We maximise the use of registers by buffering rows of input data into registers andaccumulating the results in registers before writing out the final result. Our strategyis to buffer two rows of grid points while performing the five FMADD instructionsper grid point. We use row lengths (stripes) of 20 points (a multiple of 5) and enforcea design goal that each grid data point is loaded into a register just once.


Five FMADD operations are performed on five consecutive T grid points followedby five FMADDs on the respective L values. This is followed by the five C values,five R values and finally the five B values. After completing a run of five grid pointsthe accumulated results need to be saved and the accumulators cleared. This takes 10cycles.

To avoid stalling the FPU, we immediately start a second run of five grid pointsusing a second set of five accumulators. This is effectively double-buffering the resultaccumulators, using 10 registers in total (r8 - r12 for the first set, r15 - r19 for theother).

During the execution of these 5 x 5 FMADD instructions we use the “spare”integer operation slots to replace the five Top (T) grid points with the five Bottom(B) grid points while leaving the seven middle (five each of Left, Centre, Right, withoverlap) buffered values alone. We also use these spare slots to save the accumulatedresults from the previous five grid points and to clear the next five accumulators.

Use of row stripes

As mentioned earlier, we use row stripes of 20 points. Buffering two rows of 20 datapoints along with the “boundary” values at each end requires a total of 44 registers.

Registers r20 - r41 are pre-loaded with grid data for the top boundary row (T) andregisters r42 - r63 are pre-loaded with grid data for the middle row (L,C,R). As theFMADDs are performed on the five lots of T data buffered in the registers, and duringthe FMADDs of the L, C and R grid points, the T data in r20 - r41 is progressivelyreplaced with the equivalent B data from the next row of grid data. These loads needto be complete before the five final FMADDs on the B data are performed.

At the commencement of the next row, r20 - r41 now contain the middle data (L,C, R) and r42 - r63 contain the new T data. During the processing of the FMADDsfor this row, r42 - r63 are progressively replaced with B data from the next row. Atthe completion of the second row, the above registers will be in the same order as atthe start, that is T data in r20 - r41 and L, C, R data in r42 - r63.

This constitutes one unrolled loop of 40 x 5 = 200 FMADD instructions and ideallythe same number of cycles. The code for the loop is approximately 1300 bytes: 800bytes for the 200 x 32-bit FMADD instructions and 480 bytes for 120 x 32-bit integerinstructions performing loads, stores and clears. There is also a 4 or 5 cycle looppenalty as a register is decremented and a conditional branch is made to the top ofthe loop.

Assembly code structure

Many attempts were made to implement the above operations in C. However, anumber of issues were encountered. The main issue was that the C compiler wasreluctant to allow all 64 registers (63 not including the Stack Pointer) to be used.


Hence there were a number of data move instructions in the resulting assembly codeto block the dual-issuing of FPU and integer/data movement instructions.

The main problem with writing the code in assembly language was allocationof the registers. Minor code changes could result in large rewrites of register usagewhich inevitably makes the code prone to errors.

To avoid writing too much assembly code, two macros were written to performeach of the 5× 5 FMADD runs. Calling them alternately while keeping the sequencingof register numbers correct greatly simplified the code. Each macro results in 25FMADD instructions with 15 data movement instructions interleaved, for a total of 40x 32-bit instructions executing in 25 clock cycles and performing 50 Flops. Stringing4 pairs of these macros together results in 200 FMADDs, almost 1300 bytes of codeand 400 Flops for two stripes of grid data. The decrement and branching at the endof a run of two rows of the stripe adds 4 or 5 clock cycles and so a 2 or 2.5% overheadover 200 clocks.

The main snippets of assembly code for implementing high performance stencilare given in Listing B.1.

Other problem sizes

Since the algorithm uses row stripes of 20, grids whose column sizes are multiples of20 can be computed by processing multiple row stripes one after the other. Assemblycode would need to be changed in order to process column sizes which are notmultiples of 20. The overall grid size is limited by the available memory size on theeCores. Larger sizes can be processed by storing the grid in the shared memory andemploying an algorithm which uses blocking of data.

Further optimisations

At the completion of each iteration, the boundary row/columns of adjoining meshnodes need to be updated with “edge” data from the new grid while edge datafrom surrounding nodes needs to be copied to the boundary row/columns of thelocal grid. To do this more efficiently for the “in-place” algorithm, the boundaryrows and columns can be double-buffered. This would allow the transferring ofboundary data to neighbouring mesh nodes to commence while those nodes maystill be processing the current boundary data. Performance gains are likely to bemodest, roughly the same as the difference between the results with and withoutcommunication discussed below.

3.1.2.1 Floating-Point Performance of Optimised Stencil

Here, we compare the floating-point performance of the stencil kernel for differentconfigurations of grid sizes. The stencil is evaluated for 50 iterations. Using arow width of 20, as explained in Section 3.1.2, we run multiple stripes of 20× X,


where X is the number of rows, one after the other to test larger grid sizes. Threescenarios are considered i) the performance on a single eCore as a function of gridsize ii) the performance using all 64 eCores when running the same problem on eacheCore, iii) the performance when running one grid across all 64 eCores includingcommunication of the boundary region between eCores.

20 40 60 800

0.2

0.4

0.6

0.8

1

1.21.09

0.971.02 1.04

1.12

1.011.05 1.07

1.131.08 1.06 1.08

1.14

1.021.06

Columns

GFL

OP/

s

20 Rows 40 Rows 60 Rows 80 Rows

Figure 3.5: Stencil kernel on Adapteva Epiphany: Single core floating-point performance -Performance ranges from 0.97-1.14 GFLOP/s (81-95%) of peak

On a single eCore the performance ranges from 0.97-1.14 GFLOP/s or between81-95% of peak as shown in Figure 3.5. For small sizes, grids with more rows thancolumns tend to perform slightly better than the same size grid with more columnsthan rows. This is due to the overhead involved in performing multiple stripes ofcomputation when column size is greater than 20 elements.

The performance of the code on all 64 eCores is shown in Figure 3.6. The darkercolours show the performance of the stencil kernel including communication ofboundary region. The lighter colours at the top of each bar show the performancewithout communication of data.

As expected, when the computations are replicated across all 64 cores withno data communication performance scales linearly with a peak performance of72.8 GFLOP/s for a stencil containing 80 rows and 20 columns as shown in Figure 3.6.When boundary data is transferred during each iteration, this performance drops to63.6 GFLOP/s or 82.8% of peak. For correctness of the stencil computations, commu-nication cannot be completely overlapped with computation and this costs roughly9 GFLOP/s. Due to the nature of 2D DMA block transfers, grids with more columnsthan rows show less performance drop than equivalent grids with more rows thancolumns.


20 40 60 800

20

40

60

80

69.9

62.564.6 65.8

45.750.3

55.6 57

71.7

64.4 66.3 67.7

55.9 57.360.5

63.7

72.4

65.167.2 68.8

60.8 6063.1

65.2

72.8

65.467.9

63.661.5

64.5

Columns

GFL

OP/

s

20 Rows 40 Rows 60 Rows 80 Rows

Figure 3.6: Stencil kernel on Adapteva Epiphany: 64-core floating-point performance - Darkercolours show performance of the stencil kernel including communication of boundary region.Lighter colours at the top of each bar show performance without communication of data.

3.1.3 Developing GEMM kernel for Epiphany chip

There are several parallel algorithms for matrix multiplication (matmul or GEMM)on many-core systems. The algorithm used here is based on the approach by YanivSapir [2012]. Our implementation operates at three levels:

• At the most basic level, matrix blocks that fit inside a single eCore’s SRAMare multiplied. The requirement here is for a matrix multiply routine that isoptimised for a single eCore both in terms of performance and memory usage.

• At the next level, if the matrices are too large to be stored on a single eCorethey are block distributed across the memory of multiple eCores. The algorithmproceeds by executing the kernel matrix multiply on each eCore for a given setof component blocks and then shuffling the blocks between eCores, repeatingthis process until the overall matrix multiplication is complete.

• At the top level, if the matrices are too large to be stored on the entire chip aprocedure analogous to the block wise algorithm outlined in the previous stepis used to orchestrate movement of portions of the different matrices betweenoff-chip shared memory and distributed eCore memory.

Below we expand on each of the three levels. For the purpose of what follows weconsider matrices A and B with dimensions (m× k) and (k× n) respectively that aremultiplied together to form C with dimensions (m× n).


3.1.3.1 Tuned Single-core GEMM kernel

Here, matrices which fit inside a single eCore’s SRAM (up to 32× 32) are multiplied.As with the stencil code initial attempts were made to develop the eCore matrix mul-tiply kernel in C. This however achieved only 60 % of peak performance. Thereforethe inner most loop of the matrix multiply was replaced with hand-tuned assemblycode.

The assembly code loads 4 elements of the first matrix (matrix A) into 4 registers(r11, r12, r14 and r15) at a time. In turn each of these elements is multiplied witheach element in the corresponding row of the second matrix (matrix B) with theintermediate results accumulated into 32 registers (r32-r63). In this process the rowsof matrix B are loaded 8 elements at a time into registers r16-r23. Double-word loadsare used allowing these 8 elements to be loaded in 4 clock cycles.

By pre-loading a few elements of matrix A and B, after each has been usedthe next unprocessed element is loaded into the freed registers. This enables loadinstructions and FMADD instructions to be interleaved, although care must be takento ensure there are at least 5 cycles between using the same register for a load and afloating-point instruction in order to avoid stalling the execution pipeline.

Each row of matrix A is loaded only once. Each element in the row is multipliedwith all the elements in the corresponding row of matrix B. For example, the firstelement in a row of matrix A is multiplied with all the elements in the first row ofmatrix B and the intermediate results are accumulated in the first row of matrix C. Thesecond element in a row of matrix A is multiplied with all the elements in the secondrow of matrix B, with the intermediate results being accumulated in the second rowof matrix C. This means that for each row of matrix A, all the rows of matrix B need tobe loaded from memory. Once all the elements in a row of matrix A are processed, thecorresponding row of matrix C will have its final result. These values are now writtenout from the intermediate registers to memory using double-word store instructionsand the registers are cleared for the next row of results.

Assembly code structure

As with the stencil, a macro was written to simplify the code. The macro is usedto multiply an element of matrix A with all the elements in a row of matrix B.This involves 32 FMADD instructions and around 18 data movement instructionsinterleaved, resulting in 50 instructions executing 64 Flops in 32 cycles. For a 32× 32matmul, the macro is expanded 32 times for computing each row of product matrixC, resulting in around 6.5 KB of assembly code and 2048 Flops for each row of result.At the end of a row, the code loops around to compute another row of the resultincurring some overhead for the branch operation.

The main snippets of assembly code for implementing high performance matrixmultiplication kernel for the Epiphany chip are given in Listing B.2.


Other problem sizes

The disadvantage of writing in assembly is that the code is not very flexible to changesto the sizes of the operand matrices, the ’m’ dimension of matrix A being the onlyparameter which is configurable in the current code (as it is the loop count). The codeis optimised for the case where the dimensions ’k’ and ’n’ are both equal to 32. Tooperate on different sizes of operand matrices efficiently, a few changes would needto be done to the assembly code including the macros in order to perform matrixmultiplication efficiently for those sizes. Using this as a building block, larger matrixsizes can be operated on as described in Section 3.1.3.2.

Memory considerations

The operand matrices A and B, and the product matrix C are stored in the localmemory of each eCore. Each eCore stores matrices of sizes up to 32× 32 using atotal of 12 kB for storing the three matrices. As the local SRAM in each eCore isorganized as four banks of 8 kB as described in Section 2.2.1.1, the matrices are placedin different data banks. The operand matrices A and B are stored in data bank 2 andthe product matrix C is stored in the last data bank (bank 3). The entire code takesaround 11 kB of storage and occupies the first data bank (bank 0) and portions of thesecond data bank (bank 1) with the stack being allocated in the bottom half of bank 1.The size of the code has to be kept in mind while allocating memory for the operandmatrices. This is especially important for the multi-core GEMM kernel version asdescribed below.

3.1.3.2 On-chip Multi-core GEMM kernel

Using the single-core version as a building block, we implement a multi-core versionin order to operate on bigger matrices. With each eCore able to store operands ofsizes 32× 32, we can work on matrices of size 256× 256 with all the data residing inthe local memory of the 64 eCores.

Using capitals to refer to blocks of each matrix, expanding the matrix multiplica-tion we obtain:

C11 = A11B11 + A12B21 + A13B31 + ...

C12 = A11B12 + A12B22 + A13B32 + ...

...

C21 = A21B11 + A22B21 + A23B31 + ...

C22 = A21B12 + A22B22 + A23B32 + ...


(3.1)...

If each eCore is assigned a specific block of C, we can see from Equation 3.1 theblocks that are required by each eCore in order to complete the matrix product. Inthe implementation used here for each matrix a row of blocks is mapped to a row ofeCores. The multiplication proceeds using Cannon’s algorithm, where blocks of Aare progressively rotated around rows of eCores while blocks of B are rotated aroundcolumns of eCores. This process is illustrated in Figure 3.7.

Figure 3.7: Multicore matrix multiplication on Adapteva Epiphany: Assignment of blocks ofA and B and data flow between eCores

For block sizes less than 32× 32, double buffering is used for each of the operandmatrices A and B in order to overlap computation and communication, therebyimproving performance. Once an eCore completes its block computation, it transfersits portion of the matrix A and B to the second buffers of the neighbouring eCoreswithout waiting for their computation to finish.

For blocks of size 32× 32 this is not possible. With each matrix requiring 4 KB ofstorage, storing the double-buffers for the operand matrices and the product matrixC would require a total of 20 KB of storage. However, since the size of the entire codeincluding assembly is just over 13 KB, this doesn’t leave enough space for the doublebuffers and the stack. Hence an alternate “half-buffering” scheme was implemented.

In this scheme the matrix A is initially allocated in each eCore from 0x4000 to0x4FFF and the matrix B from 0x5800 to 0x67FF (4 KB each) and the matrix C isallocated from 0x7000 to 0x7FFF. A buffer of 2 KB is allocated adjacent to each ofthese matrices, from 0x5000 to 0x57FF for matrix A and 0x6800 to 0x6FFF for matrixB. Once an eCore is ready to transmit its data, it starts transferring the lower 2 KB ofthe matrix A onto the buffer for matrix A of the neighbouring eCore on the left side.


This is followed by a transfer of the lower 2 KB of matrix B to the buffer for matrix Bof the neighbouring eCore above it as shown in figures 3.8 and 3.9.

Figure 3.8: Multicore matrix multiplication on Adapteva Epiphany: Transfer of Matrix A - 1stiteration

Figure 3.9: Multicore matrix multiplication on Adapteva Epiphany: Transfer of Matrix B - 1stiteration

Once all the eCores complete these transfers, they start transferring the upperhalves of the matrices A and B, replacing the lower halves of the correspondingmatrices of the neighbours. The pointers to these two matrices are also changedaccordingly. In the following iteration, communication is performed in the reverseorder as illustrated in figures 3.10 and 3.11. After changing the pointers to the twomatrices again, the allocation of the matrices would be identical to the initial one.


Figure 3.10: Multicore matrix multiplication on Adapteva Epiphany: Transfer of Matrix A -2nd iteration

Figure 3.11: Multicore matrix multiplication on Adapteva Epiphany: Transfer of Matrix B -2nd iteration


3.1.3.3 Off-chip GEMM kernel

For square matrices larger than 256× 256 there is insufficient memory to perform on-chip matrix multiplication, and it becomes necessary to page blocks of the matricesfrom off-chip shared memory. Here we exploit an analogous algorithm to that usedto move blocks of matrices A and B between eCores for the on-chip case. Namelyblocks of the product matrix C are computed in turn by paging in blocks of A and Bfrom shared memory. Thus in the 512× 512 case to complete one 256× 256 block ofC requires two 256× 256 blocks of both A and B to be read from shared memory.

3.1.3.4 Floating-Point Performance of GEMM kernel

Here, we compare the floating-point performance of the matrix multiplication kernelas a function of sizes of the operand matrices.

Single-core Floating-Point Performance

The results for a single eCore are shown in Table 3.2. The maximum size of matricesthat are multiplied is 32× 32 as mentioned earlier. On a single eCore the performanceranges from 0.85-1.15 GFLOP/s or between 70-96% of peak.

Dimensions GFLOP/s % Peak

8 × 8 0.85 70.5

16 × 16 1.07 89.5

20 × 20 1.11 92.5

24 × 24 1.12 93.4

32 × 32 1.15 95.9

Table 3.2: Matrix multiplication on Adapteva Epiphany: Single core floating-point perfor-mance - Peak of ≈ 96% achieved.

On-chip Multi-core Floating-Point Performance

Table 3.3 shows the floating-point performance of the on-chip multi-core versionwhich was implemented as detailed in Section 3.1.3.2. For grid sizes which are able tofit on the local memory of the chip (up to 256× 256), the performance is around 85%including the data communication between pairs of eCores. (This does not include thetime taken to transfer the initial operand matrices from the external shared memoryto the chip). The table shows the per-core dimensions of the product matrix C andthe number of eCores used to perform the multiplication. With a per-core matrix sizeof 32× 32, the overall matrix dimensions would be 64× 64 when running on 2× 2eCores, 128× 128 on 4× 4 eCores and 256× 256 on 8× 8 eCores.


Matrix C Num of eCores

(per-core) 4 × 4 8 × 8GFLOP/s % Peak GFLOP/s % Peak

8 × 8 5.1 26 20.3 26

16 × 16 12.8 67 51.4 67

20 × 20 14.4 75 57.6 75

24 × 24 15.4 80 62.2 81

32 × 32 16.3 85 65.3 85

Table 3.3: Matrix multiplication on Adapteva Epiphany: Multi-core on-chip floating-pointperformance - Peak of 85% achieved.

The on-chip matrix multiplication of two 256× 256 matrices can be broken downinto the computation of a 32× 32 matrix product by each eCore and the transfer ofthe two operand matrices A and B totalling 8 kB to the neighbouring eCore in eachiteration. Considering 1.15 GFLOP/s for the matrix product by a single eCore (fromTable 3.2) and 2GB/s transfer rate between eCores (from results in Section 3.1.1.1),the maximum theoretical performance can be estimated to be roughly 68 GFLOP/s.From Table 3.3, the performance achieved by the code is around 65 GFLOP/s whichis very close to the estimate.

Off-chip Multi-core Floating-Point Performance

The performance drops for sizes larger than 256× 256 due to the need for multipletransfers of blocks to and from the shared memory as the algorithm progresses asdiscussed earlier. The results are shown in Table 3.4. A per-core matrix size of32× 32 is used to perform the multiplication of large matrices of sizes 512× 512 and1024× 1024. To build the result for the large matrix size 1536× 1536, a per-core sizeof 24× 24 is used and hence the overall performance in GFLOP/s is a bit worse thanthe other two cases. In all the cases, the off-chip memory transfer dominates theoverall performance with around 86-90% of the total time taken being spent on theblock DMA transfers in and out of shared memory and 10-13% of the total time takenbeing spent on the computation.

Matrix C GFLOP/s % Peak % Computation % Memory Transfers

512 × 512 8.3 10.8 % 12.8 % 87.2 %

1024 × 1024 8.6 11.1 % 13.1 % 86.9 %

1536 × 1536 6.3 8.2 % 10.9 % 89.1 %

Table 3.4: Matrix multiplication on Adapteva Epiphany: Off-chip floating-point performancefor larger matrices - Peak of ≈ 11% achieved.


We can use the results from the micro-benchmarks to create an analytical modelfor estimating the performance of the kernel. To analyse the performance of the off-chip matrix multiplication, we consider the multiplication of two 512× 512 matrices.Each matrix can be considered as consisting of four blocks of 256× 256 elements.Each iteration in the outer-most loop in the algorithm involves transferring one blockof matrix A and one block of matrix B from the shared memory to the chip andhaving all the 64 eCores perform parallel multiplication to produce an intermediateblock result. The transfer of two blocks of 256× 256 elements (512 KB) takes around3.5 milliseconds at 150MB/s (from results in Section 3.1.1.1). The computation ofthe block matrix product takes around 0.51 milliseconds at 65.32 GFLOP/s (fromTable 3.3). Thus the ratio of computation to off-chip transfers is roughly 1:6.5. Fromthe result in Table 3.4 the ratio of computation to off-chip data transfer is 1:6.8 whichis very close to the estimate.

3.1.4 Feasibility of Work Partitioning between Host CPU and Epiphany

As seen in the results in the previous sections, the Epiphany co-processor is able toachieve high performance for problem sizes which fit in its local scratchpad memory.In order to operate on larger problem sizes, data has to be fetched from the 32 MBshared memory region, which is shared with the host CPU. However, due to the slowshared memory bandwidth, this causes a massive degradation in overall performance(up to 90%) as shown in Table 3.4.

It is also noted that the host dual-core ARM Cortex-A9 CPU cores (running at 667MHz) are capable of four single-precision (SP) FLOP per cycle (using NEON SIMDinstructions) for a peak performance of only 5.336 GFLOP/s (2× 4× 0.667) comparedto Epiphany’s peak of 76.8 GFLOP/s. While it is theoretically possible to use boththe host CPU and the Epiphany co-processor for simultaneous execution, there islittle performance benefit. Rather, due to the need of accessing shared memory toorchestrate data movement between the CPU and the Epiphany chip, this leads toa degradation in performance as explained in Section 3.1.3.4. The same is seen inother works [Richie et al., 2015]. Moreover, a dedicated CPU thread is required toorchestrate this movement which adds to the complexity.

Therefore considering the significant complexities with little added benefit, it wasconsidered not worth developing partitioned kernels for this architecture. Rather,the available versions of the Epiphany chip are best operated as a coprocessor of amore general-purpose host CPU where an application code would typically performall of its initialization and outer loops on the host CPU and offload the innermostnumerically intensive loops as kernels to the Epiphany eCore nodes.


3.2 Developing Applications for NVIDIA Tegra SoCs

The Tegra SoCs feature the CPU and GPU on the same chip which share all theavailable physical memory. This avoids the need for data movement and replication ofdata. Compared to the Epiphany, this makes these systems better suited to developingpartitioned kernels which are capable of utilising both the CPU and the GPU forsimultaneous computation of the same problem.

Available and supported programming models are used to parallelise the codefor each device. For the CPU kernel, OpenMP is used for parallelisation while for theGPU, CUDA is used. This section describes the development of partitioned Stenciland GEMM kernels for the Tegra SoCs.

3.2.1 Developing Partitioned Stencil kernel for Tegra SoCs

As described earlier, a 5-point star-shaped stencil computation is implemented ona 2D rectangular grid. The grid is stored row-wise in the main memory, which isaccessible to both the CPU and GPU for simultaneous execution.

The focus here is to develop kernels for the CPU and the GPU in such a waythat the computation can be seamlessly partitioned between the two devices. Thebest strategy would be to use freely available libraries, if available, which containkernels that are optimised for the CPU and GPU respectively. Since no suitablefreely available libraries were found, custom kernels were developed using supportedprogramming models for the CPU and GPU with the design goal of partitioning thecomputation easily between the two devices.

A naive implementation of the 2D 5-point stencil, which uses Jacobi iteration, ischosen for simplicity. The overall strategy here is to develop kernels for the CPUand GPU which performs the computations of the input grid for one iteration. Thegrid is partitioned into sub-domains which are computed by the GPU and the CPUsimultaneously in each iteration. The calculation is not done in place which meansthe algorithm alternates the source and target arrays after each iteration.

CPU kernel

A version of the kernel was implemented for the CPU. Parallelisation is achieved bydividing up the portion of the allocated sub-domain between all the CPU cores. Thisis achieved using OpenMP.

Compiler auto vectorization options used for this kernel include: -mfpu=neon-ftree-vectorize. Further attempts to improve performance such as manual loopunrolling and loop tiling did not yield any benefits and were discarded in favour ofthe optimisations applied by the chosen compiler options.

§3.2 Developing Applications for NVIDIA Tegra SoCs 55

GPU kernel

A GPU version of the kernel was implemented. Blocks and threads are deployedfollowing a 2D layout to map the decomposition of the computational domain toachieve parallelisation. Each thread within a block is responsible for the computationassociated with a single grid point on each iteration. Adjacent blocks share dataplaced on boundaries.

Each thread reads its input element from global memory into the block’s localshared memory prior to the actual computation. After the threads in the blockwork together to read in the necessary values into block’s local shared memory,the computations are performed on these values. At the end of the computationthe output result is written back into global memory. This reduces the number ofmemory accesses to global memory since the computation of each point requiresreading in five grid values.

We evaluated the performance effect of having each thread compute multiple gridpoints. However there was no observable difference in performance. So for simplicity,each thread is responsible for the computation of only one grid point.

Work partitioning

As described before, stencil computation can be parallelised via domain decomposi-tion with data communication primarily between adjacent domains. The rectangulargrid may be divided along one or two dimensions. For simplicity we choose topartition in one dimension, row-wise.

Here the grid is divided into 2 parts where the computation associated with onepart containing a number of rows (and all the columns) is allocated to the CPU andthe computation associated with the other part is allocated to the GPU as shown inFigure 3.12. Thus each of the devices can simultaneously compute a portion of thefinal result.

Since the computation associated with each element depends on the values of theneighbouring points from the previous iteration (timestep), this means the bound-ary values need to be communicated to each sub-domain before each sub-domaincan proceed to the next iteration. Therefore, for correctness, this necessitates a syn-chronization between the CPU and GPU after the computation of the respectivesub-domains for each iteration.

For the Tegra SoCs, since the whole grid is accessible to both the CPU and theGPU, there is no need for explicit data transfers between the CPU and the GPU.Instead synchronization ensures that the computations for the particular iteration forthe respective sub-domains have completed.

Since the calculations are not done in place, two arrays are initialized to storethe grid values. The values of the previous iteration are stored in t_old while thecomputed values from the current iteration are stored in t_new. At the end of theiteration, the array pointers are swapped. Thus in the following iteration the correct


Figure 3.12: Implementing stencil kernel on Tegra SoCs: Partitioning of stencil grid betweenCPU and GPU - The hollow dots represent boundary region of the full grid. The dotshighlighted in red represent the boundary region of the two partitions which are sharedbetween the CPU and GPU.

values from the previous iteration are used for computation.The computation is partitioned based on an input parameter cpu_ratio. Con-

sidering a stencil grid of dimensions nx columns ×ny rows, cpuny rows are allocatedto the CPU for computation (where cpuny = cpu_ratio ∗ ny). gpuny rows are allocatedto the GPU for computation (where gpuny = ny − cpuny ).

A piece of code illustrating the main components of the split stencil computationis shown in Listing 3.3.

Listing 3.3: Sample code for partitioned stencil computation on the Tegra systems//Allocate unified memory regions for the grid

2 cudaMallocManaged((void**)t_old, sizeof(real) * n_x * n_y);

cudaMallocManaged((void**)t_new, sizeof(real) * n_x * n_y);

4

//Initialize the grid

6 initialize_grid(n_x, n_y);

8 iteration=0;

10 do

iteration++;

12

//Start GPU kernel

14 kernel_compute_gpu(t_old, t_new, n_x, gpu_n_y);

16 //Unprotect unified memory regions for concurrent execution

mprotect(t_old, sizeof(real) * n_x * n_y, PROT_READ | PROT_WRITE);

18 mprotect(t_new, sizeof(real) * n_x * n_y, PROT_READ | PROT_WRITE);

20 //Start CPU kernel

kernel_compute_cpu(t_old+((gpu_n_y-1)*n_x), t_new+((gpu_n_y-1)*n_x),


22 n_x, cpu_n_y);

24 //Wait for GPU to complete

cudaDeviceSynchronize();

26

// swap array pointers

28 real *temp = t_new; t_new = t_old; t_old = temp;

30 while (iteration < max_iteration);

Concurrent execution by CPU and GPU

Since the CPU and the GPU share the same physical memory, implementing concur-rent execution by the CPU and the GPU should be possible using “unified memory”.However, this is not supported on the Tegra systems as explained in Section 2.2.2. Onthe TK1 and TX1 system, this can be circumvented by issuing mprotect() immedi-ately after the GPU kernel is started as shown in Listing 3.3. Thus at the end of eachiteration, there is no need of explicit transfer of boundary data for these systems.

However the same approach for concurrent access to unified memory region byCPU and GPU does not work on the Xavier system. To circumvent this problem,the sub-domains are stored in separate memory buffers and passed to the respectivekernels. At the end of each iteration, once computation is completed for the respectivesub-domains, the boundary region of the two sub-domains are communicated to eachother.

3.2.1.1 Performance Results and Analysis

Three LPSoC platforms were used for the following experiments. They are the TegraLPSoCs: TK1, TX1 and Xavier summarised in Table A.1.

Three experiments were performed to evaluate the performance of the applicationon each of the Tegra systems. In the first two experiments, the stencil kernel wasrun only on the CPU and on the GPU respectively. In the third experiment, thepartitioned stencil kernel was run while changing the fraction of work given to theCPU. In each experiment, the number of columns and rows of the grid is equal andthe stencil kernel is executed for 50 iterations. The mean of 5 samples is reportedfor all experiments and measurements are reported with a margin of error, at aconfidence level of 95%, of less than 3% for performance.

The performance results for CPU-only and GPU-only kernels are shown in fig-ures 3.13, 3.14 and 3.15 for the TK1, TX1 and Xavier systems respectively. Overall, onall the three systems the GPU performance is observed to be much faster than CPUperformance for this application kernel. The exception is double precision stencil onthe Xavier system where the GPU and CPU performance are comparable.

The CPU performance is seen to generally decrease when the problem size isincreased on the TK1 and TX1. For small sizes, which fit in the L1 cache, performance


0 500 1,000 1,500 2,000 2,500 3,000 3,500 4,000 4,500 5,000 5,500 6,000 6,500 7,000 7,500 8,0000

2

4

6

8

nx=ny

GFL

OP/

s

CPU-only performance (DP) GPU-only performance (DP)

(a) Double Precision

0 500 1,000 1,500 2,000 2,500 3,000 3,500 4,000 4,500 5,000 5,500 6,000 6,500 7,000 7,500 8,0000

2

4

6

8

nx=ny

GFL

OP/

s

CPU-only performance (SP) GPU-only performance (SP)

(b) Single Precision

Figure 3.13: Stencil kernel on TK1: Performance of CPU-only and GPU-only kernels withvarying problem size - Number of columns nx = number of rows ny.

0 500 1,000 1,500 2,000 2,500 3,000 3,500 4,000 4,500 5,000 5,500 6,000 6,500 7,000 7,500 8,0000

2

4

6

8

10

12

14

16

nx=ny

GFL

OP/

s

CPU only performance (DP) GPU only performance (DP)


0 500 1,000 1,500 2,000 2,500 3,000 3,500 4,000 4,500 5,000 5,500 6,000 6,500 7,000 7,500 8,0000

2

4

6

8

10

12

14

16

nx=ny

GFL

OP/

s



Figure 3.14: Stencil kernel on TX1: Performance of CPU-only and GPU-only kernels withvarying problem size - Number of columns nx = number of rows ny.


is improved. However, as the grid sizes are increased, the multiple accesses to memoryfor each point hampers the performance. The computation is memory-bound ascomputation of each grid point involves nine floating point operations and up to fourmemory accesses. From equation 3.1.2, the RHS elements Tprevi−1,j and Tprevi,j can beobtained from the L1 cache since it was loaded in the previous iterations. Tprevi,j+1

must be loaded from memory since it was not used before within the sweep. WhetherTprevi+1,j and Tprevi,j−1 cause cache misses in a certain cache level depends on whetherthree successive rows of the array fit into the cache and this depends on the problemsize [Datta et al., 2009; Stengel et al., 2015]. In addition, there may be an extra accessdue to write-allocate transfer on every store miss. Consequently CPU performance islimited by available memory bandwidth which is less than the bandwidth available tothe GPU and also by contention for memory access between all the cores [Cavicchioliet al., 2018]. On the Xavier, the CPU benefits from the increased cache size and theperformance increases initially when the grid size is increased from 512× 512 andquickly reaches a steady state when performance is limited by the memory bandwidth(theoretical peak bandwidth for each of the experimental hardware platforms is givenin Table 2.1). Single-precision performance is seen to be generally between 1.3 and1.7 times faster than double-precision performance for all the three systems.

0 500 1,000 1,500 2,000 2,500 3,000 3,500 4,000 4,500 5,000 5,500 6,000 6,500 7,000 7,500 8,0000

5

10

15

20

25

nx=ny

GFL

OP/

s



0 500 1,000 1,500 2,000 2,500 3,000 3,500 4,000 4,500 5,000 5,500 6,000 6,500 7,000 7,500 8,0000

10

20

30

40

50

60

nx=ny

GFL

OP/

s



Figure 3.15: Stencil kernel on Xavier: Performance of CPU-only and GPU-only kernels withvarying problem size - Number of columns nx = number of rows ny.

The GPU performance is seen to generally increase when problem size is increasedfrom a small size (512× 512) as it maps well to the parallel nature of the problem.


After a point, it becomes limited by memory bandwidth and performance reachesa steady state. However, the absolute performance is low compared to the peakperformance of the GPU. For the TK1 and TX1, the single-precision performance isseen to be around 1.7 times faster than double-precision performance while on theXavier, single-precision performance is around 3 times faster than double-precision.

The performance results for partitioned stencil computation are shown in fig-ures 3.16, 3.17 and 3.18 for the TK1, TX1 and Xavier systems respectively. The fractionof work given to the CPU is varied in increments of 5%.

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

2

3

4

5

Fraction of work given to CPU (%)

GFL

OP/

s

512× 512 1024× 10242048× 2048 4096× 40968192× 8192


0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 1002

3

4

5

6

7

8


GFL

OP/

s

512× 512 1024× 10242048× 2048 4096× 40968192× 8192


Figure 3.16: Stencil kernel on TK1: Performance of partitioned kernel as fraction of workgiven to CPU is varied - Five problem sizes (nx = ny).

For very small sizes, the CPU kernel is seen to outperform the GPU kernel in allthe systems as the overhead for launching the GPU kernel becomes a factor. For largersizes, the GPU outperforms the CPU for both single-precision and double-precisionin all the systems. It is evident that the best strategy for improving performance is toassign most of the work to the GPU and less to the CPU for this kernel.

For double precision stencil on the Xavier system, performance of the CPU andthe GPU for larger grid sizes are more evenly matched than the other systems and asa result assigning around 40% of the workload to the CPU maximises performanceon this system. For the other two systems, this ratio is closer to 20-25%. The load isbalanced well between the CPU and the GPU under these conditions.

The ratio of work given to the CPU which yields best performance is seen tovary with problem sizes because the shape of the partitioned sub-domains influence


0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

2

4

6

8


GFL

OP/

s512× 512 1024× 1024

2048× 2048 4096× 40968192× 8192


0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 1002

4

6

8

10

12

14

16


GFL

OP/

s

512× 512 1024× 10242048× 2048 4096× 40968192× 8192


Figure 3.17: Stencil kernel on TX1: Performance of partitioned kernel as fraction of workgiven to CPU is varied - Five problem sizes (nx = ny).

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

5

10

15

20

25

30


GFL

OP/

s

512× 512 1024× 10242048× 2048 4096× 40968192× 8192


0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

10

20

30

40

50

60


GFL

OP/

s

512× 512 1024× 10242048× 2048 4096× 40968192× 8192


Figure 3.18: Stencil kernel on Xavier: Performance of partitioned kernel as fraction of workgiven to CPU is varied - Five problem sizes (nx = ny).


performance. In figures 3.16, 3.17 and 3.18 it is observed that as more work is givento the CPU, the performance reduces gradually until a point. This is seen to happenat high CPU ratios (close to 100%). When the CPU ratio is further increased to 100%,an increase in performance is observed. This is because as work is given to the GPU,the overhead for launching the GPU kernel results in reduced performance comparedto giving all the work to the CPU. This is more prominent for smaller problem sizesas the CPU kernel outperforms the GPU kernel for these sizes. Overall in all cases,there is a performance benefit from partitioning the workload between the CPU andthe GPU.

3.2.1.2 Critique

Due to the nature of the problem, there is dependence between sub-domains whilethe computations within a sub-domain during an iteration can proceed independently.At the end of each iteration, each sub-domain needs the computed boundary values ofneighbouring sub-domains in order to proceed to the next iteration. This necessitatessynchronization between sub-domains at the end of each iteration.

The design of the overall partitioning strategy requires that each device completesthe computations involved for its allocated sub-domain in each sub-domain and thensynchronizes. This means that the CPU and GPU needs to synchronize at the end ofeach iteration. This is the main bottleneck in performance. However, this is necessaryfor correctness of the results.

In the TK1 and TX1 systems, the GPU kernel’s performance is further hamperedby the number of blocks which can be executed simultaneously. There is only 1 SMin the TK1 GPU and 2 SMs in the TX1 GPU while the Xavier GPU has 8 SMs. With ablock size of 32× 32 threads and a maximum number of 2048 threads per SM, thismeans that only 2 blocks can execute simultaneously on the TK1 while 4 blocks canexecute simultaneously on the TX1.

When the grid problem size is increased, the number of blocks required also in-creases and this impacts the performance especially since synchronization is requiredfor all the blocks at the end of each iteration. A further optimisation could be donewhere adjacent blocks synchronizes between themselves when their respective com-putations finish. However, the focus here is on partitioning execution between theCPU and GPU. Hence a device level synchronization is necessary.

3.2.2 Developing Partitioned GEMM kernel for Tegra SoCs

(This kernel was initially developed as part of a collaborative effort [Mitra et al., 2016]. Thepresent author was primarily responsible for implementation on the Xavier LPSoC as well asevaluation of the kernel on all the different platforms.)

Similar to the stencil, the focus here is to develop kernels for the CPU and theGPU in such a way that the computation can be seamlessly partitioned between thetwo devices. Since GEMM is a critical component of many applications, optimised


implementations are available in many freely available libraries. Therefore we use,to our knowledge, the best SGEMM and DGEMM implementations available for theCPU and GPU.

CPU kernel

We experimentally found that the BLIS library [Smith et al., 2014] produced the bestperformance on the ARM CPU architectures of our SoCs. It was able to outper-form ATLAS [Whaley and Dongarra, 1998] which is a popular library that providesoptimised BLAS routines for different CPU architectures.

Appropriate configurations were used to build the BLIS library for each of thesystems. The configuration cortex-a15 was used on the TK1 while configurationarmv8a was used for the TX1 and Xavier systems. The best parallel performanceover four cores was obtained by splitting the jr loop into four parts using the BLISconfiguration BLIS_JR_NT = 4 [Smith et al., 2014].

GPU kernel

For the GPU, NVIDIA’s vendor-supplied CuBLAS library [NVIDIA, 2008] was used.This is a highly optimised implementation of BLAS for NVIDIA’s GPU architectureswhich is built on top of the NVIDIA CUDA runtime. This library can be used toachieve near peak performance for BLAS routines and hence was used. CuBLAS 6.5was supported on the TK1 system. Version 7 was available for the TX1 while Xaviersupports version 10 of CuBLAS.

Work partitioning

As described in Section 2.3.1, matrix multiplication can be easily subdivided intoparts that can be computed in parallel.

The matrix product C = A× B, where A, B and C have dimensions m× k, k× nand m× n respectively can be partitioned by columns in the manner[

C1 C2 . . . Cp

]= A×

[B1 B2 . . . Bp

].

The product then consists of p independent matrix products Ci = A× Bi. Boththe input matrices are stored in column-major format here.

The computation is partitioned based on an input parameter cpu_ratio. Thematrix B is divided column-wise so that ncpu columns are allocated to the CPU forcomputation (where ncpu = cpu_ratio ∗ n) and ngpu columns of matrix B are allocatedto the GPU for computation (where ngpu = n− ncpu).

In our implementation, the CPU thread issues work to the GPU asynchronouslyand then calls the routine that performs the CPU portion of the computation. Once theCPU’s work is completed, it blocks until the GPU kernel has finished execution. Themain components of our partitioned GEMM computation are shown in Listing 3.4.


Listing 3.4: Sample code for partitioned GEMM computation on the Tegra systems//Allocate unified memory regions for the matrices

2 cudaMallocManaged((void**)A, sizeof(real) * m * k);

cudaMallocManaged((void**)B, sizeof(real) * k * n);

4 cudaMallocManaged((void**)C, sizeof(real) * m * n);

6 //Start GPU kernel

start_gpu_kernel(A, B, C, m, k, n_gpu);

8

//Unprotect unified memory regions for concurrent execution

10 mprotect(A, sizeof(real)*m*k, PROT_READ | PROT_WRITE);

mprotect(B, sizeof(real)*k*n, PROT_READ | PROT_WRITE);

12 mprotect(C, sizeof(real)*m*n, PROT_READ | PROT_WRITE);

14 //Start CPU kernel

start_cpu_kernel(A, B + n_gpu*k, C + n_gpu*m, m, k, n_cpu);

16

//Wait for GPU to finish computation

18 cudaDeviceSynchronize();

On the TK1 and TX1, the appropriate regions of unified memory are unprotectedusing mprotect() after the GPU kernel is invoked as shown in Listing 3.4.

Since this is not possible on the Xavier system yet, the implementation is doneby making duplicate copies of all matrices so that CPU and GPU works on separateregions of memory. At the end of computation, the portion of the result computedby the GPU is copied to the corresponding position in the output matrix using amemcpy() call since the CPU and GPU share the same physical memory.

3.2.2.1 Performance Results and Analysis

Similar to stencil, experiments were performed to evaluate the performance of theCPU only, GPU only and partitioned versions of the kernel.

The ARM Cortex-A15 CPU on the TK1 can perform two double-precision (DP)FLOP per cycle per core and therefore has a peak of 4.6 GFLOP/s (at 2.4 GHz) per coreand 18.4 GFLOP/s for four cores [Rajovic et al., 2014]. The 64-bit ARM Cortex-A57CPU on the TX1 can perform up to four DP FLOP per cycle while using NEON [Dol-beau, 2015] instructions and therefore has a peak of 7.6 GFLOP/s (at 1.9 GHz) percore and 30.4 GFLOP/s for four cores.

The TK1’s single-precision (SP) performance is higher when NEON instructionsare utilised and is approximately four times the DP peak i.e. 73.6 GFLOP/s. TheTX1’s A57 can do up to eight SP FLOP per cycle [Dolbeau, 2015] and therefore hasa peak of 15.2 GFLOP/s per core and 60.8 GFLOP/s for four cores. Not much detailshave been disclosed yet about the Xavier’s Carmel CPU’s performance capabilities.

The theoretical peak performance achievable by the TK1 GPU in SP and DP are 365and 15 GFLOP/s respectively [NVIDIA, d]. An analysis of the behaviour of SGEMMon Kepler GPUs [Lai and Seznec, 2013] concluded that the upper bound compared to


peak performance of such computation is between 57.6% and 82.5%. The theoreticalpeaks of the TX1 GPU in SP and DP are 512 and 16 GFLOP/s respectively. Due toa different dual-issuing mechanism in this architecture (compared to Kepler GPU),it has been demonstrated that SGEMM computations can achieve at least 98% peakperformance [Grauer-Gray, 2014]. The Xavier’s GPU is capable of 1.4 TFLOP/s in SP.

The figures 3.19, 3.20 and 3.21 show the performance results of SGEMM andDGEMM for the TK1, TX1 and Xavier systems respectively. For TK1 and TX1 theCPU performance is seen to hit its peak performance when problem size is aroundm = k = n = 2048 for both SP and DP. For Xavier, the CPU performance is seen to hitits peak performance at m = k = n = 8192 for SP and at m = k = n = 4096 for DP.

0 500 1,000 1,500 2,000 2,500 3,000 3,500 4,000 4,500 5,000 5,500 6,000 6,500 7,000 7,500 8,0000

2

4

6

8

10

12

14

m=k=n

GFL

OP/

s


(a) DGEMM

0 500 1,000 1,500 2,000 2,500 3,000 3,500 4,000 4,500 5,000 5,500 6,000 6,500 7,000 7,500 8,0000

50

100

150

200

m=k=n

GFL

OP/

s


(b) SGEMM

Figure 3.19: GEMM on TK1: Performance of CPU-only and GPU-only kernels with varyingproblem size - m=k=n.

For TK1 and TX1 the GPU performance is seen to hit its peak at m = k = n = 2048for both SP and DP, but worsens as matrix sizes are further increased. This likelyhighlights existing challenges for CuBLAS’s internal routines in their division of workfor these systems. On Xavier the SP GPU performance is seen to generally increase asthe problem size is increased and eventually hits its peak at m = k = n = 8192 and theDP GPU performance is seen to hit its peak at around m = k = n = 1024 and becomessteady when problem size is further increased.

The TK1 CPU achieves a maximum performance of 13 GFLOP/s (71% of peak) forDGEMM and 32 GFLOP/s (43% of peak) for SGEMM while the TK1 GPU is observedto achieve a maximum performance of 12 GFLOP/s (80% of peak) for DGEMM and


0 500 1,000 1,500 2,000 2,500 3,000 3,500 4,000 4,500 5,000 5,500 6,000 6,500 7,000 7,500 8,0000

5

10

15

m=k=n

GFL

OP/

s


(a) DGEMM

0 500 1,000 1,500 2,000 2,500 3,000 3,500 4,000 4,500 5,000 5,500 6,000 6,500 7,000 7,500 8,0000

100

200

300

400

m=k=n

GFL

OP/

s


(b) SGEMM

Figure 3.20: GEMM on TX1: Performance of CPU-only and GPU-only kernels with varyingproblem size - m=k=n.

0 500 1,000 1,500 2,000 2,500 3,000 3,500 4,000 4,500 5,000 5,500 6,000 6,500 7,000 7,500 8,0000

10

20

30

40

50

m=k=n

GFL

OP/

s


(a) DGEMM

0 500 1,000 1,500 2,000 2,500 3,000 3,500 4,000 4,500 5,000 5,500 6,000 6,500 7,000 7,500 8,0000

200

400

600

800

1,000

m=k=n

GFL

OP/

s


(b) SGEMM

Figure 3.21: GEMM on Xavier: Performance of CPU-only and GPU-only kernels with varyingproblem size - m=k=n.

§3.3 Summary 67

220 GFLOP/s (61% of peak) for SGEMM. While the observed GPU performance iswithin the expected range of 57.6%-82.5% for the GPU, the CPU performance relativeto peak is better for DGEMM compared to SGEMM.

The TX1 CPU achieves a maximum performance of 16 GFLOP/s (53% of peak) forDGEMM and 35 GFLOP/s (58% of peak) for SGEMM while the TX1 GPU is observedto achieve 13 GFLOP/s (81% of peak) for DGEMM and 410 GFLOP/s (80% of peak)for SGEMM. Given the improved DP capabilities of the A57 compared to the A15,the performance achieved was only marginally better. This can be attributed to therelative immaturity of performance tuned libraries available for aarch64 and ARMCortex-A57 compared to armhf and ARM Cortex-A15. The TX1 GPU achieved nearlytwice the SP performance compared to the TK1 GPU but declined marginally w.r.tDP performance.

The Xavier’s CPU achieves a maximum performance of 54 GFLOP/s for DGEMMand 161 GFLOP/s for SGEMM while the GPU achieves a maximum performance of28 GFLOP/s for DGEMM and 914 GFLOP/s (65% of peak) for SGEMM. The Xavier’sCPU is seen to be close to four times faster than the TX1’s CPU for DP and aroundfive times faster than the TX1 for SP. The Xavier’s GPU is seen to be just over twotimes faster than the TX1’s GPU for both SP and DP.

Figures 3.19, 3.20 and 3.21 show the performance of partitioned GEMM kernelswhen fraction of work given to the CPU is changed for the TK1, TX1 and Xaviersystems respectively. For SGEMM, the GPU far outperforms the CPU in all casesand there is very little benefit in giving work to the CPU. In fact, in a lot of casesallocating even a little amount of work to the CPU results in load imbalance and GPUends up waiting for the CPU’s section to finish. This reduces the overall performancecompared to running everything on the GPU.

DGEMM is more interesting as the GPU and CPU performances are more closelymatched and allocating around 50 to 65% of the work results in an improvementin performance on the different systems. On the TK1, a maximum performance of25 GFLOP/s is observed for matrix size m = k = n = 2048 when 50% of the workis allocated to the CPU while on the TX1 for the same size, allocating 55% of thework results in a maximum performance of 27 GFLOP/s. On the Xavier, a maximumperformance of 74 GFLOP/s is observed for matrix size m = k = n = 4096 when 65%of the work is allocated to the CPU. These observed maximum performance valuesare seen to be just less than the sum of the performance obtained on each device inisolation for each system.

3.3 Summary

This chapter presented the work undertaken to develop application kernels for theEpiphany coprocessor and the Tegra SoCs. Due to the lack of optimised routinesfor the Epiphany coprocessor, highly optimised hand-tuned routines for stencil andGEMM were developed which operate at around 85% of peak performance for on-chip


0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

5

10

15

20

25


GFL

OP/

s

256× 256× 256 512× 512× 5121024× 1024× 1024 2048× 2048× 20484096× 4096× 4096

(a) DGEMM

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 1000

50

100

150

200


GFL

OP/

s

256× 256× 256 512× 512× 5121024× 1024× 1024 2048× 2048× 20484096× 4096× 4096 8192× 8192× 8192

(b) SGEMM

Figure 3.22: GEMM on TK1: Performance of partitioned kernel as fraction of work given toCPU is varied - Six problem sizes (m=k=n).

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 1005

10

15

20

25


GFL

OP/

s

256× 256× 256 512× 512× 5121024× 1024× 1024 2048× 2048× 20484096× 4096× 4096 8192× 8192× 8192

(a) DGEMM

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

0

100

200

300

400


GFL

OP/

s

256× 256× 256 512× 512× 5121024× 1024× 1024 2048× 2048× 20484096× 4096× 4096 8192× 8192× 8192

(b) SGEMM

Figure 3.23: GEMM on TX1: Performance of partitioned kernel as fraction of work given toCPU is varied - Six problem sizes (m=k=n).

§3.3 Summary 69

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

20

40

60

80


GFL

OP/

s

256× 256× 256 512× 512× 5121024× 1024× 1024 2048× 2048× 20484096× 4096× 4096 8192× 8192× 8192

(a) DGEMM

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

0

200

400

600

800

1,000


GFL

OP/

s

256× 256× 256 512× 512× 5121024× 1024× 1024 2048× 2048× 20484096× 4096× 4096 8192× 8192× 8192

(b) SGEMM

Figure 3.24: GEMM on Xavier: Performance of partitioned kernel as fraction of work givento CPU is varied - Six problem sizes (m=k=n).

computation. Although recent developments enable the use of OpenMP [Agathosand Papadogiannakis, 2015], MPI [Richie et al., 2015] and OpenCL [Richie and B,2016] for writing parallel programs for the Epiphany chip, strategies such as thosedescribed in this chapter are needed to optimise absolute performance. This requiresconsiderable effort from the programmer. As such our contributed kernels remainthe best performing applications for this platform. A number of limitations wereidentified for this chip. It is noted that due to slow off-chip memory transfer ratesperformance drops drastically for workloads which don’t fit in the on-chip memory.This coupled with lack of native double precision support unfortunately renders theEpiphany-IV coprocessor mostly ineffective for realistic scientific workloads.

Strategies to partition work between the CPU and GPU in a heterogeneous plat-form such as NVIDIA’s Tegra SoCs were also presented. Partitioned versions of theStencil and GEMM kernels were implemented with the overall aim of splitting theworkload between the CPU and GPU. The implementations were evaluated on all thetest platforms. The methodology is general and the particular kernel versions whichwere implemented for the CPU and GPU can be replaced with any other optimisedversions, where available. Results show that using both CPU and GPU (or accelera-tors) simultaneously for a computation leads to a performance-optimal work partitionwhere the load is balanced perfectly. This depends on the relative performance of thedifferent processing elements.


The following chapters look at energy aspects of partitioning a workload betweenthe different compute devices on a chip and explore how energy to solution canbe minimised for an application. Due to the limitations of the Epiphany chip it wasascertained that this system is not suitable as an experimental heterogeneous platformto study the effect of workload partitioning. Hence the focus for the rest of the thesisis on the NVIDIA Tegra SoCs.

Chapter 4

Measuring and Modelling EnergyUsage

This chapter explores the energy efficiency aspects of computation on a heteroge-neous system such as LPSoCs, investigating in particular how an application can bepartitioned to execute on the different processing elements in a heterogeneous systemin such a way that its energy to solution is minimised. In order to compute theenergy efficiency of an application executing on a hardware, the ability to measurethe energy consumed by the device is the first step. However, most LPSoCs availablein the market lack internal energy measurement capabilities. Therefore, we devel-oped an energy measurement framework which is capable of making fine-grainedhigh-resolution energy measurements for LPSoC systems. The hardware design forthis framework was developed as a collaborative effort (Mitra et al. [2016]) while thesoftware environment was primarily developed by the present author.

Partitioning the work appropriately between the CPU and accelerators for simul-taneous execution generally leads to higher performance as shown in the previouschapter. However, a performance-optimal work partition may not always be energy-efficient as this depends on a number of system characteristics such as the relativepower draw and computational rates of the different devices. Applications could runeither entirely on one of the devices or be partitioned across different devices. It maybe more energy-efficient to execute the entire application on one of the devices insome cases or to split it across the devices on the chip in other cases. Therefore itis important to investigate the energy-efficiency impact of partitioning work acrossdifferent components and to understand the configuration which gives most benefitin terms of energy. An energy usage model was developed which can be used toreasonably describe the energy consumed by an application that is partitioned toexecute between the different components of a heterogeneous system. The energymodel was developed as a joint work (Mitra et al. [2016]). The present author wasprimarily responsible for the experimental evaluation of the model on the varioushardware platforms.

This chapter is organised as follows: Section 4.1 describes the design and develop-ment of our custom energy measurement environment including the hardware and

71

72 Measuring and Modelling Energy Usage

the software components. Section 4.2 presents the design of our simple energy usagemodel which can be used to predict the optimal strategy to partition a workloadbetween the different components in a heterogeneous system in order to minimiseenergy. Section 4.3 presents the evaluation of the energy usage model for the three ex-ample applications described before - Stencil, GEMM and multizone block tridiagonalsolver on the LPSoC platforms. Section 4.4 presents a brief critique and limitations ofthe model.

Portions of the work described in this chapter have been published in the followingpapers:

• Gaurav Mitra, Andrew Haigh, Anish Varghese, Luke Angove, Alistair P.Rendell.Split Wisely: When work partitioning is energy-optimal on heterogeneous hardware,The 18th IEEE International Conference on High Performance Computing andCommunications (HPCC) [Mitra et al., 2016]. In this collaboration, the presentauthor was primary responsible for development of the software environmentfor real-time energy measurement and experimental evaluation of the energyusage model on the various hardware platforms.

• Anish Varghese, Joshua Milthorpe, Alistair P. Rendell. Performance and EnergyAnalysis of Scientific Workloads Executing on LPSoCs, The Power and Energy As-pects of Computing Workshop, International Conference on Parallel Processingand Applied Mathematics (PEAC workshop, PPAM) [Varghese et al., 2017].

4.1 Developing an Energy Measurement Environment for LP-SoC systems

As mentioned earlier in Section 2.4.3, very few LPSoC systems offer features whichenable measurement of their energy consumption. On systems which offer thesecapabilities using internal sensors only low resolution measurements (1 second ormore) are effectively possible since sampling at a higher rate interferes with the com-putations and causes the power consumption to increase further. Therefore a customenergy measurement apparatus which enables non-intrusive fine-grained measure-ment of energy for such low-power SoC systems was developed. The methodologyallows the measurement to have a resolution which is similar to that of code profilingtools, i.e. of the order of a few milliseconds.

Having the ability to obtain energy measurements for a running application codeat function level is crucial to understanding its energy-efficiency. Therefore a simplemeasurement API which provides start and stop function calls to measure energy of asection of code and returns the energy consumed was developed for this measurementframework.

§4.1 Developing an Energy Measurement Environment for LPSoC systems 73

4.1.1 Hardware Design of Energy Measurement System

We follow similar techniques to prior works [Cao, 2014; Calore et al., 2015] to measurethe DC current drawn by a LPSoC system. The current drawn by the LPSoC systemis measured using a high precision ammeter called the µCurrent Gold [Jones, 2010]placed in series and an mbed LPC1768 micro-controller with a 0 - 3.3 V (12 bit) analog-to-digital converter (ADC). The µCurrent Gold is a current to voltage converter thatuses a high precision, low value resistor and a precision two-stage amplifier. TheµCurrent is used in the 1 mV/mA setting and has a precision of ±0.1%. The ADC hasa resolution of 0.81±0.40 mV, which corresponds to a resolution of 0.81±0.40 mA. Interms of power, this translates to a resolution of 9.7±4.8 mW at 12 V or 15.3±7.6 mWat 19 V.

The ADC is connected across the voltage output pins of the µCurrent Gold. Theset up of the energy measurement environment is shown in Figure 4.1. The figureshows the measured device sending a start signal to the measuring device (via an externaldevice) when the application reaches the section of code being measured and a stopsignal at the end of the section. The instantaneous measured current is sent via aserial link to an external recording computer at a frequency of one sample every 10 ms(100 Hz). Although theoretically there could be spikes between readings that maynot be captured, it is highly unlikely this will happen between every single readingespecially since these measurements are collected at a high frequency. These readingsare numerically integrated using the midpoint rule to calculate the energy consumedby the measured section of code. This allows us to measure and return the energyconsumed by the application at a function level at runtime.

For verification, we also compared the measurements from this measurementsystem to that from an external power meter which displays the instantaneous powerdraw on its display console and the values were found to match closely. The µCurrentGold is relatively inexpensive costing 70 USD while the mbed micro-controller costs55 USD bringing the total cost of our custom measurement framework to 125 USD.

Figure 4.1: Custom energy measurement framework using µCurrent Gold

An unmodified µCurrent Gold has a maximum output of 1.25 V. This means that


it is able to measure only up to 1.25 A. This sets the upper limit of its measurablepower range to 15 W for systems such as the TK1 with the input DC current ratedat 12 V and around 24 W for systems such as the TX1 and Xavier with input ratedat 19 V. Preliminary experimentation with an external ammeter showed that thiswas not sufficient for some of the test platforms. The TK1 was observed to consumeup to 20 W (1.67 A) while the Xavier was observed to consume up to 30 W (1.58 A)under heavy load. Therefore some alterations were made to the µCurrent in order toincrease its measurable range.

From the schematics of the µCurrent Gold [Jones, 2010], it was derived that theupper limit of current measurement is a result of the power source of the µCurrent,which is a 3 V battery. To increase this limit, we replaced this internal battery witha 5 V battery pack. A floating 0 V (or virtual ground) is used in the circuit, with thevirtual ground half-way between the battery’s 0 V and +V. This effectively halves themaximum output of the battery. Therefore with the new battery, measured at 4.8 V,the limit is increased from 1.25 V to 2.4 V.

Since only positive DC values are measured, we added a resistor to the µCurrentto lower the circuit virtual ground. A second resistor was added parallel to the resistorattached to the battery 0 V and the center of the voltage divider, effectively loweringthe virtual ground. These modifications resulted in an increase of the positive rangeof measurements from 2.4 V to approximately 4.0 V (4.0 A). However, since the ADC’slimit is 3.3 V, this means that the measurable range is now 0 - 3.3 A.

For systems such as the Jetson TK1, with DC current input rated at 12 V, thepower measurement range becomes 0 - 39.6 W (12 V × 3.3 A) while for systems suchas Jetson TX1 and Jetson Xavier, with input rated at 19 V the range becomes 0 - 62.7 W(19 V × 3.3 A). Experimentation has shown this is more than enough for these systemseven at heavy load. The overall schematic of the energy measurement setup is shownin Appendix C.

4.1.2 Software Environment for Energy Measurement System

In order to allow real-time measurement of energy at runtime, an measurement APIwas developed. This was done primarily by the present author and involved writingmore than 1500 lines of code. It was designed to allow measurement of energy fora section of code running on a particular hardware platform where the applicationunder consideration sends a measurement start signal to indicate the start of thesection in order to start measuring and a stop signal at the end of section in order tostop measuring. The energy consumption measured for this section is now returnedback to the application at runtime.

To ensure that the measurements are non-intrusive for the measured device, themeasurements from the ADC are sent to an external recording computer as mentionedearlier. In order to facilitate easy measurement of different LPSoC systems, a softwareenvironment was developed which runs as a service on the same external recordingcomputer which is connected in the same local network as the measured device.

§4.1 Developing an Energy Measurement Environment for LPSoC systems 75

This service performs the following functions: i) Receive measurement start/stopsignals from the measured device. ii) Relay the start/stop signals to the measur-ing device (mbed microcontroller), iii) Collect the measured current from the ADCand record the data, iv) Calculate the energy consumed and return it to back to themeasured device.

The client side functions that are needed to be called by the application areprovided in a header file and needs to be included in the application. The sequenceof steps required to measure the energy consumption of an application running on ahardware platform are as follows:

• During the initialization phase of the application, the application calls the clientside function power_record_init(). This sets up a socket connection to themeasurement service on the external recording computer. The measurementservice sets up the necessary buffers and gets ready to collect new measurementdata.

• When the section of code which needs to be measured is reached, the functionpower_record_start() is called by the application. The client sends a singlebyte (the character ’r’) to the measurement service using the connection whichwas set up in the initialization phase. The service in turn sends a signal to thembed microcontroller using a serial link to start measuring the current. A singlebyte is now returned back to the client as acknowledgement to ensure synchro-nization. The measured section of code starts to execute on the measured devicenow. A separate thread on the server side collects the measured current fromthe ADC via Serial IO.

• When the end of the section is reached, the application calls the functionpower_record_stop(). The client sends a single byte (character ’s’) to themeasurement service to indicate the end of measurement. The service in turnsends the stop signal to the microcontroller. The measurements received are nu-merically integrated using the midpoint rule to calculate the energy consumedby the measured section of code. The calculated energy in J is returned backto the client side. The measured data is also written into a file on the externalrecording computer for future reference.

• The client side function power_record_cleanup() is called before the ap-plication exits. This disconnects the client from the measurement service andcleans up resources on the server side.

The delay introduced by the network connection between the measured deviceand the external recording computer is the latency of the single byte transfers and isin the order of 100 µs which is negligible compared to the frequency of measurementswhich is 10 ms.

The advantage of this set up is that when a different system is to be measured, itjust needs to be added to the same network and the µCurrent placed in series with


the DC power source. All the control signals and data can now be simply transferredthrough the network connection rather than setting up the serial ports.

The measurement service also provides the ability for an external program tostream real-time measurement data in order to visually chart real-time power con-sumption of a system. A sample grab of real-time power measurement captured fromthe Jetson Xavier system is shown in Figure 4.2. The spikes in power corresponds tothe time when the system is under load.

Figure 4.2: Visualizing real-time power consumption of Jetson Xavier board. The spikes inpower corresponds to the time when the system is under load.

The Jetson Xavier development kit provides the ability to measure energy con-sumption using internal sensors by reading the value from a file in /sys filesystem.However, according to the thermal design documentation for this system [NVIDIA,2018a], it is not recommended to read the samples too frequently since it utilises CPUresources and causes the power consumption to increase along with interfering withthe computation. The recommendation is to sample the values at a frequency of 1second or lower leading to very low-resolution intrusive measurements. Therefore,we use our custom energy measurement framework to measure power consumptionfor this system as well.

4.2 Design of Energy Usage Model

Results from experiments in the previous chapter showed that partitioning the workappropriately between the CPU and accelerators for simultaneous execution leads tohigher performance. However, a performance-optimal work partition may not alwaysbe energy-efficient as this depends on a number of system characteristics.

§4.2 Design of Energy Usage Model 77

Applications could run either entirely on one of the devices or be partitionedacross different devices. It is observed that the performance and energy consumedby a running application vary according to how the application is partitioned toexecute between the different devices. This depends on factors such as relativecomputational rates of the different devices and their active and idle power usage. Itmay be more energy-efficient to execute the entire application on one of the devicein some cases or to split it across the devices on the chip in other cases. Therefore itis important to investigate the energy-efficiency impact of partitioning work acrossdifferent components and to understand the configuration which gives most benefitin terms of energy. Making such a decision dynamically while the code is runningbased on real time measurement of energy is another non-trivial challenge which willbe addressed in the following chapter.

Detailed energy models exist which are designed for modelling the performanceand energy usage of an algorithm on a processor as described in Section 2.4.4. How-ever, they do not address execution in a heterogeneous environment. Previous workby Ge et al. [2014] provides an analytical performance and energy model which cap-tures the performance and energy impact of distribution of computation betweendifferent processing elements. We develop a simple model which is similar to thiswork and extend it to use the measurable power states of LPSoC systems in order toreasonably describe the energy consumed by a running application based on how itis partitioned to execute between the different heterogeneous devices on the system.

The parameters for the model are measured system and application characteristicssuch as computational rates of each device and power draw when executing theapplication and when idle for each device. Using this data, the model aims topredict the best way to partition the workload between the devices so that the energyconsumed by the application is minimised.

The following assumptions were made for the model:

• For simplicity, only a host CPU and an accelerator (GPU) are considered.

• Frequency scaling is disabled and each compute device operates at maximumfrequency,

• The version of the application on each device fully utilises all the availableprocessing cores and is optimised for the compute device.

• The overhead of partitioning the work between different devices is negligible

• The cost of data movement is not considered in the model. If the device hasthe same shared physical memory between devices which can be accessedsimultaneously from both devices, there is no extra cost for data movementbetween them.

Consider two compute devices, say a CPU and a GPU, denoted by c and g re-spectively. The computational rate of the CPU when the application is executing


exclusively on it is denoted as Rc (GFlop/s) and the power draw of the CPU in thissituation is denoted as Pa

c (W). The computational rate of the GPU when the appli-cation is executing exclusively on it is denoted as Rg (GFlop/s) and the power drawof the GPU in this situation is denoted as Pa

g (W). The power draws when both thedevices are idle are denoted as Pi

c and Pig (W).

The total computational cost of executing the application under consideration islabelled as N and the fraction of the work given to the CPU is denoted as f . The timespent by the CPU to execute the work given to it is denoted as Tac (s) while the timespent by the GPU to execute the work given to it is denoted as Tag (s). The total timeto solution is denoted as Ts (s) where Ts = max[Tac, Tag].

The energy consumed by the CPU executing the fraction work allocated to itis represented by Equation 4.1 and the energy consumed by the GPU executing itsfraction is represented by Equation 4.2.

Ec = (Pac − Pi

c)Tac + PicTs (4.1)

Eg = (Pag − Pi

g)Tag + PigTs (4.2)

where the time spent by each device is represented by

Tac =N fRc

(4.3)

Tag =N(1− f )

Rg(4.4)

The total time to solution Ts is represented by

Ts = max[N fRc

,N(1− f )

Rg] (4.5)

Then the total energy consumed (in J) can be represented by Equation 4.6. Thefirst term in square brackets on the right-hand side represents the energy consumedby the CPU while the second term represents the energy consumed by the GPU.

E( f ) =[

(Pac − Pi

c)N fRc

+ Picmax[

N fRc

,N(1− f )

Rg]]

+[

(Pag − Pi

g)N(1− f )

Rg+ Pi

gmax[N fRc

,N(1− f )

Rg]]

(4.6)

Since for many LPSoCs, the individual power draw of the different devices cannotbe isolated, we recast the equation in terms of measurable power states. We denotethe power usage when both the CPU and the GPU are active as Pacg (W), and thatwhen both are idle as Picg (W). The power usage when the CPU is active and GPU isidle is denoted as Pacig (W) while that when CPU is idle and GPU is active is denotedas Picag (W). i.e.,

§4.3 Experimental Evaluation of the Energy Usage Model 79

Pacg = Pac + Pa

g (4.7)

Picg = Pic + Pi

g (4.8)

Pacig = Pac + Pi

g (4.9)

Picag = Pic + Pa

g (4.10)

Then,

Pac − Pi

c = Pacig − Picg (4.11)

Pag − Pi

g = Picag − Picg (4.12)

Therefore Equation 4.6 in terms of measurable power states is as follows:

E( f ) =[

(Pacig − Picg)N fRc

]+[

(Picag − Picg)N(1− f )

Rg

]+ Picgmax[

N fRc

,N(1− f )

Rg] (4.13)

The energy-optimal split ratio is the value of f that minimises the total energyconsumed. Its minimum occurs at f = 0, f = 1 or at a point in the middle f = f ∗. ie,

f ∗ = arg minf∈[0,1]

([(Pacig − Picg)

N fRc

]+[

(Picag − Picg)N(1− f )

Rg

]+ Picgmax[

N fRc

,N(1− f )

Rg])

(4.14)

Thus the model is able to predict the energy optimal work partition for an applica-tion executing across the different devices of an LPSoC system based on the followinginput parameters:

• Rc: Performance of the application executing entirely on the CPU

• Rg: Performance of the application executing entirely on the GPU

• Pacig: Power draw when the application is executing entirely on the CPU andGPU is idle

• Picag: Power draw when the application is executing entirely on the GPU andCPU is idle

• Picg: Power draw when both GPU and CPU are idle

• N: Computational cost of the application

4.3 Experimental Evaluation of the Energy Usage Model

We evaluated the accuracy of our energy model using three example applications -Stencil, GEMM and NPB (MZ) BT which are partitioned to execute on both the CPUand GPU of the different test platforms. The metrics which are evaluated are: the


error in predicted energy for each split ratio and error in predicted optimal split ratio.How the predicted split ratio varies with different problem sizes and applications isalso shown.

The mean of 5 samples is reported for all experiments. The measurements arereported with a margin of error, at a confidence level of 95%, of less than 3% forperformance and for energy and power in most of the cases. In a handful of cases,the margin error for energy is seen to be as high as 5%.

4.3.1 Hardware Setup for Experiments

Three LPSoC platforms and two Intel-based platforms, summarised in Table A.1,were used for the following experiments. The LPSoC systems are the Tegra K1, TegraX1 and the Tegra Xavier systems described in Section 2.1.

Two conventional Intel-based HPC systems with attached accelerators were alsoused for comparison. The first system contains dual-socket Xeon E5-2665 Sandy-bridge processors and a discrete NVIDIA Tesla K20m card. The second systemcontains dual-socket Xeon E5-2620 v3 Haswell processors and a discrete NVIDIATesla K80 card which houses two GK210 GPUs. As these GPUs are effectively separatedevices with physically separate memory, we use only one GPU for our experimentsfor fair comparison. Power measurements are also reported for one GPU on thissystem.

4.3.2 Evaluation of Predicted Energy vs Modelled Energy

Experiments were conducted to evaluate the model’s accuracy in predicting the en-ergy consumed by a running application based on how it is partitioned to executebetween the CPU and the GPU. Evaluation is done for stencil, GEMM and the mul-tizone block tridiagonal solver applications on each test platform. For each of theapplications, the problem size is fixed and the fraction of work (split ratio) given tothe CPU is varied from 0% to 100% in discrete increments of 5%. Energy measure-ments are taken for each of these work partitions. The measured energy-to-solutionvalues are compared to the values predicted by the model.

For each application the model’s input parameters Rc and Pacig are obtained bymeasuring the performance and power usage when all the work is given to theCPU (split ratio of 100%). Similarly, input parameters Rg and Picag are obtained bymeasuring the performance and power usage when all the work is given to the GPU(split ratio of 0%). Picg is obtained by measuring the power usage when the deviceis completely idle. The CPU and accelerator frequencies were set to maximum usingthe performance frequency scaling governor for CPU and the maximum frequencysetting for the accelerator.


4.3.2.1 Stencil

Single-precision and double-precision partitioned stencil kernel was executed onthe three Tegra systems. The problem size was fixed to nx = ny = 8192 for 50iterations. Figure 4.3 shows the performance and energy consumed on the TK1system. Figure 4.4 shows the results on the TX1 system and Figure 4.5 shows theresults on the Xavier system.

The solid blue lines in the figures show the measured performance in GFLOP/sfor each split ratio. The energy values in Joules for each of these executions weremeasured using our framework described in Section 4.1 and are represented by reddotted lines. The energy usage values predicted using our energy usage model foreach split ratio are shown using green dashed lines.

2

2.5

3

3.5

4

4.5

5

GFL

OP/

s

DP GFLOP/s

50

60

70

80

90

100

110

JOU

LES

DP Measured Energy DP Modelled Energy

3

4

5

6

7

8

GFL

OP/

s

SP GFLOP/s

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 10030354045505560657075

Fraction of Work Given to CPU (%)JO

ULE

S

SP Measured Energy SP Modelled Energy

Figure 4.3: TK1: Partitioned Stencil - nx=ny=8192. Top half shows measured and modelledenergy-to-solution for double precision stencil on the TK1 while the bottom half showsmeasured and modelled energy-to-solution for single precision stencil. Energy is minimisedwhen all work is given to GPU for both cases

On all the three systems, for single-precision and double-precision the energymodel predicts that energy is minimised when all the work is allocated to the GPU,ie, split ratio of 0%. The measured energy values indicate the same. Overall, themeasured energy values are slightly larger than the energy model’s predicted valuesand are seen to have an almost constant deviation from the modelled values. Thisdeviation can be attributed to energy consumed by other system components suchas DRAM. This is because the energy model does not explicitly consider how theenergy consumed by these components varies with problem size or the work splitratio (although their power usage is included in the captured power measurementsat split ratio of 0% and split ratio of 100% which the model takes as input). Onthe Xavier system, the deviation is seen to be slightly higher due to the need for


3

4

5

6

7

8

9

GFL

OP/

s

DP GFLOP/s

40

50

60

70

80

90

100

JOU

LES


4

6

8

10

12

14

16

GFL

OP/

s

SP GFLOP/s

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 10020253035404550556065

Fraction of Work Given to CPU (%)

JOU

LES


Figure 4.4: TX1: Partitioned Stencil - nx=ny=8192. Top half shows measured and modelledenergy-to-solution for double precision stencil on the TX1 while the bottom half showsmeasured and modelled energy-to-solution for single precision stencil. Energy is minimisedwhen all work is given to GPU for both cases

explicit boundary transfers at the end of each iteration as explained in Section 3.2.1.However, the constant nature of their consumption means that it does not affect themodel’s eventual goal of predicting the energy-optimal work partition and the modelis observed to closely describe the trend in variations in energy usage as the fractionof work given to the CPU is varied.

The Xavier system is is seen to perform faster while consuming less energy thanthe TX1 and TK1 and is much more energy efficient for both double-precision andsingle-precision.

4.3.2.2 GEMM

Figure 4.6 shows the performance and energy usage for SGEMM and DGEMM forinput dimensions of m = k = n = 4096 on the TK1 system. Figure 4.7 shows the sameon the TX1 system and Figure 4.8 shows the results on the Xavier system.

The solid blue lines in the figures show the measured performance in GFlop/sfor each split ratio. The energy values in Joules for each of these executions arerepresented by red dotted lines and the model’s predicted energy usage values foreach split ratio are shown using green dashed lines.

The energy-to-solution values for both SGEMM and DGEMM for each platformare mostly as expected by the energy usage model. For the TK1 system, the optimalwork partition for both SGEMM and DGEMM is 0% according to the measured values.The model’s prediction also shows the same. For DGEMM on TX1, the energy modelpredicts an energy-optimal split of 65% while the measured energy values indicate an


22

24

26

28

30

32

GFL

OP/

s

DP GFLOP/s

2224262830323436

JOU

LES


40

45

50

55

60

65

GFL

OP/

s

SP GFLOP/s

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

10

12

14

16

18

20


JOU

LES


Figure 4.5: Xavier: Partitioned Stencil - nx=ny=8192. Top half shows measured and modelledenergy-to-solution for double precision stencil on the Xavier while the bottom half showsmeasured and modelled energy-to-solution for single precision stencil. Energy is minimisedwhen all work is given to GPU for both cases

12

14

16

18

20

22

24

GFL

OP/

s

DGEMM GFLOP/s

707580859095100105

JOU

LES

DGEMM Measured Energy DGEMM Modelled Energy

50

100

150

200

GFL

OP/

s

SGEMM GFLOP/s

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 10051015202530354045


JOU

LES

SGEMM Measured Energy SGEMM Modelled Energy

Figure 4.6: TK1: Partitioned GEMM - m=k=n=4096. Top half shows measured and modelledenergy-to-solution for DGEMM on the TK1 while the bottom half shows measured andmodelled energy-to-solution for SGEMM. Energy is minimised when all work is given toGPU for both cases


1012141618202224

GFL

OP/

s

DGEMM GFLOP/s

95

100

105

110

115

120

125

JOU

LES


0

100

200

300

400

GFL

OP/

s

SGEMM GFLOP/s

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

5101520253035404550


JOU

LES


Figure 4.7: TX1: Partitioned GEMM - m=k=n=4096. Top half shows measured and modelledenergy-to-solution for DGEMM on the TX1 while the bottom half shows measured andmodelled energy-to-solution for SGEMM. Energy is minimised when all work is given toGPU for SGEMM while split ratio of 60% minimises energy-to-solution for DGEMM.

30

40

50

60

70

GFL

OP/

s

DGEMM GFLOP/s

444648505254565860626466

JOU

LES


200

400

600

800

GFL

OP/

s

SGEMM GFLOP/s

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

5

10

15

20

25


JOU

LES


Figure 4.8: Xavier: Partitioned GEMM - m=k=n=4096. Top half shows measured and modelledenergy-to-solution for DGEMM on the Xavier while the bottom half shows measured andmodelled energy-to-solution for SGEMM. Energy is minimised when all work is given toGPU for SGEMM while split ratio of 65% minimises energy-to-solution for DGEMM


actual optimal split of 60%. The performance optimal split is also seen to be the same.Since only increments of 5% were considered for this experiment, the predicted andactual optimal split are most likely between 60% and 65%.

For DGEMM on Xavier, the energy model predicts an energy optimal split of 65%and the measured energy values indicate the same. Results for SGEMM on both theTX1 and Xavier indicated an energy optimal split of 0%. The difference between thepredicted and measured energy values is around ≈5% on TK1 and TX1 while onXavier the deviation is seen to be slightly higher due to the need for explicit memorytransfer at the end.

For the purpose of comparison, GEMM was also executed on the two conventionalIntel-based HPC systems described in Table A.1. The Energy Measurement Library(EML) [Cabrera et al., 2015] was used to measure the energy consumption of thesetwo systems.

300

400

500

600

700

800

900

GFL

OP/

s

DGEMM GFLOP/s

40

60

80

100

120

140

JOU

LES


600800

1,0001,2001,4001,6001,8002,000

GFL

OP/

s

SGEMM GFLOP/s

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 10010

20

30

40

50

60

70


JOU

LES


Figure 4.9: Intel + K20: Partitioned GEMM - m=k=n=4096. Top half shows measured andmodelled energy-to-solution for DGEMM on the Sandy-bridge + K20 system while the bottomhalf shows measured and modelled energy-to-solution for SGEMM. Energy is minimisedwhen all work is given to GPU for both cases.

Figure 4.9 shows the performance and energy measurements for SGEMM andDGEMM on the Sandy-bridge based system while Figure 4.10 shows the result forthe same on the Haswell based system. Since the CPU and GPU memory are phys-ically different, the cost of copying between CPU and GPU memory factors as anadditional overhead which we have included in our results. On the Haswell-basedsystem, variations can be seen in performance and energy usage in the first half of thegraph where more work is allocated to the GPU. This is attributed to unfavourabledimensions of the problem allocated to the GPU and load imbalance across the 15SMs on the K80 GPU. Overall, allocating all the work to the GPU is more energy


400

600

800

1,000

GFL

OP/

s

DGEMM GFLOP/s

2030405060708090

JOU

LES


600

1,000

1,400

1,800

2,200

2,600

GFL

OP/

s

SGEMM GFLOP/s

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 10051015202530354045


JOU

LES


Figure 4.10: Intel + K80: Partitioned GEMM - m=k=n=4096. Top half shows measured andmodelled energy-to-solution for DGEMM on the Haswell + K80 system while the bottom halfshows measured and modelled energy-to-solution for SGEMM. Energy is minimised whenall work is given to GPU for both cases.

efficient on both systems. The measured energy values show an increasing deviationas more work is given to the CPU. Explicit memory copies made by the CUDA run-time to and from the GPU across the PCIe bus, which are not hidden by overlappedcomputation are responsible for this deviation. For systems with discrete GPUs, thishas an impact and can be factored into our model by proportionally reducing theeffective computational rates Rc and Rg.

4.3.2.3 Multizone Block Tridiagonal Solver

A partitioned version of the multizone block tridiagonal (BT) benchmark from theNAS Parallel Multizone (NPB-MZ) benchmark suite was executed on the Tegra sys-tems by varying the work given to the CPU. As described in Section 2.3.3 we usethe hybrid implementation developed by Dümmler and Rünger [2013] as the startingpoint.

The code was modified to use unified memory on our heterogeneous platforms.This hybrid CPU + GPU implementation parallelises by allocating different zonesto the CPU and GPU in each iteration. The computation for these zones can nowproceed as independent tasks on the different compute devices. Once the CPU andthe GPU complete their kernel computations for a particular timestep, the boundarydata is exchanged between them before starting the computations for the next timestep. Code was modified to include an input parameter cpu_ratio using which thezones are partitioned between the CPU and the GPU.


All the available cores of the CPU were used for this experiment. Figure 4.11shows the measured and modelled energy values for the class B problem size onthe TK1 system when the split ratio given to the CPU is varied from 0% to 100%.Figure 4.12 shows the same for the TX1 system and Figure 4.13 shows the results forthe Xavier system.

0.5

1

1.5

2

2.5

3

GFL

OP/

s

BT GFLOP/s

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 1001,500

2,000

2,500

3,000

3,500

4,000


JOU

LES

BT Measured Energy BT Modelled Energy

Figure 4.11: TK1: Partitioned NPB (MZ) BT - Class B. Figure shows measured and modelledenergy-to-solution for class B problem size on the TK1 system. Model and measured valuesindicate that energy is minimised when all work is given to CPU.

1

2

3

4

GFL

OP/

s

BT GFLOP/s

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

1,500

2,000

2,500

3,000

3,500

4,000


JOU

LES


Figure 4.12: TX1: Partitioned NPB (MZ) BT - Class B. Figure shows measured and modelledenergy-to-solution for class B problem size on the TX1 system. Model and measured valuesindicate that energy is minimised when all work is given to CPU.

On all the three systems, the measured energy is seen to be slightly less thanthe modelled energy for the first half (when more work is given to the GPU) andis seen to be higher than the modelled energy in the second half (when more workis given to the CPU). The overall trend of energy consumption for both systems isreasonably described by the model and the difference between the modelled andmeasured energy is less than around 10-15%. Both the model and measured energyvalues indicate that energy is minimised for this implementation of the applicationwhen all the work is given to the CPU on all the three systems.


5

10

15

20

25

30

GFL

OP/

s

BT GFLOP/s

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100500

1,000

1,500

2,000

2,500


JOU

LES


Figure 4.13: Xavier: Partitioned NPB (MZ) BT - Class B. Figure shows measured and modelledenergy-to-solution for class B problem size on the Xavier system. Model and measured valuesindicate that energy is minimised when all work is given to CPU.

4.3.3 Evaluation of Predicted Optimal Split Ratio

Experiments were performed to evaluate the accuracy of the predicted split ratiocompared to the actual optimal split ratio. Here, the energy model is used to predictthe optimal work split ratio where energy is minimised for the partitioned GEMMand Stencil kernel computation for different problem sizes.

In order to evaluate the model’s prediction accuracy, each application kernel isexecuted for each problem size using split ratios in increments of 5 from 0 to 100and actual energy consumed is measured for each split ratio. From this, the actualoptimal split ratio (Fo) and the energy consumed at this split (Eo) for this problemsize is determined. The model’s predicted optimal split ratio (Fp) is determinedfor each problem size using the measurements obtained when all the work for thisproblem size is given to each device as the model’s input parameters. The actualenergy consumed at this split ratio (Em) is now recorded. The deviation between Fp

and Fo (devFp ) in steps of 5% is reported for each problem size along with the model’serror in prediction of energy optimality which is calculated as the difference (in %)between Em and Eo. ie, EE = (Em − Eo)/Eo (%).

The mean of 5 samples is reported for all experiments with a margin of error, at aconfidence level of 95%, of less than 3% for performance and for energy and powerin most of the cases. However, it is noted that in few cases, variance of up to 5% inenergy measurement was recorded when an experiment was repeated at a later time.

4.3.3.1 Stencil

As seen in the previous experiment with Stencil, the optimal split ratio was always 0%when CPU and GPU were set to the maximum frequency in all of the test platformsfor this kernel. Since the objective of this experiment was to verify how Fp comparesto Fo, it is desirable to have non-zero values for Fo.

With this in mind, a controlled experiment was undertaken for double precisionstencil on the TX1 system where CPU frequency is set to the highest available CPUfrequency and GPU frequency is set to a low value of 230 MHz. This adjusts the


relative difference in performance between the CPU and GPU, thereby ensuring morecases with non-zero values for Fo. As such, this setting serves to simulate a hardwarewhere such a balance exists between the CPU and GPU. This setting is used for allfollowing experiments involving stencil kernel on the TX1 system.

For brevity, the results for a few problem sizes are shown in Table 4.1 whileFigure 4.14 shows the prediction error EE for all test problem sizes in this experiment.The table shows for a number of problem sizes the actual optimal split ratio Fo,model’s predicted split ratio Fp, the actual optimal energy consumption Eo, the energyconsumed Em when executed at Fp, the deviation devFp between Fp and Fo (in stepsof 5%) and the error in energy optimality EE.

It is observed that for small problem sizes (total grid size of 2048 × 2048 andsmaller on the TX1), the error in prediction is higher (more than 10%). It is foundthat on average, for such problem sizes the execution time is very low (less thanaround 0.5 s) and energy consumed is less than ≈5 J. In such cases the variance ofmeasurements is higher and the measurements could not be reliably used in themodel for accurate predictions. Hence the error is higher for these smaller problemsizes. However, it is noted that the absolute difference in energy is much less than 1 Jin most of these cases. Discarding these cases, the prediction error is observed to bemuch less as shown in Table 4.1.

From Figure 4.14, the prediction error EE is below 10% for all the cases exceptsmaller problem sizes as described earlier.

4.3.3.2 GEMM

From previous experiments described in Section 4.3.2.2, DGEMM on TX1 showedthe presence of an energy optimal split. Therefore, an experiment was conducted toevaluate the prediction accuracy of the model for DGEMM on the TX1.

For this experiment, both the CPU and GPU are set to their highest availablefrequency. The results for a few problem sizes are shown in Table 4.2 and Figure 4.15shows the error in prediction for all the test problem sizes in this experiment. Bothsquare and non-square matrices are chosen for this experiment.

From the results, on average where input size is larger than m× k× n = (1024×1024× 1024), the prediction error is less than ≈ 5% in all cases. For smaller sizes, theerror is seen to be higher.

4.3.4 Variation of Optimal Split with Problem Size

From previous experiments described in Section 4.3.3, it is evident that the optimalsplit ratio varies depending on application and also input problem size. This is dueto the fact that performance and power draw vary depending on application andproblem size. Experiments were conducted to illustrate the variance of the optimalsplit ratio, Fo with problem size for stencil and GEMM.


Table 4.1: Double precision stencil on TX1 - Evaluation of energy model’s prediction ofoptimal split

nx ny Fo Fp Eo Em devFp EE%

1024 512 100 70 0.6 0.8 -6 33.6

1024 1024 100 70 1.8 1.3 -6 11.6

1024 4096 65 65 3.7 3.6 0 0

1024 8192 65 70 6.8 7.1 1 5.1

1024 16384 65 65 12.9 12.9 0 0

1024 25600 65 65 20.3 20.3 0 0

2048 512 0 60 1.2 1.3 12 15.1

2048 4096 55 60 7.5 7.5 1 1.3

2048 8192 55 60 14.6 14.7 1 0.6

2048 16384 55 60 28.7 29.8 1 3.9

2048 25600 55 60 45.1 46.5 1 3.3

4096 1024 35 60 4.1 4.4 5 9.6

4096 4096 55 60 14.9 15.2 1 2.2

4096 8192 55 60 29.7 30.4 1 2.5

4096 16384 50 55 59.8 59.9 1 0.1

8192 2048 50 55 15.5 15.6 1 0.4

8192 3172 55 60 22.2 23.2 1 4.5

8192 4096 50 55 30.5 31.1 1 1.4

8192 8192 50 55 60.1 63.1 1 4.7

16384 512 0 0 7.9 7.9 0 0

16384 1024 0 0 15.5 15.5 0 0

16384 2048 0 0 31.3 31.3 0 0

16384 4096 0 0 62.9 62.9 0 0

25600 256 0 0 6.3 6.3 0 0

25600 512 0 0 12.6 12.6 0 0

25600 1024 0 0 25.1 25.1 0 0

25600 2048 0 0 49.4 49.4 0 0

Fo: Actual best cpu_ratioFp: Predicted best cpu_ratioEm: Measured energy at FpEo: Actual best energydevFp : Deviation of Fp from Fo in steps of 5EE %: Error in energy optimality=((Em-Eo)/Eo)*100


2,048 4,096 6,144 8,192 10,240 12,288 14,436 16,384

2,048

4,096

6,144

8,192

10,240

12,288

14,436

16,384

nx

n y

0

2

4

6

8

10

12

14

16

18

20

Figure 4.14: Double precision stencil on TX1 - Model Prediction error for different problemsizes. The colormap shows the error in energy optimality EE(%)


Table 4.2: DGEMM on TX1 - Evaluation of energy model’s prediction of optimal split

m n k Fo Fp Em Eo devFp EE%

256 1024 8192 45 50 2.90 2.79 1 3.7

256 2048 4096 45 50 2.87 2.75 1 4.3

256 2048 8192 45 50 5.70 5.45 1 4.4

256 8192 1024 45 50 2.79 2.68 1 4.0

256 8192 2048 40 50 5.57 5.40 2 3.0

256 8192 4096 45 50 11.1 10.7 1 3.7

1024 8192 512 55 55 5.20 5.20 0 0.0

1024 8192 1024 55 55 10.4 10.4 0 0.0

4096 1024 2048 55 60 11.2 11.0 1 1.3

4096 1024 4096 55 65 23.1 22.4 2 3.2

4096 1024 8192 55 65 47.2 44.6 2 5.7

4096 4096 8192 55 65 183.9 177.6 2 3.4

4096 8192 256 55 55 10.3 10.3 0 0.0

4096 8192 512 55 55 20.7 20.7 0 0.0

8192 256 1024 45 60 3.33 3.23 3 3.1

8192 256 2048 60 65 6.91 6.85 1 0.9

8192 1024 4096 70 70 50.3 50.3 0 0.0

8192 4096 512 55 60 21.4 21.0 1 1.7

8192 8192 2048 65 70 194.3 193.7 1 0.3

8192 8192 4096 70 70 409.4 409.4 0 0.0

8192 8192 8192 100 100 834.1 829.9 0 0.5

Fo: Actual best cpu_ratioFp: Predicted best cpu_ratioEm: Measured energy at FpEo: Actual best energydevFp : Deviation of Fp from Fo in steps of 5EE %: Error in energy optimality=((Em-Eo)/Eo)*100


1,0002,000

3,0004,000

5,0006,000

7,0008,000

1,000

2,000

3,000

4,000

5,000

6,000

7,000

8,000

1,000

2,000

3,000

4,000

5,000

6,000

7,000

8,000

m

k

n

0

2

4

6

8

10

12

14

16

18

20

Figure 4.15: DGEMM on TX1 - Model Prediction Error for square and non-square matrices.The work is partitioned by allocating a portion of the columns of matrix B (dimension n) tothe CPU and the rest to the GPU


4.3.4.1 Stencil

For the stencil kernel, for the purposes of this controlled experiment, GPU frequencyis reduced to 230 MHz in order to adjust the relative difference in performance be-tween the CPU and GPU as mentioned earlier.

Figure 4.16 shows how the optimal split Fo varies with problem size for doubleprecision stencil on the TX1 platform under the above conditions. From the results, itis observed that Fo varies between 45% and 70% for most problem sizes.

For a few cases when nx is very low, Fo is close to or equal to 100%. When theproblem size is increased, it is seen that Fo gradually decreases and is close to 0% fora few cases when nx is very high. This may be explained by the penalty due to L1cache misses when nx is increased.

In general, under these settings, the CPU is observed to be more energy efficientfor this kernel for smaller problem sizes while the GPU is observed to be more energyefficient for larger problem sizes.

2,048 4,096 6,144 8,192 10,240 12,288 14,436 16,384

2,048

4,096

6,144

8,192

10,240

12,288

14,436

16,384

nx

n y

0

10

20

30

40

50

60

70

80

90

100

Figure 4.16: Stencil on TX1 - how Fo varies with problem size. The colormap shows thevariance of optimal split ratio Fo (%) when problem size is changed. Fo is observed to varybetween 45% and 70% for most problem sizes.

§4.4 Critique and Limitations of the Energy Usage Model 95

4.3.4.2 DGEMM

Figure 4.17 shows how the optimal split Fo varies with problem size for doubleprecision GEMM on the TX1 platform. For all experiments involving DGEMM, theCPU and GPU are set to their highest frequency and this results in Fo varying between35% and 70% for most problem sizes.

It is observed that the performance and energy efficiency of DGEMM usingCuBLAS on TX1 deteriorates when input problem size is increased beyond 4096×4096× 4096 while the performance is very similar for problem sizes less than this.As a result, when problem size is increased, the energy optimality split leans moretowards the CPU.

For smaller problem sizes, GPU is seen to be more efficient and consequently theenergy optimality split leans towards the GPU for these cases.

1,0002,000

3,0004,000

5,0006,000

7,0008,000

2,000

4,000

6,000

8,000

1,000

2,000

3,000

4,000

5,000

6,000

7,000

8,000

m

k

n

0

10

20

30

40

50

60

70

80

90

100

Figure 4.17: DGEMM on TX1 - how Fo varies with problem size. The colormap shows thevariance of optimal split ratio Fo (%) when problem size is changed. Fo is observed to varybetween 35% and 70% for most problem sizes.

4.4 Critique and Limitations of the Energy Usage Model

The energy usage model described in Section 4.2 predicts the energy-optimal wayto partition an application based on system characteristics such as relative perfor-mance and power consumption of the individual compute devices. The cost of data


movement is not explicitly included in the model. For LPSoC systems with sharedphysical memory, there is no need for separate data transfers and this does not needto be considered. However, for systems with discrete accelerators, the model assumesinput data is already in place, providing an ideal result. Non-overlapped memorytransfers can be accounted for by modelling them as an effective reduction in thecomputational rate of the application.

Effect of Frequency Scaling on Energy Usage

It is also noted that the model assumes fixed operating frequency for the differentcompute devices. However, it is possible in some cases to lower the overall energyconsumed by a system by scaling the frequency of the processing elements. Scal-ing can be applied to different components such as CPU, Accelerator and memorysubsystem, providing a number of possible choices of configurations to consider.

Although detailed analysis of these choices is considered out of scope for thisthesis, preliminary results were recorded on the effect of CPU and GPU frequencyscaling on energy consumption by an application in the context of GEMM and stencil.

Figure 4.18 shows how performance, power and energy-to-solution are affectedby CPU frequency scaling of SGEMM and single-precision stencil running exclusivelyon the CPU. The problem size chosen for SGEMM is m=k=n=4096 and that for stencilis nx=ny=4096.

From the results on the TK1 system, it is seen that energy-to-solution for bothapplications is minimised when CPU frequency is reduced to 1.12 GHz although witha loss in performance of around ≈ 40 %.

Similarly, Figure 4.19 shows how performance, power and energy-to-solution areaffected by GPU frequency scaling of SGEMM and single-precision stencil runningexclusively on the GPU. From the results on the TK1 system, energy-to-solution forboth applications is observed to be minimum when GPU frequency is reduced to612 MHz with a loss in performance of around 10 - 20 %.

Further experimentation with different combinations of frequencies of CPU andGPU for SGEMM execution revealed that on this system, the energy-to-solution forpartitioned SGEMM is minimised with a trade-off of ≈30 % in terms of performanceloss when CPU frequency is set to 828 MHz, GPU frequency is set to 612 MHz and asplit ratio of 5% is used.

From figures 4.18 and 4.19, the power consumption of both the applications is seento increase non-linearly as a cubic function with frequency. This is since the dynamicportion of power can be modelled as a cubic function of frequency [Ishihara andYasuura, 1998; Rizvandi et al., 2012]. From these figures it is evident that performanceand power characteristics for different applications vary to different degrees whenthe operating frequency of each processing element is varied. Determining thischoice of frequencies for the different devices along with the optimal work partitionis of interest. To tackle this problem Ge et al. [2014] experimentally derives therelation between power and frequency by initially executing each application under

§4.4 Critique and Limitations of the Energy Usage Model 97

0 100 200 300 400 500 600 700 800 900 1,000 1,100 1,200 1,300 1,400 1,500 1,600 1,700 1,800 1,900 2,000 2,100 2,200

0

10

20

30

CPU Frequency (MHz)

GFL

OP/

s

SGEMM Performance Stencil Performance

20

30

40

50

60

70

80

JOU

LES

SGEMM Energy Stencil Energy

0 100 200 300 400 500 600 700 800 900 1,000 1,100 1,200 1,300 1,400 1,500 1,600 1,700 1,800 1,900 2,000 2,100 2,200

4

6

8

10

CPU Frequency (MHz)

Pow

er(W

)

SGEMM Power Stencil Power

Figure 4.18: CPU Frequency Scaling: SGEMM and Stencil on TK1. Top half shows howperformance and energy vary with CPU frequency and bottom half shows how powervaries with CPU frequency. Energy for both SGEMM and Stencil is minimised at CPUfrequency=1.12 GHz.

consideration for all the available frequencies on each device. However, this may notbe possible in all cases.

Analytically determining this choice of frequencies for a general class of applica-tion without the need for prior experimentation, especially when this decision hasto be made at runtime for a new application and/or a new platform where priorexperimentation is not possible, is of interest. Preliminary experimentation usingmicrobenchmarks designed to stress the CPU and memory components showed thatfor CPU-intensive workloads power varies non-linearly (as a cubic function) withCPU frequency while for memory-bound workloads power varies more linearly withCPU frequency (results in Appendix D). However, modelling how power consump-tion varies with operating frequency for any general workload is not very straightforward. An approach to do this might be to use a regression model which analysesthe relative mix of instructions of an application and estimates the contributions ofthe different micro-architectural components to the overall power consumption whenfrequency is varied.

Further analysis of this has been left for future work and priority was given tobuilding a framework using which a running application can obtain real-time energymeasurements and apply the energy model to dynamically decide how to change its


0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900 950 1,000

0

50

100

150

200

GPU Frequency (MHz)

GFL

OP/

s

SGEMM Performance Stencil Performance

6

7

8

9

10

JOU

LES

SGEMM Energy Stencil Energy

0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900 950 1,000

4

6

8

10

12

GPU Frequency (MHz)

Pow

er(W

)

SGEMM Power Stencil Power

Figure 4.19: GPU Frequency Scaling: SGEMM and Stencil on TK1. Top half shows howperformance and energy vary with GPU frequency and bottom half shows how powervaries with GPU frequency. Energy for both SGEMM and Stencil is minimised at GPUfrequency=612 MHz.

behaviour in order to optimise its energy usage. This is addressed in the followingchapter.

4.5 Summary

This chapter described the design of an energy measurement framework along witha measurement API which can be used to collect fine-grained energy measurementsfor an application executing on an LPSoC system. The chapter also introduced anenergy usage model which can be used to predict the energy-optimal work partitionfor an application on a heterogeneous system.

The model was evaluated in the context of three applications on LPSoC systems.Validation for the widely used GEMM kernel was provided on conventional HPCsystems with attached GPUs as well. Different metrics for evaluating the energyusage model was described. The model was observed to be effective in predicting theoptimal work partition for the applications with an error of less than around 5% forlarge problem sizes.

Some limitations of the energy usage model were identified and discussed. Themodel assumes that frequency of each device was fixed to the highest available

§4.5 Summary 99

frequency. Determining the configuration of frequencies of different devices in orderto minimise energy-to-solution is of interest and has been identified as future work.For systems with attached accelerators, explicit data transfers were not consideredand can be modelled as reduced computational rates.

The energy measurement framework allows an application to obtain real-timemeasurement of energy consumed at a function level, as simple as measuring time.A running application could use this information to change its behaviour at run-time. This enables tuning an application for optimal energy usage. The next chapterexplores this in detail, specifically how an application might use real-time energymeasurement at runtime along with the energy usage model to determine how bestto execute it in order to minimise energy-to-solution.

Chapter 5

Developing a Runtime TuningFramework

Chapter 4 presented an energy model which can be used to predict the best way topartition a computation between the different processing elements in a heterogeneoussystem in order to minimise its energy to solution. A method of measuring energyconsumption of applications running on LPSoCs was also presented.

Work described in this chapter explores how the model can be practically appliedto an application that is executing on a heterogeneous system with the objective ofminimising its energy consumption. The model depends on parameters such as theapplication’s measured performance and power usage when executed individuallyon each compute device, and the idle power draw of each device as described inSection 4.2. These parameters need to be determined at runtime without having toexecute the entire application on each device individually. Hence we explore method-ologies for reliably estimating these parameters by solving portions of the problemat runtime on each device and obtaining measurements. Other work-partitioningapproaches such as [Donfack et al., 2014] propose a heuristic method to determinethe optimal distribution based on the theoretical peak performances of each computedevice. However, this does not take into account the variation in performance andenergy characteristics with problem size.

The primary objective of the work presented in this chapter is to design, imple-ment and evaluate a proof-of-concept framework which is able to analyse energymeasurements and decide how to partition the application under consideration toachieve energy optimality. The framework is designed as a set of functions which areinvoked by the application.

In particular, this framework will collect energy and performance measurementsfrom a running application and analyse them using the energy model described inChapter 4. Using this data, the framework will execute the application in such away so as to attain energy optimality. The runtime will achieve this by choosing theoptimal fraction of work to be allocated to each processing element.

In order to limit the scope of the runtime framework, a number of assumptionsand limitations were taken into consideration. They are as follows:

101

102 Developing a Runtime Tuning Framework

• The platform under consideration has at least two different devices with differ-ent performance and energy characteristics. For simplicity, we consider onlythe CPU and a GPU for this analysis.

• The application kernel under consideration is developed in such a way that itcan run on any of the devices and the workload can be partitioned seamlessly(ie, via a simple parameter) across the devices.

• The energy and performance characteristics of the kernel on each device changesas a function of problem size.

• The kernel runs exclusively on the hardware. If the device is not executing thekernel, it does nothing else but may still be consuming energy.

• The framework is designed to work only for kernels whose operations areperiodic or repetitive. Kernels whose behaviour keeps changing over timecannot be captured properly.

While designing the runtime framework, two different scenarios are identified.Consequently different approaches are considered for each scenario, namely Of-fline pre-tuning and Dynamic tuning. They are both described and evaluated below.Section 5.1 presents and evaluates the offline pre-tuning approach and Section 5.2describes and evaluates the dynamic tuning approach.

5.1 Offline Pre-tuning

The offline pre-tuning approach involves having an initial offline pre-tuning phasewhere measurements are collected. This approach is typically suited for heavilyused kernels which form building blocks for many scientific applications and areinvoked a large number of times on a hardware platform. Pre-tuning is done oncefor each application on each platform. The initial cost of pre-tuning is anticipated tobe outweighed by the benefit of energy savings over time.

As seen earlier, the performance and energy usage of an application executingon a heterogeneous platform depends on the characteristics of the different process-ing elements as well as the input problem size. The popular auto tuning libraryATLAS [Whaley et al., 2000] performs tuning at install time where different code vari-ants of the application and parameter settings are investigated on the hardware. Theoptimal values for these parameters are empirically found and chosen to obtain bestperformance. The recent work by Shen [2015] describes an offline tuning approachfor performance by profiling an application with multiple sample problem sizes.

We use a similar idea for our offline pre-tuning approach in order to optimisefor energy. Given a partitioned application kernel and a hardware platform withmultiple processing elements, the initial one-time tuning phase involves executingthe application for different configurations of problem sizes on each of the processing

§5.1 Offline Pre-tuning 103

elements individually. Performance and energy usage are measured and storedfor each of these configurations. This data is used by the framework at runtimeto determine the optimal runtime configuration for all subsequent executions ofthis application kernel on this platform. Depending on the input problem size, theframework looks up the recorded measurements for the closest tuning problem sizeand uses them as parameters for our energy model at runtime to predict the optimalpartition.

This approach is expected to be useful for application kernels whose computa-tions are repetitive or periodic. Here, we use the widely used stencil and matrixmultiplication kernels which are critical components of image processing and BLASlibraries respectively as examples to demonstrate this approach.

5.1.1 Implementation of Pre-tuning Approach for Stencil

Consider a stencil computation with a grid size of nx columns ×ny rows. Assumeit is to be partitioned in one-dimension - by rows (ie, dimension ny). During theinitial pre-tuning phase, the framework first executes the kernel completely on theCPU for a number of different problem sizes. Energy and performance measurements(averaged over 5 runs) are recorded for each problem size. The kernel is then executedcompletely on the GPU and measurements are recorded for each problem size.

This data is used to predict the energy-optimal work partitioning for all furtherinvocations of the kernel at runtime. i.e. for a computation with an input problemsize of nx × ny the following function is invoked:

float get_best_split(int nx, int ny, int iterations)

This function returns the best split_ratio to indicate how the work should be splitbetween the CPU and GPU in order to minimise the energy to solution. In particular,split_ratio indicates the ratio of the rows ny which must be allocated to the CPU withthe remaining portion of ny to be allocated to the GPU.

The function returns the optimal split ratio by searching the collected pre-tuningmeasurements for this kernel. It finds the point, in 2-D space of the pre-tuned problemsizes, that is closest to the input point nx × ny in terms of euclidean distance. Forthis, it is enough to consider each dimension in isolation and find the closest pointto the input size in that dimension. In cases where the input point lies exactly in themiddle of two tuning points in a dimension, the larger tuning point is considered asthe closest match. Using the recorded measurements for this problem size as inputsfor the energy model, it computes the optimal split for this computation in order tominimise energy.

The selection of tuning problem sizes is important to get ensure good coverageand to obtain accurate predictions for any input problem size. However, this isnon-trivial. Exhaustively covering all possible sizes in the problem domain is not


practically possible. Having too many points in the search space is undesirable as itwould increase the time duration for the tuning phase to days.

Sizes must be chosen to cover the platform’s memory hierarchy. This differs foreach platform and application. One simple practical strategy is to use sizes which arepowers of 2 in each dimension ensuring that total number of points is limited whileat the same time covering problem sizes which fall into different memory hierarchiesof a hardware platform.

5.1.2 Implementation of Pre-tuning Approach for GEMM

Consider a matrix multiplication C = A×B where A, B and C have dimensions m× k,k× n and m× n respectively, which is partitioned by splitting the columns of matrixB between the CPU and GPU.

Similar to stencil, the pre-tuning phase involves running the kernel a number oftimes on each of the devices individually covering different configuration of problemsizes. After the pre-tuning phase, for an actual computation with an input problemsize of m× k× n that is to be split at runtime, the following runtime function (similarto stencil) is invoked:

float get_best_split(int m, int k, int n)

Similar to stencil, this function searches the pre-tuned data to find the closestpoint to the input problem dimensions, in terms of euclidean distance. For this, eachdimension is considered in isolation and closest match is found similar to stencil.Using the measurement data recorded for this point, it computes the best split_ratioto indicate how the work should be split between the CPU and GPU in order tominimise the energy to solution. In particular, split_ratio indicates the ratio of nwhich must be allocated to the CPU with the remaining portion of n to be allocatedto the GPU.

In order to limit the number of problem sizes for tuning and to cover the platform’sdifferent memory hierarchy, the simple strategy of choosing sizes that are powers of2 can be followed for each of the dimensions m, k and n.

5.1.3 Evaluation of Offline Pre-tuning Framework

For each application kernel, pre-tuning is done by running each chosen pre-tuningproblem size for split_ratio = 100% (completely on CPU) and split_ratio = 0% (com-pletely on GPU), generating the input data for the framework. For evaluation, theframework is used to make predictions for different test problem sizes within thesame range as pre-tuned problem sizes.

To evaluate the accuracy of the prediction, for each problem size the predictedsplit ratio (Fp) is computed using the pre-tuned data as described earlier. Energyusage when the computation for this problem size is split using this predicted ratio is


measured and recorded as Em. The actual optimal split ratio (Fo) can be empiricallyfound by testing all possible values. However, since running each problem size forall possible split ratios from 0 to 100% is an extremely time consuming process, thechoice was made to limit it to steps of 5% for this evaluation. The prediction of Fp

was also limited to steps of 5%. Energy usage when the actual optimal split ratio isused for this problem size is measured and recorded as Eo. The error in prediction ofenergy optimality EE = (Em − Eo)/Eo in % and the deviation of Fp from Fo in steps of5%, termed as devFp , are reported for each test case. We also report the energy savingsobtained with this approach compared to the alternative of running it completely onthe CPU (CS%) or completely on the GPU (GS%) in the absence of any predictionframework.

All the values reported in the following experiments are averaged over at least5 runs. The margin of error for energy, at 95% confidence level, is as high as 5% insome cases for smaller problem sizes where the execution time is less than 0.1 s. Forlonger running problem sizes, the margin error for energy is observed to be less than0.5%.

5.1.3.1 Evaluation of Pre-tuning Approach for Stencil

An experiment was conducted to evaluate the prediction accuracy of the pre-tuningapproach for double precision stencil on the TX1 system. The GPU frequency wasreduced to 230 MHz for all experiments involving double precision stencil on TX1in order to adjust the relative difference in performance between the CPU and GPUas described in Section 4.3.3.1. To choose the problem sizes for pre-tuning, a simplestrategy was initially followed by selecting powers of 2 starting with 512 in eachdimension as described in Section 5.1.1. The sizes are 512, 1024, 2048, 4096, 8192,16384. However, since the gap between subsequent tuning data points increases asproblem size increases, this doesn’t provide adequate coverage for larger dimensions.More problem sizes need to be included in order to adequately cover all the sizeswithin the range in order to get accurate predictions.

With this in mind, a simple heuristic is applied to close the gap for larger sizes.Between the sizes 4096 and 16384, tuning points are chosen in increments of 2048. i.ein each dimension the following 10 sizes are chosen: 512, 1024, 2048, 4096, 6144, 8192,10240, 12288, 14336, 16384. This makes up a total of 102 = 100 different configurationsof problem sizes and gives good coverage of the problem sizes in the range 512 to16384.

Different combinations of problem sizes within this range are chosen for evalu-ation. Problems with an overall grid size of 2048× 2048 or less are not consideredin this experiment since for such cases, the execution phase of the kernel is seento be less than 0.5 s and the measured energy consumption is less than ≈ 5 J. Insuch scenarios, the energy measurement samples are not reliable enough and usefulpredictions can’t be made. For brevity, the prediction results for a few test problemsizes within this range are shown in Table 5.1 along with the metrics described earlier


while Figure 5.1 shows the error in energy optimality for all the test cases. Overall,the error in prediction EE is observed to be less than 5% in all cases.

Consider two test cases from Table 5.1. For the test case 4096× 7168 the closestdata point in the pre-tuned data is 4096× 8192 according to the logic described inSection 5.1.1. Similarly for the case 5120× 5120 the closest data point is 6144× 6144.The framework searches the pre-tuned data and finds these points for each of the testproblem sizes and fetches the measurements recorded for these data points. It thenapplies them in the energy model to predict the best split_ratio.

From Table 5.1, the energy savings obtained while using this pre-tuning frame-work compared to running only on the CPU is seen to be at least ≈ 10% and is ashigh as ≈ 38% in some cases while that compared to running only on the GPU canbe as high as ≈ 19%. This means in the absence of a tuning framework, choosing thewrong device to execute the kernel may result in much higher energy consumptionthan the optimal case.

2,048 4,096 6,144 8,192 10,240 12,288 14,436 16,384

2,048

4,096

6,144

8,192

10,240

12,288

14,436

16,384

nx

n y

0

2

4

6

8

10

12

14

16

18

20

Figure 5.1: Double precision stencil pre-tuning results on TX1 - all test cases. The colormapshows the error in energy optimality EE(%). For larger problem sizes error in energy optimal-ity is seen to be less than 5%

Figure 5.1 shows the error in prediction of energy optimality for all the test cases.


Table 5.1: Double precision stencil pre-tuning results on TX1: Few problem sizes

nx ny Fp Fo devFp EE% CS% GS%

512 5120 70 65 1 1.4 8.2 11.2

512 7168 70 65 1 3.1 6.3 9.4

512 9216 70 65 1 3.9 7.0 12.1

1024 3172 65 60 1 2.6 10.2 12.9

1024 9216 65 65 0 0 15.9 14.4

1024 10240 65 60 1 0.8 15.6 15.2

1024 12288 65 65 0 0 16.4 19.1

2048 5120 60 60 0 0 22.1 11.9

2048 7168 60 55 1 0.1 20.1 11.3

2048 9216 60 60 0 0 19.0 9.2

2048 10240 60 60 0 0 19.8 10.8

2048 12288 60 60 0 0 19.2 11.0

4096 3172 60 55 1 1.3 21.9 8.1

4096 7168 60 60 0 0 23.7 11.8

4096 9216 60 50 2 2.3 23.9 6.4

5120 3172 60 60 0 0 20.3 12.4

5120 5120 55 55 0 0 20.7 12.4

5120 6144 55 60 -1 0.1 20.3 12.6

5120 9216 55 60 -1 1.3 22.2 12.2

8192 3172 55 55 0 0 25.9 9.1

8192 7168 45 50 -1 1.7 29.8 3.7

8192 9216 55 50 1 3.5 30.1 2.1

8192 10240 55 50 1 1.0 30.6 2.0

10240 3172 55 55 0 0 24.7 7.6

10240 5120 55 50 1 3.1 27.8 1.1

10240 7168 50 50 0 0 33.3 4.6

12288 3172 55 50 1 2.7 27.3 2.6

12288 5120 55 45 2 1.9 30.4 2.4

16384 3172 0 0 0 0 33.3 0

16384 5120 0 0 0 0 37.6 0

Fp: Predicted split_ratioFo: Actual best split_ratiodevFp : Steps of deviation of Fp from Fo (in increments of 5%)EE %: Energy optimality error wrt optimal splitCS %: Energy savings wrt running only on CPUGS %: Energy savings wrt running only on GPU


From the results, it is seen that for all larger test cases where the execution phase ofthe kernel is at least 0.5 s and the energy consumption is greater than ≈5 J, the errorin energy optimality is seen to be less than 5%. As such, the framework is seen to bebeneficial only for larger problem sizes.

The overheads of the pre-tuning runtime framework are measured to be negligiblewith respect to the execution of the kernel. The time cost of using this approach isless than 0.02% of execution time of kernel in all cases while the energy cost of thisapproach is less than 0.0006% of execution energy of kernel in all cases. The offlinepre-tuning phase with the chosen tuning sizes was observed to take around 8 to 10hours to complete on the TX1 system.

5.1.3.2 Evaluation of Pre-tuning Approach for GEMM

A similar experiment was conducted to evaluate the prediction accuracy of the pre-tuning approach for double precision GEMM on the TX1 platform. Initially, powersof 2 starting with 256 were chosen as pre-tuning sizes for each of the dimensionsm, k and n: 256, 512, 1024, 2048, 4096, 8192. However, the gap between subsequentdata points in three dimensional space increases as problem size increases. Thereforea similar heuristic was used to close this gap for larger dimensions. Between 2048and 8192, tuning points are chosen in increments of 1024. i.e. in each dimension thefollowing 10 sizes are chosen: 256, 512, 1024, 2048, 3072, 4096, 5120, 6144, 7168, 8192.This makes up a total of 103 = 1000 different configurations of problem sizes to coverproblem sizes in the range 256 to 8192 in each dimension.

Different combinations of problem sizes within this range are chosen for evalua-tion. For brevity, the prediction results for a few test problem sizes within this rangeare shown in Table 5.2 along with the metrics described earlier while Figure 5.2 showsthe error in energy optimality for all the test cases. Overall, the error in predictionEE is observed to be less than 6% in all cases.

From Table 5.2, the energy savings obtained while using this pre-tuning frame-work compared to running only on the GPU is seen to be at least ≈ 6% and is ashigh as ≈ 40% in some cases while that compared to running only on the CPU canbe as high as ≈ 19%, meaning that in the absence of a tuning framework, choosingthe wrong device to execute the kernel results in much higher energy consumption.

Figure 5.2 shows the error in energy optimality for all the test cases. In general,for cases where the energy consumption is greater than ≈5 J, the error in energyoptimality is seen to be much less than 5%. As such, the framework is seen to bebeneficial only for such larger problem sizes (overall size of approximately 1500×1500× 1500 or more).

As with stencil, the overheads of the runtime framework are measured to benegligible with respect to the execution of the kernel. The time overhead is less than0.01% of execution time of kernel in all cases while the energy overhead is less than0.002% of execution energy of kernel in all cases. The offline pre-tuning phase withthe chosen tuning sizes was observed to take around 16 hours to complete on TX1.


Table 5.2: DGEMM pre-tuning results on TX1: Few problem sizes

m n k Fp Fo devFp EE% CS% GS%

600 600 600 50 50 0 0 18.0 6.3

850 2000 1200 55 55 0 0 17.3 8.4

850 2000 7168 55 55 0 0 19.1 9.3

850 6000 1200 55 55 0 0 19.4 10.4

3072 3072 3072 60 55 1 2.3 14.9 23.6

3600 768 1200 55 55 0 0 17.1 13.8

3600 768 3072 60 55 1 1.9 15.3 24.3

3600 768 7168 60 55 1 2.0 13.4 27.5

4000 4000 4000 65 55 2 3.2 10.0 28.0

4800 4800 4800 65 65 0 0 4.2 26.0

6125 6125 6125 65 60 1 0.4 5.1 19.8

7500 2000 1200 65 60 1 1.2 9.4 15.9

7500 6000 3072 70 65 1 0.6 3.9 24.2

7168 7168 7168 70 65 1 0.5 2.0 36.6

7500 2000 7168 70 70 0 0 2.4 28.4

7500 6000 7168 70 70 0 0 2.3 27.2

7500 7500 7500 70 65 1 0.6 0.9 26.8

8000 8000 8000 100 100 0 0 0 40.3

Fp: Predicted split_ratioFo: Actual best split_ratiodevFp : Deviation of Fp from Fo in increments of 5%EE %: Energy optimality error wrt optimal splitCS %: Energy savings wrt running only on CPUGS %: Energy savings wrt running only on GPU


1,0002,000

3,0004,000

5,0006,000

7,0008,000

1,000

2,000

3,000

4,000

5,000

6,000

7,000

8,000

1,000

2,000

3,000

4,000

5,000

6,000

7,000

8,000

m

k

n

0

2

4

6

8

10

12

14

16

18

20

Figure 5.2: DGEMM pre-tuning results on TX1 - all test cases. The colormap shows the errorin energy optimality EE(%). For larger problem sizes error in energy optimality is seen to bemuch less than 5%

§5.2 Dynamic Tuning 111

5.2 Dynamic Tuning

In the dynamic tuning approach, there is no a-priori information (except for the idlepower of the system) and the framework tunes by executing a portion of the overallworkload on one of the devices first and subsequently the same portion on the otherdevice. Measurements are collected during these runs. The tuning framework thendetermines the optimal configuration by feeding this data into the energy model anddynamically applying it to the remaining computation.

This approach is suitable for large long-running problems where the cost of theinitial tuning is amortized over the execution of the entire application using theoptimal strategy identified.

Theoretically, tuning on CPU and GPU can proceed simultaneously if energymeasurements can be obtained from both devices simultaneously. Since none of ourexperimental platforms is suited to this, this work focuses on performing tuning onCPU and GPU one after the other.

The work done by Siehl and Zhao [2017] is closest to our dynamic runtime tuningframework. Measurements are taken for a portion of the problem at runtime andfed into a simple energy model to predict the optimal split to minimise energy.However it only measures time and uses an estimation model to estimate the energyconsumption. Our framework follows a similar approach where tuning is performedat runtime for a portion of the overall problem and prediction of optimal split ratiois taken based on real-time energy measurements. In addition, we identify threedifferent methods to achieve dynamic tuning. They are:

• Method 1 - Subset of problem: This method is generally applicable to problemswhich apply repetitive computations to the input data. Tuning is done byexecuting the computation associated with a fraction of the problem size onboth CPU and GPU.

• Method 2 - Subset of iterations: This is applicable to problems such as iter-ative solvers which involve multiple iterations of computations on the inputdata. Tuning is done by executing the computations associated with the wholeproblem size for a fraction of the total iterations on both CPU and GPU.

• Method 3 - Progressive refinement: This is also applicable to problems involvingiterative computations. Tuning is done progressively and refined with eachiteration until an optimal split ratio is found.

These methods are described below in the context of the example applicationkernels, stencil and matrix multiplication.

5.2.1 Implementation of Dynamic Tuning Approach for Stencil

Here, the following functions are invoked to perform the initial tuning phase.


void init_tuning_cpu(int nx, int ny, float* cpu_power, float* cpu_frate)

void init_tuning_gpu(int nx, int ny, float* gpu_power, float* gpu_frate)

These functions, invoked in the beginning, perform the tuning step for the CPUand GPU and collect performance and energy measurements. The measured perfor-mance (in GFLOP/s) and power (in Watts) are returned. After the tuning step iscompleted, the below function is invoked to determine the optimal split ratio to usein order to minimise energy.

float get_best_split(int nx, int ny, int iterations,

float cpu_power, float cpu_frate,

float gpu_power, float gpu_frate)

All the three dynamic tuning methodologies identified earlier are applicable tostencil computation. Implementation details of each methodology for this kernel isdescribed below.

5.2.1.1 Method 1 - Subset of Problem

Here, a subset of the stencil grid is executed for 1 iteration during the tuning stepat runtime. Suppose x% (less than or equal to 50%) of dimension ny is chosen fortuning, this is run for 1 iteration exclusively on the GPU first. The next x% is runfor 1 iteration exclusively on the CPU. Measurements are collected for these subsetsand applied in the model to predict the optimal split ratio. The remaining iterationsare run using this predicted split ratio. If x is 50%, the tuning phase covers the firstiteration. If x is less 50%, the remaining computation for the first iteration also needsto be completed using the predicted split ratio.

5.2.1.2 Method 2 - Subset of Iterations

Here, the full grid is executed for a subset of the total iterations, say x% of themaximum number of iterations during the tuning step at runtime. The first x% ofiterations are run exclusively on the GPU first. The subsequent x% of iterations arerun exclusively on the CPU. Measurements are collected for these subsets and areapplied to the energy model in order to predict the optimal split ratio. The remainingiterations are run using this predicted split ratio.

5.2.1.3 Method 3 - Progressive Refinement

The progressive refinement approach starts with an initial estimate for Fp and aimsto find the best split ratio close to the initial estimate (in steps of 5%) which resultsin lowest energy. This ensures that even if the model’s initial prediction is away fromthe optimal value, it progressively finds the Fp for which energy usage is lowest. This


is done by incrementing (or decrementing) Fp progressively for each subsequent iter-ation and comparing the energy usage to that of the previous iteration and choosingthe value of Fp that resulted in lowest energy usage. Tuning stops when the energyusage starts increasing or does not change beyond a minimum threshold.

To implement this for the stencil kernel, a number of iterations (say x) is chosenfor tuning. Tuning is done by running x iterations on the GPU first. The subsequentx iterations are run on the CPU. Measurements are collected and applied in themodel to get the initial split ratio Fpi . After the initial split ratio (Fpi ) is obtained, therefinement step proceeds as follows:

1. Run x iterations using the split ratio Fp = Fpi .

2. If Fpi ! = 0, run the next x iterations using the split ratio Fp = Fpi − 5 and measurethe energy consumed, Ei.

3. If Fpi == 0 or Ei > Ei−x, fix Fp = Fpi and stepsize = +5

4. If Ei < Ei−x, fix stepsize = −5

5. Run the next x iterations using Fp = Fp + stepsize

6. If Ei < Ei−x, repeat step 5 and step 6 until Ei starts increasing or until there isno further change in energy (within a minimum threshold).

5.2.2 Implementation of Dynamic Tuning Approach for GEMM

Similar to stencil, the following functions are invoked to perform the initial tuningphase.

void init_tuning_cpu(int m, int k, int n, float* cpu_power,

float* cpu_frate)

void init_tuning_gpu(int m, int k, int n, float* gpu_power,

float* gpu_frate)

These functions perform the tuning step for the CPU and GPU and return theperformance and power consumed. After the tuning step is completed, the belowfunction is invoked to determine the optimal split ratio to use in order to minimiseenergy.

float get_best_split(int m, int k, int n, float cpu_power, float cpu_frate,


Out of the three dynamic tuning methodologies identified earlier, only two method-ologies are applicable to the GEMM kernel. They are Method 1 - Subset of problem andMethod 3 - Progressive refinement.


5.2.2.1 Method 1 - Subset of Problem

This method is implemented for GEMM by executing a subset of the overall workloadon both CPU and GPU, one after the other. For a matrix multiplication C = A× Bwhere A, B and C have dimensions m × k, k × n and m × n respectively, which ispartitioned by splitting the columns of matrix B between the CPU and GPU, a smallpercentage, x% (much less than 50%) of the dimension n is chosen as the tuning size.The chosen subset of the workload is executed during the tuning step at runtime. Thisis run exclusively on the GPU first. The next x% of the problem is run exclusivelyon the CPU. Measurements are collected for these subsets and applied in the modelto predict the optimal split ratio. The remaining computation is executed using thispredicted split ratio.

5.2.2.2 Method 3 - Progressive Refinement

To implement this method for GEMM, a small percentage (x%) of the workload ischosen as the tuning size, similar to method 1. The chosen fraction of the workload isexecuted exclusively on the GPU first. The next x% of the problem is run exclusivelyon the CPU. Measurements are collected and applied in the model to get the initialsplit ratio Fpi . After the initial split ratio (Fpi ) is obtained, the refinement step proceedssimilar to the process for stencil described in Section 5.2.1.3. Here the fraction x%of the problem is executed in each step and split ratio Fp is adjusted based on themeasured energy.

5.2.3 Implementation of Dynamic Tuning Approach for NPB (MZ) BTSolver

The following functions are invoked to perform the initial tuning phase.

void init_tuning_cpu(char Class, float* cpu_power, float* cpu_frate)

void init_tuning_gpu(char Class, float* gpu_power, float* gpu_frate)

These functions, invoked in the beginning, perform the tuning step for the CPUand GPU during which all the zones are allocated to each device and performanceand energy measurements are collected for each device. After the tuning step iscompleted, the below function is invoked to determine the optimal split ratio to usein order to minimise energy.

float get_best_split(char Class, int iterations,

float cpu_power, float cpu_frate,


Since this application involves iterative computation, Method 2 - Subset of iterationsis most suited to this application.


5.2.3.1 Method 2 - Subset of Iterations

Similar to the implementation for stencil, during the tuning step the input grid isexecuted for a subset of the total iterations on the GPU and the CPU individually atruntime. Measurements are collected for these subsets and are applied to the energymodel in order to predict the optimal split ratio. The remaining iterations are runusing this predicted split ratio.

5.2.4 Evaluation of Dynamic Tuning Framework

The evaluations for the different dynamic tuning methods are presented in this sectionfor for stencil and GEMM kernels. For each of the methods, the predicted split ratio Fp,its deviation from actual optimal split ratio (devFp ) and error in energy optimality EEare measured and reported for each test case as done for offline pre-tuning approach(described in Section 5.1.3).

5.2.4.1 Evaluation of Dynamic Tuning Approach for Stencil

As mentioned earlier, the three different tuning methods identified earlier were im-plemented for the stencil kernel. Each of them is evaluated here for double precisionstencil computation on the TX1 system for a number of test cases spanning a rangeof sizes from 512 to 51200 in each dimension.

For each test case, the predicted split ratio after the tuning phase (Fp) is reportedalong with the deviation in predicted split ratio (devFp ). After the computation iscompleted the total energy consumed (including the tuning phase) is measured andrecorded as Et. To measure the quality of the predicted optimal split ratio, the originalproblem is again executed without any tuning step using the predicted split ratio(Fp) for the entire problem size and the energy consumed is recorded as Em. Theactual optimal energy Eo is measured by empirically testing all split ratios. The errorin energy optimality (EE) is computed by comparing the actual optimal energy Eo

against Em. The extra energy incurred due to the tuning phase, termed as energyoverhead, is measured as the difference between Et and Em in percentage. i.e. energyoverhead = (Et − Em)/Em ∗ 100.

Method 1 - Subset of Problem

An experiment was conducted to evaluate the prediction accuracy of this method. Forthis experiment, the fraction of the problem used for tuning, termed as tuning size, isvaried in order to observe how the accuracy of the tuning is affected. The differenttuning sizes chosen are 25%, 30%, 40% and 50% of the dimension ny.

Figure 5.3 shows the results on the TX1 system for test problem sizes wherenx = 1024. Figure 5.4 shows the results for problem sizes where nx = 2048 andFigure 5.5 shows the results for problem sizes where nx = 8192 and nx = 16384. Eachof the figures show how predicted split ratio Fp and EE vary when the tuning size


is varied. The total number of iterations for the stencil is fixed to 50 for all problemsizes. Note that in Figure 5.5 when nx = 16384, the results are same for ny = 1024,ny = 2048 and ny = 4096. Hence the same line is displayed for these three cases.

20 25 30 35 40 45 50 55

30

40

50

60

70

80

90

100

F p(%

)

1024 * 4096 (Fo = 65)1024 * 8192 (Fo = 65)

1024 * 16384 (Fo = 65)1024 * 25600 (Fo = 65)1024 * 51200 (Fo = 65)

20 25 30 35 40 45 50 55

0

5

10

15

Tuning size (% of ny)

Ener

gyO

ptim

alit

yEr

ror

(%)

1024 * 40961024 * 8192

1024 * 163841024 * 256001024 * 51200

Figure 5.3: Stencil (DP) dynamic tuning method 1 on TX1 - nx=1024. Subset of the problemused for tuning. Tuning size varied from 25% to 50%.

From the results, the prediction accuracy of Fp is seen to generally improve andconsequently the error in energy optimality EE is seen to reduce as the tuning size isincreased. This is expected since tuning benefits from capturing more of the problemcharacteristics as tuning size is increased. This is especially noticeable in smallerproblem sizes. From Figure 5.3, in cases where nx = 1024 and ny <= 8192, choosing25% and 30% for tuning results in much higher error while it improves when tuningsize is increased to 50%. Similar behaviour is seen for the case where nx = 2048and ny = 4096 from Figure 5.4 and for case where nx = 8192 and ny = 1024 fromFigure 5.5. This is because the execution time when using the smaller tuning sizesfor these problem sizes is very low (<0.03 s) and the energy measurements are notreliable enough to make useful predictions. The energy usage in such cases is muchless than 1 J and it’s not worth using the framework here.

For all the test cases, as problem size is increased, when choosing a tuning sizeof 50%, Fp is generally close to Fo and EE is generally around ≈ 8% or less. Thuschoosing a higher tuning size is beneficial to capture accurate information about theproblem and increase prediction accuracy.

For all the test cases, the energy overhead due to tuning is observed to be around3% to 5%. For each test problem size, the energy overhead is fairly constant (within


20 25 30 35 40 45 50 55

30

40

50

60

70

80

90

100

F p(%

)

2048 * 4096 (Fo = 55)2048 * 8192 (Fo = 55)

2048 * 16384 (Fo = 55)2048 * 25600 (Fo = 55)2048 * 51200 (Fo = 55)

20 25 30 35 40 45 50 55

0

5

10

15


Ener

gyO

ptim

alit

yEr

ror

(%)

2048 * 40962048 * 8192

2048 * 163842048 * 256002048 * 51200

Figure 5.4: Stencil (DP) dynamic tuning method 1 on TX1 - nx=2048. Subset of the problemused for tuning. Tuning size varied from 25% to 50%.

20 25 30 35 40 45 50 55

0

10

20

30

40

50

60

70

80

F p(%

)

8192 * 1024 (Fo = 45)8192 * 2048 (Fo = 50)8192 * 4096 (Fo = 50)8192 * 8192 (Fo = 45)16384 * 1024 (Fo = 0)16384 * 2048 (Fo = 0)16384 * 4096 (Fo = 0)

20 25 30 35 40 45 50 550

5

10

15

20

25


Ener

gyO

ptim

alit

yEr

ror

(%)

8192 * 10248192 * 20488192 * 40968192 * 8192

16384 * 102416384 * 204816384 * 4096

Figure 5.5: Stencil (DP) dynamic tuning method 1 on TX1 - nx=8192, 16384. Subset of theproblem used for tuning. Tuning size varied from 25% to 50%.


1%) when the tuning size is varied. When the number of iterations of the stencilis increased, the overhead is expected to become negligible compared to the overallexecution since tuning is done only in the first iteration.

Method 2 - Subset of Iterations

An experiment was conducted to evaluate the prediction accuracy of this method.Since the tuning step is sequentially run on the CPU and the GPU, the number ofiterations chosen for tuning needs to be kept low to avoid high energy overhead dueto tuning. For this experiment, the different tuning sizes chosen are 1, 2, 3 and 5iterations each for the CPU and GPU. When the maximum number of iterations isfixed to 50, this corresponds to 2%, 5%, 6% and 10% of the total number of iterationson each processing element.

Similar to Method 1, Figure 5.6 shows the results for test problem sizes wherenx = 1024. Figure 5.7 shows the results for problem sizes where nx = 2048 andFigure 5.8 shows the results for problem sizes where nx = 8192 and nx = 16384. Eachof the figures show how predicted split ratio Fp and error in energy optimality EEvary when the number of iterations for tuning varies. Note that in Figure 5.8 whennx = 16384, the results are same for ny = 1024, ny = 2048 and ny = 4096. Hence thesame line is displayed for these three cases.

0 1 2 3 4 5 6

30

40

50

60

70

80

90

100

F p(%

)

1024 * 4096 (Fo = 65)1024 * 8192 (Fo = 65)

1024 * 16384 (Fo = 65)1024 * 25600 (Fo = 65)1024 * 51200 (Fo = 65)

0 1 2 3 4 5 6

0

5

10

15

Tuning iterations

Ener

gyO

ptim

alit

yEr

ror

(%)

1024 * 40961024 * 81921024 * 163841024 * 256001024 * 51200

Figure 5.6: Stencil (DP) dynamic tuning method 2 on TX1 - nx=1024. Subset of iterations usedfor tuning. Tuning iterations varied from 1 to 5.

From the results, the error in prediction EE is seen to be generally fairly constantacross the different tuning sizes (within 1-2%). For smaller problem sizes, the error


0 1 2 3 4 5 6

30

40

50

60

70

80

90

100

F p(%

)

2048 * 4096 (Fo = 55)2048 * 8192 (Fo = 55)

2048 * 16384 (Fo = 55)2048 * 25600 (Fo = 55)2048 * 51200 (Fo = 55)

0 1 2 3 4 5 6

0

5

10

15

Tuning iterations

Ener

gyO

ptim

alit

yEr

ror

(%)

2048 * 40962048 * 81922048 * 163842048 * 256002048 * 51200

Figure 5.7: Stencil (DP) dynamic tuning method 2 on TX1 - nx=2048. Subset of iterations usedfor tuning. Tuning iterations varied from 1 to 5.

0 1 2 3 4 5 6

0

10

20

30

40

50

60

70

80

F p(%

)

8192 * 1024 (Fo = 45)8192 * 2048 (Fo = 50)8192 * 4096 (Fo = 50)8192 * 8192 (Fo = 45)16384 * 1024 (Fo = 0)16384 * 2048 (Fo = 0)16384 * 4096 (Fo = 0)

0 1 2 3 4 5 6

0

5

10

15

Tuning iterations

Ener

gyO

ptim

alit

yEr

ror

(%)

8192 * 10248192 * 20488192 * 40968192 * 8192

16384 * 102416384 * 204816384 * 4096

Figure 5.8: Stencil (DP) dynamic tuning method 2 on TX1 - nx=8192,16384. Subset of iterationsused for tuning. Tuning iterations varied from 1 to 5.


is seen to be higher when the number of iterations is reduced. This is attributed tounreliable measurements when execution time is very low (<0.03 s) for such cases. Ingeneral, for smaller problem sizes, choosing a higher number of iterations for tuningis seen to be beneficial to capture the problem characteristics more accurately. Forlarger problem sizes, 1 or 2 iterations is enough during the tuning step in order toget reliable predictions.

In general, for larger problem sizes, Fp is generally close to Fo and EE is generallyaround ≈ 5% or less across the different tuning sizes. For smaller sizes, when 1iteration is chosen for tuning, the error is quite high (> 10%) and is seen to reducewhen number of tuning iterations is increased.

As expected, the energy overhead increases proportionally as the number ofiterations chosen for tuning phase is increased since the overhead in this approach isessentially the cost of executing x number of iterations sequentially on each deviceduring the tuning phase. For all problem sizes, when the total number of iterationsis fixed to 50, energy overhead is observed to be less than ≈ 3% when tuning is doneusing 1 iteration each on CPU and GPU and it increases to around ≈ 10% to ≈ 12%when using 5 iterations each on CPU and GPU for the tuning phase. When the totalnumber of iterations of computation is increased, this overhead is expected to reduceand become negligible compared to the overall execution.

Overall, the accuracy of this method is observed to be better than Method 1since the entire problem size is used for one or more iterations in the tuning phase.However, the energy overhead is more here if more than one iteration is used fortuning.

Method 3 - Progressive Refinement

Figure 5.9 shows an illustration of how the progressive tuning method proceedswith each tuning iteration for three test cases. The figure illustrates how Fp startswith the initial value Fpi and is refined with each iteration and finally converges toa local optimal value. In the case where nx = 4096 and ny = 8192, the local optimalvalue found is not equal to the actual optimal Fo. This is because the difference inmeasured energy usage between these two values of Fp is less than the threshold of0.5% (experimentally chosen) and steady state is assumed to be reached. In the othertwo cases, the measured energy usage starts increasing when Fp is increased beyondFo and thus the actual optimal value is found.

An experiment was conducted to evaluate the accuracy of this method for thestencil kernel. Since previous experiments in Section 5.2.4.1 for methods 1 and 2indicated that dynamic tuning is mainly suitable for large problem sizes, only largertest problem sizes are chosen for evaluation and 1 iteration is chosen for execution ineach refinement step. A total of 50 iterations of stencil computation is performed foreach test case.

For brevity, the results for a few test problem sizes are shown in Table 5.3. Inaddition to reporting devFp , EE and energy overhead, it also reports the number of


0 1 2 3 4 50

20

40

60

80

100

ith tuning iteration of refinement process

F p(%

)

4096 * 8192 Fp 1024 * 25600 Fp 25600 * 1024 Fp4096 * 8192 Fo 1024 * 25600 Fo 25600 * 1024 Fo

Figure 5.9: Stencil progressive refinement tuning - illustration on TX1. Note that this showsonly the refinement process. 2 iterations are used initially to estimate Fpi

iterations needed in the refinement step for Fp to converge to a local optimal value.From the results, devFp is mostly close to 0 with a few exceptions where it is -1 or 1.EE is observed to be around 0 to 2% for all test cases. The energy overhead variesdepending on the problem size and the number of iterations taken for Fp to convergeand is expected to reduce and become negligible when the total number of iterationsof computation is large.

Overall, the accuracy of this method is seen to be better than the other two meth-ods since it’s able to correct mistakes in initial estimates. However, implementationis a bit more involved and if the initial estimate of Fpi is a long way away from Fo,the number of iterations required to converge to the optimal value may be higher,incurring a higher energy overhead.

5.2.4.2 Evaluation of Dynamic Tuning Approach for GEMM

As mentioned earlier, out of the three different tuning methods identified earlier, twomethods were found to be applicable and hence implemented for GEMM. They areevaluated here for a set of test cases spanning a range of sizes from 768 to 8192 ineach of the three dimensions.

To evaluate dynamic tuning for GEMM, the same methodology that was usedfor stencil as described in Section 5.2.4.1, is used. The predicted split ratio Fp, itsdeviation from actual optimal split ratio devFp and the error in energy optimalityEE are measured and recorded along with the energy overhead incurred due to thetuning phase.

Method 1 - Subset of Problem

An experiment was conducted to evaluate how the choice of tuning size affects theaccuracy of this method for this kernel. The different tuning sizes chosen are 5%, 8%,


Table 5.3: Stencil progressive refinement (method 3) tuning results on TX1

nx ny Fp Fo devFp EE(%) Energy Overhead (%) Iters

512 51200 65 60 1 0.2 2.9 3

512 102400 65 65 0 0 1.7 3

1024 16384 60 65 -1 1.2 4.1 3

1024 25600 65 65 0 0 0.9 3

2048 8192 55 55 0 0 7.0 3

2048 16384 55 55 0 0 2.8 3

2048 25600 55 55 0 0 2.9 2

4096 4096 55 55 0 0 3.9 3

4096 8192 50 55 -1 0.02 2.9 2

4096 16384 55 50 1 0.2 1.7 2

8192 2048 50 50 0 0 1.9 3

8192 4096 45 50 -1 0.4 3.0 3

8192 8192 45 45 0 0 1.7 3

16384 1024 0 0 0 0 4.9 2

16384 2048 0 0 0 0 0.6 2

16384 4096 0 0 0 0 0.9 2

25600 1024 0 0 0 0 1.2 2

25600 2048 0 0 0 0 1.9 2

51200 512 0 0 0 0 0.8 2

102400 512 0 0 0 0 0.01 2

Fo : Optimal split ratioFp : Predicted split ratiodevFp : Deviation of Fp from Fo in increments of 5%EE% : Error in energy optimalityEnergy overhead % : Extra energy cost due to tuningIters : Number of iterations for Fp to converge (excluding the initial 2iterations to estimate Fpi )


10% and 15% of the dimension n (columns of matrix B).The results including predicted split ratio and error in prediction for DGEMM

on TX1 system for a few test cases are shown in figures 5.10, 5.11 and 5.12 whilethe prediction error for all the test cases when 5% is chosen for tuning and 10% ischosen for tuning are shown in figures 5.13 and 5.14 respectively. Figure 5.10 showsthe results for test problem sizes where m = k = 2048. Figure 5.11 shows the resultsfor problem sizes where m = k = 4096 and Figure 5.12 shows the results for problemsizes where m = k = 8192. Each figure shows how predicted split ratio (Fp%) andenergy optimality error (EE%) vary when the tuning size is varied.

0 5 10 15

0

10

20

30

40

50

60

70

F p(%

)

n=768 (Fo = 45)n=1024 (Fo = 55)n=2048 (Fo = 55)n=3600 (Fo = 50)n=4096 (Fo = 55)n=5000 (Fo = 55)n=6000 (Fo = 55)n=8192 (Fo = 55)

0 5 10 150

5

10

15

20

Tuning size (% of n)

Ener

gyO

ptim

alit

yEr

ror

(%)

n=768n=1024n=2048n=3600n=4096n=5000n=6000n=8192

Figure 5.10: DGEMM dynamic tuning method 1 on TX1: m = k = 2048. Subset of problemused for tuning. Tuning size varied from 5% to 15%.

The prediction of Fp is seen to generally improve as the tuning size is increased,especially for smaller problem sizes. This is expected since tuning benefits fromcapturing more of the problem. For the problem sizes when n = 768, 1024 andm = k = 2048 from Figure 5.10 the error rate EE is seen to be quite high (much greaterthan 10%) for all the chosen tuning sizes. For the problem sizes when n = 1024and m = k = 4096 from Figure 5.11 error rate EE is seen to be high for values oftuning size up to 10% and it decreases when tuning size is increased to 15%. Sameis seen in Figure 5.12 when n = 768, 1024 and m = k = 8192. This indicates that forsmaller problem dimensions, a larger tuning size needs to be used so that the subsetof the overall problem chosen for tuning captures the problem characteristics moreaccurately.

Generally, when problem dimensions are increased the prediction accuracy is seen


0 5 10 15

0

10

20

30

40

50

60

70

F p(%

))


0 5 10 150

5

10

15

20


Ener

gyO

ptim

alit

yEr

ror

(%)

n=768n=1024n=2048n=3600n=4096n=5000n=6000n=8192


0 5 10 15

0

10

20

30

40

50

60

70

F p(%

))


0 5 10 150

5

10

15

20


Ener

gyO

ptim

alit

yEr

ror

(%)

n=768n=1024n=2048n=3600n=4096n=5000n=6000n=8192



to be much better and error rate is less than 6% when tuning size is increased to 10%or more. The energy overhead due to tuning is seen to increase slightly when tuningsize is varied. The energy overhead is observed to be around ≈ 3% when tuning sizeof 5% is used and increases to around ≈ 10% when tuning size is increased to 15%.This is expected since the tuning step essentially involves executing the chosen subsetexclusively on the CPU and GPU one after the other.

1,0002,000

3,0004,000

5,0006,000

7,0008,000

1,000

2,000

3,000

4,000

5,000

6,000

7,000

8,000

1,000

2,000

3,000

4,000

5,000

6,000

7,000

8,000

m

k

n

0

2

4

6

8

10

12

14

16

18

20

Figure 5.13: DGEMM dynamic tuning energy optimality error on TX1 for tuning size = 5%.The colormap shows the error in energy optimality EE(%)

Figure 5.13 shows the energy optimality error for all test cases when tuning sizeof 5% is used and Figure 5.14 shows the error when tuning size of 10% is used. InFigure 5.13, there are more cases where the error rate is higher than 5% while inFigure 5.14 there are fewer cases where the error rate is higher than 5%. This isseen mainly when the dimension n is smaller. The figures illustrate how the errorimproves generally when tuning size is increased.

Method 3 - Progressive Refinement

An experiment was conducted to evaluate the accuracy of this method for DGEMMon TX1. Since results from previous experiments for method 1 (in Section 5.2.4.2)indicated that dynamic tuning is mainly suitable for large problem sizes, larger test


1,0002,000

3,0004,000

5,0006,000

7,0008,000

1,000

2,000

3,000

4,000

5,000

6,000

7,000

8,000

1,000

2,000

3,000

4,000

5,000

6,000

7,000

8,000

m

k

n

0

2

4

6

8

10

12

14

16

18

20

Figure 5.14: DGEMM dynamic tuning energy optimality error on TX1 for tuning size = 10%.The colormap shows the error in energy optimality EE(%)


problem sizes are chosen for evaluation. Tuning size of 10% is chosen for this methodas 5% was seen to not be enough for sampling in most cases from the previous results.

The results for a few square and non-square problem sizes are shown in Table 5.4.The table reports devFp , EE, energy overhead along with the number of iterationsneeded in the refinement step for Fp to converge to a local optimal value.

Table 5.4: DGEMM progressive refinement (method 3) tuning results on TX1

m n k Fp Fo devFp EE(%) Energy Overhead(%) Iters

2048 2048 2048 50 55 -1 3.9 13.1 2

2048 3600 2048 50 50 0 0 16.2 3

2048 4096 2048 50 55 -1 3.5 6.5 3

2048 5000 2048 50 55 -1 0.6 11.5 3

2048 6000 2048 55 55 0 0 11.3 4

2048 8192 2048 55 55 0 0 8.6 4

4096 2048 4096 50 55 -1 3.1 12.9 3

4096 3600 4096 55 55 0 0 12.3 3

4096 4096 4096 55 55 0 0 11.3 3

4096 5000 4096 55 55 0 0 6.1 2

4096 6000 4096 55 55 0 0 8.6 2

4096 8192 4096 55 55 0 0 7.1 3

8192 2048 8192 65 70 -1 1.2 11.9 3

8192 3600 8192 70 70 0 0 11.6 4

8192 4096 8192 70 70 0 0 10.2 4

8192 5000 8192 70 70 0 0 8.8 3

8192 6000 8192 70 70 0 0 8.2 3

8192 8192 8192 70 70 0 0 6.7 3

Fo : Optimal split ratioFp : Predicted split ratiodevFp : Deviation of Fp from Fo in increments of 5%EE% : Error in energy optimalityEnergy overhead % : Extra energy cost due to tuningIters : Number of iterations for Fp to converge (excluding the initial 2iterations to estimate Fpi )

From the results, devFp is mostly 0 with a few exceptions where it is -1. EE isobserved to be less than 4% in all test cases. The energy overhead varies depending onthe problem size and the number of tuning steps (iterations) taken for Fp to converge.For smaller problem sizes, the overhead due to tuning is seen to be high (> 10%) andit is seen to generally decrease for larger problem sizes.

Overall, the accuracy of this method is seen to be better than method 1 since it’s


able to correct mistakes in initial estimates. However, the energy overhead of thismethod for the GEMM kernel is seen to be higher in most test cases compared tomethod 1, especially when the number of tuning steps required for Fp to converge ismore.

5.2.4.3 Evaluation of Dynamic Tuning Approach for NPB (MZ) BT Solver

As mentioned earlier, the “subset of iterations” method of dynamic tuning was im-plemented for the multizone NPB BT solver.

Method 2 - Subset of Iterations

An experiment was conducted to evaluate the prediction accuracy of this method forthis application on the TX1. The problem sizes chosen are class W, class A and classB considering the availability of memory. 1 iteration is used for tuning on the CPUand GPU initially for each of the three problem sizes. The total number of iterationsis set to 200 which is the default value of for these problem classes.

For the purposes of evaluating this method’s accuracy in predicting the optimalsplit, a controlled experiment was undertaken where 1 thread is used for the CPUexecution of the application. This is done in order to adjust the relative difference inperformance between the CPU and GPU for this application thereby ensuring Fo isnot simply 100% for each problem class.

The same methodology that was used for the other applications is used for evalu-ation. The predicted split ratio Fp, its deviation from actual optimal split ratio devFp

and the error in energy optimality EE are measured and recorded along with theenergy overhead incurred due to the tuning phase.

The results for each of the three problem classes are shown in Table 5.5. For eachclass, the framework predicts the optimal split ratio with 100% accuracy. The energyoverhead due to the tuning phase increases slightly with problem size as it involvessequentially executing the computations for 1 iteration on each device individually.Overall the overhead is seen to be less than 2.8% in all cases.

Table 5.5: NPB (MZ) BT solver dynamic tuning method 2 results on TX1

Class Grid size Zones Fp Fo devFp EE(%) Energy Overhead (%)

W 64 x 64 x 8 16 100 100 0 0 1.8

A 128 x 128 x 16 16 50 50 0 0 2.5

B 304 x 208 x 17 64 50 50 0 0 2.8

Fo : Optimal Split ratioFp : Predicted split ratiodevFp : Deviation of Fp from Fo in increments of 5%EE% : Error in energy optimalityEnergy overhead % : Extra energy cost due to tuning

§5.3 Critique of Runtime Tuning Framework 129

5.3 Critique of Runtime Tuning Framework

Although both the tuning approaches, namely offline pre-tuning and Dynamic tuningwere implemented and evaluated for two kernels, the approaches are general andcould be extended to other similar applications which consist of repetitive computa-tions over a period of time.

The pre-tuning approach requires a-priori information collected from a one-timeoffline tuning phase and is expected to be beneficial to achieve energy optimalityfor applications which are going to be used extensively over time on a platformsuch as BLAS routines, stencil kernels which are building blocks for many scientificapplications. However, the pre-tuning phase potentially requires a large number ofpre-tuning problem sizes to provide good coverage of the problem and could be verytime consuming depending on the application.

The currently implemented pre-tuning framework makes predictions based onparameters obtained from discrete data points during pre-tuning. Therefore thechoice of pre-tuning problem sizes affects the tuning accuracy. The pre-tuning sizesneed to be chosen such that the different memory hierarchies of a platform arecovered. This changes depending on application. An alternative to using discretetuning data points to fetch parameters for an unknown problem size is to use curvefitting based on a reduced set of pre-tuned data points. Based on this, the energymodel parameters for the input problem size may be determined and used to predictthe optimal split ratio.

The dynamic tuning approach requires no a-priori information about the applica-tion or the platform and gathers the necessary parameters at runtime. It is beneficialmainly for long-running applications where a portion of the overall problem may beused for tuning. Three different methods were identified for performing dynamictuning. The first method involves tuning based on solving a portion of the problemsize and is generally applicable to most repetitive computation kernels includingBLAS routines, image processing kernels, partial differential equation (PDE) solversetc. The second method is applicable to iterative solvers such as image processingkernels, PDE solvers and involves tuning based on solving the entire problem sizefor a fraction of the overall iterations. The third method involves repeatedly tuningon a subset of the overall problem and is applicable to most repetitive computationkernels such as the applications mentioned above. The prediction accuracy of thesecond method is seen to be better than the first since the entire input problem sizeis used for tuning for at least one iteration although its overhead is higher if moreiterations are used to obtain samples for tuning. The third method shows the bestaccuracy out of the three methods because of the ability to correct mistakes in initialprediction. The overhead of all the methods are expected to be negligible for longrunning computations.

It must be noted that the framework is designed only for applications whosecomputational and memory access patterns are regular and repetitive. In such casesassuming the runtime tuning framework can sample the repetitive pattern appropri-


ately, the framework is expected to provide fair predictions. However, for applicationslike sorting algorithms and other kernels which involve irregular memory access ormultiple levels of memory indirections, the framework may not be able to accuratelymodel the pattern especially if the memory access patterns keep changing continu-ously at runtime.

5.4 Summary

This chapter presented work describing the design and development of a proof-of-concept runtime framework which is able to analyse energy measurements at runtimeand determine how to partition an application between the different devices in aheterogeneous platform in order to minimise energy to solution.

Two different approaches were considered. Although the methodology of theframework is general, the approaches were demonstrated for two critical applicationkernels, namely stencil and matrix multiplication. The different approaches wereevaluated for different problem sizes for each kernel. The results show that suffi-ciently large problems are benefited from using this framework, enabling them to beexecuted in such a way that the energy consumed is within ≈ 5% of optimal energyconsumption. Future work aims to demonstrate and evaluate the framework for otherclasses of applications.

Chapter 6

Conclusions and Future Work

This work assesses the suitability of the Adapteva Epiphany-IV NoC and the Tegrafamily of LPSoC processors which are designed for use in a mobile context, for sci-entific computing. Three main issues were considered, namely i) developing efficientapplications for these systems, ii) measurement of energy usage for such low-powersystems, iii) minimising energy to solution for an application executing on heteroge-neous systems.

Part of the work undertaken for this thesis constituted the first in-depth study ofthe capabilities of the Epiphany-IV NoC, while demonstrating how high-performancescientific applications can be written for the architecture. The challenges involvedin programming the Epiphany-IV NoC were described in Chapter 3. Considerableeffort was required to achieve best performance for GEMM and stencil applicationson this architecture. Performance on a single core was maximised by writing hand-tuned assembly code which utilizes all the registers and the limited memory carefully.Cannon’s algorithm was used for implementing multicore GEMM by cycling operanddata through the cores and avoiding multiple loads of operand data from DRAM. Dueto memory limitations a novel “half-buffering” scheme was employed to orchestratedata transfer.

Ease of writing parallel programs for the Epiphany platform has been improved byrecent developments which enable support for OpenMP, MPI and OpenCL. However,to extract the best performance out of each core, significant effort is still requiredfrom the programmer. A number of limitations were identified for this chip includinglimited on-chip memory, lack of native double-precision support and limited off-chipmemory bandwidth. It is observed that due to slow off-chip memory transfer ratesperformance drops drastically for workloads which don’t fit in the on-chip memory,effectively crippling performance for realistic scientific workloads.

In 2016 the Epiphany-V [Olofsson, 2012] was announced containing 1024 cores,double precision floating point support, 64 kB memory per core and improved powerefficiency compared to the Epiphany-IV. However this never went into production andas such development of Epiphany architecture was stopped. Many of the strategiesused for this platform are relevant to other many-core architectures such as theSW26010 many-core processor which is used in the Sunway Taihulight [Fu et al.,2016]. It is noted that the Epiphany architecture has similarities with the design of

131

132 Conclusions and Future Work

the “computing processing element” (CPE) cluster of the SW26010 processor whichis organised as an 8 × 8 mesh of limited 64-bit RISC cores. This processor wasdesigned with a unique hardware feature, “Register Level Communication” (RLC)which enables sharing of register data among all the CPEs via the mesh networknecessitating design of specific algorithms to achieve peak performance [Lin et al.,2017]. In comparison the Epiphany architecture has a scratchpad memory which isshared between all the cores in the mesh network. In general architecture-specificoptimizations are invariably required to obtain near-peak performance on such many-core processors.

Strategies to partition work between the on-chip compute devices in a heteroge-neous platform such as NVIDIA’s Tegra SoCs were presented. Partitioned versionsof the stencil and GEMM kernels were implemented where the workload is splitbetween the CPU and GPU. The shared memory architecture of the Tegra systemsenable the CPU and the GPU to access the same physical memory simultaneouslyand are well suited to concurrent execution. Results showed that using both CPUand GPU (or accelerators) simultaneously for a computation leads to a performance-optimal work partition where the load is balanced and this depends on the relativeperformance of each compute device. The methodologies for simultaneous executionby the CPU and the GPU are general and can be applied to other heterogeneousplatforms. These include other LPSoCs and conventional systems with discrete accel-erators which support concurrent access to memory. On systems where concurrentaccess is not supported for different devices, double buffering can be used to achievethe same effect.

Using the Tegra LPSoCs as test heterogeneous platforms, the effect of work parti-tioning on energy usage was studied in Chapter 4. In order to measure energy usageof an application executing on these systems, a high resolution, non-intrusive frame-work was developed along with an easy-to-use measurement API. This measurementframework enables a running program to obtain its energy usage at the function leveland can be used for other similar low-power platforms.

Chapter 4 also described an energy usage model suited to LPSoCs that is ableto predict the optimal partitioning of an application between the different computedevices in order to optimise its energy usage. The inputs to the model are theapplication’s measured performance and power usage when executed individuallyon each compute device, and the idle power draw of each device. Evaluation was donefor three important applications - two core computational science kernels, namelymatrix multiplication and stencil computation, and the complex block tridiagonalbenchmark from the multizone NAS parallel benchmark suite. Results from the threeTegra systems, namely Tegra K1, Tegra X1 and Tegra Xavier were provided. For eachapplication, the model was observed to be effective in predicting the optimal workpartition for large problem sizes.

Chapter 5 explored how runtime energy tuning can be practically achieved fora running application using real-time energy measurements and the energy model.

§6.1 Performance-energy Trade-offs and Energy Efficiency 133

Two different approaches were considered; offline pre-tuning and dynamic tuning.The pre-tuning approach is based on collecting information a-priori from a one-timeoffline tuning phase. This is beneficial for applications which are used extensivelyover time on a given platform such as linear algebra routines, stencil kernels etcwhich are building blocks for many scientific applications. The dynamic tuningapproach requires no a-priori information about the application or the platform andgathers the necessary parameters at runtime. It is beneficial mainly for long-runningapplications where a portion of the overall problem may be used for tuning. Thisapproach can generally be applied to repetitive computation kernels including BLASroutines, image processing libraries, partial differential equation (PDE) solvers etc.

Three different methods that were identified for performing dynamic tuning in-clude: i) tuning based on estimates gained from solving a portion of the problem size,ii) tuning based on solving the entire problem size for a fraction of the total number ofsteps (for iterative problems), and iii) tuning based on progressively solving portionsof the entire problem a small number of times with the expectation that the finalsolution involves many repetitions of this. The suitability of different approachesfor the three applications mentioned earlier were evaluated and discussed. For longrunning computations the tuning approaches are seen to be generally effective in pre-dicting the energy optimal partition with an error of around 5-7%. It was noted thatthe framework is designed only for applications whose computational and memoryaccess patterns are regular and repetitive where the computational pattern can bereliably sampled. It may not be useful for computations where the memory accesspatterns change continuously at runtime.

6.1 Performance-energy Trade-offs and Energy Efficiency

From the experimental results, a few trade-offs can be observed with respect toperformance and energy-to-solution. Using both the CPU and accelerator in a het-erogeneous platform typically results in maximising absolute performance. However,it may not be always energy-optimal to use both. From the results for DGEMM of4K matrices, on TX1 and Xavier the performance-optimal work partitioning is seento minimise the energy-to-solution as well while this is not the case for TK1. For thesame computation on TK1, although it is energy-optimal to allocate all the work tothe GPU, it is observed that performance is increased by almost 2× by allocating 50%of the work to both the CPU and the GPU with only a very small increase in energy(1.1×). While this is an acceptable trade-off in most cases, in environments whereenergy cost is the most critical factor it would be desirable to minimise energy-to-solution altogether.

The energy efficiency target for an exascale system with a power budget of 20 MWwas estimated to be 5− 10 pJ/FLOP (in double precision) by Dubé [2011] (1 pJ =10−12 J). From the experimental results for GEMM using the Tegra LPSoCs, thebest energy efficiency observed for single-precision was 20 pJ/FLOP using only the


GPU on the Xavier system while an efficiency of 56 pJ/FLOP and 28 pJ/FLOP wereobserved for the TK1 and TX1 respectively. In double-precision, Xavier achieved thebest efficiency of 343 pJ/FLOP while splitting the workload between the CPU and theGPU. The TK1 and TX1 achieved 485 pJ/FLOP and 613 pJ/FLOP respectively. Thusthe Xavier system is observed to be more energy efficient than the TK1 and TX1.

Allocating all the work to the GPU was always more energy-efficient on theconventional HPC hardware. The Sandy system containing a K20 GPU achieved89 pJ/FLOP in single-precision and 221 pJ/FLOP in double-precision. Improved ef-ficiency is observed in the Haswell system containing a K80 GPU with a readingof 66 pJ/FLOP in single-precision and 189 pJ/FLOP in double-precision. Further im-provement is seen in the state-of-the-art NVIDIA V100 GPU with Jack Dongarra [2018]reporting an energy efficiency of 33 pJ/FLOP in single precision and 72 pJ/FLOP indouble-precision.

From these it is observed that while the single-precision efficiency of the Xavieris marginally better than the V100 GPU, the V100 far outperforms the Tegra SoCs interms of double-precision energy efficiency. It can be concluded that while efficiencyof single-precision (and half-precision) computations on contemporary LPSoC pro-cessors is comparable to conventional hardware, they have limited effectiveness fordouble-precision workloads currently. This is mainly because design of contempo-rary LPSoC processors are driven by mobile computing rather than high performancecomputing. However this limitation is not necessarily a problem in adopting themfor HPC as these systems can be very efficient for application areas like deep learn-ing which make use of half-precision computations and mixed-precision scientificcomputations which are now gaining traction [Sorna et al., 2018; Jack Dongarra, 2018;Zhang et al., 2019].

6.2 Future Work

This thesis used the energy-to-solution metric to calculate energy efficiency. Whilethis is useful in environments where the cost of absolute energy usage is the mostcritical factor, it would be beneficial to consider other metrics such as Energy-delay-product (EDP) and Performance per Total Cost of Operation (TCO), which considerperformance degradation along with energy usage, for more rigorous analysis.

This work presented an energy measurement framework along with a softwareAPI which enabled an application running on low-power systems to obtain real-timeenergy measurements at the function level. It would be beneficial to integrate thisinto existing unified measurement frameworks such as Energymon [Imes et al., 2016]which provides a portable interface to collect energy measurements from diversesources.

The energy model presented in this work is used to predict the optimal parti-tioning of a workload in order to minimise energy consumption. But it does notconsider dynamic scaling of voltage and frequency. It is observed that overall energy

§6.2 Future Work 135

consumed by a system may be further minimised by scaling the frequency of the com-pute devices. Scaling can be applied to different components such as CPU, acceleratorand memory subsystem, providing a number of possible choices of configurationsto consider. Dynamically determining the configuration of frequencies of differentdevices in order to minimise energy-to-solution for any given application would beof benefit and has been identified as future work.

Dynamically determining how to utilize the different available compute devicesin order to optimise the energy usage of a running application was explored using aproof-of-concept runtime framework. Three applications were used for evaluation ofthe framework. Future work will aim to extend this by applying the tuning frameworkto other complex HPC applications as well. Also of interest is exploring other low-power multicore processor architectures such as the ARM big.LITTLE SoC which usea combination of big (and fast) and little (and slow) cores.

It would be interesting to integrate the tuning methodologies explored here intoprogramming frameworks like OpenCL which offer a “write once, run anywhere”paradigm (on supported hardware), to enable automatic energy-optimal distributionof a workload. Since the release of the OpenMP accelerator model, offloading ofexecution to accelerators is supported in OpenMP using the target directive. Ourproof-of-concept runtime framework could serve as a template for extending exist-ing work-sharing constructs in OpenMP for dynamically distributing the workloadbetween different compute devices by tuning for energy at runtime.

Appendix A

Experimental Hardware Platforms

Four LPSoC platforms and two Intel-based platforms were used for various experi-ments. They are summarised in Table A.1.

Platform CPU Cores Max Freq RAM Accelerator Cores Max Freq Acc RAM Acc SDK Linux Kernel

Epiphany Cortex-A9 2 667 MHz 512 MB Epiphany-IV 64 600 MHz Shared 32 MB eSDK 5.13 3.12.0armhf

TK1 Cortex-A15 4 2.3 GHz 2 GB GK20A 192 852 MHz Shared host mem CUDA 6.5 3.10.40armhfTX1 Cortex-A57 4 1.9 GHz 4 GB GM20B 256 998 MHz Shared host mem CUDA 7.0 3.10.67aarch64Xavier Carmel 8 2.2 GHz 16 GB GV10B 512 1.3 GHz Shared host mem CUDA 10.0 4.9.108aarch64SANDY Xeon E5-2665 2×8 2.4 GHz 128 GB K20m 2496 706 MHz 5 GB CUDA 7.0 3.13.0HASWELL Xeon E5-2620 v3 2×6 2.4 GHz 128 GB K80 2496 875 MHz 12 GB CUDA 7.5 3.16.0

Table A.1: Hardware Platforms for Experiments

The LPSoC systems are the Epiphany-IV NoC, Tegra K1, Tegra X1 and the TegraXavier systems described in Section 2.1. To evaluate the Epiphany system, we used aprototype ZedBoard [ZedBoard] evaluation module which contains an FPGA “daugh-ter” card (FMC) housing the Epiphany-IV 64-core (E64G401) chip. The host processoris a dual-core ARM Cortex-A9 CPU housed in a Zynq system-on-chip. The boardcontains 512 MB of memory out of which a 32 MB region is shared with the Epiphanychip.

To evaluate the Tegra K1 LPSoC, we used the NVIDIA Jetson TK1 developmentboard [NVIDIA, 2014a]. It contains the Tegra K1 LPSoC housing a quad-core ARMCortex-A15 CPU and a NVIDIA Kepler GPU with 192 CUDA cores. The boardcontains 2 GB of memory which is shared between the CPU and GPU.

To evaluate the Tegra X1 LPSoC, we used the NVIDIA Jetson TX1 developmentboard [NVIDIA, 2015a]. It contains the Tegra X1 LPSoC featuring a 64-bit quad-coreARM Cortex-A57 CPU along with an on-chip NVIDIA Maxwell GPU with 256 CUDAcores. The board contains 4 GB of memory which is shared between the CPU andGPU.

To evaluate the Xavier LPSoC, we used the Jetson AGX Xavier developmentkit [NVIDIA, 2018b]. It contains the Xavier LPSoC featuring a 64-bit octa-core customCarmel CPU along with an on-chip NVIDIA Volta GPU with 512 CUDA cores. Theboard contains 16 GB of memory which is shared between the CPU and GPU.

Two conventional Intel-based HPC systems with attached accelerators were alsoused. The first system (SANDY) contains a dual-socket 16-core Xeon E5-2665 Sandy-

137

138 Experimental Hardware Platforms

bridge processor and a discrete NVIDIA Tesla K20m card. The second system(HASWELL) contains a dual-socket 12-core Xeon E5-2620 v3 Haswell processor anda discrete NVIDIA Tesla K80 card which houses two GK210 GPUs. Both systemscontain 128 GB of host memory. The K20 GPU in the SANDY system contains 5 GBof GPU memory while the K80 GPU in the HASWELL system contains 12 GB of GPUmemory.

Appendix B

Developing OptimisedApplications for the Epiphany-IVNoC

The main snippets of assembly code for implementing high performance stencil andmatrix multiplication application kernels for the Epiphany chip are given in thefollowing sections. These have been published in the Parallella examples repository(https://github.com/parallella/parallella-examples).

B.1 Stencil

Listing B.1: Assembly code for high performance stencil kernel1 .set stride, 62

3 ; Macro to perform calculations for 5 grid points in a row.macro dogrid0 a0,a1,a2,a3,a4,b0,b1,b2,b3,b4,b5,b6,o0,o1,o2,o3,o4,i0,i1,i2,i3,i4

5 fmadd r15,r\a0,r3str r8,[r0,#\o0]





15 fmadd r15,r\b0,r4fmadd r16,r\b1,r4



21 ldr r\a0,[r0,#\i0 + stride]fmadd r16,r\b2,r5





31 eor r8,r8,r8fmadd r16,r\b3,r6


35 eor r10,r10,r10

139

140 Developing Optimised Applications for the Epiphany-IV NoC

fmadd r18,r\b5,r637 eor r11,r11,r11

fmadd r19,r\b6,r639 eor r14,r14,r14

fmadd r15,r\a0,r741 fmadd r16,r\a1,r7

fmadd r17,r\a2,r743 fmadd r18,r\a3,r7

fmadd r19,r\a4,r745 .endm

47 ; Macro to perform calculations for 5 grid points in a row.macro dogrid1 a0,a1,a2,a3,a4,b0,b1,b2,b3,b4,b5,b6,o0,o1,o2,o3,o4,i0,i1,i2,i3,i4


















83 eor r19,r19,r19fmadd r8,r\a0,r7

85 fmadd r9,r\a1,r7fmadd r10,r\a2,r7

87 fmadd r11,r\a3,r7fmadd r14,r\a4,r7

89 .endm

91 ; Start of stencil compute function

93 ; load the co-efficientsldr r14,[r3,#5]

95 ldr r7,[r3,#4]ldr r6,[r3,#3]

97 ldr r5,[r3,#2]ldr r4,[r3,#1]

99 ldr r3,[r3,#0]nop

101 ; preload the first two rowsldrd r20,[r0,#0]

103 ldrd r22,[r0,#1]ldrd r24,[r0,#2]

105 ldrd r26,[r0,#3]ldrd r28,[r0,#4]

107 ldrd r30,[r0,#5]ldrd r32,[r0,#6]

109 ldrd r34,[r0,#7]ldrd r36,[r0,#8]

111 ldrd r38,[r0,#9]ldrd r40,[r0,#10]

113 ; row 2add r0,r0,#(stride * 4)

115 ldrd r42,[r0,#0]ldrd r44,[r0,#1]

117 ldrd r46,[r0,#2]

§B.2 Matrix Multiplication 141

ldrd r48,[r0,#3]119 ldrd r50,[r0,#4]

ldrd r52,[r0,#5]121 ldrd r54,[r0,#6]

ldrd r56,[r0,#7]123 ldrd r58,[r0,#8]

ldrd r60,[r0,#9]125 ldrd r62,[r0,#10]

add r1,r1,#(stride * 4)127 ; clear first 5 results

eor r15,r15,r15129 eor r16,r16,r16

eor r17,r17,r17131 eor r18,r18,r18

eor r19,r19,r19133 ldr r42,[r1,#0] ; reload "old" left-hand value

str r62,[r1],#stride ; save right-hand value to left135

; Perform calculations for all rows (2 rows of 20 grid points at a time)137

; do first 5 points139 dogrid0 21,22,23,24,25,42,43,44,45,46,47,48, 1, 1, 1, 1, 1, 1, 2, 3, 4, 5

.Lb:141 dogrid1 26,27,28,29,30,47,48,49,50,51,52,53, 1, 2, 3, 4, 5, 6, 7, 8, 9,10

ldr r20,[r1,#0] ; load in "old" left-hand value143 dogrid0 31,32,33,34,35,52,53,54,55,56,57,58, 6, 7, 8, 9,10,11,12,13,14,15

ldr r41,[r0,#21 + stride]145 dogrid1 36,37,38,39,40,57,58,59,60,61,62,63,11,12,13,14,15,16,17,18,19,20

str r40,[r1],#stride ; save right-hand value to left147 ; 2nd row

dogrid0 43,44,45,46,47,20,21,22,23,24,25,26,16,17,18,19,20,1+stride,2+stride,3+stride,4+stride,5+stride149 add r0,r0,#(stride * 4)

dogrid1 48,49,50,51,52,25,26,27,28,29,30,31, 1, 2, 3, 4, 5, 6, 7, 8, 9,10151 ldr r42,[r1,#0] ; load in "old" left-hand value

dogrid0 53,54,55,56,57,30,31,32,33,34,35,36, 6, 7, 8, 9,10,11,12,13,14,15153 ldr r63,[r0,#21 + stride]

dogrid1 58,59,60,61,62,35,36,37,38,39,40,41,11,12,13,14,15,16,17,18,19,20155 str r62,[r1],#stride ; save right-hand value to left

; 1st row157 dogrid0 21,22,23,24,25,42,43,44,45,46,47,48,16,17,18,19,20,1+stride,2+stride,3+stride,4+stride,5+stride

add r0,r0,#(stride * 4)159 sub r2,r2,#1

nop161 bne .Lb

B.2 Matrix Multiplication

Listing B.2: Assembly code for high performance matmul1 .set stride, 32

3 ; macro to multiply an element of matrix A with all the elements in a row of matrix B.macro doMult areg,index,aprev,incr

5 fmadd r32,r\areg,r16ldrd r22,[r1,#\index + 3]

7 fmadd r33,r\areg,r17ldr r\aprev,[r0,#\incr]

9 fmadd r34,r\areg,r18fmadd r35,r\areg,r19


13 ldrd r16,[r1,#\index + 4]fmadd r38,r\areg,r22







27 ldrd r18,[r1,#\index + 9]


fmadd r47,r\areg,r2329 ldrd r20,[r1,#\index + 10]


fmadd r49,r\areg,r1733 fmadd r50,r\areg,r18











.endm55

; macro to multiply an element of matrix A (last row) with all the elements in a row of matrix B57 .macro doMultend areg,index,aprev,incr


fmadd r33,r\areg,r1761 ldr r\aprev,[r0,#\incr]



ldrd r16,[r1,#\index + 4]67 fmadd r38,r\areg,r22


















;Point back to first row103 ldrd r16,[r1,#0]

fmadd r62,r\areg,r22105 ldrd r18,[r1,#1]

fmadd r63,r\areg,r23107 ldrd r20,[r1,#2]

.endm109

§B.2 Matrix Multiplication 143

; macro to multiply an element of matrix A with all the elements in a row of matrix B111 .macro doMultincr areg,index,aprev,incr


fmadd r33,r\areg,r17115 ldr r\aprev,[r0,#\incr]

fmadd r34,r\areg,r18117 ;Move r0 to point to next row of Matrix A

add r0,r0,#(stride * 4)119 fmadd r35,r\areg,r19






















ldrd r20,[r1,#\index + 18]163 .endm

165 ; Start of matmul compute function

167 ;preload the first row and first row of A and B;Matrix A

169 ldr r11,[r0,#0]ldr r12,[r0,#1]

171 ldr r14,[r0,#2]nop

173 nop

175 ;Matrix Bldrd r16,[r1,#0]

177 ldrd r18,[r1,#1]ldrd r20,[r1,#2]

179;Matrix C - Reading first row of Intermediate result Matrix C

181 ldrd r32,[r2,#0]ldrd r34,[r2,#1]

183 ldrd r36,[r2,#2]ldrd r38,[r2,#3]

185 ldrd r40,[r2,#4]ldrd r42,[r2,#5]

187 ldrd r44,[r2,#6]ldrd r46,[r2,#7]

189 ldrd r48,[r2,#8]ldrd r50,[r2,#9]

191 ldrd r52,[r2,#10]


ldrd r54,[r2,#11]193 ldrd r56,[r2,#12]

ldrd r58,[r2,#13]195 ldrd r60,[r2,#14]

ldrd r62,[r2,#15]197

.Lb:199 ;Start accumulating

doMult 11,0,15,3201 doMult 12,1 * stride/2,11,4

doMult 14,2 * stride/2,12,5203 doMult 15,3 * stride/2,14,6













;Start buffering next row of Matrix A229 doMultincr 11,28 * stride/2,15,31


doMultend 15,31 * stride/2,14,2233

;Start writing out results into C and read next row of intermediate result C235 strd r32,[r2,#0]

ldrd r32,[r2,#16 + 0]237 strd r34,[r2,#1]

ldrd r34,[r2,#16 + 1]239 strd r36,[r2,#2]

ldrd r36,[r2,#16 + 2]241 strd r38,[r2,#3]

ldrd r38,[r2,#16 + 3]243 strd r40,[r2,#4]

ldrd r40,[r2,#16 + 4]245 strd r42,[r2,#5]

ldrd r42,[r2,#16 + 5]247 strd r44,[r2,#6]

ldrd r44,[r2,#16 + 6]249 strd r46,[r2,#7]

ldrd r46,[r2,#16 + 7]251 strd r48,[r2,#8]

ldrd r48,[r2,#16 + 8]253 strd r50,[r2,#9]

ldrd r50,[r2,#16 + 9]255 strd r52,[r2,#10]

ldrd r52,[r2,#16 + 10]257 strd r54,[r2,#11]

ldrd r54,[r2,#16 + 11]259 strd r56,[r2,#12]

ldrd r56,[r2,#16 + 12]261 strd r58,[r2,#13]

ldrd r58,[r2,#16 + 13]263 strd r60,[r2,#14]

ldrd r60,[r2,#16 + 14]265 strd r62,[r2,#15]

ldrd r62,[r2,#16 + 15]267 ;Point to next row of Matrix C

add r2,r2,#(stride * 4)269

;Now loop back271 sub r3,r3,#1

nop273 bne .Lb

Appendix C

Energy Measurement Setup

The schematic of the energy measurement setup using the µCurrent Gold is shownin Figure C.1.

Figure C.1: Schematic of energy measurement framework using µCurrent Gold

145

146 Energy Measurement Setup

Appendix D

Frequency Scaling and EnergyUsage

To characterize the power requirements of the CPU and memory components, wecreated two synthetic microbenchmarks: a “CPU-bound” code with a high proportionof floating-point and integer instructions that operate on data in (on-chip) cache, anda “memory-bound” code designed to be highly cache-inefficient, such that a highproportion of time and energy is devoted to off-chip memory access.

The CPU-bound code evaluates a polynomial function for every element of an ar-ray. The computation has a high flop intensity of 36 flops per memory access. The sizeof the vectors is chosen so that the array can be stored entirely in L1 cache. This is toensure that there is no memory access during computation and the microbenchmarkis thus bound by CPU performance.

The memory-bound code copies 32-bit values from one array to another. To ensurethat copying of each element involves explicit memory access, after transferring oneelement in the array, the next element to be transferred is located at an offset suchthat the element will not be present in the cache, thereby necessitating an off-chipmemory access.

D.1 CPU-bound Workload

In the first experiment, we measured performance and power consumption for theCPU-bound workload while varying the CPU frequency. The mean of 5 samples arereported for all experiments with a margin of error, at a confidence level of 95%, ofless than 0.5% for performance and 1% for energy and power. Power consumptionwas observed to increase non-linearly with increase in CPU frequency as shown inFigure D.1. The variation in performance and energy-to-solution with frequency areshown in Figure D.2.

To identify the relation between power and frequency, a non-linear curve-fittingwas done on the measured data. The result describes a cubic relation. The relationsare shown in Equation D.1 for TK1 and Equation D.2 for TX1.

147

148 Frequency Scaling and Energy Usage

0 100 200 300 400 500 600 700 800 900 1,000 1,100 1,200 1,300 1,400 1,500 1,600 1,700 1,800 1,900 2,000 2,100 2,2000

2

4

6

8

10

CPU Frequency (MHz)

Pow

er(W

)

Idle Power Active Power

Figure D.1: CPU-bound workload on TK1 - Variation of active power draw as CPU frequencyis varied.

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6·104

Perf

orm

ance

(MFL

OP/

s)

MFLOP/s

0 100 200 300 400 500 600 700 800 900 1,000 1,100 1,200 1,300 1,400 1,500 1,600 1,700 1,800 1,900 2,000 2,100 2,2000

10

20

30

40

50

60

CPU Frequency (MHz)

Ener

gy(J

)

JOULES

Figure D.2: CPU-bound workload on TK1 - Variation of performance and energy usage whenCPU frequency is varied.

§D.2 Memory-bound Workload 149

P = 0.31 ∗ f 3 + 1.66 ∗ f + 1.95 (D.1)

P = 0.25 ∗ f 3 + 1.81 ∗ f + 2.66 (D.2)

where P is the power consumption in watts, and f is the CPU frequency in GHz.

D.2 Memory-bound Workload

A similar experiment was conducted for the memory bound workload. The observedincrease in power consumption with increasing CPU frequency is shown in Figure D.3.The variations in performance and energy-to-solution with frequency are shown inFigure D.4.

0 100 200 300 400 500 600 700 800 900 1,000 1,100 1,200 1,300 1,400 1,500 1,600 1,700 1,800 1,900 2,000 2,100 2,2000

2

4

6

8

10

CPU Frequency (MHz)

Pow

er(W

)

Idle Power Active Power

Figure D.3: Memory-bound workload on TK1 - Variation of active power draw when CPUfrequency is varied.

0

50

100

150

200

250

Band

wid

th(M

B/s)

MEM BAND

0 100 200 300 400 500 600 700 800 900 1,000 1,100 1,200 1,300 1,400 1,500 1,600 1,700 1,800 1,900 2,000 2,100 2,2000

10

20

30

40

50

CPU Frequency (MHz)

Ener

gy(J

)

JOULES

Figure D.4: Memory-bound workload on TK1 - Variation of performance and energy usagewhen CPU frequency is varied.

The relations obtained from curve-fitting are shown in Equation D.3 for TK1 andEquation D.4 for TX1. The coefficient for the cubic term is negligible, resulting in alinear relation between power and CPU frequency for the memory-bound workload.

150 Frequency Scaling and Energy Usage

P = 1.74 ∗ f + 1.95 (D.3)

P = 1.72 ∗ f + 2.66 (D.4)

Bibliography

Adapteva. Parallella examples. https://github.com/parallella/parallella-examples. (citedon page 33)

Adapteva, 2012a. Epiphany Architecture Reference Manual. http://www.adapteva.com/wp-content/ uploads/ 2012/ 12/ epiphany_arch_reference_3.12.12.18.pdf , (2012). (cited onpage 19)

Adapteva, 2012b. Epiphany Technical Reference Documents. http://www.adapteva.com/all-documents, (2012). (cited on pages 9 and 18)

Adapteva, 2013. Epiphany-IV Datasheet. http://www.adapteva.com/wp-content/ uploads/2013/ 06/ e64g401_datasheet_4.13.6.14.pdf , (2013). (cited on pages 8 and 9)

Agathos, S. N. and Papadogiannakis, A., 2015. Targeting the Parallella. (2015),662–674. doi:10.1007/978-3-662-48096-0. (cited on pages 18 and 69)

Agullo, E.; Demmel, J.; Dongarra, J.; Hadri, B.; Kurzak, J.; Langou, J.; Ltaief,H.; Luszczek, P.; and Tomov, S., 2009. Numerical Linear Algebra on EmergingArchitectures: The PLASMA and MAGMA Projects. In Journal of Physics: ConferenceSeries, vol. 180, 012037. IOP Publishing. (cited on page 27)

ARM Community. Enabling mass IoT connectivity as ARM part-ners ship 100 billion chips. https:// community.arm.com/ iot/ b/ blog/ posts/enabling-mass-iot-connectivity-as-arm-partners-ship-100-billion-chips. (cited onpage 8)

Asanovic, K.; Bodik, R.; Catanzaro, B. C.; Gebis, J. J.; Husbands, P.; Keutzer, K.;Patterson, D. A.; Plishker, W. L.; Shalf, J.; Williams, S. W.; et al., 2006. Thelandscape of parallel computing research: A view from berkeley. Technical report,Technical Report UCB/EECS-2006-183, EECS Department, University of California,Berkeley. (cited on page 23)

Augonnet, C.; Thibault, S.; Namyst, R.; and Wacrenier, P.-A., 2011. Starpu: aunified platform for task scheduling on heterogeneous multicore architectures.Concurrency and Computation: Practice and Experience, 23, 2 (2011), 187–198. (citedon page 27)

Bailey, D. H.; Barszcz, E.; Barton, J. T.; Browning, D. S.; Carter, R. L.; Fatoohi,R. A.; Frederickson, P. O.; Lasinski, T. A.; Simon, H. D.; Venkatakrishnan, V.;

151

https://github.com/parallella/parallella-examples

http://www.adapteva.com/wp-content/uploads/2012/12/epiphany_arch_reference_3.12.12.18.pdf

http://www.adapteva.com/wp-content/uploads/2012/12/epiphany_arch_reference_3.12.12.18.pdf

http://www.adapteva.com/all-documents

http://www.adapteva.com/all-documents

http://www.adapteva.com/wp-content/uploads/2013/06/e64g401_datasheet_4.13.6.14.pdf

http://www.adapteva.com/wp-content/uploads/2013/06/e64g401_datasheet_4.13.6.14.pdf

http://dx.doi.org/10.1007/978-3-662-48096-0

https://community.arm.com/iot/b/blog/posts/enabling-mass-iot-connectivity-as-arm-partners-ship-100-billion-chips

https://community.arm.com/iot/b/blog/posts/enabling-mass-iot-connectivity-as-arm-partners-ship-100-billion-chips

152 Bibliography

and Weeratunga, S. K., 1991. The NAS Parallel Benchmarks. Technical report,The International Journal of Supercomputer Applications. (cited on page 25)

Beaumont, O.; Boudet, V.; Rastello, F.; and Robert, Y., 2001. Matrix Multiplicationon Heterogeneous Platforms. Parallel and Distributed Systems, IEEE Transactions on,12, 10 (2001), 1033–1051. (cited on page 27)

Bedard, D.; Lim, M. Y.; Fowler, R.; and Porterfield, A., 2010. Powermon: Fine-grained and integrated power monitoring for commodity computer systems. InIEEE SoutheastCon 2010 (SoutheastCon), Proceedings of the, 479–484. IEEE. (cited onpage 28)

Bergman, K.; Borkar, S.; Campbell, D.; Carlson, W.; Dally, W.; Denneau, M.;Franzon, P.; Harrod, W.; Hill, K.; Hiller, J.; et al., 2008. Exascale computingstudy: Technology challenges in achieving exascale systems. (2008). (cited on page1)

Bilmes, J.; Asanovic, K.; Jim, C.-w. C.; Ca, B.; and Ca, B., 1997. Optimizing Ma-trix Multiply using PHiPAC : a Portable , High-Performance , ANSI C CodingMethodology. (1997). (cited on page 29)

Cabrera, A.; Almeida, F.; Arteaga, J.; and Blanco, V., 2015. Measuring Energy Con-sumption Using EML (Energy Measurement Library). Computer Science-Researchand Development, 30, 2 (2015), 135–143. (cited on pages 28 and 85)

Calore, E.; Schifano, S. F.; and Tripiccione, R., 2015. Energy-Performance Trade-offs for HPC Applications on Low Power Processors. Lecture Notes in ComputerScience (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes inBioinformatics), 9523 (2015), 737–748. doi:10.1007/978-3-319-27308-2. (cited onpages 26, 28, and 73)

Cannon, L. E., 1969. A cellular computer to implement the kalman filter algorithm.Technical report, DTIC Document. (cited on page 24)

Cao, T., 2014. Power, Performance, and Upheaval: An Opportunity for ManagedLanguages. , July (2014). (cited on pages 28 and 73)

Cavicchioli, R.; Capodieci, N.; and Bertogna, M., 2018. Memory interference char-acterization between CPU cores and integrated GPUs in mixed-criticality platforms.IEEE International Conference on Emerging Technologies and Factory Automation, ETFA,(2018), 1–10. doi:10.1109/ETFA.2017.8247615. (cited on pages 26 and 59)

Che, S.; Boyer, M.; Meng, J.; Tarjan, D.; Sheaffer, J. W.; Lee, S. H.; and Skadron,K., 2009. Rodinia: A benchmark suite for heterogeneous computing. Proceedingsof the 2009 IEEE International Symposium on Workload Characterization, IISWC 2009,(2009), 44–54. doi:10.1109/IISWC.2009.5306797. (cited on pages 25 and 33)

http://dx.doi.org/10.1007/978-3-319-27308-2

http://dx.doi.org/10.1109/ETFA.2017.8247615

http://dx.doi.org/10.1109/IISWC.2009.5306797

Bibliography 153

Choi, J.; Walker, D. W.; and Dongarra, J. J., 1994. Pumma: Parallel universalmatrix multiplication algorithms on distributed memory concurrent computers.Concurrency: Practice and Experience, 6, 7 (1994), 543–570. (cited on page 24)

Choi, J. W.; Bedard, D.; Fowler, R.; and Vuduc, R., 2013. A roofline model ofenergy. Proc. International Parallel and Distributed Processing Symposium (IPDPS),(2013). doi:10.1109/IPDPS.2013.77. (cited on page 28)

Cochran, R.; Hankendi, C.; Coskun, A.; and Reda, S., 2011. Pack& Cap: Adaptive DVFS and thread packing under power caps. Pro-ceedings of the Annual International Symposium on Microarchitecture, MI-CRO, , 1 (2011), 175–185. doi:10.1145/2155620.2155641. http://www.scopus.com/inward/record.url?eid=2-s2.0-84858763476&partnerID=40&md5=aba8e35c560beac0504674e06e591884%5Cnhttp://www.scopus.com/inward/record.url?eid=2-s2.0-84858763476&partnerID=tZOtx3y1. (cited on page 30)

Colella, P., 2004. Defining software requirements for scientific computing. (2004).(cited on page 23)

Datta, K.; Kamil, S.; Williams, S.; Oliker, L.; Shalf, J.; and Yelick, K., 2009.Optimization and performance modeling of stencil computations on modern mi-croprocessors. SIAM review, 51, 1 (2009), 129–159. (cited on pages 23 and 59)

Datta, K.; Murphy, M.; Volkov, V.; Williams, S.; Carter, J.; Oliker, L.; Patterson,D.; Shalf, J.; Yelick, K.; Berkeley, L.; and Division, C. S., 2008. Stencil Computa-tion Optimization and Auto-tuning on State-of-the-Art Multicore Architectures. ,November (2008). (cited on pages 23 and 29)

der Wijngaart, R. F. V. and Jin, H., 2003. NAS Parallel Benchmarks, Multi-ZoneVersions. NAS Technical Report, , July (2003), 1–9. http://www.nas.nasa.gov/assets/pdf/techreports/2003/nas-03-010.pdf. (cited on page 25)

Dolbeau, R., 2015. Theoretical peak flops per instruction set on less conventionalhardware. (2015). (cited on page 64)

Donfack, S.; Tomov, S.; and Dongarra, J., 2014. Dynamically BalancedSynchronization-Avoiding LU Factorization with Multicore and GPUs. In Parallel& Distributed Processing Symposium Workshops (IPDPSW), 2014 IEEE International,958–965. IEEE. (cited on pages 27 and 101)

Dongarra, J.; Beckman, P.; Moore, T.; Aerts, P.; Aloisio, G.; Andre, J. C.; Barkai,D.; Berthou, J. Y.; Boku, T.; Braunschweig, B.; Cappello, F.; Chapman, B.; Xuebin

Chi; Choudhary, A.; Dosanjh, S.; Dunning, T.; Fiore, S.; Geist, A.; Gropp,B.; Harrison, R.; Hereld, M.; Heroux, M.; Hoisie, A.; Hotta, K.; Zhong Jin;Ishikawa, Y.; Johnson, F.; Kale, S.; Kenway, R.; Keyes, D.; Kramer, B.; Labarta,J.; Lichnewsky, A.; Lippert, T.; Lucas, B.; MacCabe, B.; Matsuoka, S.; Messina,

http://dx.doi.org/10.1109/IPDPS.2013.77

http://dx.doi.org/10.1145/2155620.2155641

http://www.scopus.com/inward/record.url?eid=2-s2.0-84858763476&partnerID=40&md5=aba8e35c560beac0504674e06e591884%5Cnhttp://www.scopus.com/inward/record.url?eid=2-s2.0-84858763476&partnerID=tZOtx3y1




http://www.nas.nasa.gov/assets/pdf/techreports/2003/nas-03-010.pdf

http://www.nas.nasa.gov/assets/pdf/techreports/2003/nas-03-010.pdf

154 Bibliography

P.; Michielse, P.; Mohr, B.; Mueller, M. S.; Nagel, W. E.; Nakashima, H.; Papka,M. E.; Reed, D.; Sato, M.; Seidel, E.; Shalf, J.; Skinner, D.; Snir, M.; Sterling,T.; Stevens, R.; Streitz, F.; Sugar, B.; Sumimoto, S.; Tang, W.; Taylor, J.; Thakur,R.; Trefethen, A.; Valero, M.; Van Der Steen, A.; Vetter, J.; Williams, P.;Wisniewski, R.; and Yelick, K., 2011. The international exascale software projectroadmap. International Journal of High Performance Computing Applications, 25, 1(2011), 3–60. doi:10.1177/1094342010391989. (cited on page 1)

Dubé, N., 2011. True Sustainability, the Path to a Net-Zero Datacenter: Energy,Carbon, Water. In Proceedings of the 2011 workshop on Energy Efficiency: HPC Systemand Datacenters, 201–236. ACM. (cited on page 133)

Dümmler, J. and Rünger, G., 2013. Execution schemes for the npb-mz benchmarkson hybrid architectures: A comparative study. In PARCO, 733–742. (cited on pages25, 27, and 86)

Duran, A.; Ayguadé, E.; Badia, R. M.; Labarta, J.; Martinell, L.; Martorell, X.;and Planas, J., 2011. Ompss: a proposal for programming heterogeneous multi-core architectures. Parallel Processing Letters, 21, 02 (2011), 173–193. (cited on page27)

Endrei, M.; Jin, C.; Dinh, M.; Abramson, D.; Poxon, H.; DeRose, L.; and De Supin-ski, B. R., 2018. A bottleneck-centric tuning policy for optimizing energy in parallelprograms. Advances in Parallel Computing, 32 (2018), 265–276. (cited on page 30)

Fatica, M. and Phillips, E. Synthetic Aperture Radar imaging on a CUDA-enabledmobile platform. (cited on page 26)

Fu, H.; Liao, J.; Yang, J.; Wang, L.; Song, Z.; Huang, X.; Yang, C.; Xue, W.; Liu,F.; Qiao, F.; Zhao, W.; Yin, X.; Hou, C.; Zhang, C.; Ge, W.; Zhang, J.; Wang, Y.;Zhou, C.; and Yang, G., 2016. The Sunway TaihuLight supercomputer: systemand applications. Science China Information Sciences, 59, July (2016), 1–16. doi:10.1007/s11432-016-5588-7. http://link.springer.com/10.1007/s11432-016-5588-7. (citedon pages 2 and 131)

Fu, S.; Chang, R.; Couture, S.; Menarini, M.; Escobar, M.; Kuteifan, M.; Lubarda,M.; Gabay, D.; and Lomakin, V. Explore computational power of mobile platformsin micromagnetic simulations. (cited on page 26)

Ge, R.; Feng, X.; Burtscher, M.; and Zong, Z., 2014. Performance and energymodeling for cooperative hybrid computing. In Networking, Architecture, and Storage(NAS), 2014 9th IEEE International Conference on, 232–241. IEEE. (cited on pages 29,77, and 96)

Gensh, R.; Aalsaud, A.; Rafiev, A.; Xia, F.; Iliasov, A.; Romanovsky, A.; and

Yakovlev, A., 2015. Experiments with odroid-xu3 board. Newcastle University, Com-puting Science. (cited on page 27)

http://dx.doi.org/10.1177/1094342010391989

http://dx.doi.org/10.1007/s11432-016-5588-7

http://dx.doi.org/10.1007/s11432-016-5588-7

http://link.springer.com/10.1007/s11432-016-5588-7

Bibliography 155

Grauer-Gray, S., 2014. MAXAS: A full walk through of the SGEMM implementation.https://github.com/NervanaSystems/maxas/wiki/SGEMM. (cited on page 65)

Green500. Green500: Ranking the Most Energy Efficient Supercomputers. http://www.green500.org. (cited on page 2)

Gschwandtner, P.; Durillo, J. J.; and Fahringer, T. Multi-Objective Auto-Tuningwith Insieme : Optimization and Trade-Off Analysis for Time , Energy and ResourceUsage. (cited on page 30)

Hager, G.; Treibig, J.; Habich, J.; and Wellein, G., 2016. Exploring performanceand power properties of modern multi-core chips via simple machine models.Concurrency and Computation: Practice and Experience, 28, 2 (2016), 189–210. (citedon page 28)

Heroux, M. and Barrett, R., 2011. Mantevo project. (cited on page 25)

Hines, J., 2018. Stepping up to summit. Computing in Science & Engineering, 20, 2(2018), 78–82. (cited on page 2)

Hoffmann, H. and Henry, 2015. JouleGuard. Proceedings of the 25th Symposiumon Operating Systems Principles - SOSP ’15, (2015), 198–214. doi:10.1145/2815400.2815403. http://dl.acm.org/citation.cfm?doid=2815400.2815403. (cited on page 30)

Ibrahim, K.; Williams, S.; and Oliker, L., 2018. Roofline scaling trajectories: Amethod for parallel application and architectural performance analysis. In 2018International Conference on High Performance Computing & Simulation (HPCS), 350–358. IEEE. (cited on page 25)

Imes, C.; Bergstrom, L.; and Hoffmann, H., 2016. A portable interface for run-time energy monitoring. Proceedings of the 2016 24th ACM SIGSOFT InternationalSymposium on Foundations of Software Engineering - FSE 2016, (2016), 968–974. doi:10.1145/2950290.2983956. http://dl.acm.org/citation.cfm?doid=2950290.2983956. (citedon pages 28 and 134)

Imes, C.; Kim, D. H.; Maggio, M.; and Hoffmann, H., 2015. POET: A portableapproach to minimizing energy under soft real-time constraints. Proceedings of theIEEE Real-Time and Embedded Technology and Applications Symposium, RTAS, 2015-May (2015), 75–86. doi:10.1109/RTAS.2015.7108419. (cited on page 30)

Intel, 2011. Intel 64 and IA-32 Architectures Software DeveloperâAZs Manual, Vol-ume 3A, 3B and 3C: System Programming Guide, Parts 1 and 2. (2011). (cited onpage 28)

Ishihara, T. and Yasuura, H., 1998. Voltage scheduling problem for dynamicallyvariable voltage processors. Proc. International Symposium on Low Power Electronicsand Design (ISLPED), 1, 1 (1998). doi:10.1145/280756.280894. http://portal.acm.org/citation.cfm?doid=280756.280894. (cited on page 96)

https://github.com/NervanaSystems/maxas/wiki/SGEMM

http://www.green500.org

http://www.green500.org

http://dx.doi.org/10.1145/2815400.2815403

http://dx.doi.org/10.1145/2815400.2815403

http://dl.acm.org/citation.cfm?doid=2815400.2815403

http://dx.doi.org/10.1145/2950290.2983956

http://dx.doi.org/10.1145/2950290.2983956

http://dl.acm.org/citation.cfm?doid=2950290.2983956

http://dx.doi.org/10.1109/RTAS.2015.7108419

http://dx.doi.org/10.1145/280756.280894

http://portal.acm.org/citation.cfm?doid=280756.280894


156 Bibliography

Jack Dongarra, 2018. Adaptive Linear Solvers and Eigensolvers. Argonne TrainingProgram on Extreme-Scale Computing (ATPESC), (2018). (cited on page 134)

Jones, D. L., 2010. The µCurrent. http://alternatezone.com/electronics/ucurrent/uCurrentArticle.pdf. (cited on pages 73 and 74)

Kamil, S.; Chan, C.; Williams, S.; Oliker, L.; Shalf, J.; Howison, M.; Bethel, E. W.;Nersc, C. R. D.; and Berkeley, L., 2009. A Generalized Framework for Auto-tuningStencil Computations. Cray User Group 2009 Proceedings, (2009). (cited on pages 23and 29)

Kamil, S.; Datta, K.; Williams, S.; Oliker, L.; Shalf, J.; and Yelick, K., 2006.Implicit and explicit optimizations for stencil computations. Proceedings of the2006 workshop on Memory system performance and correctness - MSPC ’06, (2006), 51.doi:10.1145/1178597.1178605. http://portal.acm.org/citation.cfm?doid=1178597.1178605.(cited on page 23)

Karcher, T.; Schaefer, C.; and Pankratius, V., 2009. Auto-tuning support formanycore applications. ACM SIGOPS Operating Systems Review, 43, 2 (2009), 96.doi:10.1145/1531793.1531808. (cited on page 29)

Komoda, T.; Hayashi, S.; Nakada, T.; Miwa, S.; and Nakamura, H., 2013. Powercapping of cpu-gpu heterogeneous systems through coordinating dvfs and taskmapping. In 2013 IEEE 31st International Conference on Computer Design (ICCD),349–356. IEEE. (cited on page 29)

Lai, J. and Seznec, A., 2013. Performance Upper Bound Analysis and Optimizationof SGEMM on Fermi and Kepler GPUs. In Code Generation and Optimization (CGO),2013 IEEE/ACM International Symposium on, 1–10. IEEE. (cited on page 64)

Lang, J.; Gudula, R.; and St, P., 2015. Towards Energy-efficient Linear Algebra withan ATLAS Library Tuned for Energy Consumption. (2015), 63–70. (cited on page29)

Lang, J. and Rünger, G., 2014. An execution time and energy model for an energy-aware execution of a conjugate gradient method with cpu/gpu collaboration. Jour-nal of Parallel and Distributed Computing, 74, 9 (2014), 2884–2897. (cited on page27)

Lebacki, B.; Wolfe, M.; and Miles, D., 2012. The PGI Fortran and C99 OpenACCCompilers. Cray User Group, (2012). (cited on page 17)

Liao, C.; Yan, Y.; de Supinski, B. R.; Quinlan, D. J.; and Chapman, B., 2013.Early Experiences with the OpenMP Accelerator Model. vol. 8122 of Lecture Notesin Computer Science, 84–98. Springer Berlin Heidelberg. ISBN 978-3-642-40697-3.doi:10.1007/978-3-642-40698-0_7. (cited on page 17)

http://alternatezone.com/electronics/ucurrent/uCurrentArticle.pdf

http://alternatezone.com/electronics/ucurrent/uCurrentArticle.pdf

http://dx.doi.org/10.1145/1178597.1178605


http://dx.doi.org/10.1145/1531793.1531808

http://dx.doi.org/10.1007/978-3-642-40698-0_7

Bibliography 157

Lim, D.-J.; Anderson, T. R.; and Shott, T., 2015. Technological forecasting of su-percomputer development: The march to exascale computing. Omega, 51 (2015),128–135. (cited on page 1)

Lin, J.; Xu, Z.; Nukada, A.; Maruyama, N.; and Matsuoka, S., 2017. Optimizationsof Two Compute-bound Scientific Kernels on the SW26010 Many-core Processor.(2017). doi:10.1109/ICPP.2017.52. (cited on page 132)

LLNL. Sierra. https://hpc.llnl.gov/hardware/platforms/sierra. (cited on page 2)

Marcondes, A. H.; Diel, G.; de Souza, F. R.; Vieira, P. R.; Fiorese, A.; and

Koslovski, G. P., 2016. Executing distributed applications on sdn-based data cen-ter: A study with nas parallel benchmark. In 2016 7th International Conference on theNetwork of the Future (NOF), 1–3. IEEE. (cited on page 25)

Mitra, G., 2017. Low-power system-on-chip processors for energy efficient highperformance computing: The texas instruments keystone ii. (2017). (cited on page26)

Mitra, G.; Haigh, A.; Varghese, A.; Angove, L.; and Rendell, A. P., 2016. Splitwisely: When work partitioning is energy-optimal on heterogeneous hardware.HPCC, (2016). (cited on pages 34, 62, 71, and 72)

NVIDIA, a. CUDA C Best Practices Guide - Zero Copy. https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html. (cited on page 21)

NVIDIA, b. CUDA C Programming Guide - Unified Memory programming. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html. (cited on page 22)

NVIDIA, c. CUDA for Tegra - Memory Selection. https://docs.nvidia.com/cuda/cuda-for-tegra-appnote/index.html. (cited on page 22)

NVIDIA, d. Jetson Tegra K1 Double Precision Performance. https://devtalk.nvidia.com/default/topic/719503/jetson-tegra-k1-double-precision-performance/?offset=3. (cited onpage 64)

NVIDIA, e. NVIDIA Tesla V100 GPU Architecture. https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf. (cited on page 2)

NVIDIA, 2012. NVML API REFERENCE MANUAL. (2012). (cited on page 28)

NVIDIA, 2014a. NVIDIA Jetson TK1 Development Kit. Technical brief, (2014), 1–15.(cited on pages xv, 11, and 137)

NVIDIA, 2014b. NVIDIA Tegra K1 Processor. http://www.nvidia.com/object/tegra-k1-processor.html. (cited on page 8)

http://dx.doi.org/10.1109/ICPP.2017.52

https://hpc.llnl.gov/hardware/platforms/sierra

https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html

https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html

https://docs.nvidia.com/cuda/cuda-for-tegra-appnote/index.html

https://docs.nvidia.com/cuda/cuda-for-tegra-appnote/index.html

https://devtalk.nvidia.com/default/topic/719503/jetson-tegra-k1-double-precision-performance/?offset=3

https://devtalk.nvidia.com/default/topic/719503/jetson-tegra-k1-double-precision-performance/?offset=3

https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf

https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf

http://www.nvidia.com/object/tegra-k1-processor.html

http://www.nvidia.com/object/tegra-k1-processor.html

158 Bibliography

NVIDIA, 2014c. NVIDIA Tegra K1 Whitepaper - A New Era in Mobile Computingv1.0. White paper, (2014), 1–26. doi:10.1016/j.bbi.2008.05.010. (cited on pages xv, 2,11, and 12)

NVIDIA, 2015a. NVIDIA Jetson TX1 Development kit. https://developer.nvidia.com/embedded/buy/jetson-tx1-devkit. (cited on pages xv, 13, 14, and 137)

NVIDIA, 2015b. NVIDIA Tegra X1 Processor. http://www.nvidia.com/object/tegra-x1-processor.html. (cited on page 8)

NVIDIA, 2015c. NVIDIA Tegra X1 Whitepaper - NVIDIA’s New Mobile Superchip.(2015), 1–41. (cited on pages xv, 13, and 14)

NVIDIA, 2018a. Jetson AGX Xavier thermal design guide. https://developer.nvidia.com/embedded/dlc/jetson-agx-xavier-thermal-design-guide. (cited on page 76)

NVIDIA, 2018b. NVIDIA Tegra Xavier development kit. https://developer.nvidia.com/embedded/buy/jetson-agx-xavier-devkit. (cited on pages xv, 8, 15, 16, and 137)

NVIDIA, C., 2008. CuBLAS library. NVIDIA Corporation, Santa Clara, California, 15(2008). (cited on page 63)

NVIDIA, D. F., 2018c. NVIDIA Jetson AGX Xavier Delivers 32TeraOps for New Era of AI in Robotics. https://devblogs.nvidia.com/nvidia-jetson-agx-xavier-32-teraops-ai-robotics/. (cited on pages xv, 15, 16,and 17)

Ohshima, S.; Kise, K.; Katagiri, T.; and Yuba, T., 2007. Parallel Processing ofMatrix Multiplication in a CPU and GPU Heterogeneous Environment. In HighPerformance Computing for Computational Science-VECPAR 2006, 305–318. Springer.(cited on page 26)

Okada, T. K.; Goldman, A.; and Cavalheiro, G. G. H., 2016. Using nas parallelbenchmarks to evaluate hpc performance in clouds. In 2016 IEEE 15th InternationalSymposium on Network Computing and Applications (NCA), 27–30. IEEE. (cited onpage 25)

Olofsson, A., 2012. Epiphany-V: A 1024 processor 64-bit RISC System-On-Chip.(2012), 1–15. (cited on page 131)

Olofsson, A.; Nordström, T.; and Ul-Abdin, Z., 2014. Kickstarting high-performance energy-efficient manycore architectures with epiphany. In 2014 48thAsilomar Conference on Signals, Systems and Computers, 1719–1726. IEEE. (cited onpage 2)

Olofsson, A.; Trogan, R.; and Raikhman, O., 2011. A 1024-core 70GFLOPS/WFloating Point Manycore Microprocessor. Technical report. (cited on page 9)

http://dx.doi.org/10.1016/j.bbi.2008.05.010

https://developer.nvidia.com/embedded/buy/jetson-tx1-devkit

https://developer.nvidia.com/embedded/buy/jetson-tx1-devkit

http://www.nvidia.com/object/tegra-x1-processor.html

http://www.nvidia.com/object/tegra-x1-processor.html

https://developer.nvidia.com/embedded/dlc/jetson-agx-xavier-thermal-design-guide

https://developer.nvidia.com/embedded/dlc/jetson-agx-xavier-thermal-design-guide

https://developer.nvidia.com/embedded/buy/jetson-agx-xavier-devkit

https://developer.nvidia.com/embedded/buy/jetson-agx-xavier-devkit

https://devblogs.nvidia.com/nvidia-jetson-agx-xavier-32-teraops-ai-robotics/

https://devblogs.nvidia.com/nvidia-jetson-agx-xavier-32-teraops-ai-robotics/

Bibliography 159

Otterness, N.; Yang, M.; Rust, S.; Park, E.; Anderson, J. H.; Smith, F. D.; Berg,A.; and Wang, S., 2017. An evaluation of the NVIDIA TX1 for supporting real-time computer-vision workloads. Proceedings of the IEEE Real-Time and EmbeddedTechnology and Applications Symposium, RTAS, (2017), 353–363. doi:10.1109/RTAS.2017.3. (cited on page 26)

Papadrakakis, M.; Stavroulakis, G.; and Karatarakis, A., 2011. A New Erain Scientific Computing: Domain Decomposition Methods in Hybrid CPU-GPUArchitectures. Computer Methods in Applied Mechanics and Engineering, 200, 13 (2011),1490–1508. (cited on page 27)

Rajovic, N.; Carpenter, P. M.; Gelado, I.; Puzovic, N.; Ramirez, A.; and Valero,M., 2013. Supercomputing With Commodity CPUs: Are Mobile SoCs Ready forHPC? In Proceedings of SC13: International Conference for High Performance Comput-ing, Networking, Storage and Analysis, 40. ACM. (cited on pages 1, 2, and 26)

Rajovic, N.; Rico, A.; Puzovic, N.; Adeniyi-Jones, C.; and Ramirez, A., 2014.Tibidabo: Making the Case for an ARM-based HPC System. Future GenerationComputer Systems, 36 (2014), 322–334. (cited on pages 1, 26, 28, and 64)

Richie, D.; Ross, J.; Park, S.; and Shires, D., 2015. Threaded MPI programmingmodel for the Epiphany RISC array processor. Journal of Computational Science, 9(2015), 94–100. doi:10.1016/j.jocs.2015.04.023. http://dx.doi.org/10.1016/j.jocs.2015.04.023. (cited on pages 18, 53, and 69)

Richie, D. A. and B, J. A. R., 2016. OpenCL + OpenSHMEM Hybrid ProgrammingModel for the Adapteva Epiphany Architecture. 10007 (2016), 181–192. doi:10.1007/978-3-319-50995-2. http://link.springer.com/10.1007/978-3-319-50995-2. (cited onpages 18 and 69)

Rizvandi, N. B.; Zomaya, A. Y.; Lee, Y. C.; Boloori, A. J.; and Taheri, J.,2012. Multiple frequency selection in DVFS-enabled processors to minimize en-ergy consumption. Energy-Efficient Distributed Computing Systems, (2012), 443–463.doi:10.1002/9781118342015.ch17. (cited on page 96)

Sanders, J. and Kandrot, E., 2010. CUDA by Example: An Introduction to General-Purpose GPU Programming. Addison-Wesley Professional, 1st edn. ISBN 0131387685,9780131387683. (cited on page 21)

Shalf, J.; Dosanjh, S.; and Morrison, J., 2011. Exascale computing technologychallenges. Lecture Notes in Computer Science (including subseries Lecture Notes inArtificial Intelligence and Lecture Notes in Bioinformatics), 6449 LNCS (2011), 1–25.doi:10.1007/978-3-642-19328-6_1. (cited on page 1)

Shen, J., 2015. Efficient High Performance Computing on Heterogeneous Platforms. Ph.D.thesis, TU Delft, Delft University of Technology. (cited on pages 30 and 102)



http://dx.doi.org/10.1016/j.jocs.2015.04.023



http://dx.doi.org/10.1007/978-3-319-50995-2

http://dx.doi.org/10.1007/978-3-319-50995-2

http://link.springer.com/10.1007/978-3-319-50995-2

http://dx.doi.org/10.1002/9781118342015.ch17

http://dx.doi.org/10.1007/978-3-642-19328-6_1

160 Bibliography

Siehl, K. and Zhao, X., 2017. Supporting Energy-Efficient Computing on Hetero-geneous CPU-GPU Architectures. 2017 IEEE 5th International Conference on FutureInternet of Things and Cloud (FiCloud), (2017), 134–141. doi:10.1109/FiCloud.2017.46.http://ieeexplore.ieee.org/document/8114474/. (cited on pages 30 and 111)

Smith, T. M.; van de Geijn, R. A.; Smelyanskiy, M.; Hammond, J. R.; and Van Zee,F. G., 2014. Anatomy of High-Performance Many-Threaded Matrix Multiplication.In 28th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2014).(cited on page 63)

Sorna, A.; Cheng, X.; D’Azevedo, E.; Won, K.; and Tomov, S., 2018. Optimizing thefast fourier transform using mixed precision on tensor core hardware. In 2018 IEEE25th International Conference on High Performance Computing Workshops (HiPCW), 3–7.IEEE. (cited on page 134)

Stengel, H.; Treibig, J.; Hager, G.; and Wellein, G., 2015. Quantifying performancebottlenecks of stencil computations using the execution-cache-memory model. InProceedings of the 29th ACM on International Conference on Supercomputing, 207–216.ACM. (cited on page 59)

Sundriyal, V. and Sosonkina, M., 2018. Modeling of the cpu frequency to minimizeenergy consumption in parallel applications. Sustainable Computing: Informatics andSystems, 17 (2018), 1–8. (cited on page 25)

Tiwari, A.; Jundt, A.; Ward, W. A.; Campbell, R.; and Carrington, L., 2015.Building blocks for a system-wide power and thermal management framework. InParallel and Distributed Systems (ICPADS), 2015 IEEE 21st International Conference on,700–707. IEEE. (cited on page 28)

Tiwari, A.; Laurenzano, M.; Carrington, L.; and Snavely, A., 2012a. Auto-tuningfor Energy Usage in Scientific Applications. In Euro-Par 2011: Parallel ProcessingWorkshops, vol. 7156 of Lecture Notes in Computer Science, 178–187. Springer BerlinHeidelberg. ISBN 978-3-642-29739-7. doi:10.1007/978-3-642-29740-3_21. (cited onpage 28)

Tiwari, A.; Laurenzano, M. A.; Carrington, L.; and Snavely, A., 2012b. Auto-tuning for energy usage in scientific applications. Lecture Notes in Computer Science(including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioin-formatics), 7156 LNCS, PART 2 (2012), 178–187. doi:10.1007/978-3-642-29740-3-21.(cited on page 30)

Tiwari, A.; Laurenzano, M. A.; Carrington, L.; and Snavely, A., 2012c. Modelingpower and energy usage of hpc kernels. In Parallel and Distributed Processing Sym-posium Workshops & PhD Forum (IPDPSW), 2012 IEEE 26th International, 990–998.IEEE. (cited on page 28)

http://dx.doi.org/10.1109/FiCloud.2017.46

http://ieeexplore.ieee.org/document/8114474/

http://dx.doi.org/10.1007/978-3-642-29740-3_21

http://dx.doi.org/10.1007/978-3-642-29740-3-21

Bibliography 161

Top500. Top500 Supercomputers. http://www.top500.org. (cited on page 1)

Van De Geijn, R. A. and Watts, J., 1997. Summa: Scalable universal matrix multipli-cation algorithm. Concurrency-Practice and Experience, 9, 4 (1997), 255–274. (citedon page 24)

Varghese, A.; Edwards, B.; Mitra, G.; and Rendell, A. P., 2015. Programming theadapteva epiphany 64-core network-on-chip coprocessor. The International Journalof High Performance Computing Applications, 31, 4 (2015), 285–302. (cited on page34)

Varghese, A.; Edwards, R.; Mitra, G.; and Rendell, A. P., 2014. Programming theAdapteva Epiphany 64-core Network-on-chip Coprocessor. In Proceedings of the 4thInternational Workshop on Accelerators and Hybrid Exascale Systems(IPDPSW). (citedon page 34)

Varghese, A.; Milthorpe, J.; and Rendell, A. P., 2017. Performance and energyanalysis of scientific workloads executing on lpsocs. In International Conference onParallel Processing and Applied Mathematics, 113–122. Springer. (cited on page 72)

Whaley, R. C. and Dongarra, J. J., 1998. Automatically Tuned Linear AlgebraSoftware. In Proceedings of the 1998 ACM/IEEE conference on Supercomputing, 1–27.IEEE Computer Society. (cited on page 63)

Whaley, R. C.; Petitet, A.; and Dongarra, J. J., 2000. Automated Emperical Op-timization of Software and the ATLAS Project. Parallel Computing, 27, LAPACKWorking Note 147 (2000), 3–35. doi:10.1016/S0167-8191(00)00087-9. (cited on pages29 and 102)

Wittmann, M.; Hager, G.; and Wellein, G., 2010. Multicore-aware parallel tem-poral blocking of stencil codes for shared and distributed memory. In Parallel &Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010 IEEE InternationalSymposium on, 1–7. IEEE. (cited on page 23)

Yaniv Sapir, 2012. Scalable Parallel Multiplication of Big Matrices. http://www.adapteva.com/white-papers/ scalable-parallel-multiplication-of-big-matrices/ , (2012). (cited on page45)

ZedBoard. ZedBoard Development Kit. http://www.zedboard.org/ product/ zedboard.(cited on pages 9, 34, and 137)

Zhang, H.; Chen, D.; and Ko, S., 2019. Efficient multiple-precision floating-pointfused multiply-add with mixed-precision support. IEEE Transactions on Computers,(2019). (cited on page 134)

http://www.top500.org

http://dx.doi.org/10.1016/S0167-8191(00)00087-9

http://www.adapteva.com/white-papers/scalable-parallel-multiplication-of-big-matrices/

http://www.adapteva.com/white-papers/scalable-parallel-multiplication-of-big-matrices/

http://www.zedboard.org/product/zedboard