# bounds on the energy consumption of computational kernels · pdf file bounds on the energy...

Post on 24-Jul-2020

0 views

Embed Size (px)

TRANSCRIPT

Bounds on the Energy Consumption of Computational Kernels

Andrew Gearhart

Electrical Engineering and Computer Sciences University of California at Berkeley

Technical Report No. UCB/EECS-2014-175 http://www.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-2014-175.html

October 23, 2014

Copyright © 2014, by the author(s). All rights reserved.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission.

Acknowledgement

Research partially funded by DARPA Award Number HR0011-12-2-0016, the Center for Future Architecture Research, a member of STARnet, a Semiconductor Research Corporation program sponsored by MARCO and DARPA, and ASPIRE Lab industrial sponsors and affiliates Intel, Google, Nokia, NVIDIA, Oracle, and Samsung. Any opinions, findings, conclusions, or recommendations in this paper are solely those of the authors and does not necessarily reflect the position or the policy of the sponsors.

Bounds on the Energy Consumption of Computational Kernels

by

Andrew Scott Gearhart

A dissertation submitted in partial satisfaction of the

requirements for the degree of

Doctor of Philosophy

in

Computer Science

and the Designated Emphasis

in

Computational Science and Engineering

in the

Graduate Division

of the

University of California, Berkeley

Committee in charge:

Professor James W. Demmel, Chair Professor Katherine A. Yelick

Professor Tarek I. Zohdi

Fall 2014

Bounds on the Energy Consumption of Computational Kernels

Copyright 2014 by

Andrew Scott Gearhart

1

Abstract

Bounds on the Energy Consumption of Computational Kernels

by

Andrew Scott Gearhart

Doctor of Philosophy in Computer Science and the Designated Emphasis

in Computational Science and Engineering

University of California, Berkeley

Professor James W. Demmel, Chair

As computing devices evolve with successive technology generations, many machines target either the mobile or high-performance computing/datacenter environments. In both of these form factors, energy consumption often represents the limiting factor on hardware and software effi- ciency. On mobile devices, limitations in battery technology may reduce possible hardware ca- pability due to a tight energy budget. On the other hand, large machines such as datacenters and supercomputers have budgets directly related to energy consumption and small improvements in energy efficiency can significantly reduce operating costs. Such challenges have influenced re- search upon the impact of applications, operating and runtime systems upon energy consumption. Until recently, little consideration was given to the potential energy efficiency of algorithms them- selves.

A dominant idea within the high-performance computing (HPC) community is that applications can be decomposed into a set of key computational problems, called kernels. Via automatic perfor- mance tuning and new algorithms for many kernels, researchers have successfully demonstrated performance improvements on a wide variety of machines. Motivated by the large and increas- ingly growing dominant cost (in time and energy) of moving data, algorithmic improvements have been attained by proving lower bounds on the data movement required to solve a computational problem, and then developing communication-optimal algorithms that attain these bounds.

This thesis extends previous research on communication bounds and computational kernels by presenting bounds on the energy consumption of a large class of algorithms. These bounds apply to sequential, distributed parallel and heterogeneous machine models and we detail methods to further extend these models to larger classes of machines. We argue that the energy consumption of computational kernels is usually predictable and can be modeled via linear models with a handful of terms. Thus, these energy models (and the accompanying bounds) may apply to many HPC applications when used in composition.

2

Given energy bounds, we analyze the implications of such results under additional constraints, such as an upper bound on runtime, and also suggest directions for future research that may aid future development of a hardware/software co-tuning process. Further, we present a new model of energy efficiency, Cityscape, that allows hardware designers to quickly target areas for improve- ment in hardware attributes. We believe that combining our bounds with other models of energy consumption may provide a useful method for such co-tuning; i.e. to enable algorithm and hard- ware architects to develop provably energy-optimal algorithms on customized hardware platforms.

i

Now this is not the end. It is not even the beginning of the end.

But it is, perhaps, the end of the beginning.

- Sir Winston Churchill, 1942

ii

Contents

Contents ii

List of Figures v

List of Tables vii

1 Introduction 1 1.1 Communication Now Dominates Performance Costs . . . . . . . . . . . . . . . . . 1 1.2 Energy Efficiency at the Algorithm Level . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Thesis Goals and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.4 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Energy Consumption and Computing 5 2.1 Power vs. Energy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 Phase-based Execution of Applications . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3 Key Consumers of Energy on Desktops and Server Nodes . . . . . . . . . . . . . . 9

Energy Consumption in CMOS Logic . . . . . . . . . . . . . . . . . . . . . . . . 10 Other Hardware Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.4 Network Energy Consumption on Distributed Parallel Machines . . . . . . . . . . 15

3 Machine Models for Runtime and Energy 17 3.1 Problems, Algorithms, and Implementations . . . . . . . . . . . . . . . . . . . . . 17 3.2 Machine Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

Sequential Machine Model (S) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Distributed Parallel Machine Model 1 (DP1) . . . . . . . . . . . . . . . . . . . . . 21 Model Compositions and Distributed Parallel Model 2 (DP2) . . . . . . . . . . . . 21 Heterogeneous Machine Model (H) . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.3 Problems of Particular Focus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 Matrix-vector multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 Matrix-matrix Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 O(n2) n-body problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.4 Model Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

iii

Performance Counter Measurement . . . . . . . . . . . . . . . . . . . . . . . . . . 37 Measuring Power and Energy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 Sequential Model (S) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

Fitting the Model via Least Squares . . . . . . . . . . . . . . . . . . . . . 42 Distributed Parallel Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 Heterogeneous Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3.5 Parameter Estimation for Machines and Implementations and Related Work . . . . 58

4 Bounds on Communication, Runtime and Energy for Specific Algorithms 61 4.1 Communication Lower Bounds for Sequential and Distributed Parallel Machines . 61

Lower Bounds on the DP Models that Include Link Contention . . . . . . . . . . . 64 4.2 Energy Lower Bounds for Specific Algorithms . . . . . . . . . . . . . . . . . . . . 69

O(n3) Classical Matrix Multiplication . . . . . . . . . . . . . . . . . . . . . . . . 69 Strassen and Strassen-like Matrix Multiplication . . . . . . . . . . . . . . . . . . . 72 Matrix-vector multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 O(n2) n-body problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

4.3 Bounds on Heterogeneous Machines . . . . . . . . . . . . . . . . . . . . . . . . . 78 Input/Output Dominated Lower Bounds . . . . . . . . . . . . . . . . . . . . . . . 80 Loomis-Whitney Dominated Lower Bound . . . . . . . . . . . . . . . . . . . . . . 82

4.4 Optimal Heterogeneous Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 84 Heterogeneous Matrix-Vector Multiplication . . . . . . . . . . . . . . . . . . . . . 84 Heterogeneous O(n3) Matrix-Matrix Multiplication . . . . . . . . . . . . . . . . . 86

5 Bounds on Communication, Runtime and Energy for Programs that Access Arrays 90 5.1 Bounds on Programs that Reference Arrays . . . . . . . . . . . . . . . . . . . . . 90

Sequential Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 Distributed Parallel Model 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 Distributed Parallel Model 2 . .