machine-learning-based performance heuristics for runtime cpu/gpu selection

1

Machine-Learning-based Performance Heuristics

for Runtime CPU/GPU Selection

Akihiro Hayashi (Rice University)Kazuaki Ishizaki (IBM Research - Tokyo)

Gita Koblents (IBM Canada)Vivek Sarkar(Rice University)

ACM International Conference on Principles and Practices of Programming on the Java Platform: virtual machines, languages, and tools (PPPJ’15)

2

Background:Java 8 Parallel Streams APIsExplicit Parallelism with lambda

expressions

IntStream.range(0, N) .parallel() .forEach(i -> <lambda>)

3

Background:Explicit Parallelism with JavaHigh-level parallel programming with

Java offers opportunities for preserving portability enabling compiler to perform parallel-

aware optimizations and code generation

Java 8 Parallel Stream API

Multi-coreCPUs

Many-coreGPUs

FPGAs

Java 8 Programs

HW

SW

4

Background:JIT Compilation for GPU Execution IBM Java 8 Compiler

Built on top of the production version of the IBM Java 8 runtime environment

Multi-coreCPUs

Many-coreGPUs

method A

method A

method A

method A

Interpretation onJVM

1st invocation

2nd invocation

Nth invocation

(N+1)th invocationNative CodeGeneration for

Multi-coreCPUs

5

Background:The compilation flow of IBM Java 8 Compiler

Java bytecod

e

parallel streams identification in IR

Target machinecode generation

PTX2binarymodule

NVIDIA GPUnative code

PowerPCnative code

libnvvm

OurIR

PTX

JIT compiler Module by NVIDIA

Analysis andoptimizations

Our new modules for GPU

IR forparallel streams

NVVM IRgeneration

NVVM IR

Existingoptimizations

Bytecode translation

Runtime Helpers

GPU

Featureextraction

Optimizations for GPUsTo improve performance

Read Only Cache Utilization / Buffer Alignment Data Transfer Optimizations

To support Java’s language features Loop versioning for eliminating redundant exception

checking on GPUsVirtual method invocation support with

de-virtualization and loop versioning

6

Motivation:Runtime CPU/GPU SelectionSelecting a faster hardware device

is a challenging problem

Multi-coreCPUs

Many-coreGPUs

method A

method A

Nth invocation

(N+1)th invocationNative CodeGeneration for

(Problem)Which one is

faster?

7

Related Work:Linear regression

Regression based cost estimation [1,2] is specific to an application

App 1) BlackScholes

[1] Leung et al. Automatic Parallelization for Graphics Processing Units (PPPJ’09)[2] Kerr et al. Modeling GPU-CPU Workloads and Systems (GPGPU-3)

Exec

ution

Tim

e (m

sec)

Exec

ution

Tim

e (m

sec)

App 2) Vector Addition

CPU

GPU

GPU

CPU

8

Open Question:Accurate cost model is required?

Accurate cost model construction would be too much

Considerable effort will be needed to update performance models for future generations of hardware

Multi-coreCPUs

Many-coreGPUs

Which one is faster?

Machine-learning-based performance heuristics

9

Our Approach:ML-based Performance Heuristics

A binary prediction model is constructed by supervised machine learning with support vector machines (SVMs)

bytecode

App A

PredictionModel

JIT compilerfeature 1

data 1

bytecode

App A

data 2

bytecode

App B

data 3

feature 2

feature 3

LIBSVM

Training run with JIT Compiler Offline Model Construction

featureextraction

featureextraction

featureextraction

JavaRuntime

CPUGPU

10

Features of programthat may affect performance Loop Range (Parallel Loop Size)The dynamic number of Instructions

Memory AccessArithmetic operationsMath MethodsBranch InstructionsOther Instructions

11

Features of programthat may affect performance (Cont’d)The dynamic number of Array

AccessesCoalesced Access (a[i]) (aligned access)Offset Access (a[i+c])Stride Access (a[c*i])Other Access (a[b[i]])

Data Transfer SizeH2D Transfer SizeD2H Transfer Size

12

An example of feature vector{ "title" :"MRIQ.runGPULambda()V", "lineNo" :114, "features" :{ "range": 32768, "IRs" :{ "Memory": 89128, "Arithmetic": 61447, "Math": 6144, "Branch": 3074, "Other": 58384 }, "Array Accesses" :{ "Coalesced": 9218, "Offset": 0, "Stride": 0, "Other": 12288 }, "H2D Transfer" : [131088,131088,131088,12304,12304,12304,12304,16,16], "D2H Transfer" : [131072,131072,0,0,0,0,0,0,0,0] }, }

13

ApplicationsApplication Source Field Max Size Data Type

BlackScholes Finance 4,194,304 doubleCrypt JGF Cryptography Size C (N=50M) byte

SpMM JGF Numerical Computing Size C (N=500K) double

MRIQ Parboil Medical Large (64^3) float Gemm Polybench

NumericalComputing

2K x 2K intGesummv Polybench 2K x 2K intDoitgen Polybench 256x256x256 int

Jacobi-1D Polybench N=4M, T=1 intMatrix

Multiplication2K x 2K double

Matrix Transpose 2K x 2K doubleVecAdd 4M double

14

PlatformCPU

IBM POWER8 @ 3.69GHz20-cores8 SMT threads per cores

= up to 160 threads 256 GB of RAM

GPUNVIDIA Tesla K40m

12GB of Global Memory

15

Prediction Model ConstructionObtained 291 samples by running 11

applications with different data setsChoice is either GPU or 160 worker

threads on CPUbytecod

eApp A

PredictionModel

feature 1data

1

bytecode

App A

data 2

bytecode

App B

data 3

feature 2

feature 3

LIBSVM 3.2

Training run with JIT Compiler Offline Model Construction

featureextraction

featureextraction

featureextraction

JavaRuntime

16

Speedups and the accuracy with the max data size: 160 worker threads vs. GPU

BlackScholes Crypt SpMM MRIQ Gemm Gesummv Doitgen Jacobi-1D MM MT VA0.0

0.1

1.0

10.0

100.0

1000.0

10000.0

40.6 37.482.0

64.227.6

1.4 1.04.4

36.77.4 5.7

42.7 34.658.1

844.7 772.3

1.0 0.1

1.9

1164.8

9.0

1.2

Higher is better160 worker threads (Fork/join) GPU

Spee

dup

rela

tive

to S

EQEN

TIAL

Java

(lo

g sc

ale)

Prediction x ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔

17

How to evaluate prediction models:TRUE/FALSE POSITIVE/NEGATIVE4 types of binary prediction results

(Let POSITIVE is CPU, NEGATIVE is GPU)TRUE POSITIVE

Correctly predicted that CPU is fasterTRUE NEGATIVE

Correctly predicted that GPU is fasterFALSE POSITIVE

Predicted CPU is faster, but GPU is actually fasterFALSE NEGATIVE

Predicted GPU is faster, but CPU is actually faster

18

How to evaluate prediction models:Accuracy, Precision, and Recall metrics Accuracy

the percentage of selections predicted correctly: (TP + TN) / (TP + TN + FP + FN)

Precision X (X = CPU or GPU) Precision CPU : TP / (TP + FP)

# of samples correctly predicted CPU is faster / total # of samples predicted that CPU is faster

Precision GPU : TN / (TN + FN)Recall X (X = CPU or GPU)

Recall CPU : TP / (TP + FN)# of samples correctly predicted CPU is faster

/ total # of samples labeled that CPU is actually faster Recall GPU : TN / (TN + FP)

How precise is the model when it predicts X is

faster.

How does the prediction hit the nail when X is

actually faster.

19

How to evaluate prediction models:5-fold cross validationOverfitting problem:

Prediction model may be tailored to the eleven applications if training data = testing data

To avoid the overfitting problem: Calculate the accuracy of the prediction model

using 5-fold cross validationSubset 1 Subset 2 Subset 3 Subset 4 Subset 5

Subset 2Subset 1 Subset 3 Subset 4 Subset 5

Build a prediction model trained on Subset 2-5Used for TESTINGAccuracy : X%, Precision : Y%,...

Used for TESTINGAccuracy : P%, Precision Q%, …

TRAINING DATA (291 samples)

Build a prediction model trained on Subset 1, 3-5

20

Range "+=nIRs" "+=dIRs" "+=Array" "ALL (+=DT)"0.0%

20.0%

40.0%

60.0%

80.0%

100.0%

120.0%

79.0%

97.6% 99.0% 99.0% 97.2%

Accuracy (%), Total number of samples = 291

Accuracies with cross-validation:160 worker threads or GPU

Higher is better

21

Precisions and Recallswith cross-validation

PrecisionCPU

RecallCPU

PrecisionGPU

RecallGPU

Range 79.0% 100% 0% 0%

+=nIRs 97.8% 99.1% 96.5% 91.8%

+=dIRs 98.7% 100% 100% 95.0%

+=Arrays 98.7% 100% 100% 95.0%

ALL 96.7% 100% 100% 86.9%

Higher is better

All prediction models except Range rarely make a bad decision

22

Discussion Based on results with 291 samples,

(range, # of detailed Insns, # of array accesses) shows the best accuracy DT does not contribute to improving the accuracy since the

DT optimizations do not make GPUs faster Pros. and Cons.

(+) Future generations of hardware can be supported easily by re-running applications

(+) Just add another training data to rebuild a prediction model

(-) Takes time to collect training data Loop Range, # of arithmetic, and # of coalesced

accesses affects the decision

23

Related Work:Java + GPU

Lang JIT GPU Kernel Device Selection

JCUDA Java - CUDA GPU only

Lime Lime ✔ Override map/reduce Static

Firepile Scala ✔ reduce Static

JaBEE Java ✔ Override run GPU only

Aparapi Java ✔ map Static

Hadoop-CL Java ✔ Override map/reduce Static

RootBeer Java ✔ Override run Not Described

HJ-OpenCL HJ - forall Static

PPPJ09 (auto) Java ✔ For-loop Dynamic with Regression

Our Work Java ✔ Parallel Stream Dynamic with Machine Learning

None of these approaches considers Java 8 Parallel Stream APIs and a dynamic device selection with machine-learning

24

ConclusionsMachine-learning based Performance Heuristics

Up to 99% accuracy Promising way to build performance heuristics

Future Work Exploration of features of program (e.g. CFG) Selection of the best configuration from

1 worker, 2 workers ,… 160 workers, GPU Parallelizing Training Phase

For more details on GPU code generations “Compiling and Optimizing Java 8 Programs for GPU

execution”, PACT15, October 2015

25

AcknowledgementsSpecial thanks to

IBM CASMarcel Mitran (IBM Canada)Jimmy Kwa (IBM Canada)Habanero Group at RiceYoichi Matsuyama (CMU)

machine-learning-based performance heuristics for runtime cpu/gpu selection

Technology