machine-learning-based performance heuristics for runtime cpu/gpu selection
TRANSCRIPT
1
Machine-Learning-based Performance Heuristics
for Runtime CPU/GPU Selection
Akihiro Hayashi (Rice University)Kazuaki Ishizaki (IBM Research - Tokyo)
Gita Koblents (IBM Canada)Vivek Sarkar(Rice University)
ACM International Conference on Principles and Practices of Programming on the Java Platform: virtual machines, languages, and tools (PPPJ’15)
2
Background:Java 8 Parallel Streams APIsExplicit Parallelism with lambda
expressions
IntStream.range(0, N) .parallel() .forEach(i -> <lambda>)
3
Background:Explicit Parallelism with JavaHigh-level parallel programming with
Java offers opportunities for preserving portability enabling compiler to perform parallel-
aware optimizations and code generation
Java 8 Parallel Stream API
Multi-coreCPUs
Many-coreGPUs
FPGAs
Java 8 Programs
HW
SW
4
Background:JIT Compilation for GPU Execution IBM Java 8 Compiler
Built on top of the production version of the IBM Java 8 runtime environment
Multi-coreCPUs
Many-coreGPUs
method A
method A
method A
method A
Interpretation onJVM
1st invocation
2nd invocation
Nth invocation
(N+1)th invocationNative CodeGeneration for
Multi-coreCPUs
5
Background:The compilation flow of IBM Java 8 Compiler
Java bytecod
e
parallel streams identification in IR
Target machinecode generation
PTX2binarymodule
NVIDIA GPUnative code
PowerPCnative code
libnvvm
OurIR
PTX
JIT compiler Module by NVIDIA
Analysis andoptimizations
Our new modules for GPU
IR forparallel streams
NVVM IRgeneration
NVVM IR
Existingoptimizations
Bytecode translation
Runtime Helpers
GPU
Featureextraction
Optimizations for GPUsTo improve performance
Read Only Cache Utilization / Buffer Alignment Data Transfer Optimizations
To support Java’s language features Loop versioning for eliminating redundant exception
checking on GPUsVirtual method invocation support with
de-virtualization and loop versioning
6
Motivation:Runtime CPU/GPU SelectionSelecting a faster hardware device
is a challenging problem
Multi-coreCPUs
Many-coreGPUs
method A
method A
Nth invocation
(N+1)th invocationNative CodeGeneration for
(Problem)Which one is
faster?
7
Related Work:Linear regression
Regression based cost estimation [1,2] is specific to an application
App 1) BlackScholes
[1] Leung et al. Automatic Parallelization for Graphics Processing Units (PPPJ’09)[2] Kerr et al. Modeling GPU-CPU Workloads and Systems (GPGPU-3)
Exec
ution
Tim
e (m
sec)
Exec
ution
Tim
e (m
sec)
App 2) Vector Addition
CPU
GPU
GPU
CPU
8
Open Question:Accurate cost model is required?
Accurate cost model construction would be too much
Considerable effort will be needed to update performance models for future generations of hardware
Multi-coreCPUs
Many-coreGPUs
Which one is faster?
Machine-learning-based performance heuristics
9
Our Approach:ML-based Performance Heuristics
A binary prediction model is constructed by supervised machine learning with support vector machines (SVMs)
bytecode
App A
PredictionModel
JIT compilerfeature 1
data 1
bytecode
App A
data 2
bytecode
App B
data 3
feature 2
feature 3
LIBSVM
Training run with JIT Compiler Offline Model Construction
featureextraction
featureextraction
featureextraction
JavaRuntime
CPUGPU
10
Features of programthat may affect performance Loop Range (Parallel Loop Size)The dynamic number of Instructions
Memory AccessArithmetic operationsMath MethodsBranch InstructionsOther Instructions
11
Features of programthat may affect performance (Cont’d)The dynamic number of Array
AccessesCoalesced Access (a[i]) (aligned access)Offset Access (a[i+c])Stride Access (a[c*i])Other Access (a[b[i]])
Data Transfer SizeH2D Transfer SizeD2H Transfer Size
12
An example of feature vector{ "title" :"MRIQ.runGPULambda()V", "lineNo" :114, "features" :{ "range": 32768, "IRs" :{ "Memory": 89128, "Arithmetic": 61447, "Math": 6144, "Branch": 3074, "Other": 58384 }, "Array Accesses" :{ "Coalesced": 9218, "Offset": 0, "Stride": 0, "Other": 12288 }, "H2D Transfer" : [131088,131088,131088,12304,12304,12304,12304,16,16], "D2H Transfer" : [131072,131072,0,0,0,0,0,0,0,0] }, }
13
ApplicationsApplication Source Field Max Size Data Type
BlackScholes Finance 4,194,304 doubleCrypt JGF Cryptography Size C (N=50M) byte
SpMM JGF Numerical Computing Size C (N=500K) double
MRIQ Parboil Medical Large (64^3) float Gemm Polybench
NumericalComputing
2K x 2K intGesummv Polybench 2K x 2K intDoitgen Polybench 256x256x256 int
Jacobi-1D Polybench N=4M, T=1 intMatrix
Multiplication2K x 2K double
Matrix Transpose 2K x 2K doubleVecAdd 4M double
14
PlatformCPU
IBM POWER8 @ 3.69GHz20-cores8 SMT threads per cores
= up to 160 threads 256 GB of RAM
GPUNVIDIA Tesla K40m
12GB of Global Memory
15
Prediction Model ConstructionObtained 291 samples by running 11
applications with different data setsChoice is either GPU or 160 worker
threads on CPUbytecod
eApp A
PredictionModel
feature 1data
1
bytecode
App A
data 2
bytecode
App B
data 3
feature 2
feature 3
LIBSVM 3.2
Training run with JIT Compiler Offline Model Construction
featureextraction
featureextraction
featureextraction
JavaRuntime
16
Speedups and the accuracy with the max data size: 160 worker threads vs. GPU
BlackScholes Crypt SpMM MRIQ Gemm Gesummv Doitgen Jacobi-1D MM MT VA0.0
0.1
1.0
10.0
100.0
1000.0
10000.0
40.6 37.482.0
64.227.6
1.4 1.04.4
36.77.4 5.7
42.7 34.658.1
844.7 772.3
1.0 0.1
1.9
1164.8
9.0
1.2
Higher is better160 worker threads (Fork/join) GPU
Spee
dup
rela
tive
to S
EQEN
TIAL
Java
(lo
g sc
ale)
Prediction x ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔
17
How to evaluate prediction models:TRUE/FALSE POSITIVE/NEGATIVE4 types of binary prediction results
(Let POSITIVE is CPU, NEGATIVE is GPU)TRUE POSITIVE
Correctly predicted that CPU is fasterTRUE NEGATIVE
Correctly predicted that GPU is fasterFALSE POSITIVE
Predicted CPU is faster, but GPU is actually fasterFALSE NEGATIVE
Predicted GPU is faster, but CPU is actually faster
18
How to evaluate prediction models:Accuracy, Precision, and Recall metrics Accuracy
the percentage of selections predicted correctly: (TP + TN) / (TP + TN + FP + FN)
Precision X (X = CPU or GPU) Precision CPU : TP / (TP + FP)
# of samples correctly predicted CPU is faster / total # of samples predicted that CPU is faster
Precision GPU : TN / (TN + FN)Recall X (X = CPU or GPU)
Recall CPU : TP / (TP + FN)# of samples correctly predicted CPU is faster
/ total # of samples labeled that CPU is actually faster Recall GPU : TN / (TN + FP)
How precise is the model when it predicts X is
faster.
How does the prediction hit the nail when X is
actually faster.
19
How to evaluate prediction models:5-fold cross validationOverfitting problem:
Prediction model may be tailored to the eleven applications if training data = testing data
To avoid the overfitting problem: Calculate the accuracy of the prediction model
using 5-fold cross validationSubset 1 Subset 2 Subset 3 Subset 4 Subset 5
Subset 2Subset 1 Subset 3 Subset 4 Subset 5
Build a prediction model trained on Subset 2-5Used for TESTINGAccuracy : X%, Precision : Y%,...
Used for TESTINGAccuracy : P%, Precision Q%, …
TRAINING DATA (291 samples)
Build a prediction model trained on Subset 1, 3-5
20
Range "+=nIRs" "+=dIRs" "+=Array" "ALL (+=DT)"0.0%
20.0%
40.0%
60.0%
80.0%
100.0%
120.0%
79.0%
97.6% 99.0% 99.0% 97.2%
Accuracy (%), Total number of samples = 291
Accuracies with cross-validation:160 worker threads or GPU
Higher is better
21
Precisions and Recallswith cross-validation
PrecisionCPU
RecallCPU
PrecisionGPU
RecallGPU
Range 79.0% 100% 0% 0%
+=nIRs 97.8% 99.1% 96.5% 91.8%
+=dIRs 98.7% 100% 100% 95.0%
+=Arrays 98.7% 100% 100% 95.0%
ALL 96.7% 100% 100% 86.9%
Higher is better
All prediction models except Range rarely make a bad decision
22
Discussion Based on results with 291 samples,
(range, # of detailed Insns, # of array accesses) shows the best accuracy DT does not contribute to improving the accuracy since the
DT optimizations do not make GPUs faster Pros. and Cons.
(+) Future generations of hardware can be supported easily by re-running applications
(+) Just add another training data to rebuild a prediction model
(-) Takes time to collect training data Loop Range, # of arithmetic, and # of coalesced
accesses affects the decision
23
Related Work:Java + GPU
Lang JIT GPU Kernel Device Selection
JCUDA Java - CUDA GPU only
Lime Lime ✔ Override map/reduce Static
Firepile Scala ✔ reduce Static
JaBEE Java ✔ Override run GPU only
Aparapi Java ✔ map Static
Hadoop-CL Java ✔ Override map/reduce Static
RootBeer Java ✔ Override run Not Described
HJ-OpenCL HJ - forall Static
PPPJ09 (auto) Java ✔ For-loop Dynamic with Regression
Our Work Java ✔ Parallel Stream Dynamic with Machine Learning
None of these approaches considers Java 8 Parallel Stream APIs and a dynamic device selection with machine-learning
24
ConclusionsMachine-learning based Performance Heuristics
Up to 99% accuracy Promising way to build performance heuristics
Future Work Exploration of features of program (e.g. CFG) Selection of the best configuration from
1 worker, 2 workers ,… 160 workers, GPU Parallelizing Training Phase
For more details on GPU code generations “Compiling and Optimizing Java 8 Programs for GPU
execution”, PACT15, October 2015
25
AcknowledgementsSpecial thanks to
IBM CASMarcel Mitran (IBM Canada)Jimmy Kwa (IBM Canada)Habanero Group at RiceYoichi Matsuyama (CMU)