machine learning for compiler optimization ashari self-tuning compilers selecting a good set of...
TRANSCRIPT
MACHINE LEARNING BASED COMPILER OPTIMIZATION
Slides have been copied and adapted from 2011 SIParCS by William Petzke;
Arash Ashari
Self-Tuning Compilers
Selecting a good set of compiler optimization flags is a challenging task, why?
Fast change and rapid improvement in microprocessor technology along with static and dynamic profile of programs
In some cases performance improvement can be made over maximum optimization level i.e. –O3 (gcc) or –fast (pgi)
Self-tuning or adaptive methods can be used to optimize compiler performance, flags selection
Objective: Generate a combination of compiler flags that will minimize Run-time/ Compile-time/ Code-size/ Power-Consumption etc.
Example (faster than –O3)
LU kernel gcc -O2 -fsched-stalled-insns=10 -finline-limit=1197 -falign-labels=1
-fno-branch-count-reg -fbranch-target-load-optimize -fcrossjumping -fno-cx-limited-range -fno-defer-pop -fdelete-null-pointer-checks -fno-early-inlining -fforce-addr -fno-function-cse -fno-gcse-sm -fif-conversion -fno-if-conversion2 -fno-modulo-sched -fno-move-loop-invariants -fno-omit-frame-pointer -fno-optimize-sibling-calls -fno-peephole2 -freorder-blocks -fno-rerun-cse-after-loop -freschedule-modulo-scheduled-loops -fno-sched-spec-load-dangerous -fno-sched-spec-load -fno-sched-spec -fno-sched2-use-superblocks -fschedule-insns2 -fsignaling-nans -fno-single-precision-constant -fno-strength-reduce -ftree-ccp -fno-tree-ch -ftree-dce -fno-tree-fre -ftree-loop-im -ftree-loop-optimize -ftree-lrs -fno-tree-sra -ftree-vect-loop-version -fno-tree-vectorize -funroll-loops -fno-unswitch-loops 3.9
gcc –O3 4.9 20% improvement!
NP-Complete
Many program optimization problems are NP- Complete
Implications Assumed intractable Exponential algorithms required to provide global
optimization Heuristics or approximation algorithms must be
used Iterative compilation: Time consuming!
cTuning Framework
cTuning is a large set of tools enabling adaptive tuning for GCC
Open source Interesting project with many features Complex Stability issues Collective Mind Framework http://ctuning.org
Iterative Compilation
Over 2^80 possible flag combinations for GCC
In cTuning a random search algorithm is used to find good compiler optimization flag combinations Each GCC flag included in combo with probability 0.5 Run n >1000 iterations per program Flag combination with best performance kept [FKMC11]
Iterative compilation provides speedup over GCC –O3 but is slow
Machine Learning
Learning from experience (exploitation) Goal of using ML is to predict good optimization
flags for a new (unseen) program based on previously seen programs, while reducing time required by iterative compilation [FKMC11]
Classification of an unseen program is performed by a learning algorithm that compares the features of the new program to others in the training database and selecting the flags from the best match
Collective Tuning
Supervised Learning A training set is provided to the learning
algorithm that includes inputs and known outputs or targets
Algorithm generalizes a function from inputs and targets in training set
Algorithm uses the function to determine output of previously unseen inputs
[MAR09]
Collective Tuning
Program features and transformations (flags) collected from iterative compilation of many programs form the training set
Example features ft20 = Conditional branch count ft21 = Assignment instruction count ft24 = Instruction count
56 features collected in total (we use 55) Stored in a collective database (training set)
Training
Adapted from: http://www.ctuning.org/wiki/index.php/CTools:CTuningCC
Classification
Adapted from: http://www.ctuning.org/wiki/index.php/CTools:CTuningCC
KNN
K-Nearest Neighbor or KNN algorithm Classification is based on majority vote of the k nearest
neighbors Algorithm Calculate Euclidean distance between previously unseen
feature vector and features of every training example and find k nearest neighbors
Assign output of the majority class among the k nearest neighbors
cTuning optimizations classified based on the nearest neighbor k = 1
[SEG07] [MAR09]
2D KNN Classification Example
Assume square and triangle classes can represent two other programs
k = 1 green query point assigned to red triangle class
k = 3 triangles win so query point assigned to red triangle class
k = 5 squares win so query point assigned to blue square class
Image Credit: Antti Ajanki http://en.wikipedia.org/wiki/File:KnnClassification.svg Creative Commons Creative Commons License Deed Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0)
2D Nearest Neighbor Decision Surface
k=1 or Nearest Neighbor Voronoi Diagram All query points falling in a
polygon are closer to the training example point shown in the cell than any other
Query point will be classified to the label of the training point for the cell
[MIT97] Image Credit: Mysid (SVG), Cyp(original) http://en.wikipedia.org/wiki/File:Coloured_Voronoi_2D.svg Creative Commons Creative Commons License Deed Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0)
Extension of cTuning
New key, value based database format was developed using bash script due to stability issues with cTuning database
Two training databases created using new data gathered
Fastest runs of cBench programs with the GCC and PGI compiler
[cTuning]
Fastest runs of polybench programs with the GCC and PGI compiler
[http://www.cse.ohio-state.edu/~pouchet/software/polybench/]
More Extension of cTuning
knn algorithm with feature normalization implemented in bash script to work with databases
Scripts to parse, create database, and perform leave one out cross validation
Modifications to ctuning-cc (GCC, PGI and MPI) compiler and integrating with the scripts
Research Conclusions
For the training programs iterative compilation reduced runtime for all (gcc, pgi, mpi90)
Leave one out cross validation used Leave out one program for testing and use remaining N-1
programs for training Speedup results from using flags predicted by KNN is
different from IC. (not always beneficial) Real Validation:
Shallow Water Eulag HD3D
Shallow Water
The shallow water is a set of motion equations that show how the fluid flow will evolve in response to rotational and gravitational accelerations of the earth, forming waves.
Default Flag – Run
time
Improvement %
IC KNN 10-KNN All
gfortran 8.1 2.7 % -4 % -3.3 % 0.6 %
pgfortran 5.2 3 % -36 % 1.9 % 2.7 %
Eulag
EULAG is a numerical solver for all-scale geophysical flows. The underlying anelastic equations are either solved in an EULerian (flux form), or a LAGrangian (advective form) framework.
program Default Flag –
Run time
Improvement % IC KNN 10-KNN All
gfortran
Cartinterpolv2
4.1 2.6 % 1 % 1.8 % 1.5 %
boxstruct43n 0.15 4.5 % 2.2 % 3 % 3 %
prizmaticlimitedjan12-blue
8 0.8 % 0 % 0.2 % 0 %
pi318 34.2 5.8 % 2 % 2 % 4.3 %
Eulag 46.5 4.6 % 1.5 % 1.9 % 3.8 %
pgfortran
Cartinterpolv2 1.5 0.9 % 0.3 % 0.3 % 0.7 %
boxstruct43n 0.04 13.1 % 4.9 % 4.9 % 10.3 %
prizmaticlimitedjan12-blue
0.04 11.5 % -99 % 0 % 0 %
pi318 28.8 6.5 % -21 % -18 % 2.3 %
Eulag 30.4 6.25 % -28.5 % -15 % 2.2 %
HD3D
HD3D is a pseudospectral three-dimensional periodic hydrodynamic/magnetohydrodynamic/ Hall-MHD turbulence model. fftp. fftp_mod fftp3D pseudospec3D_mod pseudospec3D_hd hd3D.para
Default Flag – Run
time
Improvement %
IC KNN 10-KNN All
mpif90 43 0.1 % -4 % -2.8 % -2 %
Future Research Opportunities
Exploration of feature space in an attempt to: prune the feature space extract better features or combination of features
Using different machine learning algorithms/ similarity measurement
References
[FKMC11] Grigori Fursin, Yuriy Kashnikov, Abdul Wahid
Memon, Zbigniew Chamski, et al. Milepost GCC: Machine Learning Enabled Self-tuning Compiler. International Journal Parallel Programming. 39:296-327, 2011.
[MAR09] Stephen Marsland. Machine Learning An Algorithmic Perspective. Chapman & Hall/CRC, Boca Raton, FL, 2009.
[MIT97] Tom M. Mitchell. Machine Learning. McGraw-Hill, Boston, MA, 1997.
[SEG07] Toby Segaran. Programming Collective Intelligence. O’Reilly, Sebastopol, CA, 2007.