machine learning for compiler optimization ashari self-tuning compilers selecting a good set of...

27
MACHINE LEARNING BASED COMPILER OPTIMIZATION Slides have been copied and adapted from 2011 SIParCS by William Petzke; Arash Ashari

Upload: dinhkhuong

Post on 18-Apr-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

MACHINE LEARNING BASED COMPILER OPTIMIZATION

Slides have been copied and adapted from 2011 SIParCS by William Petzke;

Arash Ashari

Self-Tuning Compilers

Selecting a good set of compiler optimization flags is a challenging task, why?

Fast change and rapid improvement in microprocessor technology along with static and dynamic profile of programs

In some cases performance improvement can be made over maximum optimization level i.e. –O3 (gcc) or –fast (pgi)

Self-tuning or adaptive methods can be used to optimize compiler performance, flags selection

Objective: Generate a combination of compiler flags that will minimize Run-time/ Compile-time/ Code-size/ Power-Consumption etc.

Example (faster than –O3)

LU kernel gcc -O2 -fsched-stalled-insns=10 -finline-limit=1197 -falign-labels=1

-fno-branch-count-reg -fbranch-target-load-optimize -fcrossjumping -fno-cx-limited-range -fno-defer-pop -fdelete-null-pointer-checks -fno-early-inlining -fforce-addr -fno-function-cse -fno-gcse-sm -fif-conversion -fno-if-conversion2 -fno-modulo-sched -fno-move-loop-invariants -fno-omit-frame-pointer -fno-optimize-sibling-calls -fno-peephole2 -freorder-blocks -fno-rerun-cse-after-loop -freschedule-modulo-scheduled-loops -fno-sched-spec-load-dangerous -fno-sched-spec-load -fno-sched-spec -fno-sched2-use-superblocks -fschedule-insns2 -fsignaling-nans -fno-single-precision-constant -fno-strength-reduce -ftree-ccp -fno-tree-ch -ftree-dce -fno-tree-fre -ftree-loop-im -ftree-loop-optimize -ftree-lrs -fno-tree-sra -ftree-vect-loop-version -fno-tree-vectorize -funroll-loops -fno-unswitch-loops 3.9

gcc –O3 4.9 20% improvement!

NP-Complete

Many program optimization problems are NP- Complete

Implications Assumed intractable Exponential algorithms required to provide global

optimization Heuristics or approximation algorithms must be

used Iterative compilation: Time consuming!

cTuning Framework

cTuning is a large set of tools enabling adaptive tuning for GCC

Open source Interesting project with many features Complex Stability issues Collective Mind Framework http://ctuning.org

Iterative Compilation

Over 2^80 possible flag combinations for GCC

In cTuning a random search algorithm is used to find good compiler optimization flag combinations Each GCC flag included in combo with probability 0.5 Run n >1000 iterations per program Flag combination with best performance kept [FKMC11]

Iterative compilation provides speedup over GCC –O3 but is slow

Machine Learning

Learning from experience (exploitation) Goal of using ML is to predict good optimization

flags for a new (unseen) program based on previously seen programs, while reducing time required by iterative compilation [FKMC11]

Classification of an unseen program is performed by a learning algorithm that compares the features of the new program to others in the training database and selecting the flags from the best match

Collective Tuning

Supervised Learning A training set is provided to the learning

algorithm that includes inputs and known outputs or targets

Algorithm generalizes a function from inputs and targets in training set

Algorithm uses the function to determine output of previously unseen inputs

[MAR09]

Collective Tuning

Program features and transformations (flags) collected from iterative compilation of many programs form the training set

Example features ft20 = Conditional branch count ft21 = Assignment instruction count ft24 = Instruction count

56 features collected in total (we use 55) Stored in a collective database (training set)

Static Features

Training

Adapted from: http://www.ctuning.org/wiki/index.php/CTools:CTuningCC

Classification

Adapted from: http://www.ctuning.org/wiki/index.php/CTools:CTuningCC

KNN

K-Nearest Neighbor or KNN algorithm Classification is based on majority vote of the k nearest

neighbors Algorithm Calculate Euclidean distance between previously unseen

feature vector and features of every training example and find k nearest neighbors

Assign output of the majority class among the k nearest neighbors

cTuning optimizations classified based on the nearest neighbor k = 1

[SEG07] [MAR09]

2D KNN Classification Example

Assume square and triangle classes can represent two other programs

k = 1 green query point assigned to red triangle class

k = 3 triangles win so query point assigned to red triangle class

k = 5 squares win so query point assigned to blue square class

Image Credit: Antti Ajanki http://en.wikipedia.org/wiki/File:KnnClassification.svg Creative Commons Creative Commons License Deed Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0)

2D Nearest Neighbor Decision Surface

k=1 or Nearest Neighbor Voronoi Diagram All query points falling in a

polygon are closer to the training example point shown in the cell than any other

Query point will be classified to the label of the training point for the cell

[MIT97] Image Credit: Mysid (SVG), Cyp(original) http://en.wikipedia.org/wiki/File:Coloured_Voronoi_2D.svg Creative Commons Creative Commons License Deed Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0)

Extension of cTuning

New key, value based database format was developed using bash script due to stability issues with cTuning database

Two training databases created using new data gathered

Fastest runs of cBench programs with the GCC and PGI compiler

[cTuning]

Fastest runs of polybench programs with the GCC and PGI compiler

[http://www.cse.ohio-state.edu/~pouchet/software/polybench/]

More Extension of cTuning

knn algorithm with feature normalization implemented in bash script to work with databases

Scripts to parse, create database, and perform leave one out cross validation

Modifications to ctuning-cc (GCC, PGI and MPI) compiler and integrating with the scripts

Research Conclusions

For the training programs iterative compilation reduced runtime for all (gcc, pgi, mpi90)

Leave one out cross validation used Leave out one program for testing and use remaining N-1

programs for training Speedup results from using flags predicted by KNN is

different from IC. (not always beneficial) Real Validation:

Shallow Water Eulag HD3D

Speedup of IC and KNN over Default cBench and polyBench - GCC

Speedup of IC and KNN over Default cBench and polyBench - PGCC

Shallow Water

The shallow water is a set of motion equations that show how the fluid flow will evolve in response to rotational and gravitational accelerations of the earth, forming waves.

Default Flag – Run

time

Improvement %

IC KNN 10-KNN All

gfortran 8.1 2.7 % -4 % -3.3 % 0.6 %

pgfortran 5.2 3 % -36 % 1.9 % 2.7 %

Eulag

EULAG is a numerical solver for all-scale geophysical flows. The underlying anelastic equations are either solved in an EULerian (flux form), or a LAGrangian (advective form) framework.

program Default Flag –

Run time

Improvement % IC KNN 10-KNN All

gfortran

Cartinterpolv2

4.1 2.6 % 1 % 1.8 % 1.5 %

boxstruct43n 0.15 4.5 % 2.2 % 3 % 3 %

prizmaticlimitedjan12-blue

8 0.8 % 0 % 0.2 % 0 %

pi318 34.2 5.8 % 2 % 2 % 4.3 %

Eulag 46.5 4.6 % 1.5 % 1.9 % 3.8 %

pgfortran

Cartinterpolv2 1.5 0.9 % 0.3 % 0.3 % 0.7 %

boxstruct43n 0.04 13.1 % 4.9 % 4.9 % 10.3 %

prizmaticlimitedjan12-blue

0.04 11.5 % -99 % 0 % 0 %

pi318 28.8 6.5 % -21 % -18 % 2.3 %

Eulag 30.4 6.25 % -28.5 % -15 % 2.2 %

HD3D

HD3D is a pseudospectral three-dimensional periodic hydrodynamic/magnetohydrodynamic/ Hall-MHD turbulence model. fftp. fftp_mod fftp3D pseudospec3D_mod pseudospec3D_hd hd3D.para

Default Flag – Run

time

Improvement %

IC KNN 10-KNN All

mpif90 43 0.1 % -4 % -2.8 % -2 %

Future Research Opportunities

Exploration of feature space in an attempt to: prune the feature space extract better features or combination of features

Using different machine learning algorithms/ similarity measurement

Acknowledgment

All of you Consulting Services Group (CISL/CSG) Davide Del Vento SIParCS

References

[FKMC11] Grigori Fursin, Yuriy Kashnikov, Abdul Wahid

Memon, Zbigniew Chamski, et al. Milepost GCC: Machine Learning Enabled Self-tuning Compiler. International Journal Parallel Programming. 39:296-327, 2011.

[MAR09] Stephen Marsland. Machine Learning An Algorithmic Perspective. Chapman & Hall/CRC, Boca Raton, FL, 2009.

[MIT97] Tom M. Mitchell. Machine Learning. McGraw-Hill, Boston, MA, 1997.

[SEG07] Toby Segaran. Programming Collective Intelligence. O’Reilly, Sebastopol, CA, 2007.

Questions