towards a more principled compiler: progressive backend compiler optimization
DESCRIPTION
Towards a More Principled Compiler: Progressive Backend Compiler Optimization. David Koes 8/28/2006. Performance Gains Due to Compiler (gcc). 2.8Ghz Pentium 4, 1GB RAM, -O3 …. The Future of Compiler Optimization. is this possible?. How do we exploit the existing optimization potential?. - PowerPoint PPT PresentationTRANSCRIPT
School of Computer Science
Towards a More Principled Compiler:
Progressive Backend Compiler Optimization
Towards a More Principled Compiler:
Progressive Backend Compiler Optimization
David Koes8/28/2006
2School of Computer Science
Performance Gains Due to Compiler (gcc)Performance Gains Due to Compiler (gcc)
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
50%
Nov-95Nov-96Nov-97Nov-98Nov-99Nov-00Nov-01Nov-02Nov-03Nov-04Nov-05Nov-06Nov-07Nov-08Nov-09
SPEC2000 Performance Improvement
2.8Ghz Pentium 4, 1GB RAM, -O3 …
3School of Computer Science
The Future of Compiler OptimizationThe Future of Compiler Optimization
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
50%
Nov-95Nov-96Nov-97Nov-98Nov-99Nov-00Nov-01Nov-02Nov-03Nov-04Nov-05Nov-06Nov-07Nov-08Nov-09
SPEC2000 Performance Improvement
is this possible?
10-30% improvement just from reordering compiler phaseshttp://www.cs.rice.edu/~keith/Adapt/
Yes!Yes!How do we exploit the existing optimization potential?
Need a more principled compilerNeed a more principled compiler
4School of Computer Science
Compiler code size improvementCompiler code size improvement
0%
5%
10%
15%
20%
25%
Nov-95May-96Nov-96May-97Nov-97May-98Nov-98May-99Nov-99May-00Nov-00May-01Nov-01May-02Nov-02May-03Nov-03May-04Nov-04May-05Nov-05
Code size improvement
5School of Computer Science
A Principled CompilerA Principled Compiler
A compiler that A compiler that – knows right from wrongknows right from wrong
(less optimal from more optimal)(less optimal from more optimal)– follows a rigorous procedure to get the desired outputfollows a rigorous procedure to get the desired output
6School of Computer Science
Today’s Compiler Today’s Compiler
target dependenttarget dependenttarget dependenttarget dependent
target independenttarget independenttarget independenttarget independent
const
proploop unroll
GVNstrength reductSCCP
code motion
…copy prop
inlining
DCE
PRE
peephole …
reg allocinsn sched
branch opt
optimized program
machine description
Problems– some phases not
internally optimal• purely heuristic solution
– machine description mostly ignored
– lack of integration between phases
insn select
7School of Computer Science
optimized program
machine description
copy
prop
loop unrol
l
DCE
PREconst
prop
code motio
ninline
GVN
strength
reduct peep
-hole
CSE SCC
Preg alloc
branch optinsn
select
…
Ideal CompilerIdeal Compiler– each phase locally optimal– makes full use of machine
description– tight integration between
phases
Absolutely Absolutely nono idea how to do idea how to do this or if it’s even this or if it’s even possiblepossible
8School of Computer Science
optimized program
machine description
copy
prop
loop unrol
l
DCE
PREconst
prop
code motio
ninline
GVN
strength
reduct peep
-hole
CSE SCC
Preg alloc
branch opt
…
Towards a More Principled CompilerTowards a More Principled Compiler– each phase locally optimal– makes full use of machine
description– tight integration between
phases
insn selec
t
reg alloc
9School of Computer Science
OutlineOutline
I. Motivation
II. Related Work
III. Completed Work
IV. Proposed Work
V. Contributions & Timeline
10School of Computer Science
Register Allocation ProblemRegister Allocation Problem
…
v = 1
w = v + 3
x = w + v
u = v
t = u + x
print(x);
print(w);
print(t);
print(u);
…
registerregisterallocatorallocatorregisterregisterallocatorallocator
unbounded number of unbounded number of program variablesprogram variables
limited number of limited number of processor registers + processor registers + slow memoryslow memory
eaxebxecxedxesiedi
ebpesp
spill code optimizationspill code optimizationspill code optimizationspill code optimization
memory operandsmemory operandsmemory operandsmemory operands
register preferencesregister preferencesregister preferencesregister preferencesrematerializationrematerializationrematerializationrematerialization
live range splittinglive range splittinglive range splittinglive range splitting
Related WorkRelated Work
11School of Computer Science
Method Expressive Fast Optimal
Linear Scan
Graph Coloring
Integer Linear Programming
Partitioned Boolean Quadratic Programming / /
Register Allocation Previous WorkRegister Allocation Previous Work Related WorkRelated Work
12School of Computer Science
Instruction Selection ProblemInstruction Selection Problem
movl (p),t1leal (x,t1),t2leal 1(y),t3leal (t2,t3),r
IRIR AssemAssem
instruction selector
instruction selector
minimum cost tilingminimum cost tilingminimum cost tilingminimum cost tiling?
IR RepresentationIR RepresentationIR RepresentationIR Representation
Related WorkRelated Work
13School of Computer Science
Instruction Selection Previous WorkInstruction Selection Previous Work
MethodDAG Tiling
Register Allocation Aware
Fast Optimal
Dynamic Programming
Binate Covering
Peephole Based Instruction Selection
AVIV Code Generator
Exhaustive Search
Related WorkRelated Work
14School of Computer Science
OutlineOutline
I. Motivation
II. Related Work
III. Completed Work
IV. Proposed Work
V. Contributions & Timeline
15School of Computer Science
A More Principled Register AllocatorA More Principled Register Allocator– fully utilize machine description
• explicit and expressive model of costs of allocation for given architecture
– optimal solutions
reg allocreg
alloc
machine description
Completed WorkCompleted Work
16School of Computer Science
Multi-commodity Network Flow: An Expressive ModelMulti-commodity Network Flow: An Expressive ModelGiven network (directed graph) with
– cost and capacity on each edge– sources & sinks for multiple commodities
Find lowest cost flow of commodities
NP-complete for integer flows
Example:edges have unit capacity
a b
a b
01
Completed WorkCompleted Work
17School of Computer Science
Variables Commodities
Variable Definition Source
Variable Last Use Sink
Nodes Allocation Classes (Reg/Mem/Const)
Registers Limits Node Capacities
Spill Costs Edge Costs
Allocation Flow
Register Allocation as a MCNFRegister Allocation as a MCNF
a
a
r0 r1 mem 1
r1 mem 1
r0 r1 mem 1
3
Completed WorkCompleted Work
18School of Computer Science
ExampleExampleSource Codeint example(int a, int b){ int d = 1; int c = a - b; return c+d;}
Pre-alloc AssemblyMOVE 1 -> dSUB a,b -> cADD c,d -> cMOVE c -> r0
insn pref cost
mem access cost
load cost
Completed WorkCompleted Work
19School of Computer Science
Control FlowControl FlowMCNF can only represent straight-line code
– need to link together networks from basic blocks
Extend MCNF model with merge and split nodes to implement boundary constraints.
a: %eaxa: %eax
a: %eaxa: %eaxa: %eaxa: %eax
a: mema: mem
a: mema: mem
a: mema: mem
a: mema: mem
details in proposal document…details in proposal document…
along with modeling persistence of along with modeling persistence of values in memoryvalues in memory
Completed WorkCompleted Work
20School of Computer Science
A Better Register AllocatorA Better Register Allocator– fully utilize machine description
• explicit and expressive model of costs of allocation for given architecture: Global MCNF
– locally optimal• NP-hard, so use progressive
solution technique
reg allocreg
alloc
machine description
Completed WorkCompleted Work
21School of Computer Science
A Better Register AllocatorA Better Register Allocator– fully utilize machine description
• explicit and expressive model of costs of allocation for given architecture: Global MCNF
– locally optimal• NP-hard, so use progressive
solution technique
reg allocreg
alloc
machine description
Completed WorkCompleted Work
22School of Computer Science
Progressive Solution TechniqueProgressive Solution Technique
Quickly find a good allocation
Then progressively find better allocations– until optimal allocation found– or time limit is reached
Compile Time
Allo
catio
n Q
ualit
y
Lagrangian relaxation directed allocatorsLagrangian relaxation directed allocators
Technique:Technique:
Completed WorkCompleted Work
23School of Computer Science
Lagrangian Relaxation: IntuitionLagrangian Relaxation: IntuitionRelaxes the hard constraints
– only have to solve single commodity flow
Combines easy subproblems using a Lagrangian multiplier (price)– an additional price on each edge– a price on each split/merge node
a b
a b
01
Example:edges have unit capacity
a b
a b
0+11with price, solution to single commodity flow can be solution to multicommodity flow
Completed WorkCompleted Work
24School of Computer Science
Solution ProcedureSolution ProcedureCompute prices with iterative
subgradient optimization– guaranteed converge to optimal prices– optimal for linear relaxation
At each iteration, construct a feasible integer solution using current prices– iterative allocator in documentin document
– simultaneous allocator– trace-based simultaneous allocator
a b
a b
0+1+11
a b
a b
0+11
Completed WorkCompleted Work
25School of Computer Science
Simultaneous AllocatorSimultaneous Allocator
XX XX
Current cost:-1-1-3-3-2-2
Edges to/from memory cost 3
Completed WorkCompleted Work
26School of Computer Science
Trace-Based AllocationTrace-Based AllocationDecompose function into traces of basic blocks
– run simultaneous allocator on each trace– control flow internal to trace presents difficulty
addressed in proposal documentaddressed in proposal document
Completed WorkCompleted Work
27School of Computer Science
EvaluationEvaluationImplemented in gcc 3.4.4 targeting x86
Optimize for code sizecode size– perfect static evaluation– important metric in its own right
MediaBench, MiBench, Spec95, Spec2000– over 10,000 functions
Completed WorkCompleted Work
28School of Computer Science
ProgressivenessProgressivenesssquareEncrypt
Completed WorkCompleted Work
29School of Computer Science
ProgressivenessProgressivenessquicksort
Completed WorkCompleted Work
30School of Computer Science
0%
1%
2%
3%
4%
5%
6%
7%
8%
initial heuristics only 10 iterations 100 iterations 1000 iterations default allocatorAverage code size improvement over graph allocator
Code SizeCode Size
Progressive!
Completed WorkCompleted Work
31School of Computer Science
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
1 iteration 10 iterations 100 iterations 1000 iterations
Percent of functions
>25% from optimal
<=25% from optimal
<=10% from optimal
<=5% from optimal
<=1% from optimal
optimal
OptimalityOptimality
Proven optimality
Completed WorkCompleted Work
32School of Computer Science
Compile Time Slowdown :-(Compile Time Slowdown :-(
9.2x slower
Completed WorkCompleted Work
33School of Computer Science
A Better Register AllocatorA Better Register Allocator– fully utilize machine description
• explicit and expressive model of costs of allocation for given architecture: Global MCNF
– locally optimal• approach optimality using
progressive solution technique: Lagrangian directed allocators
reg allocreg
alloc
machine description
Completed WorkCompleted Work
34School of Computer Science
OutlineOutline
I. Motivation
II. Related Work
III. Completed Work
IV. Proposed Work
V. Contributions & Timeline
35School of Computer Science
A Better Better Register AllocatorA Better Better Register AllocatorSolver Improvements
– Improve initial solution– Improve quality as prices converge– Hope to prove approximation bounds
Model Improvements– Improve accuracy of model– Model simplification– Represent uniform register sets efficiently
Proposed WorkProposed Work
36School of Computer Science
Model SimplificationModel SimplificationSummarize overly expressive sections of the model
Conservative simplificationdoes not change optimal value
Aggressive simplificationexplore tradeoff between model complexity and optimality
Conservative simplificationdoes not change optimal value
Aggressive simplificationexplore tradeoff between model complexity and optimality
Proposed WorkProposed Work
37School of Computer Science
Instruction Selection Interaction Instruction Selection Interaction
perform same operation
which instruction is best depends on the register allocator
so let register allocator decide
which instruction is best depends on the register allocator
so let register allocator decide
Proposed WorkProposed Work
38School of Computer Science
Register Allocation Aware Instruction SElection (RA2ISE)Register Allocation Aware Instruction SElection (RA2ISE)Instruction selection not finalized
until register allocation
IR tiled with Register Allocation Aware Tiles (RAATs)
A RAAT represents several instruction sequences– different costs– a sequence for every possible
register allocation
Proposed WorkProposed Work
39School of Computer Science
RA2ISERA2ISE
tilingtilingtilingtiling
IR RAAT
modelmodelcreatiocreatio
nn
modelmodelcreatiocreatio
nn
registerregisterallocationallocationregisterregister
allocationallocationcwtl %eaxcwtl %eax
Proposed WorkProposed Work
40School of Computer Science
Implementing RA2ISEImplementing RA2ISEAdd side-constraints to Global MCNF model
– implement inter-variable preferences and constraints• “if x allocated to r1 and y allocated to r2, then save three bytes”
• “x and y must be allocated to the same register”
Implement x86 RAATs– RAAT tables created manually– GMCNF RAAT representation automatically generated
from RAAT table with minimum use of side constraints
Algorithms for tiling RAATs– leverage existing algorithms– exploit feedback between passes
Proposed WorkProposed Work
41School of Computer Science
Tiling RAATsTiling RAATs
3
2
4
24
1
1
53
3
2
1
1
1
53
3
2
1
11
13
1
1
4
1
14
2
3
tilingtilingtilingtiling
1
14
2
3
4
3
eax
edx memmem
registerregister
allocateallocateregister
register
allocateallocate
feedback
feedback
feedback
feedback
Proposed WorkProposed Work
42School of Computer Science
EvaluationEvaluationImplement in production quality compiler (gcc)
Evaluate code size and simple code speed metric
Evaluate on three different architectures– x86 (8 registers)– 68k/ColdFire (16 registers)– PPC (32 registers)
Proposed WorkProposed Work
43School of Computer Science
OutlineOutline
I. Motivation
II. Related Work
III. Completed Work
IV. Proposed Work
V. Contributions & Timeline
44School of Computer Science
ContributionsContributionsRA2ISE
– register allocation aware tiles (RAATs) explicitly encode effect of register allocation on instruction sequence
– algorithms for tiling RAATs– expressive model of register allocation that operates
on RAATs and explicitly represents all important components of register allocation
– progressive solver for this model that can quickly find decent solution and approaches optimality as more time is allowed for compilation
Comprehensive evaluation of RA2ISE
45School of Computer Science
Thesis StatementThesis Statement
RARA22ISE is a principled and effective system ISE is a principled and effective system for performing instruction selection and for performing instruction selection and
register allocation.register allocation.
RARA22ISE is a principled and effective system ISE is a principled and effective system for performing instruction selection and for performing instruction selection and
register allocation.register allocation.
46School of Computer Science
One Step Towards a More Principled CompilerOne Step Towards a More Principled Compiler
optimized program
machine description
copy
prop
loop unrol
l
DCE
PREconst
prop
code motio
ninline
GVN
strength
reduct peep
-hole
CSE SCC
Preg alloc
branch opt
…insn selec
t
reg alloc
47School of Computer Science
TimelineTimelineFall 2006
add simple speed metric option to modelbegin model simplification workimprove model accuracy and solver performance
Winter 2006
finish model simplification workadd side-constraints to modelimplement existing gcc tiles as RAATsimprove model accuracy and solver performance
Spring 2007
finish implementation of side-constraints and gcc RAATsbegin work on RA2ISE infrastructurecreate gcc-independent set of RAATs for x86improve model accuracy and solver performance
Summer 2007finish work on RA2ISEinvestigate and develop tiling algorithmsimprove model accuracy and solver performance
Fall 2007add 68k/ColdFire and PowerPC targetsinvestigate uniform register set simplificationsimprove model accuracy and solver performance
Winter 2007begin writing thesiswork on improving compile time performance
Spring 2008 finish writing thesis
48School of Computer Science
Andrew Richard Koes
49School of Computer Science
Questions?Questions?
?
50School of Computer Science
Processor PerformanceProcessor Performance
0%
1000%
2000%
3000%
4000%
5000%
6000%
7000%
Nov-95May-96Nov-96May-97Nov-97May-98Nov-98May-99Nov-99May-00Nov-00May-01Nov-01May-02Nov-02May-03Nov-03May-04Nov-04May-05Nov-05May-06
SPEC2000 Performance Improvement
Performance w/o Compiler Improvements
Double every 24 months Double every 18 months
51School of Computer Science
Instruction Selection & Register AllocationInstruction Selection & Register Allocation
machine description
reg allocreg
alloc
insn selec
t
insn selec
t
– fully utilize machine description– locally optimal– tight integration between phases
52School of Computer Science
Costs of Register AllocationCosts of Register AllocationSpilling to/from memory
movl 8(%ebp), %edx
Direct memory accessaddl 8(%ebp), %eax
Moving between registersmovl %edx,%ecx
Rematerialization of constant valuemovl $3,%eax
Register usage preferencesimul %edx,%eax
vs.imul %edx,%ecx
53School of Computer Science
Iterative Heuristic AllocatorIterative Heuristic AllocatorAllocate each variable in a heuristic priority order
Find shortest path in each block– avoid edges that make remaining problem infeasible
Process blocks in topological order– allocation at block entry fixed by previous blocks
– shortest path is minimum cost allocation for a variable– allocate most significant variables first
– shortest path is minimum cost allocation for a variable– allocate most significant variables first
Intuition:Intuition:
– greedy: can’t undo poor decisions– greedy: can’t undo poor decisions
Limitation:Limitation:
54School of Computer Science
Iterative Heuristic AllocatorIterative Heuristic AllocatorAllocation order:
a, b, c, d
Cost:
a
0
b
4
c
0
d
-2
Total: 22
Edges to/from memory cost 3
55School of Computer Science
Simultaneous AllocatorSimultaneous AllocatorScan each block
– maintain an allocation of all live variables– at variable definition find cheapest allocation
• allocation with shortest path to variable’s sink or block exit• allowed to evict (reallocate) already allocated variable
– eviction cost shortest path to edge from current allocation to new allocation in this block
– cost of eviction added to shortest path cost
– minimizing cost for all variables at once– minimizing cost for all variables at once
Intuition:Intuition:
– path computations limited to single block– future blocks do not change previous block allocations
– path computations limited to single block– future blocks do not change previous block allocations
Limitation:Limitation:
56School of Computer Science
easy-updateeasy-updateeasy-updateeasy-update
full-updatefull-updatefull-updatefull-update
Trace-Based AllocationTrace-Based AllocationDecompose function into traces of basic blocks
– run simultaneous allocator on each trace– control flow internal to trace
• update only blocks that are necessary (easy-update)• update all effected blocks (full-update)
57School of Computer Science
0%
10%
20%
30%
40%
50%
60%
>10%10-5% 5-3% 3-2% 2-1% 1-0%0%
0-1% 1-2% 2-3% 3-5% 5-10%>10%
Percent difference between predicted and actual size
Percent of functionslarger than predicted(under-predicted)
smaller than predicted(over-predicted)
Accuracy of the ModelAccuracy of the ModelGlobal MCNF model correctly predicts costs of register allocation within 2% for 72.5% of functions compiled
58School of Computer Science
Compile Time Slowdown :-(Compile Time Slowdown :-(
10x slower
0
5
10
15
20
25
30
35
40
45
50
099.go
124.m88ksim129.compress
130.li
132.ijpeg134.perl147.vortex
164.gzip
168.wupwise
171.swim173.applu175.vpr176.gcc181.mcf
183.equake188.ammp197.parser254.gap
255.vortex256.bzip2300.twolf301.apsiCRC32
adpcm_dadpcm_ebasicmathbitcount
blowfish_dblowfish_e
dijkstraepic_depic_eg721_dg721_egsm_dgsm_eispell
jpeg_djpeg_elamemesa
mpeg2_dmpeg2_epatricia
pegwit_dpegwit_epgp_dpgp_eqsortrasta
sha
stringsearch
susan
geo. mean
Factor slower than graph allocation
Initialization Initial Iterative Allocation Initial Simultaneous Allocation One Iteration
59School of Computer Science
Code size improvementCode size improvement
0%
1%
2%
3%
4%
5%
6%
7%
8%
initial heuristicsonly
10 iterations 100 iterations 1000 iterations default allocatorCode size improvement over graph allocator
without traces with easy traces with full traces
60School of Computer Science
Code Size ImprovementCode Size Improvement
-5%
0%
5%
10%
15%
20%
099.go
124.m88ksim129.compress
130.li
132.ijpeg134.perl147.vortex
164.gzip
168.wupwise
171.swim173.applu175.vpr176.gcc181.mcf
183.equake188.ammp197.parser254.gap
255.vortex256.bzip2300.twolf301.apsiCRC32adpcm
basicmathbitcountblowfishdijkstra
epicg721gsmispelljpeglamemesampeg2patriciapegwit
pgpqsortrastasha
stringsearch
susanaverage
Code size improvement over graph allocator
initial heuristics only 10 iterations 100 iterations 1000 iterations
61School of Computer Science
Code Size ImprovementCode Size Improvement
-6%
-4%
-2%
0%
2%
4%
6%
8%
10%
12%
099.go
124.m88ksim129.compress
130.li
132.ijpeg134.perl147.vortex
164.gzip
168.wupwise
171.swim173.applu175.vpr176.gcc181.mcf
183.equake188.ammp197.parser254.gap
255.vortex256.bzip2300.twolf301.apsiCRC32adpcm
basicmathbitcountblowfishdijkstra
epicg721gsmispelljpeglamemesampeg2patriciapegwit
pgpqsortrastasha
stringsearch
susanaverage
Code size improvement over default allocator
initial heuristic only 10 iterations 100 iterations 1000 iterations
62School of Computer Science
Code PerformanceCode Performance
63School of Computer Science
int foo(int a, short b) { return a*4+b; }
4 movl 4(%esp),%eax3 sall $2,%eax4 addl 8(%esp),%eax1 cwtl1 ret
5 movswl 8(%esp),%edx4 movl 4(%esp),%eax3 leal (%edx,%eax,4),%eax1 ret
Integrating Register Allocation and Instruction SelectionIntegrating Register Allocation and Instruction Selection
64School of Computer Science
Another RAAT Another RAAT