towards a more principled compiler: progressive backend compiler optimization

School of Computer Science

Towards a More Principled Compiler:

Progressive Backend Compiler Optimization

Towards a More Principled Compiler:

Progressive Backend Compiler Optimization

David Koes8/28/2006

2School of Computer Science

Performance Gains Due to Compiler (gcc)Performance Gains Due to Compiler (gcc)

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

50%

Nov-95Nov-96Nov-97Nov-98Nov-99Nov-00Nov-01Nov-02Nov-03Nov-04Nov-05Nov-06Nov-07Nov-08Nov-09

SPEC2000 Performance Improvement

2.8Ghz Pentium 4, 1GB RAM, -O3 …


The Future of Compiler OptimizationThe Future of Compiler Optimization

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

50%

Nov-95Nov-96Nov-97Nov-98Nov-99Nov-00Nov-01Nov-02Nov-03Nov-04Nov-05Nov-06Nov-07Nov-08Nov-09


is this possible?

10-30% improvement just from reordering compiler phaseshttp://www.cs.rice.edu/~keith/Adapt/

Yes!Yes!How do we exploit the existing optimization potential?

Need a more principled compilerNeed a more principled compiler


Compiler code size improvementCompiler code size improvement

0%

5%

10%

15%

20%

25%

Nov-95May-96Nov-96May-97Nov-97May-98Nov-98May-99Nov-99May-00Nov-00May-01Nov-01May-02Nov-02May-03Nov-03May-04Nov-04May-05Nov-05

Code size improvement


A Principled CompilerA Principled Compiler

A compiler that A compiler that – knows right from wrongknows right from wrong

(less optimal from more optimal)(less optimal from more optimal)– follows a rigorous procedure to get the desired outputfollows a rigorous procedure to get the desired output


Today’s Compiler Today’s Compiler

target dependenttarget dependenttarget dependenttarget dependent

target independenttarget independenttarget independenttarget independent

const

proploop unroll

GVNstrength reductSCCP

code motion

…copy prop

inlining

DCE

PRE

peephole …

reg allocinsn sched

branch opt

optimized program

machine description

Problems– some phases not

internally optimal• purely heuristic solution

– machine description mostly ignored

– lack of integration between phases

insn select


optimized program

machine description

copy

prop

loop unrol

l

DCE

PREconst

prop

code motio

ninline

GVN

strength

reduct peep

-hole

CSE SCC

Preg alloc

branch optinsn

select

…

Ideal CompilerIdeal Compiler– each phase locally optimal– makes full use of machine

description– tight integration between

phases

Absolutely Absolutely nono idea how to do idea how to do this or if it’s even this or if it’s even possiblepossible


optimized program

machine description

copy

prop

loop unrol

l

DCE

PREconst

prop

code motio

ninline

GVN

strength

reduct peep

-hole

CSE SCC

Preg alloc

branch opt

…

Towards a More Principled CompilerTowards a More Principled Compiler– each phase locally optimal– makes full use of machine

description– tight integration between

phases

insn selec

t

reg alloc


OutlineOutline

I. Motivation

II. Related Work

III. Completed Work

IV. Proposed Work

V. Contributions & Timeline


Register Allocation ProblemRegister Allocation Problem

…

v = 1

w = v + 3

x = w + v

u = v

t = u + x

print(x);

print(w);

print(t);

print(u);

…

registerregisterallocatorallocatorregisterregisterallocatorallocator

unbounded number of unbounded number of program variablesprogram variables

limited number of limited number of processor registers + processor registers + slow memoryslow memory

eaxebxecxedxesiedi

ebpesp

spill code optimizationspill code optimizationspill code optimizationspill code optimization

memory operandsmemory operandsmemory operandsmemory operands

register preferencesregister preferencesregister preferencesregister preferencesrematerializationrematerializationrematerializationrematerialization

live range splittinglive range splittinglive range splittinglive range splitting

Related WorkRelated Work


Method Expressive Fast Optimal

Linear Scan

Graph Coloring

Integer Linear Programming

Partitioned Boolean Quadratic Programming / /

Register Allocation Previous WorkRegister Allocation Previous Work Related WorkRelated Work


Instruction Selection ProblemInstruction Selection Problem

movl (p),t1leal (x,t1),t2leal 1(y),t3leal (t2,t3),r

IRIR AssemAssem

instruction selector

instruction selector

minimum cost tilingminimum cost tilingminimum cost tilingminimum cost tiling?

IR RepresentationIR RepresentationIR RepresentationIR Representation



Instruction Selection Previous WorkInstruction Selection Previous Work

MethodDAG Tiling

Register Allocation Aware

Fast Optimal

Dynamic Programming

Binate Covering

Peephole Based Instruction Selection

AVIV Code Generator

Exhaustive Search



OutlineOutline

I. Motivation

II. Related Work

III. Completed Work

IV. Proposed Work



A More Principled Register AllocatorA More Principled Register Allocator– fully utilize machine description

• explicit and expressive model of costs of allocation for given architecture

– optimal solutions

reg allocreg

alloc

machine description

Completed WorkCompleted Work


Multi-commodity Network Flow: An Expressive ModelMulti-commodity Network Flow: An Expressive ModelGiven network (directed graph) with

– cost and capacity on each edge– sources & sinks for multiple commodities

Find lowest cost flow of commodities

NP-complete for integer flows

Example:edges have unit capacity

a b

a b

01



Variables Commodities

Variable Definition Source

Variable Last Use Sink

Nodes Allocation Classes (Reg/Mem/Const)

Registers Limits Node Capacities

Spill Costs Edge Costs

Allocation Flow

Register Allocation as a MCNFRegister Allocation as a MCNF

a

a

r0 r1 mem 1

r1 mem 1

r0 r1 mem 1

3



ExampleExampleSource Codeint example(int a, int b){ int d = 1; int c = a - b; return c+d;}

Pre-alloc AssemblyMOVE 1 -> dSUB a,b -> cADD c,d -> cMOVE c -> r0

insn pref cost

mem access cost

load cost



Control FlowControl FlowMCNF can only represent straight-line code

– need to link together networks from basic blocks

Extend MCNF model with merge and split nodes to implement boundary constraints.

a: %eaxa: %eax

a: %eaxa: %eaxa: %eaxa: %eax

a: mema: mem

a: mema: mem

a: mema: mem

a: mema: mem

details in proposal document…details in proposal document…

along with modeling persistence of along with modeling persistence of values in memoryvalues in memory



A Better Register AllocatorA Better Register Allocator– fully utilize machine description

• explicit and expressive model of costs of allocation for given architecture: Global MCNF

– locally optimal• NP-hard, so use progressive

solution technique

reg allocreg

alloc

machine description





– locally optimal• NP-hard, so use progressive

solution technique

reg allocreg

alloc

machine description



Progressive Solution TechniqueProgressive Solution Technique

Quickly find a good allocation

Then progressively find better allocations– until optimal allocation found– or time limit is reached

Compile Time

Allo

catio

n Q

ualit

y

Lagrangian relaxation directed allocatorsLagrangian relaxation directed allocators

Technique:Technique:



Lagrangian Relaxation: IntuitionLagrangian Relaxation: IntuitionRelaxes the hard constraints

– only have to solve single commodity flow

Combines easy subproblems using a Lagrangian multiplier (price)– an additional price on each edge– a price on each split/merge node

a b

a b

01

Example:edges have unit capacity

a b

a b

0+11with price, solution to single commodity flow can be solution to multicommodity flow



Solution ProcedureSolution ProcedureCompute prices with iterative

subgradient optimization– guaranteed converge to optimal prices– optimal for linear relaxation

At each iteration, construct a feasible integer solution using current prices– iterative allocator in documentin document

– simultaneous allocator– trace-based simultaneous allocator

a b

a b

0+1+11

a b

a b

0+11



Simultaneous AllocatorSimultaneous Allocator

XX XX

Current cost:-1-1-3-3-2-2

Edges to/from memory cost 3



Trace-Based AllocationTrace-Based AllocationDecompose function into traces of basic blocks

– run simultaneous allocator on each trace– control flow internal to trace presents difficulty

addressed in proposal documentaddressed in proposal document



EvaluationEvaluationImplemented in gcc 3.4.4 targeting x86

Optimize for code sizecode size– perfect static evaluation– important metric in its own right

MediaBench, MiBench, Spec95, Spec2000– over 10,000 functions



ProgressivenessProgressivenesssquareEncrypt



ProgressivenessProgressivenessquicksort



0%

1%

2%

3%

4%

5%

6%

7%

8%

initial heuristics only 10 iterations 100 iterations 1000 iterations default allocatorAverage code size improvement over graph allocator

Code SizeCode Size

Progressive!



0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

1 iteration 10 iterations 100 iterations 1000 iterations

Percent of functions

>25% from optimal

<=25% from optimal

<=10% from optimal

<=5% from optimal

<=1% from optimal

optimal

OptimalityOptimality

Proven optimality



Compile Time Slowdown :-(Compile Time Slowdown :-(

9.2x slower





– locally optimal• approach optimality using

progressive solution technique: Lagrangian directed allocators

reg allocreg

alloc

machine description



OutlineOutline

I. Motivation

II. Related Work

III. Completed Work

IV. Proposed Work



A Better Better Register AllocatorA Better Better Register AllocatorSolver Improvements

– Improve initial solution– Improve quality as prices converge– Hope to prove approximation bounds

Model Improvements– Improve accuracy of model– Model simplification– Represent uniform register sets efficiently

Proposed WorkProposed Work


Model SimplificationModel SimplificationSummarize overly expressive sections of the model

Conservative simplificationdoes not change optimal value

Aggressive simplificationexplore tradeoff between model complexity and optimality

Conservative simplificationdoes not change optimal value

Aggressive simplificationexplore tradeoff between model complexity and optimality



Instruction Selection Interaction Instruction Selection Interaction

perform same operation

which instruction is best depends on the register allocator

so let register allocator decide

which instruction is best depends on the register allocator

so let register allocator decide



Register Allocation Aware Instruction SElection (RA2ISE)Register Allocation Aware Instruction SElection (RA2ISE)Instruction selection not finalized

until register allocation

IR tiled with Register Allocation Aware Tiles (RAATs)

A RAAT represents several instruction sequences– different costs– a sequence for every possible

register allocation



RA2ISERA2ISE

tilingtilingtilingtiling

IR RAAT

modelmodelcreatiocreatio

nn

modelmodelcreatiocreatio

nn

registerregisterallocationallocationregisterregister

allocationallocationcwtl %eaxcwtl %eax



Implementing RA2ISEImplementing RA2ISEAdd side-constraints to Global MCNF model

– implement inter-variable preferences and constraints• “if x allocated to r1 and y allocated to r2, then save three bytes”

• “x and y must be allocated to the same register”

Implement x86 RAATs– RAAT tables created manually– GMCNF RAAT representation automatically generated

from RAAT table with minimum use of side constraints

Algorithms for tiling RAATs– leverage existing algorithms– exploit feedback between passes



Tiling RAATsTiling RAATs

3

2

4

24

1

1

53

3

2

1

1

1

53

3

2

1

11

13

1

1

4

1

14

2

3

tilingtilingtilingtiling

1

14

2

3

4

3

eax

edx memmem

registerregister

allocateallocateregister

register

allocateallocate

feedback

feedback

feedback

feedback



EvaluationEvaluationImplement in production quality compiler (gcc)

Evaluate code size and simple code speed metric

Evaluate on three different architectures– x86 (8 registers)– 68k/ColdFire (16 registers)– PPC (32 registers)



OutlineOutline

I. Motivation

II. Related Work

III. Completed Work

IV. Proposed Work



ContributionsContributionsRA2ISE

– register allocation aware tiles (RAATs) explicitly encode effect of register allocation on instruction sequence

– algorithms for tiling RAATs– expressive model of register allocation that operates

on RAATs and explicitly represents all important components of register allocation

– progressive solver for this model that can quickly find decent solution and approaches optimality as more time is allowed for compilation

Comprehensive evaluation of RA2ISE


Thesis StatementThesis Statement

RARA22ISE is a principled and effective system ISE is a principled and effective system for performing instruction selection and for performing instruction selection and

register allocation.register allocation.

RARA22ISE is a principled and effective system ISE is a principled and effective system for performing instruction selection and for performing instruction selection and

register allocation.register allocation.


One Step Towards a More Principled CompilerOne Step Towards a More Principled Compiler

optimized program

machine description

copy

prop

loop unrol

l

DCE

PREconst

prop

code motio

ninline

GVN

strength

reduct peep

-hole

CSE SCC

Preg alloc

branch opt

…insn selec

t

reg alloc


TimelineTimelineFall 2006

add simple speed metric option to modelbegin model simplification workimprove model accuracy and solver performance

Winter 2006

finish model simplification workadd side-constraints to modelimplement existing gcc tiles as RAATsimprove model accuracy and solver performance

Spring 2007

finish implementation of side-constraints and gcc RAATsbegin work on RA2ISE infrastructurecreate gcc-independent set of RAATs for x86improve model accuracy and solver performance

Summer 2007finish work on RA2ISEinvestigate and develop tiling algorithmsimprove model accuracy and solver performance

Fall 2007add 68k/ColdFire and PowerPC targetsinvestigate uniform register set simplificationsimprove model accuracy and solver performance

Winter 2007begin writing thesiswork on improving compile time performance

Spring 2008 finish writing thesis


Andrew Richard Koes


Questions?Questions?

?


Processor PerformanceProcessor Performance

0%

1000%

2000%

3000%

4000%

5000%

6000%

7000%

Nov-95May-96Nov-96May-97Nov-97May-98Nov-98May-99Nov-99May-00Nov-00May-01Nov-01May-02Nov-02May-03Nov-03May-04Nov-04May-05Nov-05May-06


Performance w/o Compiler Improvements

Double every 24 months Double every 18 months


Instruction Selection & Register AllocationInstruction Selection & Register Allocation

machine description

reg allocreg

alloc

insn selec

t

insn selec

t

– fully utilize machine description– locally optimal– tight integration between phases


Costs of Register AllocationCosts of Register AllocationSpilling to/from memory

movl 8(%ebp), %edx

Direct memory accessaddl 8(%ebp), %eax

Moving between registersmovl %edx,%ecx

Rematerialization of constant valuemovl $3,%eax

Register usage preferencesimul %edx,%eax

vs.imul %edx,%ecx


Iterative Heuristic AllocatorIterative Heuristic AllocatorAllocate each variable in a heuristic priority order

Find shortest path in each block– avoid edges that make remaining problem infeasible

Process blocks in topological order– allocation at block entry fixed by previous blocks

– shortest path is minimum cost allocation for a variable– allocate most significant variables first

– shortest path is minimum cost allocation for a variable– allocate most significant variables first

Intuition:Intuition:

– greedy: can’t undo poor decisions– greedy: can’t undo poor decisions

Limitation:Limitation:


Iterative Heuristic AllocatorIterative Heuristic AllocatorAllocation order:

a, b, c, d

Cost:

a

0

b

4

c

0

d

-2

Total: 22

Edges to/from memory cost 3


Simultaneous AllocatorSimultaneous AllocatorScan each block

– maintain an allocation of all live variables– at variable definition find cheapest allocation

• allocation with shortest path to variable’s sink or block exit• allowed to evict (reallocate) already allocated variable

– eviction cost shortest path to edge from current allocation to new allocation in this block

– cost of eviction added to shortest path cost

– minimizing cost for all variables at once– minimizing cost for all variables at once

Intuition:Intuition:

– path computations limited to single block– future blocks do not change previous block allocations

– path computations limited to single block– future blocks do not change previous block allocations

Limitation:Limitation:


easy-updateeasy-updateeasy-updateeasy-update

full-updatefull-updatefull-updatefull-update

Trace-Based AllocationTrace-Based AllocationDecompose function into traces of basic blocks

– run simultaneous allocator on each trace– control flow internal to trace

• update only blocks that are necessary (easy-update)• update all effected blocks (full-update)


0%

10%

20%

30%

40%

50%

60%

>10%10-5% 5-3% 3-2% 2-1% 1-0%0%

0-1% 1-2% 2-3% 3-5% 5-10%>10%

Percent difference between predicted and actual size

Percent of functionslarger than predicted(under-predicted)

smaller than predicted(over-predicted)

Accuracy of the ModelAccuracy of the ModelGlobal MCNF model correctly predicts costs of register allocation within 2% for 72.5% of functions compiled


Compile Time Slowdown :-(Compile Time Slowdown :-(

10x slower

0

5

10

15

20

25

30

35

40

45

50

099.go

124.m88ksim129.compress

130.li

132.ijpeg134.perl147.vortex

164.gzip

168.wupwise

171.swim173.applu175.vpr176.gcc181.mcf

183.equake188.ammp197.parser254.gap

255.vortex256.bzip2300.twolf301.apsiCRC32

adpcm_dadpcm_ebasicmathbitcount

blowfish_dblowfish_e

dijkstraepic_depic_eg721_dg721_egsm_dgsm_eispell

jpeg_djpeg_elamemesa

mpeg2_dmpeg2_epatricia

pegwit_dpegwit_epgp_dpgp_eqsortrasta

sha

stringsearch

susan

geo. mean

Factor slower than graph allocation

Initialization Initial Iterative Allocation Initial Simultaneous Allocation One Iteration


Code size improvementCode size improvement

0%

1%

2%

3%

4%

5%

6%

7%

8%

initial heuristicsonly

10 iterations 100 iterations 1000 iterations default allocatorCode size improvement over graph allocator

without traces with easy traces with full traces


Code Size ImprovementCode Size Improvement

-5%

0%

5%

10%

15%

20%

099.go


130.li


164.gzip

168.wupwise



255.vortex256.bzip2300.twolf301.apsiCRC32adpcm

basicmathbitcountblowfishdijkstra

epicg721gsmispelljpeglamemesampeg2patriciapegwit

pgpqsortrastasha

stringsearch

susanaverage

Code size improvement over graph allocator

initial heuristics only 10 iterations 100 iterations 1000 iterations


Code Size ImprovementCode Size Improvement

-6%

-4%

-2%

0%

2%

4%

6%

8%

10%

12%

099.go


130.li


164.gzip

168.wupwise



255.vortex256.bzip2300.twolf301.apsiCRC32adpcm

basicmathbitcountblowfishdijkstra

epicg721gsmispelljpeglamemesampeg2patriciapegwit

pgpqsortrastasha

stringsearch

susanaverage

Code size improvement over default allocator

initial heuristic only 10 iterations 100 iterations 1000 iterations


Code PerformanceCode Performance


int foo(int a, short b) { return a*4+b; }

4 movl 4(%esp),%eax3 sall $2,%eax4 addl 8(%esp),%eax1 cwtl1 ret

5 movswl 8(%esp),%edx4 movl 4(%esp),%eax3 leal (%edx,%eax,4),%eax1 ret

Integrating Register Allocation and Instruction SelectionIntegrating Register Allocation and Instruction Selection


Another RAAT Another RAAT

towards a more principled compiler: progressive backend compiler optimization

Documents