compiler design

Universityof

Amsterdam

CSACSAComputerSystems

Architecture

Introduction to Compiler Design:optimization and backend issues

Andy Pimentel

Computer Systems Architecture [email protected]

Introduction to Compiler Design – A. Pimentel – p. 1/98

Universityof

Amsterdam


Architecture

Compilers: Organization Revisited

IRSource MachineCode

IRFrontend BackendOptimizer

OptimizerIndependent part of compilerDifferent optimizations possibleIR to IR translation


Universityof

Amsterdam


Architecture

Intermediate Representation (IR)

Flow graphNodes are basic blocks

Basic blocks are single entry and single exitEdges represent control-flow

Abstract Machine CodeIncluding the notion of functions and procedures

Symbol table(s) keep track of scope and bindinginformation about names


Universityof

Amsterdam


Architecture

Partitioning into basic blocks

1. Determine the leaders, which are:The first statementAny statement that is the target of a jumpAny statement that immediately follows a jump

2. For each leader its basic block consists of the leader and allstatements up to but not including the next leader


Universityof

Amsterdam


Architecture

Partitioning into basic blocks (cont’d)

1 prod=02 i=13 t1=4*i4 t2=a[t1]5 t3=4*i6 t4=b[t3]7 t5=t2*t48 t6=prod+t59 prod=t610 t7=i+i11 i=t712 if i < 21 goto 3

BB1

BB2


Universityof

Amsterdam


Architecture

Intermediate Representation (cont’d)

Structure within a basic block:

Abstract Syntax Tree (AST)Leaves are labeled by variable names or constantsInterior nodes are labeled by an operator

Directed Acyclic Graph (DAG)

C-like

3 address statements (like we have already seen)


Universityof

Amsterdam


Architecture

Directed Acyclic Graph

Like ASTs:Leaves are labeled by variable names or constantsInterior nodes are labeled by an operator

Nodes can have variable names attached that contain thevalue of that expression

Common subexpressions are represented by multiple edgesto the same expression


Universityof

Amsterdam


Architecture

DAG creation

Suppose the following three address statements:

1. x � y op z

2. x � op y

3. x � y

i f

�

i � � 20

�

... will be treated like case 1 with x undefined


Universityof

Amsterdam


Architecture

DAG creation (cont’d)

If node

�

y

�

is undefined, create leaf labeled y, same for z ifapplicable

Find node n labeled op with children node

�

y

�

and node

�

z

�

if applicable. When not found, create node n. In case 3 let nbe node

�

y

�

Make node

�

x

�

point to n and update the attached identifiersfor x


Universityof

Amsterdam


Architecture

DAG example

1 t1 = 4 * i2 t2 = a[t1]3 t3 = 4 * i4 t4 = b[t3]5 t5 = t2 * t46 t6 = prod + t57 prod = t68 t7 = i + 19 i = t7

10 if (i � � 20) goto 1

[ ] [ ]

*

* +

+

<=

prod

t6, prod

t5

t2 t4

t1, t3 t7, i 20

1i4ba


Universityof

Amsterdam


Architecture

Local optimizations

On basic blocks in the intermediate representationMachine independent optimizations

As a post code-generation step (often called peepholeoptimization)

On a small “instruction window” (often a basic block)Includes machine specific optimizations


Universityof

Amsterdam


Architecture

Transformations on basic blocks

Examples

Function-preserving transformationsCommon subexpression eliminationConstant foldingCopy propagationDead-code eliminationTemporary variable renamingInterchange of independent statements


Universityof

Amsterdam


Architecture

Transformations on basic blocks (cont’d)

Algebraic transformations

Machine dependent eliminations/transformationsRemoval of redundant loads/storesUse of machine idioms


Universityof

Amsterdam


Architecture

Common subexpression elimination

If the same expression is computed more than once it iscalled a common subexpression

If the result of the expression is stored, we don’t have torecompute it

Moving to a DAG as IR, common subexpressions areautomatically detected!

x � a � b x � a � b

� � � � � � �

y � a � b y � x


Universityof

Amsterdam


Architecture

Constant folding

Compute constant expression at compile time

May require some emulation support

x � 3 � 5 x � 8

� � � � � � �

y � x � 2 y � 16


Universityof

Amsterdam


Architecture

Copy propagation

Propagate original values when copied

Target for dead-code elimination

x � y x � y

� � � � � � �

z � x � 2 z � y � 2


Universityof

Amsterdam


Architecture

Dead-code elimination

A variable x is dead at a statement if it is not used after thatstatement

An assignment x � y � z where x is dead can be safelyeliminated

Requires live-variable analysis (discussed later on)


Universityof

Amsterdam


Architecture

Temporary variable renaming

t1 � a � b t1 � a � bt2 � t1 � 2 t2 � t1 � 2

� � � � � � �t1 � d � e t3 � d � ec � t1 � 1 c � t3 � 1

If each statement that defines a temporary defines a newtemporary, then the basic block is in normal-form

Makes some optimizations at BB level a lot simpler(e.g. common subexpression elimination, copypropagation, etc.)


Universityof

Amsterdam


Architecture

Algebraic transformations

There are many possible algebraic transformations

Usually only the common ones are implemented

x � x � 0

x � x � 1

x � x � 2 � x � x � � 1

x � x2 � x � x � x


Universityof

Amsterdam


Architecture

Machine dependent eliminations/transformations

Removal of redundant loads/stores1 mov R0, a2 mov a, R0 // can be removed

Removal of redundant jumps, for example1 beq ...,$Lx bne ...,$Ly2 j $Ly � $Lx: ...3 $Lx: ...

Use of machine idioms, e.g.,Auto increment/decrement addressing modesSIMD instructions

Etc., etc. (see practical assignment)


Universityof

Amsterdam


Architecture

Other sources of optimizations

Global optimizationsGlobal common subexpression eliminationGlobal constant foldingGlobal copy propagation, etc.

Loop optimizations

They all need some dataflow analysis on the flow graph


Universityof

Amsterdam


Architecture

Loop optimizations

Code motionDecrease amount of code inside loop

Take a loop-invariant expression and place it before theloop

while (i � � limit � 2) � t � limit � 2while (i � � t)


Universityof

Amsterdam


Architecture

Loop optimizations (cont’d)

Induction variable eliminationVariables that are locked to the iteration of the loop arecalled induction variables

Example: in for (i = 0; i < 10; i++) i is aninduction variable

Loops can contain more than one induction variable, forexample, hidden in an array lookup computation

Often, we can eliminate these extra induction variables


Universityof

Amsterdam


Architecture


Strength reductionStrength reduction is the replacement of expensiveoperations by cheaper ones (algebraic transformation)

Its use is not limited to loops but can be helpful forinduction variable elimination

i � i � 1 i � i � 1t1 � i � 4 � t1 � t1 � 4t2 � a

t1

t2 � a

t1

if (i � 10) goto top if (i � 10) goto top


Universityof

Amsterdam


Architecture


Induction variable elimination (2)Note that in the previous strength reduction we have toinitialize t1 before the loop

After such strength reductions we can eliminate aninduction variable

i � i � 1 t1 � t1 � 4t1 � t1 � 4 � t2 � a

t1

t2 � a

t1

if (t1 � 40) goto topif (i � 10) goto top


Universityof

Amsterdam


Architecture

Finding loops in flow graphs

Dominator relationNode A dominates node B if all paths to node B go throughnode A

A node always dominates itself

We can construct a tree using this relation: the Dominator tree


Universityof

Amsterdam


Architecture

Dominator tree example

5

1

2

3

4

9

6

7

8

4

10

1

2 3

5

6

7

8

9 10

Flow graph Dominator tree


Universityof

Amsterdam


Architecture

Natural loops

A loop has a single entry point, the header, whichdominates the loop

There must be a path back to the header

Loops can be found by searching for edges of which theirheads dominate their tails, called the backedges

Given a backedge n � d, the natural loop is d plus thenodes that can reach n without going through d


Universityof

Amsterdam


Architecture

Finding natural loop of n d

procedure insert(m) {if (not m � loop) {

loop � loop

mpush(m)

}}

stack � /0loop � �d

�

insert(n)while (stack

� � /0) {m = pop()for (p � pred

�m

�

) insert(p)}


Universityof

Amsterdam


Architecture

Natural loops (cont’d)

When two backedges go to the same header node, we mayjoin the resulting loops

When we consider two natural loops, they are eithercompletely disjoint or one is nested inside the other

The nested loop is called an inner loop

A program spends most of its time inside loops, so loopsare a target for optimizations. This especially holds forinner loops!


Universityof

Amsterdam


Architecture

Our example revisited

1

2

3

4

5 6

7

8

9 10

Flow graph

Natural loops:

1. backedge 10 −> 7: {7,8,10} (the inner loop) 2. backedge 7 −> 4: {4,5,6,7,8,10} 3. backedges 4 −> 3 and 8 −> 3: {3,4,5,6,7,8,10} 4. backedge 9 −> 1: the entire flow graph


Universityof

Amsterdam


Architecture

Reducible flow graphs

A flow graph is reducible when the edges can be partitionedinto forward edges and backedges

The forward edges must form an acyclic graph in whichevery node can be reached from the initial node

Exclusive use of structured control-flow statements such asif-then-else, while and break produces reduciblecontrol-flow

Irreducible control-flow can create loops that cannot beoptimized


Universityof

Amsterdam


Architecture

Reducible flow graphs (cont’d)

Irreducible control-flow graphs can always be madereducible

This usually involves some duplication of code

a

cb

a

cb

c’


Universityof

Amsterdam


Architecture

Dataflow analysis

Data analysis is needed for global code optimization, e.g.:Is a variable live on exit from a block? Does adefinition reach a certain point in the code?

Dataflow equations are used to collect dataflow informationA typical dataflow equation has the formout

S

� gen

S

�

in

S� kill

S

�

The notion of generation and killing depends on thedataflow analysis problem to be solved

Let’s first consider Reaching Definitions analysis forstructured programs


Universityof

Amsterdam


Architecture

Reaching definitions

A definition of a variable x is a statement that assigns ormay assign a value to x

An assignment to x is an unambiguous definition of x

An ambiguous assignment to x can be an assignment to apointer or a function call where x is passed by reference


Universityof

Amsterdam


Architecture

Reaching definitions (cont’d)

When x is defined, we say the definition is generated

An unambiguous definition of x kills all other definitions ofx

When all definitions of x are the same at a certain point, wecan use this information to do some optimizations

Example: all definitions of x define x to be 1. Now, byperforming constant folding, we can do strength reductionif x is used in z � y � x


Universityof

Amsterdam


Architecture

Dataflow analysis for reaching definitions

During dataflow analysis we have to examine every paththat can be taken to see which definitions reach a point inthe code

Sometimes a certain path will never be taken, even if it ispart of the flow graph

Since it is undecidable whether a path can be taken, wesimply examine all paths

This won’t cause false assumptions to be made for thecode: it is a conservative simplification

It merely causes optimizations not to be performed


Universityof

Amsterdam


Architecture

The building blocks

S

S

S

S

d: a=b+c

S1

S2

S1 S2

S1

gen[S]=

�

d

�

kill[S]=Da � �d

�

out[S]=gen[S]

�

(in[S]-kill[S])

gen[S]=gen[S2]

�

(gen[S1]-kill[S2])kill[S]=kill[S2]

�

(kill[S1]-gen[S2])in[S1]=in[S]in[S2]=out[S1]out[S]=out[S2]

gen[S]=gen[S1]

�

gen[S2]kill[S]=kill[S1]

�

kill[S2]in[S1]=in[S2]=in[S]out[S]=out[S1]

�

out[S2]

gen[S]=gen[S1]kill[S]=kill[S1]in[S1]=in[S]

�

gen[S1]out[S]=out[S1]


Universityof

Amsterdam


Architecture

Dealing with loops

The in-set to the code inside the loop is the in-set of theloop plus the out-set of the loop: in

S1 � in

S

out

S1

The out-set of the loop is the out-set of the code inside:out

S

� out

S1

Fortunately, we can also compute out

S1

in terms of in

S1

:out

S1

� gen

S1

�

in

S1� kill

S1

�


Universityof

Amsterdam


Architecture

Dealing with loops (cont’d)

I � in

�

S1

�� O � out

�

S1

�� J � in

�

S

�� G � gen

�

S1

�

and K � kill

�

S1

�

I � J

�

O

O � G

� �

I � K

�

Assume O � /0, then I1 � J

O1 � G

� �

I1 � K

� � G

� �

J � K�

I2 � J

�

O1 � J

�

G

� �

J � K� � J

�

G

O2 � G

� �

I2 � K

� � G� �

J

�

G � K

� � G

� �

J � K

�

O1 � O2 so in

�

S1� � in�

S

� �

gen

�

S1

�

and out

�

S

� � out

�

S1

�


Universityof

Amsterdam


Architecture

Reaching definitions example

d1 i = m - 1d2 j = nd3 a = u1

dod4 i = i + 1d5 j = j - 1

if (e1)d6 a = u2

elsed7 i = u3

while (e2)

001 1111110 0000

;000 1111111 0000

;000 1101110 0000

100 0000000 1001

d1 d2010 0000000 0100

d3001 0000000 0010

do000 1111110 0000

;

;

d4

110 0000000 1111

000 1100110 0001

000 1000100 0001 d5

000 0100010 0000

if

e1

000 0011000 0000

d6 d7

000 0010 100 1000000 0001

001 0000

e2

;

In reality, dataflow analysis is often performed at the granularityof basic blocks rather than statements


Universityof

Amsterdam


Architecture

Iterative solutions

Programs in general need not be made up out of structuredcontrol-flow statements

We can do dataflow analysis on these programs using aniterative algorithm

The equations (at basic block level) for reaching definitionsare:

in

B �

P � pred

B

!

out

P

out

B

� gen

B

�

in

B

� kill

B

�


Universityof

Amsterdam


Architecture

Iterative algorithm for reaching definitions

for (each block B) out

B

� gen

B

do {change � falsefor (each block B) {

in

B

�

P � pred

B

!out

P

oldout � out

B

out

B � gen

B

�

in

B

� kill

B

�

if (out

B

� � oldout) change � true}

} while (change)


Universityof

Amsterdam


Architecture

Reaching definitions: an example

d1: i = m −1d2: j = nd3: a = u1

d4: i = i + 1d5: j = j − 1

d6: a = u2

d7: i = u3

B3

B1

B2

B4

gen[B1] = {d1,d2,d3}kill[B1] = {d4,d5,d6,d7}

kill[B2] = {d1,d2,d7}gen[B2] = {d4,d5}

gen[B3] = {d6}kill[B3] = {d3}

gen[B4] = {d7}kill[B4] = {d1,d4}

Block B Initial Pass 1 Pass 2

in

"

B

#

out"

B#

in

"

B

#

out

"

B

#

in

"

B

#

out

"

B

#

B1 000 0000 111 0000 000 0000 111 0000 000 0000 111 0000

B2 000 0000 000 1100 111 0011 001 1110 111 1111 001 1110

B3 000 0000 000 0010 001 1110 000 1110 001 1110 000 1110

B4 000 0000 000 0001 001 1110 001 0111 001 1110 001 0111


Universityof

Amsterdam


Architecture

Available expressions

An expression e is available at a point p if every path fromthe initial node to p evaluates e, and the variables used by eare not changed after the last evaluations

An available expression e is killed if one of the variablesused by e is assigned to

An available expression e is generated if it is evaluated

Note that if an expression e is assigned to a variable usedby e, this expression will not be generated


Universityof

Amsterdam


Architecture

Available expressions (cont’d)

Available expressions are mainly used to find commonsubexpressions

t1 = 4 * i

?

t2 = 4 * i

B2

B3

B1 t1 = 4 * i

t2 = 4 * i

t0 = 4 * ii = ...

B1

B2

B3


Universityof

Amsterdam


Architecture

Available expressions (cont’d)

Dataflow equations:

out

B

� e_gen

B

�

in

B� e_kill

B

�

in

B

�

P � pred

B!

out

P

for B not initial

in

B1

� /0 where B1 is the initial block

The confluence operator is intersection instead of the union!


Universityof

Amsterdam


Architecture

Liveness analysis

A variable is live at a certain point in the code if it holds avalue that may be needed in the future

Solve backwards:Find use of a variableThis variable is live between statements that havefound use as next statementRecurse until you find a definition of the variable


Universityof

Amsterdam


Architecture

Dataflow for liveness

Using the sets use

B

and de f

B

de f

B

is the set of variables assigned values in B priorto any use of that variable in Buse

B

is the set of variables whose values may be usedin B prior to any definition of the variable

A variable comes live into a block (in in

B

), if it is eitherused before redefinition of it is live coming out of the blockand is not redefined in the block

A variable comes live out of a block (in out

B

) if and onlyif it is live coming into one of its successors


Universityof

Amsterdam


Architecture

Dataflow equations for liveness

in

B

� use

B

�

out

B

� de f

B �

out

B

�

S �succ$

B%

in

S

Note the relation between reaching-definitions equations:the roles of in and out are interchanged


Universityof

Amsterdam


Architecture

Algorithms for global optimizations

Global common subexpression eliminationFirst calculate the sets of available expressions

For every statement s of the form x � y � z where y � z isavailable do the following

Search backwards in the graph for the evaluations ofy � zCreate a new variable uReplace statements w � y � z by u � y � z; w � uReplace statement s by x � u


Universityof

Amsterdam


Architecture

Copy propagation

Suppose a copy statement s of the form x � y isencountered. We may now substitute a use of x by a use ofy if

Statement s is the only definition of x reaching the useOn every path from statement s to the use, there are noassignments to y


Universityof

Amsterdam


Architecture

Copy propagation (cont’d)

To find the set of copy statements we can use, we define anew dataflow problem

An occurrence of a copy statement generates this statement

An assignment to x or y kills the copy statement x � y

Dataflow equations:

out

B

� c_gen

B

�

in

B

� c_kill

B

�

in

B �

P � pred

B

!

out

P

for B not initial

in

B1

� /0 where B1 is the initial block


Universityof

Amsterdam


Architecture

Copy propagation (cont’d)

For each copy statement s: x � y doDetermine the uses of x reached by this definition of xDetermine if for each of those uses this is the onlydefinition reaching it ( � s � in

Buse

)If so, remove s and replace the uses of x by uses of y


Universityof

Amsterdam


Architecture

Detection of loop-invariant computations

1. Mark invariant those statements whose operands areconstant or have reaching definitions outside the loop

2. Repeat step 3 until no new statements are marked invariant

3. Mark invariant those statements whose operands either areconstant, have reaching definitions outside the loop, or haveone reaching definition that is marked invariant


Universityof

Amsterdam


Architecture

Code motion

1. Create a pre-header for the loop

2. Find loop-invariant statements

3. For each statement s defining x found in step 2, check that(a) it is in a block that dominate all exits of the loop(b) x is not defined elsewhere in the loop(c) all uses of x in the loop can only be reached from

this statement s

4. Move the statements that conform to the pre-header


Universityof

Amsterdam


Architecture

Code motion (cont’d)

i = 2u = u + 1

i = 1

if u < v goto B3

v = v − 1if v <= 20 goto B5

j = i

B3

B2

B4

B5

B1

i = 1 B1

i = 2u = u + 1

if u < v goto B3

v = v − 1if v <= 20 goto B5

j = i

B3

B2

B4

B5

i = 3

i = 2u = u + 1

i = 1

if u < v goto B3

B3

B2

B1

v = v − 1if v <= 20 goto B5

j = i B5

k = iB4

Condition (a) Condition (b) Condition (c)


Universityof

Amsterdam


Architecture

Detection of induction variables

A basic induction variable i is a variable that only hasassignments of the form i � i

&

c

Associated with each induction variable j is a triple

�

i ' c ' d

�

where i is a basic induction variable and c and d areconstants such that j � c � i � d

In this case j belongs to the family of i

The basic induction variable i belongs to its own family,with the associated triple

�

i ' 1 ' 0

�


Universityof

Amsterdam


Architecture

Detection of induction variables (cont’d)

Find all basic induction variables in the loop

Find variables k with a single assignment in the loop withone of the following forms:

k � j � b, k � b � j, k � j(

b, k � j � b, k � b � j, whereb is a constant and j is an induction variable

If j is not basic and in the family of i then there must beNo assignment of i between the assignment of j and kNo definition of j outside the loop that reaches k


Universityof

Amsterdam


Architecture

Strength reduction for induction variables

Consider each basic induction variable i in turn. For eachvariable j in the family of i with triple

�i ' c ' d

�

:Create a new variable sReplace the assignment to j by j � sImmediately after each assignment i � i

&

n appends � s � c � nPlace s in the family of i with triple

�

i ' c ' d

�

Initialize s in the preheader: s � c � i � d


Universityof

Amsterdam


Architecture

Strength reduction for induction variables (cont’d)

i = i + 1t2 = 4 * it3 = a[t2]if t3 < v goto B2

Strength reduction

i = m − 1t1 = 4 * nv = a[t1]

if i < n goto B5

B5

B1

B2

B3

B4

i = m − 1t1 = 4 * nv = a[t1]

s2 = 4 * i

t3 = a[t2]if t3 < v goto B2

t2 = s2s2 = s2 + 4i = i + 1

if i < n goto B5

B5

B3

B2

B1

B4


Universityof

Amsterdam


Architecture

Elimination of induction variables

Consider each basic induction variable i only used tocompute other induction variables and tests

Take some j in i’s family such that c and d from the triple�

i ' c ' d

�

are simple

Rewrite tests if (i relop x) tor � c � x � d; if ( j relop r)

Delete assignments to i from the loop

Do some copy propagation to eliminate j � s assignmentsformed during strength reduction


Universityof

Amsterdam


Architecture

Alias Analysis

Aliases, e.g. caused by pointers, make dataflow analysismore complex (uncertainty regarding what is defined andused: x � �p might use any variable)

Use dataflow analysis to determine what a pointer mightpoint to

in

B

contains for each pointer p the set of variables towhich p could point at the beginning of block B

Elements of in

B

are pairs

�

p ' a

�

where p is a pointerand a a variable, meaning that p might point to a

out

B

is defined similarly for the end of B


Universityof

Amsterdam


Architecture

Alias Analysis (cont’d)

Define a function transB such that transB�

in

B � � out

B

transB is composed of transs, for each stmt s of block BIf s is p � &a or p � &a

&

c in case a is an array, thentranss

�

S

� �

�

S ��

p ' b

�)

any variable b� � � �

p ' a

� �

If s is p � q

&

c for pointer q and nonzero integer c,then

transs

�

S� � �S ��

p ' b

�)

any variable b

� �

� �p ' b

�) �

q ' b

� �

S and b is an array variable

�

If s is p � q, thentranss

�

S

� � �S ��

p ' b

�)

any variable b

� �

� �

p ' b

�) �

q ' b

� � S

�


Universityof

Amsterdam


Architecture


– If s assigns to pointer p any other expression, thentranss

�

S

� � S ��

p ' b

�)

any variable b�

– If s is not an assignment to a pointer, then transs

�

S

� � S

Dataflow equations for alias analysis:

out

B � transB

�

in

B

�

in

B �

P � pred

B

!

out

P

where transB�

S� � transsk

�

transsk *1

�,+ + + �

transs1

�

S

� � � �


Universityof

Amsterdam


Architecture


How to use the alias dataflow information? Examples:In reaching definitions analysis (to determine gen andkill)

� statement �p � a generates a definition of everyvariable b such that p could point to b

� �p � a kills definition of b only if b is not an arrayand is the only variable p could possibly point to (tobe conservative)

In liveness analysis (to determine de f and use)

� �p � a uses p and a. It defines b only if b is theunique variable that p might point to (to beconservative)

� a � �p defines a, and represents the use of p and ause of any variable that p could point to


Universityof

Amsterdam


Architecture

Code generation

Instruction selectionWas a problem in the CISC era (e.g., lots of addressingmodes)

RISC instructions mean simpler instruction selection

However, new instruction sets introduce new, complicatedinstructions (e.g., multimedia instruction sets)


Universityof

Amsterdam


Architecture

Instruction selection methods

Tree-based methods (IR is a tree)Maximal MunchDynamic programmingTree grammars

Input tree treated as string using prefix notationRewrite string using an LR parser and generateinstructions as side effect of rewriting rules

If the DAG is not a tree, then it can be partitioned intomultiple trees


Universityof

Amsterdam


Architecture

Tree pattern based selection

Every target instruction is represented by a tree pattern

Such a tree pattern often has an associated cost

Instruction selection is done by tiling the IR tree with theinstruction tree patterns

There may be many different ways an IR tree can be tiled,depending on the instruction set


Universityof

Amsterdam


Architecture

Tree pattern based selection (cont’d)

+

mem

const d

+

mem

mem

move

const b

temp 1 const a

* temp 2 const c

+

Name Effect Trees Cycles

— ri temp 0

ADD ri

- r j

. rk+

1

MUL ri

- r j

/ rk*

1

ADDI ri

- r j

. c +

const

+

constconst 1

LOAD ri

- M 0r j. c 1

+

const

mem

+

mem

const

mem

const

mem3

STORE M

0r j

. c 1 - ri

+

const

mem

move

+

mem

move

const

mem

move

const

mem

move3

MOVEM M

0

r j

1 - M 0ri

1

mem

move

mem6


Universityof

Amsterdam


Architecture

Optimal and optimum tilings

The cost of a tiling is the sum of the costs of the tree patterns

An optimal tiling is one where no two adjacent tiles can becombined into a single tile of lower cost

An optimum tiling is a tiling with lowest possible cost

An optimum tiling is also optimal, but not vice-versa


Universityof

Amsterdam


Architecture

Maximal Munch

Maximal Munch is an algorithm for optimal tilingStart at the root of the treeFind the largest pattern that fitsCover the root node plus the other nodes in the pattern;the instruction corresponding to the tile is generatedDo the same for the resulting subtrees

Maximal Munch generates the instructions in reverse order!


Universityof

Amsterdam


Architecture

Dynamic programming

Dynamic programming is a technique for finding optimumsolutions

Bottom up approachFor each node n the costs of all children are foundrecursively.Then the minimum cost for node n is determined.

After cost assignment of the entire tree, instructionemission follows:

Emission(node n): for each leaves li of the tileselected at node n, perform Emission(li). Then emitthe instruction matched at node n


Universityof

Amsterdam


Architecture

Register allocation...a graph coloring problem

First do instruction selection assuming an infinite numberof symbolic registers

Build an interference graphEach node is a symbolic registerTwo nodes are connected when they are live at thesame time

Color the interference graphConnected nodes cannot have the same colorMinimize the number of colors (maximum is thenumber of actual registers)


Universityof

Amsterdam


Architecture

Coloring by simplification

Simplify interference graph G using heuristic method(K-coloring a graph is NP-complete)

Find a node m with less than K neighborsRemove node m and its edges from G, resulting in G

2

.Store m on a stackColor the graph G

2Graph G can be colored since m has less than Kneighbors


Universityof

Amsterdam


Architecture

Coloring by simplification (cont’d)

SpillIf a node with less than K neigbors cannot be found inG

Mark a node n to be spilled, remove n and its edgesfrom G (and stack n) and continue simplification

SelectAssign colors by popping the stackArriving at a spill node, check whether it can becolored. If not:

The variable represented by this node will reside inmemory (i.e. is spilled to memory)Actual spill code is inserted in the program


Universityof

Amsterdam


Architecture

Coalescing

If there is no interference edge between the source anddestination of a move, the move is redundant

Removing the move and joining the nodes is calledcoalescing

Coalescing increases the degree of a node

A graph that was K colorable before coalescing might notbe afterwards


Universityof

Amsterdam


Architecture

Sketch of the algorithm with coalescing

Label move-related nodes in interference graph

While interference graph is nonemptySimplify, using non-move-related nodesCoalesce move-related nodes using conservativecoalescing

Coalesce only when the resulting node has less thanK neighbors with a significant degree

No simplifications/coalescings: “freeze” amove-related node of a low degree � do not considerits moves for coalescing anymoreSpill

Select


Universityof

Amsterdam


Architecture

Register allocation: an example

Live in: k,jg = mem[j+12]h = k −1f = g * he = mem[j+8]m = mem[j+16]b = mem[f]c = e + 8d = ck = m + 4j = bgoto dLive out: d,k,j

e

d

h g

kj b

f

m

c

Assume a 4-coloring (K � 4)

Simplify by removing and stacking nodes with � 4neighbors (g,h,k,f,e,m)


Universityof

Amsterdam


Architecture

Register allocation: an example (cont’d)

After removing and stacking the nodes g,h,k,f,e,m:

After simplification

d

j b

c

j&b d&c

After coalescing

Coalesce now and simplify again


Universityof

Amsterdam


Architecture


R0 R1 R2 R3Stacked elements: d&cj&b mefkgh

4 registers available:

e

d

h g

kj b

f

m

c

e

d

h g

kj b

f

m

c


Universityof

Amsterdam


Architecture


R0 R1 R2 R3efkgh

4 registers available:

e

d

h g

kj b

f

m

c

e

d

h g

kj b

f

m

c

Stacked elements: m

ETC., ETC.

No spills are required and both moves were optimized away


Universityof

Amsterdam


Architecture

Instruction scheduling

Increase ILP (e.g., by avoiding pipeline hazards)Essential for VLIW processors

Scheduling at basic block level: list schedulingSystem resources represented by matrix Resources 3

TimePosition in matrix is true or false, indicating whetherthe resource is in use at that timeInstructions represented by matrices Resources 3

Instruction durationUsing dependency analysis, the schedule is made byfitting instructions as tight as possible


Universityof

Amsterdam


Architecture

List scheduling (cont’d)

Finding optimal schedule is NP-complete problem � useheuristics, e.g. at an operation conflict schedule the mosttime-critical first

For a VLIW processor, the maximum instruction durationis used for scheduling � painful for memory loads!

Basic blocks usually are small (5 operations on the average)

� benefit of scheduling limited � Trace Scheduling


Universityof

Amsterdam


Architecture

Trace scheduling

Schedule instructions over code sections larger than basicblocks, so-called traces

A trace is a series of basic blocks that does not extendbeyond loop boundaries

Apply list scheduling to whole trace

Scheduling code inside a trace can move code beyond basicblock boundaries � compensate this by adding code to theoff-trace edges


Universityof

Amsterdam


Architecture

Trace scheduling (cont’d)

BB1

BB2 BB3

BB4


Universityof

Amsterdam


Architecture


Operation to be movedbefore Op A

Op COp AOp B

Off Trace

Off Trace

in TraceBasic Block

in TraceBasic Block

in TraceBasic Block

(c)

(b)

(a)

Copied code

Basic BlockOff Trace

traceOp ABranch

Op BOp C

Branch

Op A

Branch

Op B

Op B

below Branch in

Op COp AOp B Op B

Op C

BranchOp A

Op A Op C Op BOp A

Op C

In Trace

allowed if no side-

In TraceCopied code inoff Trace Basic Block

codeeffects in Off trace

Moved code onlyOperation to be movedabove Branch

In Trace

In Trace

Operation to be moved


Universityof

Amsterdam


Architecture


Trace selection

Because of the code copies, the trace that is most oftenexecuted has to be scheduled first

A longer trace brings more opportunities for ILP (loopunrolling!)

Use heuristics about how often a basic block is executedand which paths to and from a block have the most chanceof being taken (e.g. inner-loops) or use profiling (inputdependent)


Universityof

Amsterdam


Architecture

Other methods to increase ILP

Loop unrollingTechnique for increasing the amount of code availableinside a loop: make several copies of the loop body

Reduces loop control overhead and increases ILP (moreinstructions to schedule)

When using trace scheduling this results in longer tracesand thus more opportunities for better schedules

In general, the more copies, the better the job the schedulercan do but the gain becomes minimal


Universityof

Amsterdam


Architecture

Loop unrolling (cont’d)

Example

for (i = 0; i < 100; i++)

a[i] = a[i] + b[i];becomes

for (i = 0; i < 100; i += 4) {

a[i] = a[i] + b[i];

a[i+1] = a[i+1] + b[i+1];

a[i+2] = a[i+2] + b[i+2];

a[i+3] = a[i+3] + b[i+3];

}


Universityof

Amsterdam


Architecture

Software pipelining

Also a technique for using the parallelism available inseveral loop iterations

Software pipelining simulates a hardware pipeline, henceits name

pipelinedSoftware

iteration

Iterattion 0Iteration 1

Iteration 2Iteration 3

Iteration 4

There are three phases: Prologue, Steady state and Epilogue


Universityof

Amsterdam


Architecture

Software pipelining (cont’d)

Loop: LDADDD F4,F0,F2

SD

F0,0(R1)

0(R1),F4

Body

SBGEZ R1, Loop Loop control

T0

T1

T2

T... Loop:

LD

ADDD .

.

SD

LD

LD SBGEZ Loop.ADDD

LD

SD ADDD .Steady state

Prologue

Epilogue

Tn

Tn+1

Tn+2

SD ADDD

SD


Universityof

Amsterdam


Architecture

Modulo scheduling

Scheduling multiple loop iterations using softwarepipelining can create false dependencies between variablesused in different iterations

Renaming the variables used in different iterations is calledmodulo scheduling

When using n variables for representing the same variable,the steady state of the loop has to be unrolled n times


Universityof

Amsterdam


Architecture

Compiler optimizations for cache performance

Merging arrays (better spatial locality)

int val[SIZE]; struct merge {

int key[SIZE]; 4 int val, key; };

struct merge m_array[SIZE]

Loop interchange

Loop fusion and fission

Blocking (better temporal locality)


Universityof

Amsterdam


Architecture

Loop interchange

Exchanging of nested loops to change the memory footprintBetter spatial locality

for (i = 0; i < 50; i++)

for (j = 0; j < 100; j++)

a[j][i] = b[j][i] * c[j][i];

becomesfor (j = 0; j < 100; j++)

for (i = 0; i < 50; i++)

a[j][i] = b[j][i] * c[j][i];


Universityof

Amsterdam


Architecture

Loop fusion

Fuse multiple loops togetherLess loop controlBigger basic blocks (scheduling)Possibly better temporal locality

for (i = 0; i < n; i++)

c[i] = a[i] + b[i];

for (j = 0; j < n; j++)

d[j] = a[j] * e[j];

becomes

for (i = 0; i < n; i++) {

c[i] = a[i] + b[i];

d[i] = a[i] * e[i];

}


Universityof

Amsterdam


Architecture

Loop fission

Split a loop with independent statements into multiple loopsEnables other transformations (e.g. vectorization)Results in smaller cache footprint (better temporallocality)

for (i = 0; i < n; i++) {

a[i] = b[i] + c[i];

d[i] = e[i] * f[i];

}

becomes

for (i = 0; i < n; i++) {

a[i] = b[i] + c[i];

}

for (i = 0; i < n; i++) {

d[i] = e[i] * f[i];

}


Universityof

Amsterdam


Architecture

Blocking

Perform computations on sub-matrices (blocks), e.g. whenmultiple matrices are accessed both row by row and column bycolumn

i

j

i

k j

k

X Y Zfor (i=0; i < N; i++) for (j=0; j < N; j++) {

r = 0;for (k = 0; k < N; k++) {

r = r + y[i][k]*z[k][j];};x[i][j] = r;

};

Matrix multiplication x = y*z

not touched older access recent access

i

j

i

k j

k

X Y Z

Blocking


compiler design

Documents

compiler design

node x

university of amsterdam

variable x

node z

node n

t5 prod

prod t4