code generation manas thakur · manas thakur cs502: compiler design 11 live ranges a variable is...

CS502: Compiler Design

Code Generation

Manas Thakur

Fall 2020

Manas Thakur CS502: Compiler Design 2

So near yet so far!

Lexical AnalyzerLexical Analyzer

Syntax AnalyzerSyntax Analyzer

Semantic AnalyzerSemantic Analyzer

Intermediate Code Generator

Intermediate Code Generator

Character stream

Token stream

Syntax tree

Syntax tree

Intermediaterepresentation

Machine-Independent Code Optimizer

Machine-Independent Code Optimizer

Code GeneratorCode Generator

Target machine code

Intermediate representation

Machine-Dependent Code Optimizer

Machine-Dependent Code Optimizer

Target machine code

SymbolTable

F r

o n

t e

n d

B a

c k

e n

d


Recall from Lecture 2

t1 = id3 * 32.0id1 = id2 + t1

LD R1, id3MUL R1, R1, #32.0LD R2, id2ADD R1, R1, R2ST id1, R1

Code GeneratorCode Generator


Roles of Code Generator

● Convert IR to target program.

– Bring it down!

● Using the primitives available on the target machine.

– Usually a form of assembly.

● Requirement: Preserve the semantics of the source program.

– In terms of the observable behaviour.

● Expectation: Target code is of high quality.

– Execution time, space, energy, and so on.

● Expectation2: Code generator itself should be efficient.

– Is it so hard an expectation for an IR generator?


Code Generation Demo


Code generation in reality

● The problem of generating an optimal target program is undecidable.

– Recall we had said most problems in the front-end are simple, and most in the back-end are complex?

● Several subproblems are NP-hard or NP-complete.

● Need to depend upon:

– Approximation algorithms

– Heuristics

– Conservative estimates


Essential tasks during code generation

● Instruction selection

– Map low-level IR to actual machine (or assembly) instructions

– Not necessarily 1-1 mapping

– Several things vary with the architecture:● Instruction set● Addressing modes

● Register allocation

– Low-level IR assumes unlimited registers

– Map to actual resources provided by the hardware

– Goal: Make the best use of registers

Also, Instruction Scheduling


Where are registers?

CPU

Registers

Level-1 cache

Level-2 cache

Main memory

Virtual memory

Farther away,larger,slower

Pentium 4 3.2 Ghz

Core 2 Duo

Athlon 64

1 cycle 1 cycle 1 cycle

2 cycles(16 KB)

3 cycles(64 KB)

3 cycles(128 KB)

19 cycles(2 MB)

14 cycles(2 MB)

13 cycles(1 MB)

204 cycles 180 cycles 125 cycles

millions of cycles

millions of cycles

millions of cycles


Architecture and registers

● RISC

– Many registers, 3AC, simple addressing modes

– Examples: IBM PowerPC, Oracle SPARC, ARM (mobiles, tablets)

● CISC

– Few registers, 2AC, Variety of addressing modes, several register classes, variable-length instructions, instructions with side-effects

– Examples: Intel x86, AMD Athlon

● Stack machine

– Push/Pop, stack-top uses registers

– Example: JVM


Register allocation

● Involves

– Allocation: which variables to be put into registers

– Assignment: which register to use for a variable

● Finding an optimal assignment of registers to variables is an NP-complete problem.

● Architectural conventions complicate matters:

– Combination of registers used for double-precision arithmetic

– Result must be stored into accumulator

– Some registers reserved for special purposes


Live ranges

● A variable is live if its current value may be used in future.

– Two variables that are live at the same time cannot use the same register.

– They interfere with each other.

● Conversely, if two variables do not interfere, then they can use the same register.

● Need to determine the durations in which variables are live.

[S1] a = 0L1: [S2] b = a + 1 [S3] c = c + b [S4] a = b * 2 [S5] if a < N goto L1 [S6] return c

a: {S1-S2, S4-S5}

b: {S2-S4}

c: {S3-S6}


Interference graphs

● Represent program variables/temporaries as nodes.

● If the live ranges of variables u and v overlap, then draw an edge between u and v.

● An edge (u,v) indicates that variables u and v interfere, and hence cannot be mapped to the same register.

● Tomorrow: Can we color graphs to perform register allocation?

[S1] a = 0L1: [S2] b = a + 1 [S3] c = c + b [S4] a = b * 2 [S5] if a < N goto L1 [S6] return c

a b

c


Code Generation (Cont.)

Manas Thakur

Fall 2020


Register allocation using graph coloring

● Idea:

– If we can color the interference graph using K colors, then we can allocate the variables to K registers.

– Two nodes that interfere with each other must use different colors.

a b

c

= Register R1

= Register R2

R1 = 0L1: R1 = R1 + 1 R2 = R2 + R1 R1 = R1 * 2 if R1 < N goto L1 return R2

a = 0L1: b = a + 1 c = c + b a = b * 2 if a < N goto L1 return c


Graph coloring● Key questions:

– Can we efficiently find a K-coloring of the graph?● Bad news: Graph coloring is an NP-complete problem.

– Can we efficiently find the optimal coloring of the graph (i.e., using the least number of colors)?

● We don’t necessarily need the perfect coloring.● Compute an approximation with heuristics.

– What do we do when there aren’t enough colors (registers) to color the graph?

● Temporarily move that variable to memory (slow, but what else!).● Called spilling.● Need to add instructions to “store” and (later) “load” the spilled variable.


Graph coloring: a simplistic approach

repeat

repeat

Remove a node n and all its edges from G, such that the degree of n is less than K

Push n onto a stack

until G has no node with degree less than K

// G is now either empty or all its nodes have degree ≥K

if G is not empty then

Take a node m and all its edges out of G, and mark m for spilling

endif

until G is empty

Take one node at a time from stack and assign a non-conflicting color


Need for spill

a

b c

d e a

b c

d e

Is this graph 2-colorable?

What about this one?


Kempe’s heuristic (1879!) to reduce spilling

repeat

repeat

Remove a node n and all its edges from G, such that the degree of n is less than K

Push n onto a stack

until G has no node with degree less than K

// G is now either empty or all its nodes have degree ≥K

if G is not empty then

Take one node m out of G

Push m onto the stack

endif

until G is empty

Take one node at a time from stack and assign a non-conflicting color if possible, else spill


NFS Revisited!

● No need to spill now.

● Don’t have a choice; need to spill.

a

b c

d e

a

b c

d e


Coalescing

● If there is a copy statement x = y such that there is no conflict between x and y:

– We can use the same register for both x and y

– Merge the graph nodes for x and y into one node

● Good because:

– Reduces the number of registers and removes move instructions

● Bad because:

– Increases the number of neighbours of the merged node, which may cause more spilling


Register Allocation (Cont.)

● Register allocation is expensive

– Many algorithms use heuristics for graph coloring

– Still it may take time quadratic in the number of live ranges

● Online/JIT compilers need to generate code quickly

– Sacrifice efficient register allocation for compilation speed

● Linear scan register allocation

– Massimiliano Poletto and Vivek Sarkar (ACM TOPLAS 1999)

– Idea: Make one pass over the list of variables

– Spill variables with longest lifetimes – those that would tie up a register for the longest time


LSRA (Cont.)

● Compute live intervals

– A live interval for a variable is a range [i,j], such that

● The variable is not live before instruction i● The variable is not live after instruction j

– Overlapping intervals imply interference

● Given R registers and N overlapping intervals

– R intervals allocated to registers

– N-R intervals spilled

● Key: Choosing the right intervals to spill


Linear scan algorithm● Sort live intervals

– In order of increasing start points

– Quickly find the next live interval in order

● Maintain a sorted list of active intervals

– In order of increasing end points

– Quickly find expired intervals

● At each step, update active as follows:

– Add the next interval from the sorted list

– Remove any expired intervals (those whose end points are earlier than the start point of the new interval)


Linear scan algorithm (Cont.)

● Restriction:

– Never allow active to have more than R elements

● Spill scenario:

– active has R elements; new interval doesn’t cause any existing intervals to expire

● Heuristic:

– Spill the interval that ends last (furthest from current position)● Greedy algorithm


LSRA example (2 registers)

● Step 1: active = {a}

● Step 2: active = {a, b}

● Step 3: active = {a, b, c} ==> spill c ==> active = {a, b}

● Step 4: a and b expire; active = {d}

● Step 5: active = {e, d}

abcde

1 2 3 4 5

Variables

Two registers, one spill


LSRA: A few comments

● Significantly faster RA than graph coloring

● Efficacy of RA not as good:

– Holes in live ranges not taken into account● Graph coloring can take care by maintaining different graphs at

different points

– A variable once spilled remains spilled forever● Improvements exist; e.g., Traub et al.’s second-chance binpacking

● The choice in most fast JIT compilers

● Next class: Instruction selection



Manas Thakur

Fall 2020


Instruction Selection

● Low-level IR different from machine ISA

– Why?

– Allow different back ends

● Differences between IR and ISA?

– IR: simple, uniform set of operations

– ISA: many specialized instructions

● Often a single assembly instruction does the work of several operations in the IR

– May vary between CISC and RISC architectures


Instruction selection: Easy solution

● Map each IR operation to a single instruction

● May need to include memory operations:

● Problem: Inefficient use of ISA.

● Example: What if the ISA contains an add instruction with one or more memory operands in the addressing mode?

mov y, r1mov z, r2add r2, r1mov r1, x

x = y + z;


Wacky x86 Idioms

● What does this do?

● Why not use this?

● Answer: Immediate operands are encoded in the instruction, making it bigger and therefore costly to fetch and execute.

mov $0, %eax

xor %eax, %eax


More wacky x86 Idioms

● What does this do?

● Swap the values of %eax and %ebx.

● Why do it this way?

– No need for extra register!

xor %ebx, %eaxxor %eax, %ebxxor %ebx, %eax


Architectural differences in the ISA

● RISC:

– Arithmetic operations require registers

– Explicit loads and stores

● CISC:

– Complex instructions

– Arithmetic operations may refer to memory

– Often only one memory operand per instruction

ld 8(r0), r1add r2, r1

add 8(%esp), %eax


Selecting instructions with minimal cost

● Difficulties:

– How to find which patterns are the best?

– The translation may not be one-to-one

– The translation may not be even linear

● Idea: Back to tree representation

– Convert computation into a tree

– Match parts of a tree


Instruction Selection by Tree Rewriting

● Form:

● Hey, this looks like a syntax-directed translation scheme!

● Set of tree-rewriting rules called tree-translation scheme.

● Example:

replacement <- template {action}

single node tree code fragment

RiRi

{ ADD Ri, Ri, Rj }

RjRj

++RiRi


Sample Tree-Writing Rules

=

=

+


Sample Tree-Writing Rules (Cont.)

+

+

+

+


Code Generation by Tree Tiling

● Consider the following tree for a[i] = b + 1:

=

+

+

+

+

Let’s generate code for this.


Next class

● How do we actually perform tree matching?

● What do we do if more than one template matches at a time?

● ... some more relevant stuff ...

● And finally:



Manas Thakur

Fall 2020


Tree Matching by Parsing!● Idea:

– Represent the input tree using its prefix representation

– Represent tree templates as RHS of grammar rules

– Let pattern matching be done by a parser

– Convert tree translation to syntax-directed translation

=

+

+

+

+

= ind + + Ca RSP ind + Ci RSP + Mb C1


SDT for Tree Translation

Ri -> Ca { LD Ri, #a }

Ri -> Mx { LD Ri, x }

M -> = Mx Ri { ST x, Ri }

M -> = ind Ri Rj { ST *Ri, Rj }

Ri -> ind + Ca Rj { LD Ri, a(Rj) }

Ri -> + Ri ind + Ca Rj { ADD Ri, Ri, a(Rj) }

Ri -> + Ri Rj { ADD Ri, Ri, Rj }

Ri -> + Ri C1 { INC Ri }

R -> sp

M -> m


SDT for TT: A Few Remarks

● Benefits:

– Efficient and well understood

– Easy to target a new machine

● Challenges:

– Fixed left-to-right evaluation order

– The machine description grammar and parser may become large

– Special care to ensure we don’t get stuck into an infinite code-generation loop (single symbols on RHS)

Good news: Code-generator generators exist!


Including Cost

● Algorithm:

– For each node, find minimum cost tiling for that node and the subtrees below.

● Key:

– Once we have a minimum cost subtree rooted at node n, we can find the minimum cost tiling for n by trying out all possible tiles matching n.

● Use dynamic programming!


Dynamic Programming

● Idea:

– For problems with optimal substructure

– Compute optimal solutions to sub-problems

– Combine into an overall optimal solution

● How does this help?

– Use memoization:● Save previously computed solutions to sub-problems

– Sub-problems recur many times


Modern Processors

● Execution time not the sum of tile times

● Instruction order matters:

– Pipelining: parts of different instructions overlap

– Bad ordering stalls the pipeline; e.g. many operations of one type

– Superscalar: some operations executed in parallel

● Cost is an approximation

● Instruction scheduling helps


The Beginning of (Even) More Interesting Things


Code Generation & Optimization: An Intro

● Goal: Generate optimized code

● Metrics:

– Code size● Memory requirements● Say opcodes take 1 byte, each operand another byte

– Number of registers● Along with constraints on their usage

– Estimated cost● How fast is the code?● Say instructions with single operand take 1 cycle, with two operands

take 2 cycles, and those involving memory take 4 cycles

– Sometimes● power, energy, platform (in)dependence, ...



● A simple one-to-one mapping:

1: a = 02: b = a + 13: c = c + b4: a = b * 2

MOV R1 0STA a R1

MOV R1 1LDA R2 aADD R2 R1STA b R2

LDA R1 bLDA R2 cADD R2 R1STA c R2

MOV R1 2LDA R2 bMUL R2 R1STA a R2

Cost:Registers: 2Space: 42 bytesTime: 44 cycles

Can we do better?



● Better register usage:

MOV R1 0STA a R1





MOV R1 0STA a R1

MOV R2 1ADD R1 R2STA b R1

LDA R2 cADD R2 R1STA c R2

MOV R2 2MUL R1 R2STA a R1


1: a = 02: b = a + 13: c = c + b4: a = b * 2

Can we do better?



● Remove redundant store to a:


MOV R1 0STA a R1




MOV R1 0





1: a = 02: b = a + 13: c = c + b4: a = b * 2

Can we do better?



● Select specialized instructions:

MOV R1 0





CLR R1INC R1STA b R1LDA R2 cADD R2 R1STA c R2SHL R1STA a R1


1: a = 02: b = a + 13: c = c + b4: a = b * 2

Can we do better?



● Propagate constant 0:

CLR R1INC R1STA b R1LDA R2 cADD R2 R1STA c R2SHL R1STA a R1


MOV R1 1STA b R1LDA R2 cADD R2 R1STA c R2SHL R1STA a R1


1: a = 02: b = a + 13: c = c + b4: a = b * 2

Can we do better?



● Assuming b is not used in future,propagate constant 1:


MOV R1 1STA b R1LDA R2 cADD R2 R1STA c R2SHL R1STA a R1

1: a = 02: b = a + 13: c = c + b4: a = b * 2

LDA R1 cINC R1STA c R1MOV R1 2STA a R1


Can we do better?



● Assuming the availability of astore-immediate instruction:

● Assuming no other special instruction and no new knowledge about past or future instructions:

2 PCs to someone who can improve this even further!

Hint (-1): Think about the execution at architectural level.

1: a = 02: b = a + 13: c = c + b4: a = b * 2

LDA R1 cINC R1STA c R1MOV R1 2STA a R1


LDA R1 cINC R1STA c R1STI a 2




● Reordering stores to optimize memory access:

● Food for thought:

– Impact of multithreaded programson possible/allowed reorderings.

LDA R1 cINC R1STA c R1STI a 2

STI a 2LDA R1 cINC R1STA c R1

1: a = 02: b = a + 13: c = c + b4: a = b * 2


Where are we headed next week?

MOV R1 0STA a R1




Code

Optimization

STI a 2LDA R1 cINC R1STA c R1

A3 is on!

Cost:Registers: 1Space: 11 bytesTime: 13 cycles+ReorderedCost:

Registers: 2Space: 42 bytesTime: 44 cycles

code generation manas thakur · manas thakur cs502: compiler design 11 live ranges a variable is...

Documents