code generation manas thakur · manas thakur cs502: compiler design 11 live ranges a variable is...
TRANSCRIPT
CS502: Compiler Design
Code Generation
Manas Thakur
Fall 2020
Manas Thakur CS502: Compiler Design 2
So near yet so far!
Lexical AnalyzerLexical Analyzer
Syntax AnalyzerSyntax Analyzer
Semantic AnalyzerSemantic Analyzer
Intermediate Code Generator
Intermediate Code Generator
Character stream
Token stream
Syntax tree
Syntax tree
Intermediaterepresentation
Machine-Independent Code Optimizer
Machine-Independent Code Optimizer
Code GeneratorCode Generator
Target machine code
Intermediate representation
Machine-Dependent Code Optimizer
Machine-Dependent Code Optimizer
Target machine code
SymbolTable
F r
o n
t e
n d
B a
c k
e n
d
Manas Thakur CS502: Compiler Design 3
Recall from Lecture 2
t1 = id3 * 32.0id1 = id2 + t1
LD R1, id3MUL R1, R1, #32.0LD R2, id2ADD R1, R1, R2ST id1, R1
Code GeneratorCode Generator
Manas Thakur CS502: Compiler Design 4
Roles of Code Generator
● Convert IR to target program.
– Bring it down!
● Using the primitives available on the target machine.
– Usually a form of assembly.
● Requirement: Preserve the semantics of the source program.
– In terms of the observable behaviour.
● Expectation: Target code is of high quality.
– Execution time, space, energy, and so on.
● Expectation2: Code generator itself should be efficient.
– Is it so hard an expectation for an IR generator?
Manas Thakur CS502: Compiler Design 5
Code Generation Demo
Manas Thakur CS502: Compiler Design 6
Code generation in reality
● The problem of generating an optimal target program is undecidable.
– Recall we had said most problems in the front-end are simple, and most in the back-end are complex?
● Several subproblems are NP-hard or NP-complete.
● Need to depend upon:
– Approximation algorithms
– Heuristics
– Conservative estimates
Manas Thakur CS502: Compiler Design 7
Essential tasks during code generation
● Instruction selection
– Map low-level IR to actual machine (or assembly) instructions
– Not necessarily 1-1 mapping
– Several things vary with the architecture:● Instruction set● Addressing modes
● Register allocation
– Low-level IR assumes unlimited registers
– Map to actual resources provided by the hardware
– Goal: Make the best use of registers
Also, Instruction Scheduling
Manas Thakur CS502: Compiler Design 8
Where are registers?
CPU
Registers
Level-1 cache
Level-2 cache
Main memory
Virtual memory
Farther away,larger,slower
Pentium 4 3.2 Ghz
Core 2 Duo
Athlon 64
1 cycle 1 cycle 1 cycle
2 cycles(16 KB)
3 cycles(64 KB)
3 cycles(128 KB)
19 cycles(2 MB)
14 cycles(2 MB)
13 cycles(1 MB)
204 cycles 180 cycles 125 cycles
millions of cycles
millions of cycles
millions of cycles
Manas Thakur CS502: Compiler Design 9
Architecture and registers
● RISC
– Many registers, 3AC, simple addressing modes
– Examples: IBM PowerPC, Oracle SPARC, ARM (mobiles, tablets)
● CISC
– Few registers, 2AC, Variety of addressing modes, several register classes, variable-length instructions, instructions with side-effects
– Examples: Intel x86, AMD Athlon
● Stack machine
– Push/Pop, stack-top uses registers
– Example: JVM
Manas Thakur CS502: Compiler Design 10
Register allocation
● Involves
– Allocation: which variables to be put into registers
– Assignment: which register to use for a variable
● Finding an optimal assignment of registers to variables is an NP-complete problem.
● Architectural conventions complicate matters:
– Combination of registers used for double-precision arithmetic
– Result must be stored into accumulator
– Some registers reserved for special purposes
Manas Thakur CS502: Compiler Design 11
Live ranges
● A variable is live if its current value may be used in future.
– Two variables that are live at the same time cannot use the same register.
– They interfere with each other.
● Conversely, if two variables do not interfere, then they can use the same register.
● Need to determine the durations in which variables are live.
[S1] a = 0L1: [S2] b = a + 1 [S3] c = c + b [S4] a = b * 2 [S5] if a < N goto L1 [S6] return c
a: {S1-S2, S4-S5}
b: {S2-S4}
c: {S3-S6}
Manas Thakur CS502: Compiler Design 12
Interference graphs
● Represent program variables/temporaries as nodes.
● If the live ranges of variables u and v overlap, then draw an edge between u and v.
● An edge (u,v) indicates that variables u and v interfere, and hence cannot be mapped to the same register.
● Tomorrow: Can we color graphs to perform register allocation?
[S1] a = 0L1: [S2] b = a + 1 [S3] c = c + b [S4] a = b * 2 [S5] if a < N goto L1 [S6] return c
a b
c
CS502: Compiler Design
Code Generation (Cont.)
Manas Thakur
Fall 2020
Manas Thakur CS502: Compiler Design 14
Register allocation using graph coloring
● Idea:
– If we can color the interference graph using K colors, then we can allocate the variables to K registers.
– Two nodes that interfere with each other must use different colors.
a b
c
= Register R1
= Register R2
R1 = 0L1: R1 = R1 + 1 R2 = R2 + R1 R1 = R1 * 2 if R1 < N goto L1 return R2
a = 0L1: b = a + 1 c = c + b a = b * 2 if a < N goto L1 return c
Manas Thakur CS502: Compiler Design 15
Graph coloring● Key questions:
– Can we efficiently find a K-coloring of the graph?● Bad news: Graph coloring is an NP-complete problem.
– Can we efficiently find the optimal coloring of the graph (i.e., using the least number of colors)?
● We don’t necessarily need the perfect coloring.● Compute an approximation with heuristics.
– What do we do when there aren’t enough colors (registers) to color the graph?
● Temporarily move that variable to memory (slow, but what else!).● Called spilling.● Need to add instructions to “store” and (later) “load” the spilled variable.
Manas Thakur CS502: Compiler Design 16
Graph coloring: a simplistic approach
repeat
repeat
Remove a node n and all its edges from G, such that the degree of n is less than K
Push n onto a stack
until G has no node with degree less than K
// G is now either empty or all its nodes have degree ≥K
if G is not empty then
Take a node m and all its edges out of G, and mark m for spilling
endif
until G is empty
Take one node at a time from stack and assign a non-conflicting color
Manas Thakur CS502: Compiler Design 17
Need for spill
a
b c
d e a
b c
d e
Is this graph 2-colorable?
What about this one?
Manas Thakur CS502: Compiler Design 18
Kempe’s heuristic (1879!) to reduce spilling
repeat
repeat
Remove a node n and all its edges from G, such that the degree of n is less than K
Push n onto a stack
until G has no node with degree less than K
// G is now either empty or all its nodes have degree ≥K
if G is not empty then
Take one node m out of G
Push m onto the stack
endif
until G is empty
Take one node at a time from stack and assign a non-conflicting color if possible, else spill
Manas Thakur CS502: Compiler Design 19
NFS Revisited!
● No need to spill now.
● Don’t have a choice; need to spill.
a
b c
d e
a
b c
d e
Manas Thakur CS502: Compiler Design 20
Coalescing
● If there is a copy statement x = y such that there is no conflict between x and y:
– We can use the same register for both x and y
– Merge the graph nodes for x and y into one node
● Good because:
– Reduces the number of registers and removes move instructions
● Bad because:
– Increases the number of neighbours of the merged node, which may cause more spilling
Manas Thakur CS502: Compiler Design 21
Register Allocation (Cont.)
● Register allocation is expensive
– Many algorithms use heuristics for graph coloring
– Still it may take time quadratic in the number of live ranges
● Online/JIT compilers need to generate code quickly
– Sacrifice efficient register allocation for compilation speed
● Linear scan register allocation
– Massimiliano Poletto and Vivek Sarkar (ACM TOPLAS 1999)
– Idea: Make one pass over the list of variables
– Spill variables with longest lifetimes – those that would tie up a register for the longest time
Manas Thakur CS502: Compiler Design 22
LSRA (Cont.)
● Compute live intervals
– A live interval for a variable is a range [i,j], such that
● The variable is not live before instruction i● The variable is not live after instruction j
– Overlapping intervals imply interference
● Given R registers and N overlapping intervals
– R intervals allocated to registers
– N-R intervals spilled
● Key: Choosing the right intervals to spill
Manas Thakur CS502: Compiler Design 23
Linear scan algorithm● Sort live intervals
– In order of increasing start points
– Quickly find the next live interval in order
● Maintain a sorted list of active intervals
– In order of increasing end points
– Quickly find expired intervals
● At each step, update active as follows:
– Add the next interval from the sorted list
– Remove any expired intervals (those whose end points are earlier than the start point of the new interval)
Manas Thakur CS502: Compiler Design 24
Linear scan algorithm (Cont.)
● Restriction:
– Never allow active to have more than R elements
● Spill scenario:
– active has R elements; new interval doesn’t cause any existing intervals to expire
● Heuristic:
– Spill the interval that ends last (furthest from current position)● Greedy algorithm
Manas Thakur CS502: Compiler Design 25
LSRA example (2 registers)
● Step 1: active = {a}
● Step 2: active = {a, b}
● Step 3: active = {a, b, c} ==> spill c ==> active = {a, b}
● Step 4: a and b expire; active = {d}
● Step 5: active = {e, d}
abcde
1 2 3 4 5
Variables
Two registers, one spill
Manas Thakur CS502: Compiler Design 26
LSRA: A few comments
● Significantly faster RA than graph coloring
● Efficacy of RA not as good:
– Holes in live ranges not taken into account● Graph coloring can take care by maintaining different graphs at
different points
– A variable once spilled remains spilled forever● Improvements exist; e.g., Traub et al.’s second-chance binpacking
● The choice in most fast JIT compilers
● Next class: Instruction selection
CS502: Compiler Design
Code Generation (Cont.)
Manas Thakur
Fall 2020
Manas Thakur CS502: Compiler Design 28
Instruction Selection
● Low-level IR different from machine ISA
– Why?
– Allow different back ends
● Differences between IR and ISA?
– IR: simple, uniform set of operations
– ISA: many specialized instructions
● Often a single assembly instruction does the work of several operations in the IR
– May vary between CISC and RISC architectures
Manas Thakur CS502: Compiler Design 29
Instruction selection: Easy solution
● Map each IR operation to a single instruction
● May need to include memory operations:
● Problem: Inefficient use of ISA.
● Example: What if the ISA contains an add instruction with one or more memory operands in the addressing mode?
mov y, r1mov z, r2add r2, r1mov r1, x
x = y + z;
Manas Thakur CS502: Compiler Design 30
Wacky x86 Idioms
● What does this do?
● Why not use this?
● Answer: Immediate operands are encoded in the instruction, making it bigger and therefore costly to fetch and execute.
mov $0, %eax
xor %eax, %eax
Manas Thakur CS502: Compiler Design 31
More wacky x86 Idioms
● What does this do?
● Swap the values of %eax and %ebx.
● Why do it this way?
– No need for extra register!
xor %ebx, %eaxxor %eax, %ebxxor %ebx, %eax
Manas Thakur CS502: Compiler Design 32
Architectural differences in the ISA
● RISC:
– Arithmetic operations require registers
– Explicit loads and stores
● CISC:
– Complex instructions
– Arithmetic operations may refer to memory
– Often only one memory operand per instruction
ld 8(r0), r1add r2, r1
add 8(%esp), %eax
Manas Thakur CS502: Compiler Design 33
Selecting instructions with minimal cost
● Difficulties:
– How to find which patterns are the best?
– The translation may not be one-to-one
– The translation may not be even linear
● Idea: Back to tree representation
– Convert computation into a tree
– Match parts of a tree
Manas Thakur CS502: Compiler Design 34
Instruction Selection by Tree Rewriting
● Form:
● Hey, this looks like a syntax-directed translation scheme!
● Set of tree-rewriting rules called tree-translation scheme.
● Example:
replacement <- template {action}
single node tree code fragment
RiRi
{ ADD Ri, Ri, Rj }
RjRj
++RiRi
Manas Thakur CS502: Compiler Design 35
Sample Tree-Writing Rules
=
=
+
Manas Thakur CS502: Compiler Design 36
Sample Tree-Writing Rules (Cont.)
+
+
+
+
Manas Thakur CS502: Compiler Design 37
Code Generation by Tree Tiling
● Consider the following tree for a[i] = b + 1:
=
+
+
+
+
Let’s generate code for this.
Manas Thakur CS502: Compiler Design 38
Next class
● How do we actually perform tree matching?
● What do we do if more than one template matches at a time?
● ... some more relevant stuff ...
● And finally:
CS502: Compiler Design
Code Generation (Cont.)
Manas Thakur
Fall 2020
Manas Thakur CS502: Compiler Design 40
Tree Matching by Parsing!● Idea:
– Represent the input tree using its prefix representation
– Represent tree templates as RHS of grammar rules
– Let pattern matching be done by a parser
– Convert tree translation to syntax-directed translation
=
+
+
+
+
= ind + + Ca RSP ind + Ci RSP + Mb C1
Manas Thakur CS502: Compiler Design 41
SDT for Tree Translation
Ri -> Ca { LD Ri, #a }
Ri -> Mx { LD Ri, x }
M -> = Mx Ri { ST x, Ri }
M -> = ind Ri Rj { ST *Ri, Rj }
Ri -> ind + Ca Rj { LD Ri, a(Rj) }
Ri -> + Ri ind + Ca Rj { ADD Ri, Ri, a(Rj) }
Ri -> + Ri Rj { ADD Ri, Ri, Rj }
Ri -> + Ri C1 { INC Ri }
R -> sp
M -> m
Manas Thakur CS502: Compiler Design 42
SDT for TT: A Few Remarks
● Benefits:
– Efficient and well understood
– Easy to target a new machine
● Challenges:
– Fixed left-to-right evaluation order
– The machine description grammar and parser may become large
– Special care to ensure we don’t get stuck into an infinite code-generation loop (single symbols on RHS)
Good news: Code-generator generators exist!
Manas Thakur CS502: Compiler Design 43
Including Cost
● Algorithm:
– For each node, find minimum cost tiling for that node and the subtrees below.
● Key:
– Once we have a minimum cost subtree rooted at node n, we can find the minimum cost tiling for n by trying out all possible tiles matching n.
● Use dynamic programming!
Manas Thakur CS502: Compiler Design 44
Dynamic Programming
● Idea:
– For problems with optimal substructure
– Compute optimal solutions to sub-problems
– Combine into an overall optimal solution
● How does this help?
– Use memoization:● Save previously computed solutions to sub-problems
– Sub-problems recur many times
Manas Thakur CS502: Compiler Design 45
Modern Processors
● Execution time not the sum of tile times
● Instruction order matters:
– Pipelining: parts of different instructions overlap
– Bad ordering stalls the pipeline; e.g. many operations of one type
– Superscalar: some operations executed in parallel
● Cost is an approximation
● Instruction scheduling helps
Manas Thakur CS502: Compiler Design 46
The Beginning of (Even) More Interesting Things
Manas Thakur CS502: Compiler Design 47
Code Generation & Optimization: An Intro
● Goal: Generate optimized code
● Metrics:
– Code size● Memory requirements● Say opcodes take 1 byte, each operand another byte
– Number of registers● Along with constraints on their usage
– Estimated cost● How fast is the code?● Say instructions with single operand take 1 cycle, with two operands
take 2 cycles, and those involving memory take 4 cycles
– Sometimes● power, energy, platform (in)dependence, ...
Manas Thakur CS502: Compiler Design 48
Code Generation & Optimization: An Intro
● A simple one-to-one mapping:
1: a = 02: b = a + 13: c = c + b4: a = b * 2
MOV R1 0STA a R1
MOV R1 1LDA R2 aADD R2 R1STA b R2
LDA R1 bLDA R2 cADD R2 R1STA c R2
MOV R1 2LDA R2 bMUL R2 R1STA a R2
Cost:Registers: 2Space: 42 bytesTime: 44 cycles
Can we do better?
Manas Thakur CS502: Compiler Design 49
Code Generation & Optimization: An Intro
● Better register usage:
MOV R1 0STA a R1
MOV R1 1LDA R2 aADD R2 R1STA b R2
LDA R1 bLDA R2 cADD R2 R1STA c R2
MOV R1 2LDA R2 bMUL R2 R1STA a R2
Cost:Registers: 2Space: 33 bytesTime: 32 cycles
MOV R1 0STA a R1
MOV R2 1ADD R1 R2STA b R1
LDA R2 cADD R2 R1STA c R2
MOV R2 2MUL R1 R2STA a R1
Cost:Registers: 2Space: 42 bytesTime: 44 cycles
1: a = 02: b = a + 13: c = c + b4: a = b * 2
Can we do better?
Manas Thakur CS502: Compiler Design 50
Code Generation & Optimization: An Intro
● Remove redundant store to a:
Cost:Registers: 2Space: 33 bytesTime: 32 cycles
MOV R1 0STA a R1
MOV R2 1ADD R1 R2STA b R1
LDA R2 cADD R2 R1STA c R2
MOV R2 2MUL R1 R2STA a R1
MOV R1 0
MOV R2 1ADD R1 R2STA b R1
LDA R2 cADD R2 R1STA c R2
MOV R2 2MUL R1 R2STA a R1
Cost:Registers: 2Space: 30 bytesTime: 28 cycles
1: a = 02: b = a + 13: c = c + b4: a = b * 2
Can we do better?
Manas Thakur CS502: Compiler Design 51
Code Generation & Optimization: An Intro
● Select specialized instructions:
MOV R1 0
MOV R2 1ADD R1 R2STA b R1
LDA R2 cADD R2 R1STA c R2
MOV R2 2MUL R1 R2STA a R1
Cost:Registers: 2Space: 30 bytesTime: 28 cycles
CLR R1INC R1STA b R1LDA R2 cADD R2 R1STA c R2SHL R1STA a R1
Cost:Registers: 2Space: 21 bytesTime: 21 cycles
1: a = 02: b = a + 13: c = c + b4: a = b * 2
Can we do better?
Manas Thakur CS502: Compiler Design 52
Code Generation & Optimization: An Intro
● Propagate constant 0:
CLR R1INC R1STA b R1LDA R2 cADD R2 R1STA c R2SHL R1STA a R1
Cost:Registers: 2Space: 21 bytesTime: 21 cycles
MOV R1 1STA b R1LDA R2 cADD R2 R1STA c R2SHL R1STA a R1
Cost:Registers: 2Space: 20 bytesTime: 21 cycles
1: a = 02: b = a + 13: c = c + b4: a = b * 2
Can we do better?
Manas Thakur CS502: Compiler Design 53
Code Generation & Optimization: An Intro
● Assuming b is not used in future,propagate constant 1:
Cost:Registers: 2Space: 20 bytesTime: 21 cycles
MOV R1 1STA b R1LDA R2 cADD R2 R1STA c R2SHL R1STA a R1
1: a = 02: b = a + 13: c = c + b4: a = b * 2
LDA R1 cINC R1STA c R1MOV R1 2STA a R1
Cost:Registers: 1Space: 14 bytesTime: 15 cycles
Can we do better?
Manas Thakur CS502: Compiler Design 54
Code Generation & Optimization: An Intro
● Assuming the availability of astore-immediate instruction:
● Assuming no other special instruction and no new knowledge about past or future instructions:
2 PCs to someone who can improve this even further!
Hint (-1): Think about the execution at architectural level.
1: a = 02: b = a + 13: c = c + b4: a = b * 2
LDA R1 cINC R1STA c R1MOV R1 2STA a R1
Cost:Registers: 1Space: 14 bytesTime: 15 cycles
LDA R1 cINC R1STA c R1STI a 2
Cost:Registers: 1Space: 11 bytesTime: 13 cycles
Manas Thakur CS502: Compiler Design 55
Code Generation & Optimization: An Intro
● Reordering stores to optimize memory access:
● Food for thought:
– Impact of multithreaded programson possible/allowed reorderings.
LDA R1 cINC R1STA c R1STI a 2
STI a 2LDA R1 cINC R1STA c R1
1: a = 02: b = a + 13: c = c + b4: a = b * 2
Manas Thakur CS502: Compiler Design 56
Where are we headed next week?
MOV R1 0STA a R1
MOV R1 1LDA R2 aADD R2 R1STA b R2
LDA R1 bLDA R2 cADD R2 R1STA c R2
MOV R1 2LDA R2 bMUL R2 R1STA a R2
Code
Optimization
STI a 2LDA R1 cINC R1STA c R1
A3 is on!
Cost:Registers: 1Space: 11 bytesTime: 13 cycles+ReorderedCost:
Registers: 2Space: 42 bytesTime: 44 cycles