instruction selection 745
Post on 03-May-2022
4 Views
Preview:
TRANSCRIPT
Page ‹#›
School of Computer Science
Instruction Selection
15-745 Optimizing CompilersSpring 2007
1School of Computer Science
2School of Computer Science
IR
TempMap
instructionselector
registerallocator
Assem
Asseminstructionscheduler
Compiler Backend
3School of Computer Science
Intermediate Representations
Front end - produces an intermediate representation (IR)Middle end - transforms the IR into an equivalent IR that runs moreefficientlyBack end - transforms the IR into native code
IR encodes the compiler’s knowledge of the programMiddle end usually consists of several passes
FrontEnd
MiddleEnd
BackEnd
IR IRSourceCode
TargetCode
Page ‹#›
4School of Computer Science
Intermediate RepresentationsDecisions in IR design affect the speed and efficiency ofthe compilerSome important IR properties
– Ease of generation– Ease of manipulation– Procedure size– Freedom of expression– Level of abstraction
The importance of different properties varies betweencompilers
– Selecting an appropriate IR for a compiler is critical
5School of Computer Science
Examples:Trees, DAGs
Examples:3 address codeStack machine code
Example:Control-flow graph
Types of IntermediateRepresentationsStructural
– Graphically oriented– Heavily used in source-to-source translators– Tend to be large
Linear– Pseudo-code for an abstract machine– Level of abstraction varies– Simple, compact data structures– Easier to rearrange
Hybrid– Combination of graphs and linear code
6School of Computer Science
Level of AbstractionThe level of detail exposed in an IR influences theprofitability and feasibility of different optimizations.
Two different representations of an array reference:
subscript
A i j
loadI 1 => r1sub rj, r1 => r2loadI 10 => r3mult r2, r3 => r4sub ri, r1 => r5add r4, r5 => r6loadI @A => r7add r7, r6 => r8load r8 => rAij
High level AST:Good for memory disambiguation Low level linear code:
Good for address calculation7
School of Computer Science
Level of AbstractionStructural IRs are usually considered high-levelLinear IRs are usually considered low-levelNot necessarily true:
+
*
10
j 1
- j 1
-
+
@A
load
Low level AST loadArray A,i,j
High level linear code
Page ‹#›
8School of Computer Science
Abstract Syntax TreeAn abstract syntax tree is the procedure’s parsetree with the nodes for most non-terminal nodesremoved
-
x
2 y
*
x - 2 * y
When is an AST a good IR?
Structural IR
9School of Computer Science
Directed Acyclic GraphA directed acyclic graph (DAG) is an AST with a uniquenode for each value
Makes sharing explicitEncodes redundancy
x
2 y
*
-
←
z /
←
w
z ← x - 2 * yw ← x / 2
Same expression twice meansthat the compiler might arrangeto evaluate it just once!
Structural IR
10School of Computer Science
Pegasus IRPredicated ExplicitGAted SimpleUniform SSA
Structural IR used inCASH (Compiler forApplication SpecificHardware)
Structural IR
11School of Computer Science
Stack Machine CodeOriginally used for stack-based computers, now JavaExample:
x - 2 * y becomes
Advantages– Compact form– Introduced names are implicit, not explicit– Simple to generate and execute code
Useful where code is transmittedover slow communication links (the net )
push xpush 2push ymultiplysubtract
Implicit names take upno space, where explicitones do!
Linear IR
Page ‹#›
12School of Computer Science
Three Address CodeThree address code has statements of the form:
x ← y op zWith 1 operator (op ) and, at most, 3 names (x, y, & z)
Example:z ← x - 2 * y becomes
Advantages:– Resembles many machines (RISC)– Compact form
t ← 2 * yz ← x - t
Linear IR
13School of Computer Science
Two Address CodeAllows statements of the form
x ← x op yHas 1 operator (op ) and, at most, 2 names (x and y)
Example:z ← x - 2 * y becomes
– Can be very compact– Destructive operations make reuse hard– Good model for machines with destructive ops (x86)
t1 ← 2t2 ← load yt2 ← t2 * t1z ← load xz ← z - t2
Linear IR
14School of Computer Science
Control-flow GraphModels the transfer of control in the procedureNodes in the graph are basic blocks
– Straight-line code– Either linear or structural representation
Edges in the graph represent control flow
Example if (x = y)
a ← 2b ← 5
a ← 3b ← 4
c ← a * b
Basic blocks —Maximal lengthsequences ofstraight-line code
Hybrid IR
15School of Computer Science
Using Multiple Representations
Repeatedly lower the level of the intermediate representation– Each intermediate representation is suited towards certain optimizations
Example: the Open64 compiler– WHIRL intermediate format
• Consists of 5 different IRs that are progressively more detailed– gcc
• but not explicit about it :-(
FrontEnd
MiddleEnd
BackEnd
IR 1 IR 3SourceCode
TargetCode
MiddleEnd
IR 2
Page ‹#›
16School of Computer Science
Instruction Selection
IR instructionselector Assem
IR -> Assem
templates
17School of Computer Science
Instruction selection exampleSuppose we have
We can generate the x86 code…
…if c = 1, 2, or 4; otherwise…
MOVE(TEMP r, MEM(BINOP(TIMES,TEMP s,CONST c)))
movl %(,s,c), %r
imull $c, %s, %rmovl (%r), %r
18School of Computer Science
MOVE(TEMP r, MEM(BINOP(TIMES,TEMP s,CONST c)))MOVE(TEMP r, MEM(BINOP(TIMES,TEMP s,CONST c)))
Selection dependenciesWe can see that the selection of instructions candepend on the constantsThe context of an IR expression can also affect thechoice of instructionConsider
19School of Computer Science
For
we might sometimes want to generate
What context might cause us to do this?
testl $0,(,%s,c)je L
MEM(BINOP(TIMES,TEMP s,CONST c))
Example, cont’d
Page ‹#›
20School of Computer Science
Instruction selection as tree matchingIn order to take context into account, instructionselectors often use pattern-matching on IR trees
– each pattern specifies what instructions to select
21School of Computer Science
If c is 1, 2, or 4
Sample tree-matching rulesIR pattern code cost
BINOP(PLUS,i,j) leal (i,j),r 1
BINOP(TIMES,i,j) movl j,r 2 imull i,r
BINOP(PLUS,i,CONST c) leal c(i),r 1
MEM(BINOP(PLUS,i,CONST c)) movl c(i),r 1
MOVE(MEM(BINOP(PLUS,i,j)),k) movl k,(i,j) 1
BINOP(TIMES,i,CONST c) leal (,i,c),r 1
BINOP(TIMES,i,CONST c) movl c,r 2 imull i,r
MEM(i) movl (i),r 1
MOVE(MEM(i),j) movl j,(i) 1
MOVE(MEM(i),MEM(j)) movl (j),t 2 movl t,(i)
…
22School of Computer Science
a[x] = *y;MOVE
MEM MEM
y+
MEM
+
EBP CONST a
leal $a(%ebp),r1movl (r1),r2leal (,x,$4),r3leal (r2,r3),r4movl (y),r5movl r5,(r4)
CONST 4
*
x
(assume a is aformal parameterpassed on the stack)
r1
r2 r3
r4
Tiling an IR tree, v.1
23School of Computer Science
Tiling an IR tree, v.2a[x] = *y;
MOVE
MEM MEM
y+
MEM
+
BP CONST a
movl $a(%ebp),r1leal (,x,4),r2movl (y),r3movl r3,(r1,r2)
CONST 4
*
x
r1 r2
r3
Page ‹#›
24School of Computer Science
Tiling choicesIn general, for any given tree, many tilings arepossible
– each resulting in a different instruction sequenceWe can ensure pattern coverage by covering, at aminimum, all atomic IR trees
25School of Computer Science
The best tiling?We want the “lowest cost” tiling
– usually, the shortest sequence– but can also take into account cost/delay of each instruction
Optimum tiling– lowest-cost tiling
Locally Optimal tiling– no two adjacent tiles can be combined into one tile of lower
cost
26School of Computer Science
Locally optimal tilingsLocally optimal tiling is easy
– A simple greedy algorithm works extremelywell in practice:
– Maximal munch• start at root• use “biggest” match (in # of nodes)
– use cost to break ties
27School of Computer Science
Maximal munch MOVE
MEM MEM
y+
MEM
+
BP CONST a
CONST 4
*
x
Choose the largestpattern with lowestcost, i.e., the“maximal munch”
IR pattern code costMOVE(MEM(BINOP(PLUS,i,j)),k) movl k,(i,j) 1MOVE(MEM(i),j) movl j,(i) 1MOVE(MEM(i),MEM(j)) movl (j),t 2
movl t,(i)
Page ‹#›
28School of Computer Science
Maximal munchMaximal munch does not necessarily produce theoptimum selection of instructionsBut:
– it is easy to implement– it tends to work well for current instruction-set
architectures
29School of Computer Science
Maximal munch is not optimumConsider what happens, for example, if two of ourrules are changed as follows:
IR pattern code cost
MEM(BINOP(PLUS,i,CONST c)) movl c,r 3 addl i,r movl (r),r
MOVE(MEM(BINOP(PLUS,i,j)),k) movl j,r 3 addl i,r movl k,(r)
30School of Computer Science
If c is 1, 2, or 4
Sample tree-matching rulesRule # IR pattern cost
0 TEMP t 0
1 CONST c 1
2 BINOP(PLUS,i,j) 1
3 BINOP(TIMES,i,j) 2
4 BINOP(PLUS,i,CONST c) 1
5 MEM(BINOP(PLUS,i,CONST c)) 3
6 MOVE(MEM(BINOP(PLUS,i,j)),k) 3
7 BINOP(TIMES,i,CONST c) 1
8 BINOP(TIMES,i,CONST c) 2
9 MEM(i) 1
10 MOVE(MEM(i),j) 1
11 MOVE(MEM(i),MEM(j)) 231
School of Computer Science
Tiling an IR tree, new rulesa[x] = *y;
MOVE
MEM MEM
y+
MEM
+
BP CONST a
movl $a,r1addl %ebp,r1movl (r1),r1leal (,x,4),r2movl (y),r3movl r2,r4addl r1,r4movl r3,(r4) CONST 4
*
x
r1 r2
r3
Page ‹#›
32School of Computer Science
Optimum selectionTo achieve optimum instruction selection, we mustuse a more complex algorithm
– dynamic programmingIn contrast to maximal munch, the trees arematched bottom-up
33School of Computer Science
Dynamic programmingThe idea is fairly simple– Working bottom up…– Given the optimum tilings of all subtrees, generate
optimum tiling of the current tree• consider all tiles for the root of the current tree• sum cost of best subtree tiles and each tile• choose tile with minimum total cost
Second pass generates the code using the resultsfrom the bottom-up pass
34School of Computer Science
Bottom-up CG, pass 1MOVE
MEM MEM
CONST 99
+
%EBP
+
MEM
+
%EBP CONST a
CONST 4
(0,0) (1,1)
(4,1)
(9,2) (1,1)
(4,3)
(9,4)
(10,6),(11,6)
(9,2)
(2,1)
(0,0) (1,1)Note: (r,c)means rule r,total cost c
35School of Computer Science
Bottom-up CG, pass 2MOVE
MEM MEM
CONST 99
+
%EBP
+
MEM
+
%EBP CONST a
CONST 4
(0,0) (1,1)
(2,1)
(9,2) (1,1)
(4,3)
(9,4)
(10,6),(11,6)
(9,2)
(2,1)
(0,0) (1,1)Note: (r,c)means rule r,total cost c
Page ‹#›
36School of Computer Science
Bottom-up code generationHow does the running time compare to maximalmunch?Memory usage?
37School of Computer Science
ToolsA lot of tools have been developed toautomatically generate instruction selectors
– TWIG [Aho,Ganapathi,Tjiang, ’86]– BURG [Fraser,Henry,Proebsting, ’92]– BEG [Emmelmann,Schroer,Landwehr, ’89]– …
These generate bottom-up instruction selectorsfrom tree-matching rule specifications
38School of Computer Science
Code-generator generatorsTwig, Burg, and Beg use the dynamicprogramming approach
– A lot of work has gone into making these cgg’shighly efficient
There are also grammar-based cgg’s that useLR(k) parsing to perform the tree-matching
39School of Computer Science
Tiling a DAGHow would you tile
&x4 8
+ +
t0 t1
ld ldmov &x -> t3add t3,4 -> t4load (t4) -> t0add t3,4 -> t5load (t5) -> t1
mov x+4 -> t0mov x+8 -> t1
Page ‹#›
40School of Computer Science
Peephole MatchingBasic ideaCompiler can discover local improvements locally
– Look at a small set of adjacent operations– Move a “peephole” over code & search for improvement
Classic example was store followed by load
storeAI r1 ⇒ r0,8loadAI r0,8 ⇒ r15
storeAI r1 ⇒ r0,8i2i r1 ⇒ r15
Original code Improved code
41School of Computer Science
Peephole MatchingBasic ideaCompiler can discover local improvements locally
– Look at a small set of adjacent operations– Move a “peephole” over code & search for improvement
Classic example was store followed by loadSimple algebraic identities
addI r2,0 ⇒ r7mult r4,r7 ⇒ r10
mult r4,r2 ⇒ r10
Original code Improved code
42School of Computer Science
Peephole MatchingBasic ideaCompiler can discover local improvements locally
– Look at a small set of adjacent operations– Move a “peephole” over code & search for improvement
Classic example was store followed by loadSimple algebraic identitiesJump to a jump
jumpI → L10L10: jumpI → L11
L10: jumpI → L11
Original code Improved code
43School of Computer Science
Peephole MatchingImplementing itEarly systems used limited set of hand-coded patternsWindow size ensured quick processing
Modern peephole instruction selectors (Davidson)Break problem into three tasks
Apply symbolic interpretation & simplification systematically
ExpanderIR→LLIR
SimplifierLLIR→LLIR
MatcherLLIR→ASM
IR LLIR LLIR ASM
Page ‹#›
44School of Computer Science
Peephole MatchingExpanderTurns IR code into a low-level IR (LLIR) such as RTLOperation-by-operation, template-driven rewritingLLIR form includes all direct effects (e.g., setting cc)Significant, albeit constant, expansion of size
ExpanderIR→LLIR
SimplifierLLIR→LLIR
MatcherLLIR→ASM
IR LLIR LLIR ASM
45School of Computer Science
Peephole MatchingSimplifierLooks at LLIR through window and rewrites itUses forward substitution, algebraic simplification, local constantpropagation, and dead-effect eliminationPerforms local optimization within window
This is the heart of the peephole system– Benefit of peephole optimization shows up in this step
ExpanderIR→LLIR
SimplifierLLIR→LLIR
MatcherLLIR→ASM
IR LLIR LLIR ASM
46School of Computer Science
Peephole MatchingMatcherCompares simplified LLIR against a library of patternsPicks low-cost pattern that captures effectsMust preserve LLIR effects, may add new ones (e.g., setcc)Generates the assembly code output
ExpanderIR→LLIR
SimplifierLLIR→LLIR
MatcherLLIR→ASM
IR LLIR LLIR ASM
47School of Computer Science
Example
Original IR Code
wt1xsub
t1Y2mult
ResultArg2Arg1OP
LLIR Code r10 ← 2 r11 ← @y r12 ← r0 + r11 r13 ← MEM(r12) r14 ← r10 x r13 r15 ← @x r16 ← r0 + r15 r17 ← MEM(r16) r18 ← r17 - r14 r19 ← @w r20 ← r0 + r19MEM(r20) ← r18
Expand
Page ‹#›
48School of Computer Science
Example LLIR Code r10 ← 2 r11 ← @y r12 ← r0 + r11 r13 ← MEM(r12) r14 ← r10 x r13 r15 ← @x r16 ← r0 + r15 r17 ← MEM(r16) r18 ← r17 - r14 r19 ← @w r20 ← r0 + r19MEM(r20) ← r18
SimplifyLLIR Code
r13 ← MEM(r0+ @y) r14 ← 2 x r13 r17 ← MEM(r0 + @x) r18 ← r17 - r14
MEM(r0 + @w) ← r18
49School of Computer Science
Example
MatchLLIR Code
r13 ← MEM(r0+ @y) r14 ← 2 x r13 r17 ← MEM(r0 + @x) r18 ← r17 - r14
MEM(r0 + @w) ← r18
ILOC CodeloadAI r0,@y ⇒ r13multI 2 x r13 ⇒ r14loadAI r0,@x ⇒ r17sub r17 - r14 ⇒ r18storeAI r18 ⇒ r0,@w
50School of Computer Science
Steps of the Simplifier LLIR Code r10 ← 2 r11 ← @y r12 ← r0 + r11 r13 ← MEM(r12) r14 ← r10 x r13 r15 ← @x r16 ← r0 + r15 r17 ← MEM(r16) r18 ← r17 - r14 r19 ← @w r20 ← r0 + r19MEM(r20) ← r18
3-operation window
51School of Computer Science
Steps of the Simplifier LLIR Code r10 ← 2 r11 ← @y r12 ← r0 + r11 r13 ← MEM(r12) r14 ← r10 x r13 r15 ← @x r16 ← r0 + r15 r17 ← MEM(r16) r18 ← r17 - r14 r19 ← @w r20 ← r0 + r19MEM(r20) ← r18
r10 ← 2r11 ← @yr12 ← r0 + r11
Page ‹#›
52School of Computer Science
Steps of the Simplifier LLIR Code r10 ← 2 r11 ← @y r12 ← r0 + r11 r13 ← MEM(r12) r14 ← r10 x r13 r15 ← @x r16 ← r0 + r15 r17 ← MEM(r16) r18 ← r17 - r14 r19 ← @w r20 ← r0 + r19MEM(r20) ← r18
r10 ← 2r11 ← @yr12 ← r0 + r11
r10 ← 2r12 ← r0 + @yr13 ← MEM(r12)
53School of Computer Science
Steps of the Simplifier LLIR Code r10 ← 2 r11 ← @y r12 ← r0 + r11 r13 ← MEM(r12) r14 ← r10 x r13 r15 ← @x r16 ← r0 + r15 r17 ← MEM(r16) r18 ← r17 - r14 r19 ← @w r20 ← r0 + r19MEM(r20) ← r18
r10 ← 2r12 ← r0 + @yr13 ← MEM(r12)
r10 ← 2r13 ← MEM(r0 + @y)r14 ← r10 x r13
54School of Computer Science
Steps of the Simplifier LLIR Code r10 ← 2 r11 ← @y r12 ← r0 + r11 r13 ← MEM(r12) r14 ← r10 x r13 r15 ← @x r16 ← r0 + r15 r17 ← MEM(r16) r18 ← r17 - r14 r19 ← @w r20 ← r0 + r19MEM(r20) ← r18
r13 ← MEM(r0 + @y)r14 ← 2 x r13r15 ← @x
r10 ← 2r13 ← MEM(r0 + @y)r14 ← r10 x r13
55School of Computer Science
Steps of the Simplifier LLIR Code r10 ← 2 r11 ← @y r12 ← r0 + r11 r13 ← MEM(r12) r14 ← r10 x r13 r15 ← @x r16 ← r0 + r15 r17 ← MEM(r16) r18 ← r17 - r14 r19 ← @w r20 ← r0 + r19MEM(r20) ← r18
r13 ← MEM(r0 + @y)r14 ← 2 x r13r15 ← @x
r14 ← 2 x r13r15 ← @xr16 ← r0 + r15
1st op it has rolledout of window
Page ‹#›
56School of Computer Science
Steps of the Simplifier LLIR Code r10 ← 2 r11 ← @y r12 ← r0 + r11 r13 ← MEM(r12) r14 ← r10 x r13 r15 ← @x r16 ← r0 + r15 r17 ← MEM(r16) r18 ← r17 - r14 r19 ← @w r20 ← r0 + r19MEM(r20) ← r18
r14 ← 2 x r13r15 ← @xr16 ← r0 + r15
r14 ← 2 x r13r16 ← r0 + @xr17 ← MEM(r16)
57School of Computer Science
Steps of the Simplifier LLIR Code r10 ← 2 r11 ← @y r12 ← r0 + r11 r13 ← MEM(r12) r14 ← r10 x r13 r15 ← @x r16 ← r0 + r15 r17 ← MEM(r16) r18 ← r17 - r14 r19 ← @w r20 ← r0 + r19MEM(r20) ← r18
r14 ← 2 x r13r17 ← MEM(r0+@x)r18 ← r17 - r14
r14 ← 2 x r13r16 ← r0 + @xr17 ← MEM(r16)
58School of Computer Science
Steps of the Simplifier LLIR Code r10 ← 2 r11 ← @y r12 ← r0 + r11 r13 ← MEM(r12) r14 ← r10 x r13 r15 ← @x r16 ← r0 + r15 r17 ← MEM(r16) r18 ← r17 - r14 r19 ← @w r20 ← r0 + r19MEM(r20) ← r18
r17 ← MEM(r0+@x)r18 ← r17 - r14r19 ← @w
r14 ← 2 x r13r17 ← MEM(r0+@x)r18 ← r17 - r14
59School of Computer Science
Steps of the Simplifier LLIR Code r10 ← 2 r11 ← @y r12 ← r0 + r11 r13 ← MEM(r12) r14 ← r10 x r13 r15 ← @x r16 ← r0 + r15 r17 ← MEM(r16) r18 ← r17 - r14 r19 ← @w r20 ← r0 + r19MEM(r20) ← r18
r18 ← r17 - r14r19 ← @wr20 ← r0 + r19
r17 ← MEM(r0+@x)r18 ← r17 - r14r19 ← @w
Page ‹#›
60School of Computer Science
Steps of the Simplifier LLIR Code r10 ← 2 r11 ← @y r12 ← r0 + r11 r13 ← MEM(r12) r14 ← r10 x r13 r15 ← @x r16 ← r0 + r15 r17 ← MEM(r16) r18 ← r17 - r14 r19 ← @w r20 ← r0 + r19MEM(r20) ← r18
r18 ← r17 - r14r20 ← r0 + @wMEM(r20) ← r18
r18 ← r17 - r14r19 ← @wr20 ← r0 + r19
61School of Computer Science
Steps of the Simplifier LLIR Code r10 ← 2 r11 ← @y r12 ← r0 + r11 r13 ← MEM(r12) r14 ← r10 x r13 r15 ← @x r16 ← r0 + r15 r17 ← MEM(r16) r18 ← r17 - r14 r19 ← @w r20 ← r0 + r19MEM(r20) ← r18
r18 ← r17 - r14MEM(r0 + @w) ← r18
r18 ← r17 - r14r20 ← r0 + @wMEM(r20) ← r18
62School of Computer Science
Steps of the Simplifier LLIR Code r10 ← 2 r11 ← @y r12 ← r0 + r11 r13 ← MEM(r12) r14 ← r10 x r13 r15 ← @x r16 ← r0 + r15 r17 ← MEM(r16) r18 ← r17 - r14 r19 ← @w r20 ← r0 + r19MEM(r20) ← r18
r18 ← r17 - r14MEM(r0 + @w) ← r18
r18 ← r17 - r14r20 ← r0 + @wMEM(r20) ← r18
63School of Computer Science
Example LLIR Code r10 ← 2 r11 ← @y r12 ← r0 + r11 r13 ← MEM(r12) r14 ← r10 x r13 r15 ← @x r16 ← r0 + r15 r17 ← MEM(r16) r18 ← r17 - r14 r19 ← @w r20 ← r0 + r19MEM(r20) ← r18
SimplifyLLIR Code
r13 ← MEM(r0+ @y) r14 ← 2 x r13 r17 ← MEM(r0 + @x) r18 ← r17 - r14
MEM(r0 + @w) ← r18
Page ‹#›
64School of Computer Science
Making It All WorkDetailsLLIR is largely machine independent (RTL)Target machine described as LLIR → ASM patternActual pattern matching
– Use a hand-coded pattern matcher (gcc)– Turn patterns into grammar & use LR parser (VPO)
Several important compilers use this technologyIt seems to produce good portable instruction selectors
Key strength appears to be late low-level optimization
65School of Computer Science
MOVE
MEM MEM
y
SIMD InstructionsWe’ve assumed eachtile is connected
What about SIMD instructions?Ex. TI C62x add2
+
32 bits
subword parallelism
66School of Computer Science
Automatic VectorizationLoop level
Basic block level
for (i = 0; i < 1024; i++){ C[i] = A[i]*B[i];}
for (i = 0; i < 1024; i+=4){ C[i:i+3] = A[i:i+3]*B[i:i+3];}
x = a+b;y = c+d;
(x,y) = (a,c)+(b,d);
looking for independent, isomorphic operations
top related