instruction selection 745

Page ‹#›

School of Computer Science

Instruction Selection

15-745 Optimizing CompilersSpring 2007

1School of Computer Science

TempMap

instructionselector

registerallocator

Asseminstructionscheduler

Compiler Backend

Intermediate Representations

Front end - produces an intermediate representation (IR)Middle end - transforms the IR into an equivalent IR that runs moreefficientlyBack end - transforms the IR into native code

IR encodes the compiler’s knowledge of the programMiddle end usually consists of several passes

FrontEnd

MiddleEnd

BackEnd

IR IRSourceCode

TargetCode

Page ‹#›

Intermediate RepresentationsDecisions in IR design affect the speed and efficiency ofthe compilerSome important IR properties

– Ease of generation– Ease of manipulation– Procedure size– Freedom of expression– Level of abstraction

The importance of different properties varies betweencompilers

– Selecting an appropriate IR for a compiler is critical

Examples:Trees, DAGs

Examples:3 address codeStack machine code

Example:Control-flow graph

Types of IntermediateRepresentationsStructural

– Graphically oriented– Heavily used in source-to-source translators– Tend to be large

Linear– Pseudo-code for an abstract machine– Level of abstraction varies– Simple, compact data structures– Easier to rearrange

Hybrid– Combination of graphs and linear code

Level of AbstractionThe level of detail exposed in an IR influences theprofitability and feasibility of different optimizations.

Two different representations of an array reference:

subscript

loadI 1 => r1sub rj, r1 => r2loadI 10 => r3mult r2, r3 => r4sub ri, r1 => r5add r4, r5 => r6loadI @A => r7add r7, r6 => r8load r8 => rAij

High level AST:Good for memory disambiguation Low level linear code:

Good for address calculation7

Level of AbstractionStructural IRs are usually considered high-levelLinear IRs are usually considered low-levelNot necessarily true:

Low level AST loadArray A,i,j

High level linear code

Page ‹#›

Abstract Syntax TreeAn abstract syntax tree is the procedure’s parsetree with the nodes for most non-terminal nodesremoved

x - 2 * y

When is an AST a good IR?

Structural IR

Directed Acyclic GraphA directed acyclic graph (DAG) is an AST with a uniquenode for each value

Makes sharing explicitEncodes redundancy

z ← x - 2 * yw ← x / 2

Same expression twice meansthat the compiler might arrangeto evaluate it just once!

Structural IR

Pegasus IRPredicated ExplicitGAted SimpleUniform SSA

Structural IR used inCASH (Compiler forApplication SpecificHardware)

Structural IR

Stack Machine CodeOriginally used for stack-based computers, now JavaExample:

x - 2 * y becomes

Advantages– Compact form– Introduced names are implicit, not explicit– Simple to generate and execute code

Useful where code is transmittedover slow communication links (the net )

push xpush 2push ymultiplysubtract

Implicit names take upno space, where explicitones do!

Linear IR

Page ‹#›

Three Address CodeThree address code has statements of the form:

x ← y op zWith 1 operator (op ) and, at most, 3 names (x, y, & z)

Example:z ← x - 2 * y becomes

Advantages:– Resembles many machines (RISC)– Compact form

t ← 2 * yz ← x - t

Linear IR

Two Address CodeAllows statements of the form

x ← x op yHas 1 operator (op ) and, at most, 2 names (x and y)

Example:z ← x - 2 * y becomes

– Can be very compact– Destructive operations make reuse hard– Good model for machines with destructive ops (x86)

t1 ← 2t2 ← load yt2 ← t2 * t1z ← load xz ← z - t2

Linear IR

Control-flow GraphModels the transfer of control in the procedureNodes in the graph are basic blocks

– Straight-line code– Either linear or structural representation

Edges in the graph represent control flow

Example if (x = y)

a ← 2b ← 5

a ← 3b ← 4

c ← a * b

Basic blocks —Maximal lengthsequences ofstraight-line code

Hybrid IR

Using Multiple Representations

Repeatedly lower the level of the intermediate representation– Each intermediate representation is suited towards certain optimizations

Example: the Open64 compiler– WHIRL intermediate format

• Consists of 5 different IRs that are progressively more detailed– gcc

• but not explicit about it :-(

FrontEnd

MiddleEnd

BackEnd

IR 1 IR 3SourceCode

TargetCode

MiddleEnd

Page ‹#›

Instruction Selection

IR instructionselector Assem

IR -> Assem

templates

Instruction selection exampleSuppose we have

We can generate the x86 code…

…if c = 1, 2, or 4; otherwise…

MOVE(TEMP r, MEM(BINOP(TIMES,TEMP s,CONST c)))

movl %(,s,c), %r

imull $c, %s, %rmovl (%r), %r

MOVE(TEMP r, MEM(BINOP(TIMES,TEMP s,CONST c)))MOVE(TEMP r, MEM(BINOP(TIMES,TEMP s,CONST c)))

Selection dependenciesWe can see that the selection of instructions candepend on the constantsThe context of an IR expression can also affect thechoice of instructionConsider

we might sometimes want to generate

What context might cause us to do this?

testl $0,(,%s,c)je L

MEM(BINOP(TIMES,TEMP s,CONST c))

Example, cont’d

Page ‹#›

Instruction selection as tree matchingIn order to take context into account, instructionselectors often use pattern-matching on IR trees

– each pattern specifies what instructions to select

If c is 1, 2, or 4

Sample tree-matching rulesIR pattern code cost

BINOP(PLUS,i,j) leal (i,j),r 1

BINOP(TIMES,i,j) movl j,r 2 imull i,r

BINOP(PLUS,i,CONST c) leal c(i),r 1

MEM(BINOP(PLUS,i,CONST c)) movl c(i),r 1

MOVE(MEM(BINOP(PLUS,i,j)),k) movl k,(i,j) 1

BINOP(TIMES,i,CONST c) leal (,i,c),r 1

BINOP(TIMES,i,CONST c) movl c,r 2 imull i,r

MEM(i) movl (i),r 1

MOVE(MEM(i),j) movl j,(i) 1

MOVE(MEM(i),MEM(j)) movl (j),t 2 movl t,(i)

a[x] = *y;MOVE

MEM MEM

EBP CONST a

leal $a(%ebp),r1movl (r1),r2leal (,x,$4),r3leal (r2,r3),r4movl (y),r5movl r5,(r4)

CONST 4

(assume a is aformal parameterpassed on the stack)

Tiling an IR tree, v.1

Tiling an IR tree, v.2a[x] = *y;

MEM MEM

BP CONST a

movl $a(%ebp),r1leal (,x,4),r2movl (y),r3movl r3,(r1,r2)

CONST 4

Page ‹#›

Tiling choicesIn general, for any given tree, many tilings arepossible

– each resulting in a different instruction sequenceWe can ensure pattern coverage by covering, at aminimum, all atomic IR trees

The best tiling?We want the “lowest cost” tiling

– usually, the shortest sequence– but can also take into account cost/delay of each instruction

Optimum tiling– lowest-cost tiling

Locally Optimal tiling– no two adjacent tiles can be combined into one tile of lower

Locally optimal tilingsLocally optimal tiling is easy

– A simple greedy algorithm works extremelywell in practice:

– Maximal munch• start at root• use “biggest” match (in # of nodes)

– use cost to break ties

Maximal munch MOVE

MEM MEM

BP CONST a

CONST 4

Choose the largestpattern with lowestcost, i.e., the“maximal munch”

IR pattern code costMOVE(MEM(BINOP(PLUS,i,j)),k) movl k,(i,j) 1MOVE(MEM(i),j) movl j,(i) 1MOVE(MEM(i),MEM(j)) movl (j),t 2

movl t,(i)

Page ‹#›

Maximal munchMaximal munch does not necessarily produce theoptimum selection of instructionsBut:

– it is easy to implement– it tends to work well for current instruction-set

architectures

Maximal munch is not optimumConsider what happens, for example, if two of ourrules are changed as follows:

IR pattern code cost

MEM(BINOP(PLUS,i,CONST c)) movl c,r 3 addl i,r movl (r),r

MOVE(MEM(BINOP(PLUS,i,j)),k) movl j,r 3 addl i,r movl k,(r)

If c is 1, 2, or 4

Sample tree-matching rulesRule # IR pattern cost

0 TEMP t 0

1 CONST c 1

2 BINOP(PLUS,i,j) 1

3 BINOP(TIMES,i,j) 2

4 BINOP(PLUS,i,CONST c) 1

5 MEM(BINOP(PLUS,i,CONST c)) 3

6 MOVE(MEM(BINOP(PLUS,i,j)),k) 3

7 BINOP(TIMES,i,CONST c) 1

8 BINOP(TIMES,i,CONST c) 2

9 MEM(i) 1

10 MOVE(MEM(i),j) 1

11 MOVE(MEM(i),MEM(j)) 231

Tiling an IR tree, new rulesa[x] = *y;

MEM MEM

BP CONST a

movl $a,r1addl %ebp,r1movl (r1),r1leal (,x,4),r2movl (y),r3movl r2,r4addl r1,r4movl r3,(r4) CONST 4

Page ‹#›

Optimum selectionTo achieve optimum instruction selection, we mustuse a more complex algorithm

– dynamic programmingIn contrast to maximal munch, the trees arematched bottom-up

Dynamic programmingThe idea is fairly simple– Working bottom up…– Given the optimum tilings of all subtrees, generate

optimum tiling of the current tree• consider all tiles for the root of the current tree• sum cost of best subtree tiles and each tile• choose tile with minimum total cost

Second pass generates the code using the resultsfrom the bottom-up pass

Bottom-up CG, pass 1MOVE

MEM MEM

CONST 99

%EBP CONST a

CONST 4

(0,0) (1,1)

(9,2) (1,1)

(10,6),(11,6)

(0,0) (1,1)Note: (r,c)means rule r,total cost c

Bottom-up CG, pass 2MOVE

MEM MEM

CONST 99

%EBP CONST a

CONST 4

(0,0) (1,1)

(9,2) (1,1)

(10,6),(11,6)

(0,0) (1,1)Note: (r,c)means rule r,total cost c

Page ‹#›

Bottom-up code generationHow does the running time compare to maximalmunch?Memory usage?

ToolsA lot of tools have been developed toautomatically generate instruction selectors

– TWIG [Aho,Ganapathi,Tjiang, ’86]– BURG [Fraser,Henry,Proebsting, ’92]– BEG [Emmelmann,Schroer,Landwehr, ’89]– …

These generate bottom-up instruction selectorsfrom tree-matching rule specifications

Code-generator generatorsTwig, Burg, and Beg use the dynamicprogramming approach

– A lot of work has gone into making these cgg’shighly efficient

There are also grammar-based cgg’s that useLR(k) parsing to perform the tree-matching

Tiling a DAGHow would you tile

ld ldmov &x -> t3add t3,4 -> t4load (t4) -> t0add t3,4 -> t5load (t5) -> t1

mov x+4 -> t0mov x+8 -> t1

Page ‹#›

Peephole MatchingBasic ideaCompiler can discover local improvements locally

– Look at a small set of adjacent operations– Move a “peephole” over code & search for improvement

Classic example was store followed by load

storeAI r1 ⇒ r0,8loadAI r0,8 ⇒ r15

storeAI r1 ⇒ r0,8i2i r1 ⇒ r15

Original code Improved code

Classic example was store followed by loadSimple algebraic identities

addI r2,0 ⇒ r7mult r4,r7 ⇒ r10

mult r4,r2 ⇒ r10

Classic example was store followed by loadSimple algebraic identitiesJump to a jump

jumpI → L10L10: jumpI → L11

L10: jumpI → L11

Peephole MatchingImplementing itEarly systems used limited set of hand-coded patternsWindow size ensured quick processing

Modern peephole instruction selectors (Davidson)Break problem into three tasks

Apply symbolic interpretation & simplification systematically

ExpanderIR→LLIR

SimplifierLLIR→LLIR

MatcherLLIR→ASM

IR LLIR LLIR ASM

Page ‹#›

Peephole MatchingExpanderTurns IR code into a low-level IR (LLIR) such as RTLOperation-by-operation, template-driven rewritingLLIR form includes all direct effects (e.g., setting cc)Significant, albeit constant, expansion of size

ExpanderIR→LLIR

MatcherLLIR→ASM

IR LLIR LLIR ASM

Peephole MatchingSimplifierLooks at LLIR through window and rewrites itUses forward substitution, algebraic simplification, local constantpropagation, and dead-effect eliminationPerforms local optimization within window

This is the heart of the peephole system– Benefit of peephole optimization shows up in this step

ExpanderIR→LLIR

MatcherLLIR→ASM

IR LLIR LLIR ASM

Peephole MatchingMatcherCompares simplified LLIR against a library of patternsPicks low-cost pattern that captures effectsMust preserve LLIR effects, may add new ones (e.g., setcc)Generates the assembly code output

ExpanderIR→LLIR

MatcherLLIR→ASM

IR LLIR LLIR ASM

Example

Original IR Code

wt1xsub

t1Y2mult

ResultArg2Arg1OP

LLIR Code r10 ← 2 r11 ← @y r12 ← r0 + r11 r13 ← MEM(r12) r14 ← r10 x r13 r15 ← @x r16 ← r0 + r15 r17 ← MEM(r16) r18 ← r17 - r14 r19 ← @w r20 ← r0 + r19MEM(r20) ← r18

Expand

Page ‹#›

Example LLIR Code r10 ← 2 r11 ← @y r12 ← r0 + r11 r13 ← MEM(r12) r14 ← r10 x r13 r15 ← @x r16 ← r0 + r15 r17 ← MEM(r16) r18 ← r17 - r14 r19 ← @w r20 ← r0 + r19MEM(r20) ← r18

SimplifyLLIR Code

r13 ← MEM(r0+ @y) r14 ← 2 x r13 r17 ← MEM(r0 + @x) r18 ← r17 - r14

MEM(r0 + @w) ← r18

Example

MatchLLIR Code

MEM(r0 + @w) ← r18

ILOC CodeloadAI r0,@y ⇒ r13multI 2 x r13 ⇒ r14loadAI r0,@x ⇒ r17sub r17 - r14 ⇒ r18storeAI r18 ⇒ r0,@w

Steps of the Simplifier LLIR Code r10 ← 2 r11 ← @y r12 ← r0 + r11 r13 ← MEM(r12) r14 ← r10 x r13 r15 ← @x r16 ← r0 + r15 r17 ← MEM(r16) r18 ← r17 - r14 r19 ← @w r20 ← r0 + r19MEM(r20) ← r18

3-operation window

r10 ← 2r11 ← @yr12 ← r0 + r11

Page ‹#›

r10 ← 2r11 ← @yr12 ← r0 + r11

r10 ← 2r12 ← r0 + @yr13 ← MEM(r12)

r10 ← 2r13 ← MEM(r0 + @y)r14 ← r10 x r13

r13 ← MEM(r0 + @y)r14 ← 2 x r13r15 ← @x

r10 ← 2r13 ← MEM(r0 + @y)r14 ← r10 x r13

r13 ← MEM(r0 + @y)r14 ← 2 x r13r15 ← @x

r14 ← 2 x r13r15 ← @xr16 ← r0 + r15

1st op it has rolledout of window

Page ‹#›

r14 ← 2 x r13r15 ← @xr16 ← r0 + r15

r14 ← 2 x r13r16 ← r0 + @xr17 ← MEM(r16)

r14 ← 2 x r13r17 ← MEM(r0+@x)r18 ← r17 - r14

r14 ← 2 x r13r16 ← r0 + @xr17 ← MEM(r16)

r17 ← MEM(r0+@x)r18 ← r17 - r14r19 ← @w

r14 ← 2 x r13r17 ← MEM(r0+@x)r18 ← r17 - r14

r18 ← r17 - r14r19 ← @wr20 ← r0 + r19

r17 ← MEM(r0+@x)r18 ← r17 - r14r19 ← @w

Page ‹#›

r18 ← r17 - r14r20 ← r0 + @wMEM(r20) ← r18

r18 ← r17 - r14r19 ← @wr20 ← r0 + r19

r18 ← r17 - r14MEM(r0 + @w) ← r18

r18 ← r17 - r14r20 ← r0 + @wMEM(r20) ← r18

r18 ← r17 - r14MEM(r0 + @w) ← r18

r18 ← r17 - r14r20 ← r0 + @wMEM(r20) ← r18

Example LLIR Code r10 ← 2 r11 ← @y r12 ← r0 + r11 r13 ← MEM(r12) r14 ← r10 x r13 r15 ← @x r16 ← r0 + r15 r17 ← MEM(r16) r18 ← r17 - r14 r19 ← @w r20 ← r0 + r19MEM(r20) ← r18

SimplifyLLIR Code

MEM(r0 + @w) ← r18

Page ‹#›

Making It All WorkDetailsLLIR is largely machine independent (RTL)Target machine described as LLIR → ASM patternActual pattern matching

– Use a hand-coded pattern matcher (gcc)– Turn patterns into grammar & use LR parser (VPO)

Several important compilers use this technologyIt seems to produce good portable instruction selectors

Key strength appears to be late low-level optimization

MEM MEM

SIMD InstructionsWe’ve assumed eachtile is connected

What about SIMD instructions?Ex. TI C62x add2

32 bits

subword parallelism

Automatic VectorizationLoop level

Basic block level

for (i = 0; i < 1024; i++){ C[i] = A[i]*B[i];}

for (i = 0; i < 1024; i+=4){ C[i:i+3] = A[i:i+3]*B[i:i+3];}

x = a+b;y = c+d;

(x,y) = (a,c)+(b,d);

looking for independent, isomorphic operations

instruction selection 745

Documents

optimal chain rule placement for instruction selection based...

instruction selection presented by huang kuo-an, lu...

edi.ao 745

a^jds hals -^.'745-22^ · 2019-05-01 · johnson bros...

eleturf 745

instruction selection -...

745 transformer protection system instruction manualchanges...

edicion #745

joint variable partitioning and bank selection instruction

survey on instruction selection

instruction selection

¥ 760 ¥820 a0285-42-8284 ¥110 ¥ 140 ¥110 ¥118 a0285-20...

instruction selection ii cs 671 february 26, 2008

adventures in fuzzing instruction selection

ed 379 745 author lee, ginny; filby, nikola title · 2014....

dell optiplex 745 desktop pc quick reference...

edición 745

compiler construction - instruction selection ·...

logano sk645 - sk745 · logano sk645 - sk745 kesselblock...

spring 2014jim hogg - uw - cse - p501n-1 cse p501 –...