peephole optimization final pass over generated code: examine a few consecutive instructions: 2 to 4...

22
Peephole Optimization Final pass over generated code: examine a few consecutive instructions: 2 to 4 See if an obvious replacement is possible: store/load pairs MOV %eax => mema MOV mema => %eax Can eliminate the second instruction without needing any global knowledge of mema Use algebraic identities Special-case individual instructions

Post on 20-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Peephole Optimization Final pass over generated code: examine a few consecutive instructions: 2 to 4 See if an obvious replacement is possible: store/load

Peephole Optimization

Final pass over generated code:

examine a few consecutive instructions: 2 to 4

See if an obvious replacement is possible: store/load pairs

MOV %eax => memaMOV mema => %eax

Can eliminate the second instruction without needing any global knowledge of mema

Use algebraic identities

Special-case individual instructions

Page 2: Peephole Optimization Final pass over generated code: examine a few consecutive instructions: 2 to 4 See if an obvious replacement is possible: store/load

Algebraic identities

worth recognizing single instructions with a constant operand:

A * 2 = A + A

A * 1 = A

A * 0 = 0

A / 1 = A

More delicate with floating-point

Page 3: Peephole Optimization Final pass over generated code: examine a few consecutive instructions: 2 to 4 See if an obvious replacement is possible: store/load

Is this ever helpful?

Why would anyone write X * 1?

Why bother to correct such obvious junk code?

In fact one might write

#define MAX_TASKS 1...a = b * MAX_TASKS;

Also, seemingly redundant code can be produced by other

optimizations. This is an important effect.

Page 4: Peephole Optimization Final pass over generated code: examine a few consecutive instructions: 2 to 4 See if an obvious replacement is possible: store/load

Replace Multiply by Shift

A := A * 4; Can be replaced by 2-bit left shift (signed/unsigned)

But must worry about overflow if language does

A := A / 4; If unsigned, can replace with shift right

But shift right arithmetic is a well-known problem

Language may allow it anyway (traditional C)

Page 5: Peephole Optimization Final pass over generated code: examine a few consecutive instructions: 2 to 4 See if an obvious replacement is possible: store/load

Addition chains for multiplication

If multiply is very slow (or on a machine with no multiply instruction like the original SPARC), decomposing a constant operand into sum of powers of two can be effective: X * 125 = x * 128 - x*4 + x two shifts, one subtract and one add, which may

be faster than one multiply Note similarity with efficient exponentiation method

Page 6: Peephole Optimization Final pass over generated code: examine a few consecutive instructions: 2 to 4 See if an obvious replacement is possible: store/load

The Right Shift problem

Arithmetic Right shift:

shift right and use sign bit to fill most significant bits

-5 111111...1111111011

SAR 111111...1111111101

which is -3, not -2

in most languages -5/2 = -2

Prior to C99, implementations were allowed to truncate towards or away from zero if either operand was negative

Page 7: Peephole Optimization Final pass over generated code: examine a few consecutive instructions: 2 to 4 See if an obvious replacement is possible: store/load

Folding Jumps to Jumps

A jump to an unconditional jump can copy the target address

JNE lab1 ...lab1 JMP lab2

Can be replaced by JNE lab2 As a result, lab1 may become dead (unreferenced)

Page 8: Peephole Optimization Final pass over generated code: examine a few consecutive instructions: 2 to 4 See if an obvious replacement is possible: store/load

Jump to Return

A jump to a return can be replaced by a return JMP lab1

... lab1 RET

Can be replaced by RET lab1 may become dead code

Page 9: Peephole Optimization Final pass over generated code: examine a few consecutive instructions: 2 to 4 See if an obvious replacement is possible: store/load

Tail Recursion Elimination

A subprogram is tail-recursive if the last computation is a call to itself:

function last (lis : list_type) return lis_type is

begin

if lis.next = null then return lis;

else return last (lis.next);

end;

Recursive call can be replaced with

lis := lis.next;

goto start; -- added label

Page 10: Peephole Optimization Final pass over generated code: examine a few consecutive instructions: 2 to 4 See if an obvious replacement is possible: store/load

Advantages of tail recursion elimination

saves time: an assignment and jump is faster than a call with one parameter

saves stack space: converts linear stack usage to constant usage.

In languages with no loops, this may be a required optimization: specified in Scheme standard.

Page 11: Peephole Optimization Final pass over generated code: examine a few consecutive instructions: 2 to 4 See if an obvious replacement is possible: store/load

Tail-recursion elimination at the Instruction Level

Consider the sequence on the x86: CALL func

RET CALL pushes return point on stack, RET in body

of func removes it, RET in caller returns Can generate instead:

JMP func Now RET in func returns to original caller,

because single return address on stack

Page 12: Peephole Optimization Final pass over generated code: examine a few consecutive instructions: 2 to 4 See if an obvious replacement is possible: store/load

Peephole optimization in the REALIA COBOL compiler

Full compiler for Standard COBOL, targeted to the IBM PC.

Now distributed by Computer Associates

Runs in 150K bytes, but must be able to handle very large programs that run on mainframes

No global optimization possible: multiple linear passes over code, no global data structures, no flow graph.

Multiple peephole optimizations, compiler iterates until code is stable. Each pass scan code backwards to minimize address recomputations

Page 13: Peephole Optimization Final pass over generated code: examine a few consecutive instructions: 2 to 4 See if an obvious replacement is possible: store/load

Typical COBOL code: control structures and perform blocks.

Process-Balance. if Balance is negative then perform Send-Bill else perform Record-Credit end-if.Send-Bill. ...Record-Credit. ...

Page 14: Peephole Optimization Final pass over generated code: examine a few consecutive instructions: 2 to 4 See if an obvious replacement is possible: store/load

Simple Assembly: perform equivalent to call

Pb: cmp balance, 0 jnl L1 call Sb jmp L2L1: call RcL2: retSb: ... retRc: ... ret

Page 15: Peephole Optimization Final pass over generated code: examine a few consecutive instructions: 2 to 4 See if an obvious replacement is possible: store/load

Fold jump to return statement

Pb: cmp balance, 0 jnl L1 call Sb jmp L2 -- jump to return

L1: call Rc L2: retSb: ... retRc: ... ret

Page 16: Peephole Optimization Final pass over generated code: examine a few consecutive instructions: 2 to 4 See if an obvious replacement is possible: store/load

Corresponding Assembly

Pb: cmp balance, 0 jnl L1 -- jump to unconditional jump call Sb ret -- folded L1: jmp Rc -- will become useless L2: ret Sb: ... retRc: ... ret

Page 17: Peephole Optimization Final pass over generated code: examine a few consecutive instructions: 2 to 4 See if an obvious replacement is possible: store/load

code following a jump is unreachable

Pb: cmp balance, 0 jnl Rc -- folded jmp Sb ret -- unreachable L1: jmp Rc -- unreachable Sb: ... retRc: ... ret

Page 18: Peephole Optimization Final pass over generated code: examine a few consecutive instructions: 2 to 4 See if an obvious replacement is possible: store/load

Jump to following instruction is a noop

Pb: cmp balance, 0 jnl Rc jmp Sb -- jump to next instruction Sb: ... retRc: ... ret

Page 19: Peephole Optimization Final pass over generated code: examine a few consecutive instructions: 2 to 4 See if an obvious replacement is possible: store/load

Final code

Pb: cmp balance, 0 jnl RcSb: ... retRc: ... ret

Final code as efficient as inlining. All transformations are local. Each optimization

may yield further optimization opportunities Iterate till no further change

Page 20: Peephole Optimization Final pass over generated code: examine a few consecutive instructions: 2 to 4 See if an obvious replacement is possible: store/load

Arcane tricks

Consider typical maximum computation if A >= B then

C := A;else C := B;end if;

For simplicity assume all unsigned, and all in registers

Page 21: Peephole Optimization Final pass over generated code: examine a few consecutive instructions: 2 to 4 See if an obvious replacement is possible: store/load

Eliminating max jump on x86

Simple-minded asm code CMP A, B

JNAE L1 MOV A=>C JMP L2L1: MOV B=>CL2:

One jump in either case

Page 22: Peephole Optimization Final pass over generated code: examine a few consecutive instructions: 2 to 4 See if an obvious replacement is possible: store/load

Computing max without jumps on X86

Architecture-specific trick: use subtract with borrow instruction and carry flag

CMP A, B ; CF=1 if B > A, CF = 0 if A >= B SBB %eax,%eax ; all 1's if B > A, all 0's if A >= B MOV %eax, C NOT C ; all 0's if B > A, all 1's if A >= B AND B=>%eax ; B if B>A, 0 if A>=B AND A=>C ; 0 if B >A, A if A>=B OR %eax=>C ; B if B>A, A if A>=B

More instructions, but NO JUMPS

Supercompiler: exhaustive search of instruction patterns to uncover similar tricks