peephole optimization final pass over generated code: examine a few consecutive instructions: 2 to 4...

Peephole Optimization

Final pass over generated code:

examine a few consecutive instructions: 2 to 4

See if an obvious replacement is possible: store/load pairs

MOV %eax => memaMOV mema => %eax

Can eliminate the second instruction without needing any global knowledge of mema

Use algebraic identities

Special-case individual instructions

Algebraic identities

worth recognizing single instructions with a constant operand:

A * 2 = A + A

A * 1 = A

A * 0 = 0

A / 1 = A

More delicate with floating-point

Is this ever helpful?

Why would anyone write X * 1?

Why bother to correct such obvious junk code?

In fact one might write

#define MAX_TASKS 1...a = b * MAX_TASKS;

Also, seemingly redundant code can be produced by other

optimizations. This is an important effect.

Replace Multiply by Shift

A := A * 4; Can be replaced by 2-bit left shift (signed/unsigned)

But must worry about overflow if language does

A := A / 4; If unsigned, can replace with shift right

But shift right arithmetic is a well-known problem

Language may allow it anyway (traditional C)

Addition chains for multiplication

If multiply is very slow (or on a machine with no multiply instruction like the original SPARC), decomposing a constant operand into sum of powers of two can be effective: X * 125 = x * 128 - x*4 + x two shifts, one subtract and one add, which may

be faster than one multiply Note similarity with efficient exponentiation method

The Right Shift problem

Arithmetic Right shift:

shift right and use sign bit to fill most significant bits

-5 111111...1111111011

SAR 111111...1111111101

which is -3, not -2

in most languages -5/2 = -2

Prior to C99, implementations were allowed to truncate towards or away from zero if either operand was negative

Folding Jumps to Jumps

A jump to an unconditional jump can copy the target address

JNE lab1 ...lab1 JMP lab2

Can be replaced by JNE lab2 As a result, lab1 may become dead (unreferenced)

Jump to Return

A jump to a return can be replaced by a return JMP lab1

... lab1 RET

Can be replaced by RET lab1 may become dead code

Tail Recursion Elimination

A subprogram is tail-recursive if the last computation is a call to itself:

function last (lis : list_type) return lis_type is

begin

if lis.next = null then return lis;

else return last (lis.next);

end;

Recursive call can be replaced with

lis := lis.next;

goto start; -- added label

Advantages of tail recursion elimination

saves time: an assignment and jump is faster than a call with one parameter

saves stack space: converts linear stack usage to constant usage.

In languages with no loops, this may be a required optimization: specified in Scheme standard.

Tail-recursion elimination at the Instruction Level

Consider the sequence on the x86: CALL func

RET CALL pushes return point on stack, RET in body

of func removes it, RET in caller returns Can generate instead:

JMP func Now RET in func returns to original caller,

because single return address on stack

Peephole optimization in the REALIA COBOL compiler

Full compiler for Standard COBOL, targeted to the IBM PC.

Now distributed by Computer Associates

Runs in 150K bytes, but must be able to handle very large programs that run on mainframes

No global optimization possible: multiple linear passes over code, no global data structures, no flow graph.

Multiple peephole optimizations, compiler iterates until code is stable. Each pass scan code backwards to minimize address recomputations

Typical COBOL code: control structures and perform blocks.

Process-Balance. if Balance is negative then perform Send-Bill else perform Record-Credit end-if.Send-Bill. ...Record-Credit. ...

Simple Assembly: perform equivalent to call

Pb: cmp balance, 0 jnl L1 call Sb jmp L2L1: call RcL2: retSb: ... retRc: ... ret

Fold jump to return statement

Pb: cmp balance, 0 jnl L1 call Sb jmp L2 -- jump to return

L1: call Rc L2: retSb: ... retRc: ... ret

Corresponding Assembly

Pb: cmp balance, 0 jnl L1 -- jump to unconditional jump call Sb ret -- folded L1: jmp Rc -- will become useless L2: ret Sb: ... retRc: ... ret

code following a jump is unreachable

Pb: cmp balance, 0 jnl Rc -- folded jmp Sb ret -- unreachable L1: jmp Rc -- unreachable Sb: ... retRc: ... ret

Jump to following instruction is a noop

Pb: cmp balance, 0 jnl Rc jmp Sb -- jump to next instruction Sb: ... retRc: ... ret

Final code

Pb: cmp balance, 0 jnl RcSb: ... retRc: ... ret

Final code as efficient as inlining. All transformations are local. Each optimization

may yield further optimization opportunities Iterate till no further change

Arcane tricks

Consider typical maximum computation if A >= B then

C := A;else C := B;end if;

For simplicity assume all unsigned, and all in registers

Eliminating max jump on x86

Simple-minded asm code CMP A, B

JNAE L1 MOV A=>C JMP L2L1: MOV B=>CL2:

One jump in either case

Computing max without jumps on X86

Architecture-specific trick: use subtract with borrow instruction and carry flag

CMP A, B ; CF=1 if B > A, CF = 0 if A >= B SBB %eax,%eax ; all 1's if B > A, all 0's if A >= B MOV %eax, C NOT C ; all 0's if B > A, all 1's if A >= B AND B=>%eax ; B if B>A, 0 if A>=B AND A=>C ; 0 if B >A, A if A>=B OR %eax=>C ; B if B>A, A if A>=B

More instructions, but NO JUMPS

Supercompiler: exhaustive search of instruction patterns to uncover similar tricks

peephole optimization final pass over generated code: examine a few consecutive instructions: 2 to 4...

Documents

stack slide

label slide

negative slide

floatingpoint slide

dead code slide

return jmp lab1

type return lis

dead unreferenced slide