ibm toronto lab © 2006 ibm corporation cascon 20062006-10-16 avoiding live lock when patching code...

IBM Toronto Lab

CASCON 2006 2006-10-16 © 2006 IBM Corporation

Avoiding Live Lock when Patching Codein Real-Time Execution Environments

Mark StoodleyReal-Time Java Compiler DevelopmentIBM Toronto Lab

IBM Toronto Lab

© 2006 IBM Corporation2 CASCON 2006 2006-10-16

Outline

Code patching background

The live lock problem

Two ways to avoid live lock

Two examples

1. Resolving a static field reference

2. Updating target of virtual invocation cache

IBM Toronto Lab


Code patching

JIT compilers generate code designed to be modified during execution– Resolution for classes, fields, methods

– Fill in virtual/interface invocation caches

– Lazy call target update after (re)compilation

– Fixups for virtual guards

Typically performed by application threads via runtime helper– Another thread may execute during modification

IBM Toronto Lab


Code patching is Hard

Multiple threads executing while patching

Processors not designed to support it well

– Undocumented coherence requirements/loopholes

– Not designed to be fast

Prevent execution of inconsistent instructions

Strongly influenced by instruction set

– Atomic writes: how much can you change at once

IBM Toronto Lab


Code patching is Hard

Goal is quality code after patching

Interacts with lots of other complex things– Exception handling, stack walking

– Class loading and resolution rules

– Implementation induced complexities

Result is usually a complex dance– Careful design and layout of generated code

– Careful orchestration of steps

IBM Toronto Lab


Example: Static field resolution (Intel x86)

inc dword ptr[0h]ff 05 00 00 00 00

IBM Toronto Lab



call Lsnippet

db 00h

Lsnippet:

push 024h ; cp index

push 08564ach ; const pool

call unresolvedStaticGlue

db 0ff05h

Make sure first execution goes to snippet: generate 5B call instead of 6B inc

e8 d4 02 00 00 00

Generate 5B call, but make space for 6B inc

IBM Toronto Lab



call Lsnippet

db 00h

Lsnippet:




db 0ff05h

After resolving static address, glue prepares to patch 6B instruction

Resolves field to 088aa5ach

Need to patch 6 bytes atomically: Three step process

e8 d4 02 00 00 00

IBM Toronto Lab



jmp -2

db 00000002h

Lsnippet:

push 024h ; cpIndex



db 0ff05h

Step 1: protect against multiple threads by patching self-loop

2-byte self loop (JMP -2)

2 bytes cannot cross patching boundary (8B on AMD64)

eb fe 02 00 00 00

After patching fence, these 4 bytes can now be patched with static address

Patching fence = mfence, clflush, mfence


IBM Toronto Lab


Lsnippet:

push 024h ; cpIndex



db 0ff05h


jmp -2

db 088aa5ach

Step 2: write resolved static address in 4-byte field protected by self-loop

2-byte self loop (JMP -2)

eb fe ac a5 8a 08

Write resolved static address (nonatomic)

Then, patching fence to ensure all threads see address before self loop is removed


IBM Toronto Lab


Lsnippet:

push 024h ; cpIndex



db 0ff05h


inc dword ptr[088aa5ah]

Step 3: remove the self-loop and restore the original instruction bytes

ff 05 ac a5 8a 08

Benefits:

thread-safe

final code quality good

BUT:

uses self-loop

IBM Toronto Lab


Busy-wait loops can be BAD

Employed for safety in code patching

FIFO scheduling in Real-Time OS can result in live lock

T1 (priority 10) T2 (priority 20)Resolve field refPatch self-loop over instrPreempted by T2 T2 wakes up

Tries to execute same field ref

T2 stuck in self-loop

T1 can never remove self-loop: live lock

IBM Toronto Lab


Busy-wait-free code patching: no live lock

Two basic approaches

1. All threads do idempotent patch: let them all do it• Cache line ping-pong effect may be slow(er) but correct

2. Only one thread must patch: construct backup path• Direct threads that arrive while patching to backup path• Slower but correct execution

Sometimes lowers resulting code quality

IBM Toronto Lab


Example: Static field resolution, no livelock

inc dword ptr[0h]

Lsnippet:




ff 05 00 00 00 00

IBM Toronto Lab



inc dword ptr[0h]

Lsnippet:




ff 05 00 00 00 00

e8 d4 02 00 00 call Lsnippet

5B call generated explicitly ahead of the instruction to be resolved

IBM Toronto Lab



inc dword ptr[088aa5ach]

Lsnippet:




ff 05 ac a5 8a 08


After resolving static address, glue patches the memory ref instruction


Note: any threads that reach the glue ALL patch the memory ref instruction

BUT all threads will patch same value, so no races

IBM Toronto Lab




Lsnippet:




ff 05 ac a5 8a 08


Now need to get rid of call to snippet, since ref has been resolved

Patch a 5-byte NOP over the call:

lea eax, ds:[eax]

BUT: can’t do it atomically in one shot, need 3 steps again

NOTE that any thread can now safely execute the memory reference instruction because it’s been patched Resolves field to 088aa5ach

IBM Toronto Lab




Lsnippet:




ff 05 ac a5 8a 08

eb 03 02 00 00 jmp +3

db 000002h

Step 1: patch short jump JMP +3 to memory ref instruction (lock cmpxchg)


IBM Toronto Lab




Lsnippet:




ff 05 ac a5 8a 08

eb 03 44 20 00 jmp +3

db 002044h

Step 2: patch last three bytes of 5-byte NOP instruction


IBM Toronto Lab




Lsnippet:

push 024h ; cpIndex



ff 05 ac a5 8a 08

3e 8d 44 20 00 lea eax, ds:[eax]

Step 3: patch first 2 bytes of 5 byte NOP over the JMP +3

Benefits:

thread-safe

no live lock because no busy-waits

BUT:

5-byte NOP residue

hot code size increase Resolves field to 088aa5ach

IBM Toronto Lab


Example 2: Virtual invocation cache

Virtual invocation o.foo()

– Target method depends on class of receiver object o

– Full virtual dispatch uses lookup in o’s class virtual function table

• Expensive: indirection from object’s class

For performance, use virtual invocation cache

– if (receiver class is C) call C.foo(); else call o.foo();

IBM Toronto Lab



cmp ebx, <CLASS C>

jne FullDispatchSnip

call <C.foo() entry>

Continue:

FullDispatchSnip:

mov ecx, [ebx-<VFT slot>]

call ecx

jmp Continue

e8 TT TT TT TT

0f 85 FD FD FD FD

81 f9 CC CC CC CC

0xCCCCCCCC, 0x0f85, and 0xFDFDFDFD must be patched atomically to initialize cache

IBM Toronto Lab



cmp ebx, <CLASS C>


call <c.foo() entry>

Continue:

FullDispatchSnip:


call ecx

jmp Continue

e8 Ti Ti Ti Ti

0f 85 FD FD FD FD

81 f9 CC CC CC CC

If target Ti not compiled when cache initialized, then patch new target Tc over Ti after C.foo() is compiled (actually next time called)

IBM Toronto Lab



cmp ebx, <CLASS C>



Continue:

FullDispatchSnip:


call ecx

jmp Continue

e8 Ti Ti Ti Ti

0f 85 FD FD FD FD

81 f9 CC CC CC CC

This cache cannot be placed so that none of these fields cross 8B patching boundary

IBM Toronto Lab



If target not compiled yet, target written into cache is address of glue function

Glue function looks at target: compiled yet?– If not compiled, transition to interpreter

– If compiled, patch compiled target into cache

Problem: can’t write entire target atomically– Can atomically write first 2 bytes of call instruction

– Fancy footwork to avoid writing full target atomically

IBM Toronto Lab



cmp ebx, <CLASS C>



Continue:

FullDispatchSnip:


call ecx

jmp Continue

e8 Tg Tg Tg Tg

0f 85 FD FD FD FD

81 f9 CC CC CC CC

Patching boundary can fall before 0f or after 85: same as if call didn’t need patching

IBM Toronto Lab



cmp ebx, <CLASS C>



Continue:

FullDispatchSnip:


call ecx

jmp Continue

e8 Tg Tg Tg Tg

0f 85 FD FD FD FD

81 f9 CC CC CC CC

Patching has several steps so cannot allow multiple threads to proceed: establish backup path to full dispatch

IBM Toronto Lab



cmp ebx, 0ffffffffh



Continue:

FullDispatchSnip:


call ecx

jmp Continue

e8 Tg Tg Tg Tg

0f 85 FD FD FD FD

81 f9 ff ff ff ff

First, clear out class pointer: effectively converts ‘jne’ into ‘jmp’

(atomic compare and exchange)

Patching Fence

IBM Toronto Lab



cmp ebx, 0ffffffffh


jmp -14

db TgTgTg

Continue:

FullDispatchSnip:


call ecx

jmp Continue

eb f2 Tg Tg Tg

0f 85 FD FD FD FD

81 f9 ff ff ff ff

Next, protect last 3 bytes of call instruction with JMP -14 (back to compare instruction)

Patching Fence

IBM Toronto Lab



cmp ebx, 0ffffffffh


jmp -14

db TcTcTc

Continue:

FullDispatchSnip:


call ecx

jmp Continue

eb f2 Tc Tc Tc

0f 85 FD FD FD FD

81 f9 ff ff ff ff

Now we can patch the last three bytes of the call with the new target Tc

Patching Fence

IBM Toronto Lab



cmp ebx, 0ffffffffh


call TcTcTcTc

Continue:

FullDispatchSnip:


call ecx

jmp Continue

e8 Tc Tc Tc Tc

0f 85 FD FD FD FD

81 f9 ff ff ff ff

Remove the JMP -14 by putting the call instruction back

IBM Toronto Lab



cmp ebx, <CLASS C>


call TcTcTcTc

Continue:

FullDispatchSnip:


call ecx

jmp Continue

e8 Tc Tc Tc Tc

0f 85 FD FD FD FD

81 f9 CC CC CC CC

Finally, put the true class pointer back into the compare instruction

IBM Toronto Lab


Summary

Modern JITs generate code that can patch itself via runtime helpers

–Helpers are complex,hand-written assembler

– Interactions with class loading, stack walking

–Busy-wait loops employed to prevent thread races

Real-Time operating systems use FIFO scheduling

–Busy-wait loops can result in live lock

IBM Toronto Lab


Summary

Avoid live lock with two techniques:

1. If same value to be written, let all threads write it

2. If only one thread can write, establish backup path first for all threads but one to use

Two examples

– Unresolved static field reference

– Updating virtual invocation cache target when it has been (re)compiled

IBM Toronto Lab


Questions?

Mark Stoodley

IBM Toronto Lab

[email protected]

ibm toronto lab © 2006 ibm corporation cascon 20062006-10-16 avoiding live lock when patching code...

Documents