day 2: building process virtualization systems kim hazelwood acaces summer school july 2009
TRANSCRIPT
Day 2: Building Process Virtualization Systems
Kim HazelwoodACACES Summer
SchoolJuly 2009
ACACES 2009 – Process Virtualization
Course Outline
• Day 1 – What is Process Virtualization?
• Day 2 – Building Process Virtualization Systems
• Day 3 – Using Process Virtualization Systems
• Day 4 – Symbiotic Optimization
2
ACACES 2009 – Process Virtualization3
JIT-Based Process Virtualization
Application
Transform
CodeCache
Execute
Profile
ACACES 2009 – Process Virtualization4
What are the Challenges?
Performance!Solutions:• Code caches – only transform code once• Trace selection – focus on hot paths• Branch linking – only perform cache lookup once• Indirect branch hash tables / chaining• Memory “management”
Correctness – self-modifying code, munmaps, multithreading
Transparency – context switching, eflags
ACACES 2009 – Process Virtualization5
What is the Overhead?
The latest Pin overhead numbers …
100%
120%
140%
160%
180%
200%
perlb
ench
sjen
g
xala
ncbm
k
gobm
k
gcc
h264
ref
omne
tpp
bzip
2
libqu
antu
m mcf
asta
r
hmm
er
Rel
ativ
e to
Nat
ive
ACACES 2009 – Process Virtualization6
Sources of Overhead
Internal
• Compiling code & exit stubs (region detection, region formation, code generation)
• Managing code (eviction, linking)
• Managing directories and performing lookups
• Maintaining consistency (SMC, DLLs)
External
• User-inserted instrumentation
ACACES 2009 – Process Virtualization7
Improving Performance: Code Caches
Code CacheBranch Target
Address
Hit
Region Formation & Optimization
Evict Code
UpdateHash Table
Miss
No
YesYesNo
Interpret
Code is Hot?Room in
Code Cache?
Insert
Sta
rt Hash TableLookup
Counter++Delete
Exit Stub
ACACES 2009 – Process Virtualization8
Software-Managed Code Caches
• Store transformed code at run time to amortize overhead of process VMs• Contain a (potentially altered) copy of application
code
Application
Transform
CodeCache
Execute
Profile
ACACES 2009 – Process Virtualization9
Code Cache Contents
Every application instruction executed is stored in the code cache (at least)
Code Regions
• Altered copies of application code
• Basic blocks and/or traces
Exit stubs
• Swap applicationVM state
• Return control to the process VM
ACACES 2009 – Process Virtualization10
Code Regions
Basic Blocks
Traces
A
BBL A:Inst1Inst2Inst3Branch B
C
A
B
D
CFG
A
B
C
D
In Memory
A
B C
DD
Trace
ACACES 2009 – Process Virtualization11
Exit Stubs
One exit stub exists for every exit from every trace or basic block
Functionality
Prepare for context switch
Return control to VM dispatch
Details
Each exit stub ≈ 3 instructions
A
B
DExit to C
Exit to E
ACACES 2009 – Process Virtualization12
A
B C
D E
F G
HI
Call
Return
CFG
Performance: Trace Selection
• Interprocedural path• Single entry, multiple exit
ABCDI
GH
EF
Call
Return
Layout in Memory
Exit to C
Exit to F
ABDEGHI
Layout in Code Cache
Trace (superblock)
ACACES 2009 – Process Virtualization13
Performance: Cache Linking
Trace #2
Exit #1a
Exit #1b
Trace #1
Dispatch
Trace #3
ACACES 2009 – Process Virtualization14
Linking Traces
Proactive linkingLazy linking
Exit to C
Exit to F
ABDEGHI
Exit to A
FHI
Exit to A
CDEGHI
Exit to F
A
B C
D E
F G
HI
Call
Return
ABDEGHI
CDEGHI
FHI
ACACES 2009 – Process Virtualization15
Are Links Highly Beneficial?
Bench-mark
With Linking
Without Linking
Slow-down
gzip 230 sec 7951 sec 3357%
vpr 333 sec 2474 sec 643%
gcc 206 sec 3284 sec 1494%
mcf 368 sec 2014 sec 447%
crafty 215 sec 3547 sec 1550%
parser 350 sec 6795 sec 1841%
perlbmk 336 sec 6945 sec 1967%
gap 195 sec 4231 sec 2070%
vortex 382 sec 4655 sec 1119%
bzip2 287 sec 4294 sec 1396%
twolf 658 sec 6490 sec 886%
ACACES 2009 – Process Virtualization16
Code Cache Visualization
ACACES 2009 – Process Virtualization17
Challenge: Rewriting Instructions
•We must regularly rewrite branches
•No atomic branch write on x86
•Pin uses a neat trick*:
“old” 5-byte branch
2-byte self
branch
n-2 bytes of “new” branch
“new” 5-byte branch
* Sundaresan et al. 2006
ACACES 2009 – Process Virtualization18
Pretend as though the original program is executing
Original Code:0x1000 call 0x4000
Challenge: Achieving Transparency
Code cache address mapping:
0x1000 0x7000 “caller”
0x4000 0x8000 “callee”
Translated Code:0x7000 push 0x10060x7006 jmp 0x8000
Push 0x1006 on stack, then jump to 0x4000
SPC TPC
ACACES 2009 – Process Virtualization19
Challenge: Self-Modifying Code
The problem
Code cache must detect SMC and invalidate corresponding cached traces
Solutions
Many proposed … but without HW support, they are very expensive!• Changing page protection• Memory diff prior to execution• On ARM, there is an explicit instruction for SMC!
ACACES 2009 – Process Virtualization20
Self-Modifying Code Handler
(Written by Alex Skaletsky)
void main (int argc, char **argv) { PIN_Init(argc, argv); TRACE_AddInstrumentFunction(InsertSmcCheck,0); PIN_StartProgram(); // Never returns}void InsertSmcCheck () { . . . memcpy(traceCopyAddr, traceAddr, traceSize); TRACE_InsertCall(trace, IPOINT_BEFORE, (AFUNPTR)DoSmcCheck,
IARG_PTR, traceAddr, IARG_PTR, traceCopyAddr, IARG_UINT32, traceSize, IARG_CONTEXT, IARG_END);
}void DoSmcCheck (VOID* traceAddr, VOID *traceCopyAddr,
USIZE traceSize, CONTEXT* ctxP) { if (memcmp(traceAddr, traceCopyAddr, traceSize) != 0) { CODECACHE_InvalidateTrace((ADDRINT)traceAddr); PIN_ExecuteAt(ctxP); }}
ACACES 2009 – Process Virtualization21
Challenge: Parallel Applications
JIT Compiler
Syscall Emulator
Signal Emulator Dis
pa
tch
er
Instrumentation CodeCall-Back Handlers
Analysis Code
Code Cache
Pin
Serialized Parallel
T1
T2
T1
T1 T1 T2
Pin Tool
ACACES 2009 – Process Virtualization22
Challenge: Code Cache Consistency
Cached code must be removed for a variety of reasons:•Dynamically unloaded code•Ephemeral/adaptive instrumentation•Self-modifying code•Bounded code caches
EXE
Transform
CodeCache
Execute
Profile
ACACES 2009 – Process Virtualization23
Motivating a Bounded Code Cache
The Perl Benchmark
100%
150%
200%
250%
300%
350%
400%
input1 input2 input3 total
Perf
orm
ance
Rel
ative
to N
ative
Unlimited Code Cache2.5 MB Code Cache2.0 MB Code Cache1.5 MB Code Cache1.0 MB Code Cache
ACACES 2009 – Process Virtualization24
• Option 1: All threads have a private code cache (oops, doesn’t scale)
• Option 2: Shared code cache across threads
• If one thread flushes the code cache, other threads may resume in stale memory
Flushing the Code Cache
0%
100%
200%
300%
400%
500%
600%
wupwise swim mgrid applu galgel equake apsi gafort fma3d art ammp
Trac
e M
emor
y In
crea
se
ACACES 2009 – Process Virtualization25
Naïve Flush
Wait for all threads to return to the code cache
Could wait indefinitely!
VM
VM
CC1
CC1
VM stall
VM stall
CC2
CC2
VM CC1 VM CC2
Flush Delay
Thread1
Thread2
Thread3
Time
ACACES 2009 – Process Virtualization26
Generational Flush
Allow threads to continue to make progress in a separate area of the code cache
VM
VM
CC1
CC1
VM
VM
CC2
CC2
VM CC1 VM CC2
Thread1
Thread2
Thread3
Requires a high water mark
Time
ACACES 2009 – Process Virtualization
% pin –cache_size 40960 –t flusher -- /bin/ls
SWOOSH!
SWOOSH!
27
Build-Your-Own Cache Replacement
void main(int argc, char **argv) { PIN_Init(argc,argv); CODECACHE_CacheIsFull(FlushOnFull); PIN_StartProgram(); //Never returns}void FlushOnFull() { CODECACHE_FlushCache(); cout << “SWOOSH!” << endl;}
Eviction Granularities• Entire Cache• One Cache Block• One Trace• Address Range
ACACES 2009 – Process Virtualization28
A Graphical Front-End
ACACES 2009 – Process Virtualization29
Memory Scalability of the Code Cache
Ensuring scalability also requires carefully configuring the code stored in the cache
Trace Lengths
• First basic block is non-speculative, others are speculative
• Longer traces = fewer entries in the lookup table, but more unexecuted code
• Shorter traces = two off-trace paths at ends of basic blocks with conditional branches = more exit stub code
ACACES 2009 – Process Virtualization30
Effect of Trace Length on Trace Count
0
2000
4000
6000
8000
10000
12000
14000
16000
wupwise swim mgrid applu galgel equake apsi gafort fma3d art ammp
Tota
l Tra
ce C
ount
1 BB 2 BBs 4 BBs 8 BBs 16 BBs 32 BBs
ACACES 2009 – Process Virtualization31
Effect of Trace Length on Memory
0
1500
3000
4500
6000
7500
9000
10500
01 BBs 02 BBs 04 BBs 08 BBs 16 BBs 32 BBsBasic Blocks Per Trace
Code
Cac
he F
ootp
rint (
KB)
Lookup TableLinksExit StubsTraces
ACACES 2009 – Process Virtualization32
Sources of Overhead
Internal
• Compiling code & exit stubs (region detection, region formation, code generation)
• Managing code (eviction, linking)
• Managing directories and performing lookups
• Maintaining consistency (SMC, DLLs)
External
• User-inserted instrumentation
ACACES 2009 – Process Virtualization33
Adding Instrumentation
100%
200%
300%
400%
500%
600%
700%
800%
perlb
ench
sjen
g
xala
ncbm
k
gobm
k
gcc
h264
ref
omne
tpp
bzip
2
libqu
antu
m mcf
asta
r
hmm
er
Rel
ativ
e to
Nat
ive Pin
Pin+icount
ACACES 2009 – Process Virtualization34
“Normal Pin” Execution Flow
Instrumentation is interleaved with application
Uninstrumented Application
Instrumented Application
Pin Overhead
Instrumentation Overhead
“Pinned” Application
time
ACACES 2009 – Process Virtualization35
“SuperPin” Execution Flow
SuperPin creates instrumented slices
Uninstrumented Application
SuperPinned Application
Instrumented Slices
ACACES 2009 – Process Virtualization36
Issues and Design Decisions
Creating slices• How/when to start a slice• How/when to end a slice
System calls
Merging results
ACACES 2009 – Process Virtualization37
for k
S6+fo
r k
S5+fo
r k
S4+fo
r kre
cord
si
gr3
, sl
eep
S3+
for k
S2+
sleep
S1+
Execution Timelinefo
r k
S1 S2 S3 S4 S5 S6
dete
ct
sigr4
dete
ct
exit
resu
me
dete
ct
sigr3
dete
ct
sigr6
dete
ct
sigr2
dete
ct
sigr5
resu
me
reco
rd
sigr4
, sl
eep
CPU2
CPU3
CPU4
time
reco
rd
sigr2
, sl
eep
resu
me
resu
me
reco
rd
sigr5
, sl
eep
resu
me
reco
rd
sigr6
, sl
eep
resu
me
original application
instrumentedapplication slices
CPU1
ACACES 2009 – Process Virtualization38
Performance – icount1
% pin –t icount1 -- <benchmark>
0%
500%
1000%
1500%
2000%
2500%
3000%
amm
pap
plu
apsi art
bzip
2cr
afty
eon
equa
kefa
cere
cfm
a3d
galg
elga
pgc
cgz
iplu
cas
mcf
mes
am
grid
pars
erpe
rlbm
sixt
rack
swim
twol
fvo
rtex vp
rw
upw
isA
VG
Pin SuperPin
ACACES 2009 – Process Virtualization
What Did We Learn Today?
Building Process VMs is only half the battle
Robustness, correctness, performance are paramount
Lots of “tricks” are in play
• Code caches, trace selection, etc.
Knowing about these tricks is beneficial
• Lots of research opportunities
• Understanding the inner workings often helps you write better tools
39
ACACES 2009 – Process Virtualization
Want More Info?
• Read the seminal Dynamo paper
• See the more recent papers by the Pin, DynamoRIO, Valgrind teams
• Relevant conferences: VEE, CGO, ASPLOS, PLDI, PACT
40
Day 1 – What is Process Virtualization?Day 2 – Building Process Virtualization SystemsDay 3 – Using Process Virtualization SystemsDay 4 – Symbiotic Optimization