day 2: building process virtualization systems kim hazelwood acaces summer school july 2009

Day 2: Building Process Virtualization Systems

Kim HazelwoodACACES Summer

SchoolJuly 2009

ACACES 2009 – Process Virtualization

Course Outline

• Day 1 – What is Process Virtualization?

• Day 2 – Building Process Virtualization Systems

• Day 3 – Using Process Virtualization Systems

• Day 4 – Symbiotic Optimization

2

ACACES 2009 – Process Virtualization3

JIT-Based Process Virtualization

Application

Transform

CodeCache

Execute

Profile


What are the Challenges?

Performance!Solutions:• Code caches – only transform code once• Trace selection – focus on hot paths• Branch linking – only perform cache lookup once• Indirect branch hash tables / chaining• Memory “management”

Correctness – self-modifying code, munmaps, multithreading

Transparency – context switching, eflags


What is the Overhead?

The latest Pin overhead numbers …

100%

120%

140%

160%

180%

200%

perlb

ench

sjen

g

xala

ncbm

k

gobm

k

gcc

h264

ref

omne

tpp

bzip

2

libqu

antu

m mcf

asta

r

hmm

er

Rel

ativ

e to

Nat

ive


Sources of Overhead

Internal

• Compiling code & exit stubs (region detection, region formation, code generation)

• Managing code (eviction, linking)

• Managing directories and performing lookups

• Maintaining consistency (SMC, DLLs)

External

• User-inserted instrumentation


Improving Performance: Code Caches

Code CacheBranch Target

Address

Hit

Region Formation & Optimization

Evict Code

UpdateHash Table

Miss

No

YesYesNo

Interpret

Code is Hot?Room in

Code Cache?

Insert

Sta

rt Hash TableLookup

Counter++Delete

Exit Stub


Software-Managed Code Caches

• Store transformed code at run time to amortize overhead of process VMs• Contain a (potentially altered) copy of application

code

Application

Transform

CodeCache

Execute

Profile


Code Cache Contents

Every application instruction executed is stored in the code cache (at least)

Code Regions

• Altered copies of application code

• Basic blocks and/or traces

Exit stubs

• Swap applicationVM state

• Return control to the process VM


Code Regions

Basic Blocks

Traces

A

BBL A:Inst1Inst2Inst3Branch B

C

A

B

D

CFG

A

B

C

D

In Memory

A

B C

DD

Trace


Exit Stubs

One exit stub exists for every exit from every trace or basic block

Functionality

Prepare for context switch

Return control to VM dispatch

Details

Each exit stub ≈ 3 instructions

A

B

DExit to C

Exit to E


A

B C

D E

F G

HI

Call

Return

CFG

Performance: Trace Selection

• Interprocedural path• Single entry, multiple exit

ABCDI

GH

EF

Call

Return

Layout in Memory

Exit to C

Exit to F

ABDEGHI

Layout in Code Cache

Trace (superblock)


Performance: Cache Linking

Trace #2

Exit #1a

Exit #1b

Trace #1

Dispatch

Trace #3


Linking Traces

Proactive linkingLazy linking

Exit to C

Exit to F

ABDEGHI

Exit to A

FHI

Exit to A

CDEGHI

Exit to F

A

B C

D E

F G

HI

Call

Return

ABDEGHI

CDEGHI

FHI


Are Links Highly Beneficial?

Bench-mark

With Linking

Without Linking

Slow-down

gzip 230 sec 7951 sec 3357%

vpr 333 sec 2474 sec 643%

gcc 206 sec 3284 sec 1494%

mcf 368 sec 2014 sec 447%

crafty 215 sec 3547 sec 1550%

parser 350 sec 6795 sec 1841%

perlbmk 336 sec 6945 sec 1967%

gap 195 sec 4231 sec 2070%

vortex 382 sec 4655 sec 1119%

bzip2 287 sec 4294 sec 1396%

twolf 658 sec 6490 sec 886%


Code Cache Visualization


Challenge: Rewriting Instructions

•We must regularly rewrite branches

•No atomic branch write on x86

•Pin uses a neat trick*:

“old” 5-byte branch

2-byte self

branch

n-2 bytes of “new” branch

“new” 5-byte branch

* Sundaresan et al. 2006


Pretend as though the original program is executing

Original Code:0x1000 call 0x4000

Challenge: Achieving Transparency

Code cache address mapping:

0x1000 0x7000 “caller”

0x4000 0x8000 “callee”

Translated Code:0x7000 push 0x10060x7006 jmp 0x8000

Push 0x1006 on stack, then jump to 0x4000

SPC TPC


Challenge: Self-Modifying Code

The problem

Code cache must detect SMC and invalidate corresponding cached traces

Solutions

Many proposed … but without HW support, they are very expensive!• Changing page protection• Memory diff prior to execution• On ARM, there is an explicit instruction for SMC!


Self-Modifying Code Handler

(Written by Alex Skaletsky)

void main (int argc, char **argv) { PIN_Init(argc, argv); TRACE_AddInstrumentFunction(InsertSmcCheck,0); PIN_StartProgram(); // Never returns}void InsertSmcCheck () { . . . memcpy(traceCopyAddr, traceAddr, traceSize); TRACE_InsertCall(trace, IPOINT_BEFORE, (AFUNPTR)DoSmcCheck,

IARG_PTR, traceAddr, IARG_PTR, traceCopyAddr, IARG_UINT32, traceSize, IARG_CONTEXT, IARG_END);

}void DoSmcCheck (VOID* traceAddr, VOID *traceCopyAddr,

USIZE traceSize, CONTEXT* ctxP) { if (memcmp(traceAddr, traceCopyAddr, traceSize) != 0) { CODECACHE_InvalidateTrace((ADDRINT)traceAddr); PIN_ExecuteAt(ctxP); }}


Challenge: Parallel Applications

JIT Compiler

Syscall Emulator

Signal Emulator Dis

pa

tch

er

Instrumentation CodeCall-Back Handlers

Analysis Code

Code Cache

Pin

Serialized Parallel

T1

T2

T1

T1 T1 T2

Pin Tool


Challenge: Code Cache Consistency

Cached code must be removed for a variety of reasons:•Dynamically unloaded code•Ephemeral/adaptive instrumentation•Self-modifying code•Bounded code caches

EXE

Transform

CodeCache

Execute

Profile


Motivating a Bounded Code Cache

The Perl Benchmark

100%

150%

200%

250%

300%

350%

400%

input1 input2 input3 total

Perf

orm

ance

Rel

ative

to N

ative

Unlimited Code Cache2.5 MB Code Cache2.0 MB Code Cache1.5 MB Code Cache1.0 MB Code Cache


• Option 1: All threads have a private code cache (oops, doesn’t scale)

• Option 2: Shared code cache across threads

• If one thread flushes the code cache, other threads may resume in stale memory

Flushing the Code Cache

0%

100%

200%

300%

400%

500%

600%

wupwise swim mgrid applu galgel equake apsi gafort fma3d art ammp

Trac

e M

emor

y In

crea

se


Naïve Flush

Wait for all threads to return to the code cache

Could wait indefinitely!

VM

VM

CC1

CC1

VM stall

VM stall

CC2

CC2

VM CC1 VM CC2

Flush Delay

Thread1

Thread2

Thread3

Time


Generational Flush

Allow threads to continue to make progress in a separate area of the code cache

VM

VM

CC1

CC1

VM

VM

CC2

CC2

VM CC1 VM CC2

Thread1

Thread2

Thread3

Requires a high water mark

Time


% pin –cache_size 40960 –t flusher -- /bin/ls

SWOOSH!

SWOOSH!

27

Build-Your-Own Cache Replacement

void main(int argc, char **argv) { PIN_Init(argc,argv); CODECACHE_CacheIsFull(FlushOnFull); PIN_StartProgram(); //Never returns}void FlushOnFull() { CODECACHE_FlushCache(); cout << “SWOOSH!” << endl;}

Eviction Granularities• Entire Cache• One Cache Block• One Trace• Address Range


A Graphical Front-End


Memory Scalability of the Code Cache

Ensuring scalability also requires carefully configuring the code stored in the cache

Trace Lengths

• First basic block is non-speculative, others are speculative

• Longer traces = fewer entries in the lookup table, but more unexecuted code

• Shorter traces = two off-trace paths at ends of basic blocks with conditional branches = more exit stub code


Effect of Trace Length on Trace Count

0

2000

4000

6000

8000

10000

12000

14000

16000

wupwise swim mgrid applu galgel equake apsi gafort fma3d art ammp

Tota

l Tra

ce C

ount

1 BB 2 BBs 4 BBs 8 BBs 16 BBs 32 BBs


Effect of Trace Length on Memory

0

1500

3000

4500

6000

7500

9000

10500

01 BBs 02 BBs 04 BBs 08 BBs 16 BBs 32 BBsBasic Blocks Per Trace

Code

Cac

he F

ootp

rint (

KB)

Lookup TableLinksExit StubsTraces


Sources of Overhead

Internal

• Compiling code & exit stubs (region detection, region formation, code generation)

• Managing code (eviction, linking)

• Managing directories and performing lookups

• Maintaining consistency (SMC, DLLs)

External

• User-inserted instrumentation


Adding Instrumentation

100%

200%

300%

400%

500%

600%

700%

800%

perlb

ench

sjen

g

xala

ncbm

k

gobm

k

gcc

h264

ref

omne

tpp

bzip

2

libqu

antu

m mcf

asta

r

hmm

er

Rel

ativ

e to

Nat

ive Pin

Pin+icount


“Normal Pin” Execution Flow

Instrumentation is interleaved with application

Uninstrumented Application

Instrumented Application

Pin Overhead

Instrumentation Overhead

“Pinned” Application

time


“SuperPin” Execution Flow

SuperPin creates instrumented slices

Uninstrumented Application

SuperPinned Application

Instrumented Slices


Issues and Design Decisions

Creating slices• How/when to start a slice• How/when to end a slice

System calls

Merging results


for k

S6+fo

r k

S5+fo

r k

S4+fo

r kre

cord

si

gr3

, sl

eep

S3+

for k

S2+

sleep

S1+

Execution Timelinefo

r k

S1 S2 S3 S4 S5 S6

dete

ct

sigr4

dete

ct

exit

resu

me

dete

ct

sigr3

dete

ct

sigr6

dete

ct

sigr2

dete

ct

sigr5

resu

me

reco

rd

sigr4

, sl

eep

CPU2

CPU3

CPU4

time

reco

rd

sigr2

, sl

eep

resu

me

resu

me

reco

rd

sigr5

, sl

eep

resu

me

reco

rd

sigr6

, sl

eep

resu

me

original application

instrumentedapplication slices

CPU1


Performance – icount1

% pin –t icount1 -- <benchmark>

0%

500%

1000%

1500%

2000%

2500%

3000%

amm

pap

plu

apsi art

bzip

2cr

afty

eon

equa

kefa

cere

cfm

a3d

galg

elga

pgc

cgz

iplu

cas

mcf

mes

am

grid

pars

erpe

rlbm

sixt

rack

swim

twol

fvo

rtex vp

rw

upw

isA

VG

Pin SuperPin


What Did We Learn Today?

Building Process VMs is only half the battle

Robustness, correctness, performance are paramount

Lots of “tricks” are in play

• Code caches, trace selection, etc.

Knowing about these tricks is beneficial

• Lots of research opportunities

• Understanding the inner workings often helps you write better tools

39


Want More Info?

• Read the seminal Dynamo paper

• See the more recent papers by the Pin, DynamoRIO, Valgrind teams

• Relevant conferences: VEE, CGO, ASPLOS, PLDI, PACT

40

Day 1 – What is Process Virtualization?Day 2 – Building Process Virtualization SystemsDay 3 – Using Process Virtualization SystemsDay 4 – Symbiotic Optimization

day 2: building process virtualization systems kim hazelwood acaces summer school july 2009

Documents