scalable support for multithreaded applications on dynamic binary instrumentation systems

27
Scalable Support for Multithreaded Applications on Dynamic Binary Instrumentation Systems Kim Hazelwood Greg Lueck Robert Cohn

Upload: emmanuel-dalton

Post on 31-Dec-2015

69 views

Category:

Documents


3 download

DESCRIPTION

Scalable Support for Multithreaded Applications on Dynamic Binary Instrumentation Systems. Kim Hazelwood Greg Lueck Robert Cohn. counter++;. counter++;. counter++;. counter++;. counter++;. Dynamic Binary Instrumentation. sub$0xff, %edx cmp%esi, %edx jle mov$0x1, %edi - PowerPoint PPT Presentation

TRANSCRIPT

Scalable Support for Multithreaded Applications on

Dynamic Binary Instrumentation Systems

Kim HazelwoodGreg Lueck

Robert Cohn

2 Hazelwood – ISMM 2009

Dynamic Binary Instrumentation

sub $0xff, %edx

cmp %esi, %edx

jle <L1>

mov $0x1, %edi

add $0x10, %eax

counter++;

counter++;

counter++;

counter++;

counter++;

Inserts or modify arbitrary instructions in executing binaries, e.g.: instruction count

3 Hazelwood – ISMM 2009

Instruction Count Output

$ /bin/ls Makefile imageload.out itrace proccount imageload inscount atrace itrace.out

$ pin -t inscount.so -- /bin/ls

Makefile imageload.out itrace proccount imageload inscount atrace itrace.out

Count 422838

4 Hazelwood – ISMM 2009

How Does it Work?

Generates and caches modified copies of instructions

Modified (cached) instructions are executed in lieu of original instructions

EXE

Transform

CodeCache

Execute

Profile

5 Hazelwood – ISMM 2009

Why “Dynamic” Instrumentation?

Robustness! No need to recompile or relink Discover code at runtime Handle dynamically-generated code Attach to running processes

The Code Discovery Problem on x86Instr 1 Instr 2

Instr 3 JumpReg DATA

Instr 5 Instr 6Uncond Branch PADDING

Instr 8

Indirect jump to ??

Data interspersed with code

Pad for alignment

6 Hazelwood – ISMM 20096

Intel Pin

• A dynamic binary instrumentation system

• Easy-to-use instrumentation interface

• Supports multiple platforms– Four ISAs – IA32, Intel64, IPF, ARM– Four OSes – Linux, Windows, FreeBSD, MacOS

• Popular and well supported– 32,000+ downloads– 400+ citations– 500+ mailing list subscribers

7 Hazelwood – ISMM 2009

Research Applications

• Gather profile information about applications

• Compare programs generated by competing compilers

• Generate a select stream of live information for event-driven simulation

• Add security features

• Emulate new hardware

• Anything and everything multicore

8 Hazelwood – ISMM 2009

The Problem with Modern Tools

• Many research tools do not support multithreaded guest applications

• Providing support for MT apps is mostly straightforward

• Providing scalable support can be tricky!

9 Hazelwood – ISMM 2009

Issues that Arise

• Gaining control of executing threads

• Determining what should be private vs. shared between threads

• Code cache maintenance and consistency

• Concurrent instruction writes

• Providing/handling thread-local storage

• Handling indirect branches

• Handling signals / system calls

10 Hazelwood – ISMM 2009

The Pin Architecture

JIT Compiler

Syscall Emulator

Signal Emulator Dis

pa

tch

er

Instrumentation CodeCall-Back Handlers

Analysis Code

Code Cache

Pin

Serialized Parallel

T1

T2

T1

T1 T1 T2

Pin Tool

11 Hazelwood – ISMM 2009

Code Cache Consistency

Cached code must be removed for a variety of reasons:

• Dynamically unloaded code

• Ephemeral/adaptive instrumentation

• Self-modifying code

• Bounded code caches

EXE

Transform

CodeCache

Execute

Profile

12 Hazelwood – ISMM 2009

Motivating a Bounded Code Cache

The Perl Benchmark

100%

150%

200%

250%

300%

350%

400%

input1 input2 input3 total

Perf

orm

ance

Rel

ative

to N

ative

Unlimited Code Cache2.5 MB Code Cache2.0 MB Code Cache1.5 MB Code Cache1.0 MB Code Cache

13 Hazelwood – ISMM 2009

• Option 1: All threads have a private code cache (oops, doesn’t scale)

• Option 2: Shared code cache across threads

• If one thread flushes the code cache, other threads may resume in stale memory

Flushing the Code Cache

0%

100%

200%

300%

400%

500%

600%

wupwise swim mgrid applu galgel equake apsi gafort fma3d art ammp

Trac

e M

emor

y In

crea

se

14 Hazelwood – ISMM 2009

Naïve Flush

Wait for all threads to return to the code cache

Could wait indefinitely!

VM

VM

CC1

CC1

VM stall

VM stall

CC2

CC2

VM CC1 VM CC2

Flush Delay

Thread1

Thread2

Thread3

Time

15 Hazelwood – ISMM 2009

Generational Flush

Allow threads to continue to make progress in a separate area of the code cache

VM

VM

CC1

CC1

VM

VM

CC2

CC2

VM CC1 VM CC2

Thread1

Thread2

Thread3

Requires a high water mark

Time

16 Hazelwood – ISMM 2009

Memory Scalability of the Code Cache

Ensuring scalability also requires carefully configuring the code stored in the cache

Trace Lengths

• First basic block is non-speculative, others are speculative

• Longer traces = fewer entries in the lookup table, but more unexecuted code

• Shorter traces = two off-trace paths at ends of basic blocks with conditional branches = more exit stub code

17 Hazelwood – ISMM 2009

Effect of Trace Length on Trace Count

0

2000

4000

6000

8000

10000

12000

14000

16000

wupwise swim mgrid applu galgel equake apsi gafort fma3d art ammp

Tota

l Tra

ce C

ount

1 BB 2 BBs 4 BBs 8 BBs 16 BBs 32 BBs

18 Hazelwood – ISMM 2009

Effect of Trace Length on Memory

0

1500

3000

4500

6000

7500

9000

10500

01 BBs 02 BBs 04 BBs 08 BBs 16 BBs 32 BBsBasic Blocks Per Trace

Code

Cac

he F

ootp

rint (

KB)

Lookup TableLinksExit StubsTraces

19 Hazelwood – ISMM 2009

Rewriting Instructions

• Pin must regularly rewrite branches

• No atomic branch write on x86

• We use a neat trick*:

“old” 5-byte branch

2-byte self

branch

n-2 bytes of “new” branch

“new” 5-byte branch

* Sundaresan et al. 2006

20 Hazelwood – ISMM 2009

Performance Results

We use the SPEC OMP 2001 benchmarks

• OMP_NUM_THREADS environment variable

We compare

• Native performance and scalability

• Pin (no Pintool) performance scalability

• Pin (lightweight Pintool) scalability• InsCount Pintool – counts instructions at BB granularity

• Pin (middleweight Pintool) scalability •MemTrace Pintool – records memory addresses

• Pin (heavyweight Pintool) scalability• CMP$im – collects memory addresses and applies a software model of the CMP cache

21 Hazelwood – ISMM 2009

Native Scalability of SPEC OMP 2001

0 X

1 X

2 X

3 X

4 X

5 X

6 X

7 X

8 X

wupwiseswim mgrid applu galgelequake apsi gafort fma3d art ammp

Spee

dup

1 thread 2 threads 4 threads 8 threads

22 Hazelwood – ISMM 2009

Performance Scalability (No Instrumentation)

0%

20%

40%

60%

80%

100%

120%

140%

160%

wupwise swim mgrid applu galgel equake apsi gafort fma3d art ammp

Runti

me

(Rel

ative

to N

ative

)

1 thread 2 threads 4 threads 8 threads

23 Hazelwood – ISMM 2009

Performance Scalability (LightWeight Instrumentation)

0%

20%

40%

60%

80%

100%

120%

140%

160%

wupwise swim mgrid applu galgel equake apsi gafort fma3d art ammp

Runti

me

(Rel

ative

to N

ative

)

1 thread 2 threads 4 threads 8 threads

24 Hazelwood – ISMM 2009

Performance Scalability (MiddleWeight Instrumentation)

0%

100%

200%

300%

400%

500%

wupwise swim mgrid applu galgel equake apsi gafort fma3d art ammp

Runti

me

(Rel

ative

to N

ative

)

1 thread 2 threads 4 threads 8 threads

25 Hazelwood – ISMM 2009

Performance Scalability (HeavyWeight Instrumentation)

0 X50 X

100 X150 X200 X250 X300 X350 X400 X450 X500 X

wupwise swim mgrid applu galgel equake apsi gafort fma3d art ammp

Runti

me

(Rel

ative

to N

ative

)

1 thread 2 threads 4 threads 8 threads

26 Hazelwood – ISMM 2009

Memory Scalability

0

1000

2000

3000

4000

5000

6000

7000

wupwise swim mgrid applu galgel equake apsi gafort fma3d art ammp

Code

Cac

he S

ize

(KB)

1 thread 2 threads 4 threads 8 threads

27 Hazelwood – ISMM 200927

Summary

• Dynamic instrumentation tools are useful

• In the multicore era, we must provide support for MT application analysis and simulation

• Providing MT support in Pin was easy

• Making it robust and scalable was not easy

http://www.pintool.org