- 1 - dongyoon lee, peter chen, jason flinn, satish narayanasamy university of michigan, ann arbor...

36
- Dongyoon Lee, Peter Chen, Jason Flinn, Satish Narayanasamy University of Michigan, Ann Arbor Chimera: Hybrid Program Analysis for Determinism * Chimera image from http://superpunch.blogspot.com/2009/02/chimera-sketch.html

Upload: gregory-barber

Post on 12-Jan-2016

225 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: - 1 - Dongyoon Lee, Peter Chen, Jason Flinn, Satish Narayanasamy University of Michigan, Ann Arbor Chimera: Hybrid Program Analysis for Determinism * Chimera

- 1 -

Dongyoon Lee, Peter Chen,

Jason Flinn, Satish Narayanasamy

University of Michigan, Ann Arbor

Chimera: Hybrid Program Analysis for Determinism

* Chimera image from http://superpunch.blogspot.com/2009/02/chimera-sketch.html

Page 2: - 1 - Dongyoon Lee, Peter Chen, Jason Flinn, Satish Narayanasamy University of Michigan, Ann Arbor Chimera: Hybrid Program Analysis for Determinism * Chimera

- 2 -

Deterministic Replay

Goal: record and reproduce multithreaded execution• Debugging concurrency bugs• Offline heavyweight dynamic analysis• Forensics and intrusion detection• … and many more uses

Problem• Multithreaded record-and-replay is too slow (>2x) or requires custom hardware

Page 3: - 1 - Dongyoon Lee, Peter Chen, Jason Flinn, Satish Narayanasamy University of Michigan, Ann Arbor Chimera: Hybrid Program Analysis for Determinism * Chimera

- 3 -

Multithreaded Record-and-Replay is Slow

Write

Write

Read

Log shared memory dependencies

Checkpoint Memory and Register State

Log non-deterministic program input - Interrupts, I/O values, DMA, etc.

Thread 1 Thread 2 Thread 3

Page 4: - 1 - Dongyoon Lee, Peter Chen, Jason Flinn, Satish Narayanasamy University of Michigan, Ann Arbor Chimera: Hybrid Program Analysis for Determinism * Chimera

- 4 -

Replay for Data-Race-Free Programs is Cheap

Data-race-free programs• Shared memory accesses are well ordered by synchronization ops.• Recording happens-before order of sync. ops. is sufficient

Problem: Programs with data races

T1 T2X=0Y=0

X=1Y=1

Y=2

Unlock(l)Lock(l)

Unlock(l)

Signal(c)

Wait(c)

Z=1

X=2

Z=2

T3

order of mem. ops.

order of sync. ops.

Page 5: - 1 - Dongyoon Lee, Peter Chen, Jason Flinn, Satish Narayanasamy University of Michigan, Ann Arbor Chimera: Hybrid Program Analysis for Determinism * Chimera

- 5 -

Our Contribution: A Hybrid Analysis

Potentially racyprogram P

Data-race-freeprogram P’

Sound static data race analysis • Add synchronizations for potential data races• Problem: Too many false positives

Profiling non-concurrent code regions

Symbolic bounds analysis

Chimera

Page 6: - 1 - Dongyoon Lee, Peter Chen, Jason Flinn, Satish Narayanasamy University of Michigan, Ann Arbor Chimera: Hybrid Program Analysis for Determinism * Chimera

- 6 -

Roadmap

• Motivation

• Chimera Analysis

1) Static data race analysis

2) Profiling non-concurrent code regions

3) Symbolic bounds analysis

• Weak-lock Design

• Evaluation

• Conclusion

Page 7: - 1 - Dongyoon Lee, Peter Chen, Jason Flinn, Satish Narayanasamy University of Michigan, Ann Arbor Chimera: Hybrid Program Analysis for Determinism * Chimera

- 7 -

Roadmap

• Motivation

• Chimera Analysis

1) Static data race analysis

2) Profiling non-concurrent code regions

3) Symbolic bounds analysis

• Weak-lock Design

• Evaluation

• Conclusion

Page 8: - 1 - Dongyoon Lee, Peter Chen, Jason Flinn, Satish Narayanasamy University of Michigan, Ann Arbor Chimera: Hybrid Program Analysis for Determinism * Chimera

- 8 -

Static Data Race Analysis

• Find potential data-races using a sound static data race detector RELAY [Voung et al.,

FSE’07]

• Protect all potential data-races using weak-locks − A new time-out lock which may be preempted (discussed later)

• Record and replay the happens-before order of weak-locks

Page 9: - 1 - Dongyoon Lee, Peter Chen, Jason Flinn, Satish Narayanasamy University of Michigan, Ann Arbor Chimera: Hybrid Program Analysis for Determinism * Chimera

- 9 -

Protect Potential Races using Weak-locks

Potential racy-pair

Potential racy-pair

Static analysis helps avoid instrumentation for access to Z

No race report

void foo() { X = 0;

for(i = ... ){

Y[ tid ][ i ] = 0;

}

}

void bar() { X = 1;

for(i = … ){

Y[ tid ][ i ] = 1;

Z = 1; }

}

Page 10: - 1 - Dongyoon Lee, Peter Chen, Jason Flinn, Satish Narayanasamy University of Michigan, Ann Arbor Chimera: Hybrid Program Analysis for Determinism * Chimera

- 10 -

Sources of False Positives in RELAY

• Sound data-race detector reports too many false data-races− 53x overhead

• Source 1: Non-mutex synchronizations are ignored− Lockset based analysis ignores fork-join, barrier, signal-wait, etc. − May report a false data-race between memory instructions that

can never execute concurrently

• Source 2: Conservative pointer analysis − Overestimate variables accessed by a memory instruction − May report a false data-race between memory instructions that

can never access the same location

Page 11: - 1 - Dongyoon Lee, Peter Chen, Jason Flinn, Satish Narayanasamy University of Michigan, Ann Arbor Chimera: Hybrid Program Analysis for Determinism * Chimera

- 11 -

Roadmap

• Motivation

• Chimera Analysis

1) Static data race analysis

2) Profiling non-concurrent code regions

3) Symbolic bounds analysis

• Weak-lock Design

• Evaluation

• Conclusion

Page 12: - 1 - Dongyoon Lee, Peter Chen, Jason Flinn, Satish Narayanasamy University of Michigan, Ann Arbor Chimera: Hybrid Program Analysis for Determinism * Chimera

- 12 -

Profiling Non-concurrent Code Regions

Problem• Lockset based analysis ignores non-mutex synchronization ops.

Solution• Profile non-concurrent code regions (e.g., functions)• Increase the granularity of weak-locks to protect a larger code

region instead of each potential racy instruction• Parallelism is preserved unless mis-profiled

T1foo()

BARRIER

T2

BARRIER

bar()

False Race

Page 13: - 1 - Dongyoon Lee, Peter Chen, Jason Flinn, Satish Narayanasamy University of Michigan, Ann Arbor Chimera: Hybrid Program Analysis for Determinism * Chimera

- 13 -

Function-Level Weak-Locks

if profiler says foo() and bar() are not likely to run concurrentlyfoo()

BARRIERBARRIER

bar()

False Race

void foo() { X = 0;

for(i = … ){

Y[ tid ][ i ] = 0;

}

}

void bar() { X = 1;

for(i = … ){

Y[ tid ][ i ] = 1;

Z = 1; }

}

Page 14: - 1 - Dongyoon Lee, Peter Chen, Jason Flinn, Satish Narayanasamy University of Michigan, Ann Arbor Chimera: Hybrid Program Analysis for Determinism * Chimera

- 14 -

Roadmap

• Motivation

• Chimera Analysis

1) Static data race analysis

2) Profiling non-concurrent code regions

3) Symbolic bounds analysis

• Design

• Evaluation

• Conclusion

Page 15: - 1 - Dongyoon Lee, Peter Chen, Jason Flinn, Satish Narayanasamy University of Michigan, Ann Arbor Chimera: Hybrid Program Analysis for Determinism * Chimera

- 15 -

Imprecision in Conservative Pointer Analysis

T1foo()

BARRIER

T2

BARRIER

May runConcurrently

bar()

Page 16: - 1 - Dongyoon Lee, Peter Chen, Jason Flinn, Satish Narayanasamy University of Michigan, Ann Arbor Chimera: Hybrid Program Analysis for Determinism * Chimera

- 16 -

Imprecision in Conservative Pointer Analysis

• RELAY uses Steensgaard’s and Anderson’s pointer analysis− Flow-Insensitive and Context-Insensitive (FICI) analysis− Naming heap objects is conservative

• Overestimate the variables accessed by a memory instruction

void foo() { … for(i = 0 to N){ Y[ tid ][ i ] = 0; … }}

void bar() { … for(i= 0 to N){ Y[ tid ][ i ] = 1; … }}

False Race

Y[][]

Thread1 Thread 2

… … …

Potential racy-pair

Page 17: - 1 - Dongyoon Lee, Peter Chen, Jason Flinn, Satish Narayanasamy University of Michigan, Ann Arbor Chimera: Hybrid Program Analysis for Determinism * Chimera

- 17 -

Symbolic Bounds AnalysisOur Solution• Derive the symbolic lower and upper bounds that a racy code

region may access (e.g., loops) [Rugina and Rinard, PLDI’00]

• Increase the granularity of weak-locks to protect a larger code region for a set of addresses specified by a symbolic expression

• Parallelism is preserved if the bounds are precise enough

void foo() { … for(i = 0 to N){ Y[ tid ][ i ] = 0; } …}

Bounds: &Y[tid][0] to &Y[tid][N]

SymbolicBoundsAnalysis

Page 18: - 1 - Dongyoon Lee, Peter Chen, Jason Flinn, Satish Narayanasamy University of Michigan, Ann Arbor Chimera: Hybrid Program Analysis for Determinism * Chimera

- 18 -

Loop-level Weak-locks

Symbolic bounds: &Y[tid][0] ~ &Y[tid][N]

(&Y[tid][0],&Y[tid][N])

(&Y[tid][0],&Y[tid][N])

(&Y[tid][0],&Y[tid][N])

(&Y[tid][0],&Y[tid][N])

void foo() { X = 0;

for(i = 0 to N){

Y[ tid ][ i ] = 0;

}

}

void bar() { X = 1;

for(i = 0 to N){

Y[ tid ][ i ] = 1;

Z = 1; }

}

Page 19: - 1 - Dongyoon Lee, Peter Chen, Jason Flinn, Satish Narayanasamy University of Michigan, Ann Arbor Chimera: Hybrid Program Analysis for Determinism * Chimera

- 19 -

Imprecise Symbolic Bounds

Sources• Depend on the value computed inside the code region• Depend on arithmetic operations not supported in the analysis

− e.g., modulo operations, logical AND/OR, etc.

Choosing the optimal granularity• If bounds are too imprecise and the loop body is long enough,

resort to instruction (basic-block) level weak-locks for parallelism

void qux() { … for(i = 0 to N){ prev = Z[ prev ]; } …}

Bounds: -INF to +INF

SymbolicBoundsAnalysis

Page 20: - 1 - Dongyoon Lee, Peter Chen, Jason Flinn, Satish Narayanasamy University of Michigan, Ann Arbor Chimera: Hybrid Program Analysis for Determinism * Chimera

- 20 -

Roadmap

• Motivation

• Chimera Analysis

• Weak-lock Design

• Evaluation

• Conclusion

Page 21: - 1 - Dongyoon Lee, Peter Chen, Jason Flinn, Satish Narayanasamy University of Michigan, Ann Arbor Chimera: Hybrid Program Analysis for Determinism * Chimera

- 21 -

Deadlock due to Weak-locks

No deadlocks between weak-locks• function-level > loop-level > instruction-level

Deadlock between weak-locks and original sync. ops. is possible

T1

wait (cv)

T2

signal(cv)

Time-out !!

Page 22: - 1 - Dongyoon Lee, Peter Chen, Jason Flinn, Satish Narayanasamy University of Michigan, Ann Arbor Chimera: Hybrid Program Analysis for Determinism * Chimera

- 22 -

Weak-lock Time-out

A weak-lock might time-out• Invoke a special system call to handle it

Weak-lock guarantee• Only one thread holds a given weak-lock at any given time• Mutual exclusion may be compromised; but sufficient for replay

T2

signal(cv)

Time-out !!

T1

wait (cv)

Current owner Current owner

Logged order of weak-locks

Page 23: - 1 - Dongyoon Lee, Peter Chen, Jason Flinn, Satish Narayanasamy University of Michigan, Ann Arbor Chimera: Hybrid Program Analysis for Determinism * Chimera

- 23 -

Roadmap

• Motivation

• Chimera Analysis

• Weak-lock Design

• Evaluation

• Conclusion

Page 24: - 1 - Dongyoon Lee, Peter Chen, Jason Flinn, Satish Narayanasamy University of Michigan, Ann Arbor Chimera: Hybrid Program Analysis for Determinism * Chimera

- 24 -

Implementation

Source-to-source Instrumentation• Implemented in OCaml using CIL as a front end

Static analysis• Data race detection: RELAY [Voung et al., FSE’07]

− Include all library source codes for soundness (uClibc’s libc, libm, etc.)• Symbolic bounds analysis: [Rugina and Rinard, PLDI’00]

− Intra-procedural analysis for racy loops only

Runtime system• Modified Linux kernel to record/replay program input • Modified pthread library to record/replay happens-before order

of original synchronization operations and weak-locks

Page 25: - 1 - Dongyoon Lee, Peter Chen, Jason Flinn, Satish Narayanasamy University of Michigan, Ann Arbor Chimera: Hybrid Program Analysis for Determinism * Chimera

- 25 -

Evaluation Setup

Test Environment• 2.66 GHz 8-core Xeon processor with 4 GB of RAM • Different set of inputs for profiling and performance evaluation• Average of five trials with 4 worker threads• 2, 4, 8 threads for scalability results

Benchmarks• Desktop applications

− aget, pfscan, and pbzip2• Server programs

− knot and apache• SPLASH-2 suite

− ocean, water-nsq, fft, and radix

Page 26: - 1 - Dongyoon Lee, Peter Chen, Jason Flinn, Satish Narayanasamy University of Michigan, Ann Arbor Chimera: Hybrid Program Analysis for Determinism * Chimera

- 26 -

Record and Replay Performance

aget pfscan pbzip2 knot apache ocean water fft radix average0

0.5

1

1.5

2

2.5

record replay

No

rma

lize

d p

erf

. o

ve

rhe

ad

• Recording : 39% on average• Replay : similar to recording; much lower for I/O intensive prgs.

2.4% slowdown

86% slowdown

39%

Page 27: - 1 - Dongyoon Lee, Peter Chen, Jason Flinn, Satish Narayanasamy University of Michigan, Ann Arbor Chimera: Hybrid Program Analysis for Determinism * Chimera

- 27 -

Effectiveness of Coarse-grained Weak-locks

aget pfscan pbzip2 knot apache ocean water fft radix average1

10

100

instr instr + func instr + loop instr + loop + func instr + bb + loop + func

No

rma

lize

d r

ec

ord

ing

ov

erh

ea

d

135 251 100>

53x

Page 28: - 1 - Dongyoon Lee, Peter Chen, Jason Flinn, Satish Narayanasamy University of Michigan, Ann Arbor Chimera: Hybrid Program Analysis for Determinism * Chimera

- 28 -

Effectiveness of Coarse-grained Weak-locks

aget pfscan pbzip2 knot apache ocean water fft radix average1

10

100

instr instr + func instr + loop instr + loop + func instr + bb + loop + func

No

rma

lize

d r

ec

ord

ing

ov

erh

ea

d

135 100>251

• Coarse-grained weak-locks reduce the cost of instrumentation

Page 29: - 1 - Dongyoon Lee, Peter Chen, Jason Flinn, Satish Narayanasamy University of Michigan, Ann Arbor Chimera: Hybrid Program Analysis for Determinism * Chimera

- 29 -

Effectiveness of Coarse-grained Weak-locks

aget pfscan pbzip2 knot apache ocean water fft radix average1

10

100

instr instr + func instr + loop instr + loop + func instr + bb + loop + func

No

rma

lize

d r

ec

ord

ing

ov

erh

ea

d

135 251 100>

• Coarse-grained weak-locks reduce the cost of instrumentation• Exception: control-flow dependency (e.g., pfscan)

Page 30: - 1 - Dongyoon Lee, Peter Chen, Jason Flinn, Satish Narayanasamy University of Michigan, Ann Arbor Chimera: Hybrid Program Analysis for Determinism * Chimera

- 30 -

Effectiveness of Coarse-grained Weak-locks

aget pfscan pbzip2 knot apache ocean water fft radix average1

10

100

instr instr + func instr + loop instr + loop + func instr + bb + loop + func

No

rma

lize

d r

ec

ord

ing

ov

erh

ea

d

135 251 100>

• Coarse-grained weak-locks reduce the cost of instrumentation• Exception: control-flow dependency (e.g., pfscan)

Page 31: - 1 - Dongyoon Lee, Peter Chen, Jason Flinn, Satish Narayanasamy University of Michigan, Ann Arbor Chimera: Hybrid Program Analysis for Determinism * Chimera

- 31 -

Effectiveness of Coarse-grained Weak-locks

aget pfscan pbzip2 knot apache ocean water fft radix average1

10

100

instr instr + func instr + loop instr + loop + func instr + bb + loop + func

No

rma

lize

d r

ec

ord

ing

ov

erh

ea

d

135 251 100>

• Coarse-grained weak-locks reduce the cost of instrumentation• Exception: control-flow dependency (e.g., pfscan)

1.39x

Page 32: - 1 - Dongyoon Lee, Peter Chen, Jason Flinn, Satish Narayanasamy University of Michigan, Ann Arbor Chimera: Hybrid Program Analysis for Determinism * Chimera

- 32 -

Breakdown of Recording Overhead

aget pfscan pbzip2 knot apache ocean water fft radix1

1.5

2

2.5

No

rma

lize

d r

ec

ord

ing

ov

erh

ea

d

• Weak-lock overhead = contention (waiting) cost + logging cost

func locks

loop locksinstr/bb lockssync op & system log

Page 33: - 1 - Dongyoon Lee, Peter Chen, Jason Flinn, Satish Narayanasamy University of Michigan, Ann Arbor Chimera: Hybrid Program Analysis for Determinism * Chimera

- 33 -

Breakdown of Recording Overhead

aget pfscan pbzip2 knot apache ocean water fft radix1

1.5

2

2.5

No

rma

lize

d r

ec

ord

ing

ov

erh

ea

d

func wait

loop waitinstr/bb waitsync op & system log

func log

loop loginstr/bb log

• Weak-lock overhead = contention (waiting) cost + logging cost• High loop-lock contention• High instr/bb-lock contention

Page 34: - 1 - Dongyoon Lee, Peter Chen, Jason Flinn, Satish Narayanasamy University of Michigan, Ann Arbor Chimera: Hybrid Program Analysis for Determinism * Chimera

- 34 -

Scalability

aget pfscan pbzip2 knot apache ocean water fft radix average0

0.5

1

1.5

2

2.5

3

3.52p 4p 8p

No

rma

lize

d r

ec

ord

ing

ov

erh

ea

d

• Scientific applications scale worse due to imprecise symbolic bounds analysis

Page 35: - 1 - Dongyoon Lee, Peter Chen, Jason Flinn, Satish Narayanasamy University of Michigan, Ann Arbor Chimera: Hybrid Program Analysis for Determinism * Chimera

- 35 -

Conclusion

Goal: Software-only deterministic multiprocessor replay systems

Chimera Analysis• Static data race analysis

− Find and protect potential data races with weak-locks− Instruction/basic-block-level weak-locks

• Profiling non-concurrent code regions− Address the inadequacy of lockset-based algorithm− Function-level weak-locks

• Symbolic bounds analysis− Address the imprecision of conservative pointer analysis− Loop-level weak-locks

Low Recording Overhead• 39% recording overhead for 4 worker threads

Page 36: - 1 - Dongyoon Lee, Peter Chen, Jason Flinn, Satish Narayanasamy University of Michigan, Ann Arbor Chimera: Hybrid Program Analysis for Determinism * Chimera

- 36 -

Thank you