parallelizing data race detection benjamin wester facebook david devecsery, peter chen, jason flinn,...
TRANSCRIPT
Parallelizing Data Race Detection
Benjamin WesterFacebook
David Devecsery, Peter Chen, Jason Flinn, Satish NarayanasamyUniversity of Michigan
Data Races
Benjamin Wester - ASPLOS '13 2
TIM
E
t1 t2
lock(l)x = 1unlock(l)
x = 3lock(l)x = 2unlock(l)
Race Detection
Benjamin Wester - ASPLOS '13
Normal run time
TIM
E With race detector
3
Instrumentation+
Analysis
Detection is slow!8x–200x•Managed vs. binary•Granularity•Completeness•Debug info gatheredMatches app concurrency
Race Detection
Benjamin Wester - ASPLOS '13
Normal run time
TIM
E With race detector
4
Parallel race detector
Detection is slow!8x–200x•Managed vs. binary•Granularity•Completeness•Debug info gatheredMatches app concurrency
How to use parallel hardware?
• Scale program?– Performance tied to app
• Parallel algorithm?– Lots of fine-grained dependencies
• Uniparallelism– Converts execution into parallel pipeline– Works for instrumentation
Benjamin Wester - ASPLOS '13 5
[Wallace CGO’07,Nightingale ASPLOS’08]
Uniparallelism
Benjamin Wester - ASPLOS '13
TIM
E
6
Uniparallelism
Benjamin Wester - ASPLOS '13
TIM
E
7
Epoch-parallelExecution
Epoch-sequentialExecution
Uniparallelism
• Scales well:Epochs are independent execution units
• More efficient:Synchronization
Benjamin Wester - ASPLOS '13 8
TIM
E
Add Detector…
• Scales well:Epochs are independent execution units
• More efficient:Synchronization& Lock elision
Race detection depends on state!
Benjamin Wester - ASPLOS '13 9
TIM
E Here!Here!
Architecture
Benjamin Wester - ASPLOS '13 10
Epoch-sequentialEpoch-sequential Epoch-parallelEpoch-parallel CommitCommit
Moving Work
Identify meaningful fast subset of analysis state– Low frequency, lightweight, self-contained– Run in Epoch-Sequential phase– Predicted and usable in parallel phase
Symbolic execution– Replace missing state with symbols– Defer some computation to commit phase– Commit: replace symbols with concrete values
Benjamin Wester - ASPLOS '13 11
Architecture
Benjamin Wester - ASPLOS '13 12
Epoch-sequentialEpoch-sequential Epoch-parallelEpoch-parallel CommitCommit
Instrument and predictfast subset of analysis state Perform deferred work
on concrete values
Full instrumentationSymbolic execution
Fast
Slow
State: Vector clocks – e.g. ⟨0, 1, 0⟩ – for each:– Thread– Lock– Variable x 2: last read(s), last write
Example computation:Check(read_x ≤ thread_i ); write_x = thread_i
Transitive reductionUse knowledge of happens-before relationship
Parallel FastTrack
Benjamin Wester - ASPLOS '13 13
Fast
Slow
Parallel Eraser
State:– Set of locks held by thread (lockset)– Lockset for each variable– Position in state machine for each variable
Example computation:If state_x == SHARED then ls_x = ls_x ∩ {L, M}
Lockset factorizationFactor out common behavior
Benjamin Wester - ASPLOS '13 14
Parallel Performance
EfficiencyScalability
2 worker threads as baseline8-CPU PlatformBenchmarks:
SPLASH-2 (water, lu, ocean, fft, radix) Parallel app: pbzip2
Benjamin Wester - ASPLOS '13 15
Efficiency
Benjamin Wester - ASPLOS '13 16
Exactly 2 cores
Efficiency
Benjamin Wester - ASPLOS '13 17
Median 13% faster
Median 13% faster
Median 8% faster
Median 8% faster
Exactly 2 cores
Scaling Parallel FastTrack
Benjamin Wester - ASPLOS '13 18
2 Worker Threads
Scaling Parallel FastTrack
Benjamin Wester - ASPLOS '13 19
Median 4.4x with 8 coresMedian 4.4x with 8 cores
2 Worker Threads
Scaling Parallel Eraser
Benjamin Wester - ASPLOS '13 20
Median 3.3x with 8 coresMedian 3.3x with 8 cores
2 Worker Threads
Conclusion
3-Phase Uniparallel architecture– Parallel FastTrack– Parallel Eraser
Parallel algorithms:– Are more efficient– Scale better
Benjamin Wester - ASPLOS '13 21
Questions?
Benjamin Wester - ASPLOS '13 22