sound and precise analysis of parallel programs through schedule specialization
DESCRIPTION
Sound and Precise Analysis of Parallel Programs through Schedule Specialization. Jingyue Wu, Yang Tang, Gang Hu, Heming Cui, Junfeng Yang Columbia University. Motivation. Analyzing parallel programs is difficult. . precision. Total Schedules. Dynamic Analysis. Analyzed Schedules. ?. - PowerPoint PPT PresentationTRANSCRIPT
Sound and Precise Analysis ofParallel Programs through
Schedule Specialization
Jingyue Wu, Yang Tang, Gang Hu, Heming Cui, Junfeng YangColumbia University
1
2
Motivation
soundness (# of analyzed schedules / # of total schedules)
precision Total Schedules
AnalyzedSchedules
StaticAnalysis
DynamicAnalysis
AnalyzedSchedules
?
• Analyzing parallel programs is difficult.
3
• Precision: Analyze the program over a small set of schedules. • Soundness: Enforce these schedules at runtime.
Schedule Specialization
soundness (# of analyzed schedules / # of total schedules)
precision Total Schedules
StaticAnalysis
DynamicAnalysis
AnalyzedSchedules
EnforcedSchedules
ScheduleSpecialization
4
Enforcing Schedules Using Peregrine
• Deterministic multithreading– e.g. DMP (ASPLOS ’09), Kendo (ASPLOS ’09), CoreDet
(ASPLOS ’10), Tern (OSDI ’10), Peregrine (SOSP ’11), DTHREADS (SOSP ’11)
– Performance overhead• e.g. Kendo: 16%, Tern & Peregrine: 39.1%
• Peregrine– Record schedules, and reuse them on a wide range of
inputs.– Represent schedules explicitly.
5
• Precision: Analyze the program over a small set of schedules. • Soundness: Enforce these schedules at runtime.
Schedule Specialization
soundness (# of analyzed schedules / # of total schedules)
precision
StaticAnalysis
DynamicAnalysis
AnalyzedSchedules
EnforcedSchedulesSchedule
Specialization
6
Framework
• Extract control flow and data flow enforced by a set of schedules
Schedule
ScheduleSpecializationProgram
C/C++ programwith Pthread
Total order ofsynchronizations
SpecializedProgram
Extra def-usechains
7
Outline
• Example• Control-Flow Specialization• Data-Flow Specialization• Results• Conclusion
Running Example
int results[p_max];int global_id = 0;
int main(int argc, char *argv[]) { int i; int p = atoi(argv[1]); for (i = 0; i < p; ++i) pthread_create(&child[i], 0, worker, 0); for (i = 0; i < p; ++i) pthread_join(child[i], 0); return 0;}
void *worker(void *arg) { pthread_mutex_lock(&global_id_lock); int my_id = global_id++; pthread_mutex_unlock(&global_id_lock); results[my_id] = compute(my_id); return 0;}
8
Thread 0 Thread 1 Thread 2
create
create
join
join
lock
unlocklock
unlock
Race-free?
Control-Flow Specializationint main(int argc, char *argv[]) { int i; int p = atoi(argv[1]); for (i = 0; i < p; ++i) pthread_create(&child[i], 0, worker, 0); for (i = 0; i < p; ++i) pthread_join(child[i], 0); return 0;}
9
create
create
join
join
atoi
++i
create
return
i = 0
i < p
++i
join
i < p
i = 0
Control-Flow Specializationint main(int argc, char *argv[]) { int i; int p = atoi(argv[1]); for (i = 0; i < p; ++i) pthread_create(&child[i], 0, worker, 0); for (i = 0; i < p; ++i) pthread_join(child[i], 0); return 0;}
10
create
create
join
join
atoi
++i
create
return
i = 0
i < p
++i
join
i < p
i = 0
atoi
create
i = 0
i < p
Control-Flow Specializationint main(int argc, char *argv[]) { int i; int p = atoi(argv[1]); for (i = 0; i < p; ++i) pthread_create(&child[i], 0, worker, 0); for (i = 0; i < p; ++i) pthread_join(child[i], 0); return 0;}
11
create
create
join
join
atoi
++i
create
return
i = 0
i < p
++i
join
i < p
i = 0
create
atoi
i = 0
i < p
create
++i
create
i < p
Control-Flow Specializationint main(int argc, char *argv[]) { int i; int p = atoi(argv[1]); for (i = 0; i < p; ++i) pthread_create(&child[i], 0, worker, 0); for (i = 0; i < p; ++i) pthread_join(child[i], 0); return 0;}
12
create
create
join
join
atoi
++i
create
return
i = 0
i < p
++i
join
i < p
i = 0
atoi
create
i = 0
i < p
++i
create
i < p
++i
i < p
join
i < p
i = 0
++i
join
i < p
++i
i < p
return
13
Control-Flow Specialized Program
int main(int argc, char *argv[]) { int i; int p = atoi(argv[1]); i = 0; // i < p == true pthread_create(&child[i], 0, worker.clone1, 0); ++i; // i < p == true pthread_create(&child[i], 0, worker.clone2, 0); ++i; // i < p == false i = 0; // i < p == true pthread_join(child[i], 0); ++i; // i < p == true pthread_join(child[i], 0); ++i; // i < p == false return 0;}
atoi
create
i = 0
i < p
++i
create
i < p
++i
i < p
join
i < p
i = 0
++i
join
i < p
++i
i < p
return
14
More Challenges onControl-Flow Specialization
• Ambiguity
call
Caller Callee
call
S1
• A schedule has too many synchronizations
ret
S2
Data-Flow Specialization
int global_id = 0;
void *worker.clone1(void *arg) { pthread_mutex_lock(&global_id_lock); int my_id = global_id++; pthread_mutex_unlock(&global_id_lock); results[my_id] = compute(my_id); return 0;}
void *worker.clone2(void *arg) { pthread_mutex_lock(&global_id_lock); int my_id = global_id++; pthread_mutex_unlock(&global_id_lock); results[my_id] = compute(my_id); return 0;}
15
Thread 0 Thread 1 Thread 2
create
create
join
join
lock
unlock
lock
unlock
global_id = 0
my_id = global_idglobal_id++
my_id = global_idglobal_id++
Data-Flow Specialization
int global_id = 0;
void *worker.clone1(void *arg) { pthread_mutex_lock(&global_id_lock); int my_id = global_id++; pthread_mutex_unlock(&global_id_lock); results[my_id] = compute(my_id); return 0;}
void *worker.clone2(void *arg) { pthread_mutex_lock(&global_id_lock); int my_id = global_id++; pthread_mutex_unlock(&global_id_lock); results[my_id] = compute(my_id); return 0;}
16
Thread 0 Thread 1 Thread 2
create
create
join
join
lock
unlock
lock
unlock
global_id = 0
my_id = global_idglobal_id++
my_id = global_idglobal_id++
Data-Flow Specialization
int global_id = 0;
void *worker.clone1(void *arg) { pthread_mutex_lock(&global_id_lock); int my_id = global_id++; pthread_mutex_unlock(&global_id_lock); results[my_id] = compute(my_id); return 0;}
void *worker.clone2(void *arg) { pthread_mutex_lock(&global_id_lock); int my_id = global_id++; pthread_mutex_unlock(&global_id_lock); results[my_id] = compute(my_id); return 0;}
17
Thread 0 Thread 1 Thread 2
create
create
join
join
lock
unlock
lock
unlock
global_id = 0
my_id = 0global_id = 1
my_id = global_idglobal_id++
Data-Flow Specialization
int global_id = 0;
void *worker.clone1(void *arg) { pthread_mutex_lock(&global_id_lock); int my_id = global_id++; pthread_mutex_unlock(&global_id_lock); results[my_id] = compute(my_id); return 0;}
void *worker.clone2(void *arg) { pthread_mutex_lock(&global_id_lock); int my_id = global_id++; pthread_mutex_unlock(&global_id_lock); results[my_id] = compute(my_id); return 0;}
18
Thread 0 Thread 1 Thread 2
create
create
join
join
lock
unlock
lock
unlock
global_id = 0
my_id = 0global_id = 1
my_id = 1global_id = 2
Data-Flow Specialization
int global_id = 0;
void *worker.clone1(void *arg) { pthread_mutex_lock(&global_id_lock); global_id = 1; pthread_mutex_unlock(&global_id_lock); results[0] = compute(0); return 0;}
void *worker.clone2(void *arg) { pthread_mutex_lock(&global_id_lock); global_id = 2; pthread_mutex_unlock(&global_id_lock); results[1] = compute(1); return 0;}
19
Thread 0 Thread 1 Thread 2
create
create
join
join
lock
unlock
lock
unlock
global_id = 0
my_id = 0global_id = 1
my_id = 1global_id = 2
20
More Challenges onData-Flow Specialization
• Must/May alias analysis– global_id
• Reasoning about integers– results[0] = compute(0)– results[1] = compute(1)
• Many def-use chains
21
Evaluation
• Applications– Static race detector– Alias analyzer– Path slicer
• Programs– PBZip2 1.1.5– aget 0.4.1– 8 programs in SPLASH2– 7 programs in PARSEC
22
Program Original Specialized
aget 72 0
PBZip2 125 0
fft 96 0
blackscholes 3 0
swaptions 165 0
streamcluster 4 0
canneal 21 0
bodytrack 4 0
ferret 6 0
raytrace 215 0
cholesky 31 7
radix 53 14
water-spatial 2447 1799
lu-contig 18 18
barnes 370 369
water-nsquared 354 333
ocean 331 292
StaticRaceDetector
# of FalsePositives
23
Program Original Specialized
aget 72 0
PBZip2 125 0
fft 96 0
blackscholes 3 0
swaptions 165 0
streamcluster 4 0
canneal 21 0
bodytrack 4 0
ferret 6 0
raytrace 215 0
cholesky 31 7
radix 53 14
water-spatial 2447 1799
lu-contig 18 18
barnes 370 369
water-nsquared 354 333
ocean 331 292
StaticRaceDetector
# of FalsePositives
24
Program Original Specialized
aget 72 0
PBZip2 125 0
fft 96 0
blackscholes 3 0
swaptions 165 0
streamcluster 4 0
canneal 21 0
bodytrack 4 0
ferret 6 0
raytrace 215 0
cholesky 31 7
radix 53 14
water-spatial 2447 1799
lu-contig 18 18
barnes 370 369
water-nsquared 354 333
ocean 331 292
StaticRaceDetector
# of FalsePositives
25
Program Original Specialized
aget 72 0
PBZip2 125 0
fft 96 0
blackscholes 3 0
swaptions 165 0
streamcluster 4 0
canneal 21 0
bodytrack 4 0
ferret 6 0
raytrace 215 0
cholesky 31 7
radix 53 14
water-spatial 2447 1799
lu-contig 18 18
barnes 370 369
water-nsquared 354 333
ocean 331 292
StaticRaceDetector
# of FalsePositives
26
Static Race Detector: Harmful Races Detected
• 4 in aget• 2 in radix• 1 in fft
27
Precision of Schedule-AwareAlias Analysis
28
Precision of Schedule-AwareAlias Analysis
29
Precision of Schedule-AwareAlias Analysis
30
Conclusion and Future Work
• Designed and implemented schedule specialization framework– Analyzes the program over a small set of schedules– Enforces these schedules at runtime
• Built and evaluated three applications– Easy to use– Precise
• Future work– More applications– Similar specialization ideas on sequential programs
31
Related Work• Program analysis for parallel programs
– Chord (PLDI ’06), RADAR (PLDI ’08), FastTrack (PLDI ’09)• Slicing
– Horgon (PLDI ’90), Bouncer (SOSP ’07), Jhala (PLDI ’05), Weiser (PhD thesis), Zhang (PLDI ’04)
• Deterministic multithreading– DMP (ASPLOS ’09), Kendo (ASPLOS ’09), CoreDet (ASPLOS ’10),
Tern (OSDI ’10), Peregrine (SOSP ’11), DTHREADS (SOSP ’11)• Program specialization
– Consel (POPL ’93), Gluck (ISPL ’95), Jørgensen (POPL ’92), Nirkhe (POPL ’92), Reps (PDSPE ’96)
32
Backup Slides
33
Specialization Time
34
Handling Races
• We do not assume data-race freedom. • We could if our only goal is optimization.
35
Input Coverage
• Use runtime verification for the inputs not covered
• A small set of schedules can cover a wide range of inputs
36