Asymmetry Aware Scheduling Algorithms for Asymmetric Processors
Nagesh Lakshminarayana Sushma Rao Hyesoon Kim
Computer Science Georgia Institute of Technology
Outline
• Background and Problem
• Application characteristics on AMP/SMP
• LJFPF Policy
• CJFPF Policy
• Conclusion
Heterogeneous Architectures
• A particularly interesting class of parallel machines is Heterogeneous Architecture:– Multiple types of Processing Elements (PEs)
available on the same machine
PEA
PEB
PEB
PEB
PEB
Inte
rcon
nect
Heterogeneous Architectures
• Heterogeneous architectures are becoming very common:
Multicore CPU + GPU
IBM Cell processor
Special accelerator
Fast core
Slow core
Slow core
Slow core
Slow core
Focus of this talk
Asymmetric Processors
Fast core
Scheduling Problem: Multiple applications
Fast core
Slow core
Slow core
Slow core
Slow core
Scalable applications
Non-scalable applications
Fast core
Fast Core
Slow Core
Scheduling Problem: Multi-threaded application
Fast core
Slow core
Slow core
Slow core
Slow core
Fast core
Problem
How to schedule multi-threaded applications on Asymmetric Multiprocessors (AMP)?
Outline
• Background and Problem
• Application characteristics on AMP/SMP
• LJFPF Policy
• CJFPF Policy
• Conclusion
Experimental Methodology
• Use a 1.87GHz two-socket Quad-core machine to measure the performance
• Use SpeedStep technology to emulate an AMP
All-slow (SMP) All 8 processors are running at 1.6 GHz
One-fast (AMP) 1 processors are running at 1.87 GHz
7 processors are running at 1.6GHz
Half-half (AMP) 4 processors are running at 1.87GHz
4 processors are running at 1.6GHz
All-fast (SMP) All processors are running at 1.87GHz
Performance Results on AMP/SMP
0.8
0.85
0.9
0.95
1
1.05
No
rma
lize
d e
xe
cu
tio
n t
ime
All-slow
One-fast
Half-half
All-fast
Fast core
Slow core
Slow core
Slow core
Slow core
Fast core
Slow-Limited Applications
barrier
Middle-perf Benchmarks
barrier
Similar to a slow-limited benchmark but sequential section is much longer
Unstable Benchmarks
barrier
barrier
Lots of barriers Asymmetric workloads
PARSEC Benchmarks
Application Locks Barriers Cond. Variables
AMP performance category
BlackSholes 39 8 0.000 slow-limited
Bodytrack 6824702 111160 0.003 unstable
Canneal 34 0 0.003 middle-perf
dedup 10002625 0 0.009 unstable
ferret 1422579 0 0.014 slow-limited
facesim 7384488 0 0.03 middle-perf
Fluidanimate 1153407308 31998 0.02 unstable
Freqmine 39 0 0.12 middle-perf
streamcluster 1379 633174 0.013 middle-perf
swaptions 9 0 0.00 slow-limited
vips 11 0 0.0049 unstable
x264 207692 0 13793 middle-perf
Outline
• Background and Problem
• Applications on AMP/SMP
• LJFPF Policy
• CJFPF Policy
• Conclusion
LJFPF Policy
• Longest Job to a Fast Processor First
barrier
Fast core
Fast core Slow core
Slow core
How Does the Scheduler Know
• Length of work?
• Current mechanism: application sends the information
• On-going work: Prediction mechanism
Evaluation
• Matrix Multiplication
Sequential version
Parallel versionSymmetric workload
Parallel versionAsymmetric workload
Asymmetric Workload (Matrix Multiplication)
0.9
0.95
1
1.05
1.1
1.15
1.2
300-300
310-290
320-280
330-270
340-260
350-250
360-240
No
rma
lize
d e
xecu
tion
tim
e
All-fast
Half-half(LJFPF)
Half-half (RR)
All-slow
Real Application
• ITK (Medical image processing tool kit)– Open source but a real application
Evaluation: MultiRegistration
• Kernel loop has 50 iterations
50 % 8 ≠0
• Divide 50 iterations into 7, 7, 7, 7, 6, 6, 5, 5
0.92
0.94
0.96
0.98
1
1.02
1.04
All-f
ast
Ha
lf-h
alf
(LJF
PF
)
Ha
lf-h
alf
(RR
)
All-s
low
No
rma
lize
d e
xe
cu
tio
n t
imeResults: ITK Benchmark
2.3%
Outline
• Background and Problem
• Application characteristics on AMP/SMP
• LJFPF Policy
• CJFPF Policy
• Conclusion
Critical Section
Lock
Lock
Critical Section Limited Workloads
Critical section
Useful workwaiting
Case (a)
Case (b)
Critical Section Effects
0
1
2
3
4
5
6
7
8
9
10%CS 15%CS 20%CS
sp
eed
up
All-fast
Half-half
All-slow
Half-half performs similar to all-fast
CJFPF Policy
• Critical Job to a Fast Processor First Policy
Fast core
Slow core
Slow core
Slow core
0
1
2
3
4
5
6
7
8-12 16-24 40-60
sp
eed
up
CJFPF
RR
CJFPF Results
Longer critical sectionThe benefit of the CJFPF policy decreases
Conclusion
• We evaluated the characteristics of multi-threaded applications on AMPs.
• Barriers and critical sections are important factors.• Propose two new scheduling policies: Longest job
to fast core first (LJFPF), critical job to fast core first (CJFPF)– Scheduling polices improve performance for asymmetric
workloads.• Future work
– Develop a prediction mechanism– Evaluate symmetric workloads on AMPs– Other kinds of heterogeneous architectures
Thank you!