scalability-based manycore partitioning hiroshi sasaki kyushu university koji inoue kyushu...
TRANSCRIPT
Scalability-BasedManycore Partitioning
Hiroshi SasakiKyushu University
Koji InoueKyushu University
Teruo TanimotoThe University of Tokyo
Hiroshi NakamuraThe University of Tokyo
PACT 2012
Presented by Kim, Jong-yul2013. 7. 31
Contents
• Motivation• SBMP Scheduler
• Scalability Prediction• Core Partition• Core Donation• Phase Change Detection
• Evaluation Results• Conclusions
2 / 27
Prospects
• Limitation of increasing F• ILP, power wall, transistor scaling
• Multi-core, many-core system
System
APP2 APP3
…
APP1
Multi-threaded multiprogramming
3 / 27
Problem
• Traditional OS Assign equal CPU to all running apps
• Programs have different Scalability
Normalized Turnaround Time Clock cycles when multiprogrammed with others
Clock cycles when solo-run
WorkloadsAverage
AverageWorkloads
Performance
4 / 27
Linux: 2.04
Best Partitioning: 1.38
Experimental System
allocation unit
5 / 27
SBMP SchedulerScalability PredictionCore PartitioningCore DonationPhase Change Detection
6 / 27
Overview
• Assign cores considering scalability of applications
• SBMP: Scalability-Based Manycore Partitioning scheduler
Partitioning
SteadyScalability Prediction
Core Parti-tioning
Core Dona-tion Detect
7 / 27
SteadyScalability Prediction
Core Parti-tioning
Core Dona-tion Detect
8 / 27
Workloads
Scalability Prediction (1/2)
• Cumulative retired instructions per second (IPS) Little effect from # of cores
Total # of instructions
Tota
l # o
f ins
truc
tions
8%
9 / 27
Scalability Prediction (2/2)
• If obtained directly…• Warm up branch prediction & cache system• Need 8 allocations (6, 12, 18, …, 48)
• Simple model
• 3 coefficients (α, β, γ)• 3 Samplings: 1 single core + 2 different configurations
Performance Amdahl’s law Overhead caused by additional core
Over 3 seconds
10 / 27
SteadyScalability Prediction
Core Parti-tioning
Core Dona-tion Detect
11 / 27
# of cores
Relativeperformance
Core Partitioning (1/2)
High
Medium
Low
# of cores
Relativeperformance
12 / 27
Core Partitioning (2/2)
• Scalability-table for each program• Key -value
• Key : # of cores• Value : performance with [key] cores
• Goal
• Hill climbing algorithm Near optimal assignment
Single-run
Multiprogrammed
13 / 27
SteadyScalability Prediction
Core Parti-tioning
Core Dona-tion Detect
14 / 27
Core Donation
• 1 program for each processor die• CPU utilization
Core1 Program1
CPU utilization ratio < Threshold (70%)
Core2
Donor
• Donee: most beneficial one• Utilization, scalability
• Priority: Donee < Donor • Finer granularity• Processor die (6 cores)
time
Program2Program2
Donee
15 / 27
SteadyScalability Prediction
Core Parti-tioning
Core Dona-tion Detect
16 / 27
SteadyScalability Prediction
Core Parti-tioning
Core Dona-tion Detect
17 / 27
Detection (1/2)
1. Creation or termination of program2. Phase transition detected in any of the programs
Performance
18 / 27
Detection (2/2) – Phase Prediction
• SBMP scheduler monitors performance every epoch (2.5s)
• Threshold ( > or <
SteadyScalability Prediction
Core Parti-tioning
Core Dona-tion Detect
19 / 27
EvaluationCore PartitioningPhase PredictionCore DonationOverall Performance
20 / 27
Experimental System
• PARSEC benchmark suite 2.1
Processor 4 X AMD Opteron 6172
# of dies / processor 2
# of cores / die 6
Total # of cores 48
L3 cache size 12 MB / socket
Main memory 96 GB DDR3 PC3-10600
Linux kernel 2.6.37.6
21 / 27
Workloads
Core Partitioning
• SBMP-base• Scalability Prediction + Core Partitioning
• Single-phase application (2 Medium + 2 Low)
Workloads
Performance
Average
Linux: 1.88
SBMP-base: 1.54
22 / 27
Phase Prediction
• SBMP-PP (Phase Prediction)• SBMP-base + Phase Prediction
• Multiple-phase application
Workloads
Linux: 1.89
SBMP-base: 2.09
SBMP-PP: 1.77
23 / 27
Core Donation
• SBMP-CD (Core Donation)• SBMP-PP + Core Donation
• 2 low CPU utilization + 2 normal
Workloads
Linux: 2.06
SBMP-PP: 1.68
SBMP-CD: 1.60
24 / 27
Overall Results
• All programs
Linux: 1.83SBMP-base: 1.99SBMP-PP: 1.70 (8%)SBMP-CD: 1.65 (11%)
72 Workloads
25 / 27
Conclusions
• OS scheduling on many core system• Multiple Multi-threaded applications
• SBMP Scheduler• Dynamic scalability prediction + Core partitioning• Phase recognition• Core Donation
• 11% over Linux
26 / 27
QnA
27 / 27
Hill Climbing Algorithm
• Find near optimal solution• Start with arbitrary solution• Incrementally changing a single element
28 / 27