guarded power gating in a multi-core setting
DESCRIPTION
Guarded Power Gating in a Multi-core Setting. Niti Madan , Alper Buyuktosunoglu , Pradip Bose, IBM T.J.Watson June 2010. Murali Annavaram USC. Outline. Motivation Queuing Model based Methodology Results Conclusions and Future Work. Power Management through Power Gating . - PowerPoint PPT PresentationTRANSCRIPT
IBM Research
© 2010 IBM Corporation
Guarded Power Gating in a Multi-core Setting
Niti Madan, Alper Buyuktosunoglu, Pradip Bose,
IBM T.J.WatsonJune 2010
Murali AnnavaramUSC
IBM Research
2
Outline
Motivation
Queuing Model based Methodology
Results
Conclusions and Future Work
IBM Research
3
Power Management through Power Gating
– Use header or footer transistor to power-gate the idle circuit
– Apply “sleep” to header or footer => turn off voltage
– Can be applied at unit-level (intra-core or small-knob)
– Can be applied at core-level (per-core or big-knob)
VddSleep
Virtual Vdd
Logic Block.
.
.
.
IBM Research
4
Predictive Power Gating
Anita Lungu, Pradip Bose, Alper Buyuktosunoglu, Daniel Sorin,”Dynamic power gating with quality guarantees”. ISLPED ‘09
• Power-gating Algorithms are predictive by nature
• Frequent mis-predictions can burn more power than save
• Break-even point dependent upon block-size and tech parameters
• Guard mechanism proposed for unit-level power gating algorithms by Lungu et al. (ISLPED’09)
• Concern for per-core power gating algorithms as breakeven point is much higher for cores
Energy Overhead
0
Break-even point
Energy
Decide to power gate Wake-up
Cumulative Energy Savings
…10100 0000000000…
Decide to Power Gate
Correct prediction => save power
…10100 001………….
Decide to Power Gate
Ex. break-even point = 10 cycles
IBM Research
5
Power Gating Scenarios Exploiting the two dimensions of utilization to power-gate idle units
or cores
– System Utilization (OS perspective) triggers the big-knob
– Resource Utilization (Core’s perspective) triggers the small-knob• Do we PG cores or execution units or both?
How can we maximize power-savings opportunities provided by both the small and big knobs ?
Core 1
Core 2
Core 3
Core 4
time
time
(a) Baseline 4-core system (b) Folded 2-core system
IBM Research
6
Goals of this study
Explore the trade-offs between unit-level/small-knob power gating algorithms and per-core/big-knob power gating algorithms for a range of latencies/parameters
Leverage analytical models for early-stage evaluation
A case for guard mechanism for per-core power-gating
Sriram Vajapeyam, Pradip Bose
IBM Research
7
Queuing Theory Based Analytical Model Representation of Multi-processor workloads as a
Queuing system– Cores are servers
– Processing tasks are customer requests
– Tasks are processed in FCFS order
– Queuing system tracks average customer waiting time, service time and server utilization
Evaluate our power-management policies using C++ based Queuing model simulator: “QUTE”
?
ArrivalsCustomers
Queue Server(s) Departures
IBM Research
8
Overview of QUTE Framework Simulation of Queuing Models (G/G/N/k/inf/FCFS)
– Faster than cycle-accurate simulations
– Easy to explore design-space early on
Statistical Workload Generation Parameters:– Task Arrival Times: Exponential Distribution
– Task Lengths: Normal/Exponential/Uniform Distributions
Evaluation Metrics:– Performance: Average response time
– Power: Average number of cores switched on
– Other Stats: Server utilization, variance in service demand etc.
IBM Research
9
QUTE Framework
.
.
Task arrival(arrival rate distribution using random number generator)
……..C1 C2 C3 C4
(all cores queue back the task at the end of a time slice)
(service time or task Length statistical distribution)
FIFO Task Queue
IBM Research
10
Big Knob Modeling
Implemented a simple Idleness-triggered heuristic: Set Idleness Threshold (say to 0.5 msec)
Every 0.5 msec (i.e. the idleness threshold),
– Scan all cores
– Identify cores idle for > idleness threshold
– Switch off all such cores (except, make sure there is always at least one core ON, either free or active)
When a task arrives at the head of the task queue:
– If there is no free core,• If there is a switched-off core, switch it ON
IBM Research
11
Small Knob Modeling
Cannot directly simulate workload phases
Each core can have N power states – 2 states for this version : nominal power state and low
power state (75% power)
Generate statistical distribution (Gaussian) of each power state duration
Each task always starts in the nominal power state – Switch between power states in a given time-slice
Parameters: Nominal (Hi) and Low (Lo) power state means, Transition overhead
IBM Research
12
Simulation Parameters
System-level Parameters
Number of coresMean Task LengthMean Task Inter-
Arrival RateTime Slice
Simulation Length
32 5 ms
300 µs
1 ms 10000 Tasks
Big Knob Parameters Core Switch-on Lat(OnLat)
Idleness Threshold (CT)
500 µs
500 µs
Small Knob Parameters
Hi state mean Lo state mean
Transition overhead Power Factor
300 µs 100 µs
1 µs 0.75
ρ = λ / N*µ
IBM Research
13
Outline
Problem Background
Methodology: Queuing Model
Results
Conclusions and Future Work
IBM Research
14
Big Knob ResultsExperiment Response
time (µs)Average
Power (Num Cores)
Base 5002.22 32
OnLat = 0.5ms CT = 0.5msCT = 0.3msCT = 0.1msCT = 10µs
5038.465070.125158.515244.43
24.9923.3321.8321.68
OnLat = 10µs CT= 0.5msCT = 10µs
5002.935007.07
24.8220.77
• CT controls the degree of power-savings (up to 34%) • OnLat controls the performance loss (up to 5%)
IBM Research
15
Idle-Time Durations Histogram
010002000300040005000600070008000
0100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
1500
Number of durations
Idle-time Duration (us)
CT
IBM Research
16
Small Knob Results
System_Power = Num_cores x (%time_in_Hi_state + F x %time_in_Lo_state) x P where F = 0.75 for this analysis
Workload Behavior
Hi Mean
Lo Mean
Hi %
Lo %
Response Time(µs)
Avg Power (Effective Num-Cores)
Short phases
High ILPLow ILP Very High ILPVery Low ILP
100200 300 100 500 100
100 200 100 300 100 500
52 57 7930 8921
4843 2170 11 79
5050.51 5027.36 5026.46 5028.23 5013.67 5019.95
28.16 28.48 30.08 26.24 31.04 25.6
0.5 1 5 100123
Transition Overhead (us)
Performance Loss %
• Power-savings dependent upon workload behavior
• Short phases increases number of transitions and overhead
• Transition overhead tolerable for our assumptions
IBM Research
17
Hybrid Model Results (Big + Small Knob)
High ILP Workload Low ILP Workload
Inter-arrival Rate(µs)
Server Utilization(measured)
50100300500
10002000
1.01.0
0.520.310.160.08
• High ILP workloads – Big knob is most helpful
• Low ILP workloads – Small knob helpful for even lower utilization
IBM Research
18
A Case for Guard Mechanism for Multi-core Power Gating
Experiment Response Time (us) Core Switching ON/OFF Frequency
Fixed Arrival Rate 5043.88 91482
Toggling Arrival Rate 5111 226372
Depending upon workload characteristics, Per-core power gating heuristics are prone to mis-predictions and dissipating more power
Aggressive power-gating heuristics are also increase the performance overhead of mis-prediction (e.g. Lower CT )
IBM Research
19
Observations In a fully loaded system, the small knob is helpful
In a lightly loaded system, the big knob is most useful
In the intermediate loaded system, the big knob is useful to have but the usefulness of the small knob depends upon the workload characteristics– Lower ILP or low resource utilization workloads are
benefited by the small knob
Small knob is a useful feature to have regardless of system load if we can implement power state with lower power factor – Current power factor is conservative (0.75)
IBM Research
20
Future Work
Improve methodology by supporting real server utilization traces
Evaluate a system with multiple P-states and DVFS
Architect guard mechanisms for the per-core power gating algorithms
Design implementation of a hybrid PG system
IBM Research
21
Thanks and Questions!
IBM Research
22
Backup Slides
IBM Research
23
Power Factor Sensitivity Analysis for High ILP Workload
50 100 300 500 1000 20000
0.2
0.4
0.6
0.8
1
1.2
InterIntra_0.75H_0.75Intra_0.5H_0.5Intra_0.25H_0.25Intra_0.1H_0.1
IBM Research
24
Power Factor Sensitivity Analysis for Low ILP Workload
50 100 300 500 1000 20000
0.2
0.4
0.6
0.8
1
1.2
InterIntra_0.75H_0.75Intra_0.5H_0.5Intra_0.25H_0.25Intra_0.1H_0.1
IBM Research
25
Two Level Power Gating Algorithms (Lungu et al. ISLPED'09)
Observations: Correctness requirement of power
saving schemes (efficiency-wise): save power
Single level idle prediction algorithms can behave incorrectly and waste power
Proposed Idea: Add second level monitor to control
enabling of power gating scheme Improve efficiency of power
wasting cases without degrading power saving of common case
Per-core power-gating algorithms also rely on such predictive schemes and will require guard mechanisms – Cost of misprediction is higher in per-
core power-gating
Efficiency Counters Enable
Estimate Power Savings
> 0Yes
Enable = 1
Enable = 0
Cnt2++Cnt1++
Level 2: Monitor & Control
Level 1: Actuate
No
On Off_U Off_C
Off_U: Power gated, uncompensated
Off_C: Power gated, compensated