guarded power gating in a multi-core setting

IBM Research

© 2010 IBM Corporation

Guarded Power Gating in a Multi-core Setting

Niti Madan, Alper Buyuktosunoglu, Pradip Bose,

IBM T.J.WatsonJune 2010

Murali AnnavaramUSC

IBM Research

2

Outline

Motivation

Queuing Model based Methodology

Results

Conclusions and Future Work

IBM Research

3

Power Management through Power Gating

– Use header or footer transistor to power-gate the idle circuit

– Apply “sleep” to header or footer => turn off voltage

– Can be applied at unit-level (intra-core or small-knob)

– Can be applied at core-level (per-core or big-knob)

VddSleep

Virtual Vdd

Logic Block.

.

.

.

IBM Research

4

Predictive Power Gating

Anita Lungu, Pradip Bose, Alper Buyuktosunoglu, Daniel Sorin,”Dynamic power gating with quality guarantees”. ISLPED ‘09

• Power-gating Algorithms are predictive by nature

• Frequent mis-predictions can burn more power than save

• Break-even point dependent upon block-size and tech parameters

• Guard mechanism proposed for unit-level power gating algorithms by Lungu et al. (ISLPED’09)

• Concern for per-core power gating algorithms as breakeven point is much higher for cores

Energy Overhead

0

Break-even point

Energy

Decide to power gate Wake-up

Cumulative Energy Savings

…10100 0000000000…

Decide to Power Gate

Correct prediction => save power

…10100 001………….

Decide to Power Gate

Ex. break-even point = 10 cycles

IBM Research

5

Power Gating Scenarios Exploiting the two dimensions of utilization to power-gate idle units

or cores

– System Utilization (OS perspective) triggers the big-knob

– Resource Utilization (Core’s perspective) triggers the small-knob• Do we PG cores or execution units or both?

How can we maximize power-savings opportunities provided by both the small and big knobs ?

Core 1

Core 2

Core 3

Core 4

time

time

(a) Baseline 4-core system (b) Folded 2-core system

IBM Research

6

Goals of this study

Explore the trade-offs between unit-level/small-knob power gating algorithms and per-core/big-knob power gating algorithms for a range of latencies/parameters

Leverage analytical models for early-stage evaluation

A case for guard mechanism for per-core power-gating

Sriram Vajapeyam, Pradip Bose

IBM Research

7

Queuing Theory Based Analytical Model Representation of Multi-processor workloads as a

Queuing system– Cores are servers

– Processing tasks are customer requests

– Tasks are processed in FCFS order

– Queuing system tracks average customer waiting time, service time and server utilization

Evaluate our power-management policies using C++ based Queuing model simulator: “QUTE”

?

ArrivalsCustomers

Queue Server(s) Departures

IBM Research

8

Overview of QUTE Framework Simulation of Queuing Models (G/G/N/k/inf/FCFS)

– Faster than cycle-accurate simulations

– Easy to explore design-space early on

Statistical Workload Generation Parameters:– Task Arrival Times: Exponential Distribution

– Task Lengths: Normal/Exponential/Uniform Distributions

Evaluation Metrics:– Performance: Average response time

– Power: Average number of cores switched on

– Other Stats: Server utilization, variance in service demand etc.

IBM Research

9

QUTE Framework

.

.

Task arrival(arrival rate distribution using random number generator)

……..C1 C2 C3 C4

(all cores queue back the task at the end of a time slice)

(service time or task Length statistical distribution)

FIFO Task Queue

IBM Research

10

Big Knob Modeling

Implemented a simple Idleness-triggered heuristic: Set Idleness Threshold (say to 0.5 msec)

Every 0.5 msec (i.e. the idleness threshold),

– Scan all cores

– Identify cores idle for > idleness threshold

– Switch off all such cores (except, make sure there is always at least one core ON, either free or active)

When a task arrives at the head of the task queue:

– If there is no free core,• If there is a switched-off core, switch it ON

IBM Research

11

Small Knob Modeling

Cannot directly simulate workload phases

Each core can have N power states – 2 states for this version : nominal power state and low

power state (75% power)

Generate statistical distribution (Gaussian) of each power state duration

Each task always starts in the nominal power state – Switch between power states in a given time-slice

Parameters: Nominal (Hi) and Low (Lo) power state means, Transition overhead

IBM Research

12

Simulation Parameters

System-level Parameters

Number of coresMean Task LengthMean Task Inter-

Arrival RateTime Slice

Simulation Length

32 5 ms

300 µs

1 ms 10000 Tasks

Big Knob Parameters Core Switch-on Lat(OnLat)

Idleness Threshold (CT)

500 µs

500 µs

Small Knob Parameters

Hi state mean Lo state mean

Transition overhead Power Factor

300 µs 100 µs

1 µs 0.75

ρ = λ / N*µ

IBM Research

13

Outline

Problem Background

Methodology: Queuing Model

Results

Conclusions and Future Work

IBM Research

14

Big Knob ResultsExperiment Response

time (µs)Average

Power (Num Cores)

Base 5002.22 32

OnLat = 0.5ms CT = 0.5msCT = 0.3msCT = 0.1msCT = 10µs

5038.465070.125158.515244.43

24.9923.3321.8321.68

OnLat = 10µs CT= 0.5msCT = 10µs

5002.935007.07

24.8220.77

• CT controls the degree of power-savings (up to 34%) • OnLat controls the performance loss (up to 5%)

IBM Research

15

Idle-Time Durations Histogram

010002000300040005000600070008000

0100

200

300

400

500

600

700

800

900

1000

1100

1200

1300

1400

1500

Number of durations

Idle-time Duration (us)

CT

IBM Research

16

Small Knob Results

System_Power = Num_cores x (%time_in_Hi_state + F x %time_in_Lo_state) x P where F = 0.75 for this analysis

Workload Behavior

Hi Mean

Lo Mean

Hi %

Lo %

Response Time(µs)

Avg Power (Effective Num-Cores)

Short phases

High ILPLow ILP Very High ILPVery Low ILP

100200 300 100 500 100

100 200 100 300 100 500

52 57 7930 8921

4843 2170 11 79

5050.51 5027.36 5026.46 5028.23 5013.67 5019.95

28.16 28.48 30.08 26.24 31.04 25.6

0.5 1 5 100123

Transition Overhead (us)

Performance Loss %

• Power-savings dependent upon workload behavior

• Short phases increases number of transitions and overhead

• Transition overhead tolerable for our assumptions

IBM Research

17

Hybrid Model Results (Big + Small Knob)

High ILP Workload Low ILP Workload

Inter-arrival Rate(µs)

Server Utilization(measured)

50100300500

10002000

1.01.0

0.520.310.160.08

• High ILP workloads – Big knob is most helpful

• Low ILP workloads – Small knob helpful for even lower utilization

IBM Research

18

A Case for Guard Mechanism for Multi-core Power Gating

Experiment Response Time (us) Core Switching ON/OFF Frequency

Fixed Arrival Rate 5043.88 91482

Toggling Arrival Rate 5111 226372

Depending upon workload characteristics, Per-core power gating heuristics are prone to mis-predictions and dissipating more power

Aggressive power-gating heuristics are also increase the performance overhead of mis-prediction (e.g. Lower CT )

IBM Research

19

Observations In a fully loaded system, the small knob is helpful

In a lightly loaded system, the big knob is most useful

In the intermediate loaded system, the big knob is useful to have but the usefulness of the small knob depends upon the workload characteristics– Lower ILP or low resource utilization workloads are

benefited by the small knob

Small knob is a useful feature to have regardless of system load if we can implement power state with lower power factor – Current power factor is conservative (0.75)

IBM Research

20

Future Work

Improve methodology by supporting real server utilization traces

Evaluate a system with multiple P-states and DVFS

Architect guard mechanisms for the per-core power gating algorithms

Design implementation of a hybrid PG system

IBM Research

21

Thanks and Questions!

IBM Research

22

Backup Slides

IBM Research

23

Power Factor Sensitivity Analysis for High ILP Workload

50 100 300 500 1000 20000

0.2

0.4

0.6

0.8

1

1.2

InterIntra_0.75H_0.75Intra_0.5H_0.5Intra_0.25H_0.25Intra_0.1H_0.1

IBM Research

24

Power Factor Sensitivity Analysis for Low ILP Workload

50 100 300 500 1000 20000

0.2

0.4

0.6

0.8

1

1.2

InterIntra_0.75H_0.75Intra_0.5H_0.5Intra_0.25H_0.25Intra_0.1H_0.1

IBM Research

25

Two Level Power Gating Algorithms (Lungu et al. ISLPED'09)

Observations: Correctness requirement of power

saving schemes (efficiency-wise): save power

Single level idle prediction algorithms can behave incorrectly and waste power

Proposed Idea: Add second level monitor to control

enabling of power gating scheme Improve efficiency of power

wasting cases without degrading power saving of common case

Per-core power-gating algorithms also rely on such predictive schemes and will require guard mechanisms – Cost of misprediction is higher in per-

core power-gating

Efficiency Counters Enable

Estimate Power Savings

> 0Yes

Enable = 1

Enable = 0

Cnt2++Cnt1++

Level 2: Monitor & Control

Level 1: Actuate

No

On Off_U Off_C

Off_U: Power gated, uncompensated

Off_C: Power gated, compensated

guarded power gating in a multi-core setting

Documents

powergating algorithms

core power gating algorithms

guarded power gating

dynamic power gating

power gateex

power gating use header

powermanagement policies

power gatecorrect prediction