runtime power measurement/modeling and thermal modeling

139
Runtime Power Measurement/Model ing and Thermal Modeling Research Seminar Canturk ISCI

Upload: kevyn

Post on 13-Jan-2016

69 views

Category:

Documents


12 download

DESCRIPTION

Runtime Power Measurement/Modeling and Thermal Modeling. Research Seminar Canturk ISCI. MOTIVATION. Power Matters! Performance improves exponentially  SO DOES POWER DENSITY Chip areas increase 7%/year Battery Life: Improves Much Slower Thermal Issues Follows power density - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Runtime Power Measurement/Modeling and Thermal Modeling

Runtime Power

Measurement/Modelingand Thermal Modeling

Research SeminarCanturk ISCI

Page 2: Runtime Power Measurement/Modeling and Thermal Modeling

2

MOTIVATIONMOTIVATION Power Matters!

Performance improves exponentially SO DOES POWER DENSITY

Chip areas increase 7%/year Battery Life: Improves Much Slower Thermal Issues

Follows power densityPackaging costs: +$1/W over ~40W

Need good Measurement/Modeling techniques for Power & Thermally aware/adaptive systems Using Measurement to probe microarchitectural details

CASTLE, data activity experiment Compiler Level Power Optimizations

SW Power Profiling and Optimization Power aware OS

power modeling for decision making Dynamic thermal/power management

Thermal hotspots & Power threshold

Page 3: Runtime Power Measurement/Modeling and Thermal Modeling

3

MOTIVATIONMOTIVATION Power Models reflecting modern processors

Clock gating, power Voltage regulation, di/dt

Need for Fast-Realtime Modeling and Measurement to observe long time periods Thermal time constants: O(s) Not feasible even with architecural simulators

i.e.: 1s of real run ~5 x IPC hrs of WATTCH simulation

Need live, run-time power/thermal measures Dynamic Thermal Management Power-Aware OS & Systems control

Page 4: Runtime Power Measurement/Modeling and Thermal Modeling

4

THE BIG PICTURETHE BIG PICTURE

To Estimate component power & temperature breakdowns for P4 at runtime…

Bottom line…

Performance Monitoring

Real Power Measurement

PowerModeling

ThermalModeling

Page 5: Runtime Power Measurement/Modeling and Thermal Modeling

5

Performance Monitoring

Real Power Measurement

PowerModeling

ThermalModeling

Remainder of TalkRemainder of Talk

Performance Monitoring

Real Power Measurement

PowerModeling

ThermalModeling

Related Work Performance Monitoring

P4 Performance Counters Performance Reader LKM

Real Power Measurement P4 Power Measurement Setup Examples

Power Modeling P4 Power Model Model + Measurement Sync Setup,

Verification Thermal Modeling

Refined Thermal Model Ex: Ppro Thermal Model

Page 6: Runtime Power Measurement/Modeling and Thermal Modeling

6

RELATED WORKRELATED WORK Implementing counter readers:

PCL [Berrendorf 1998], Intel VTune, Brink & Abyss [Sprunt 2002]

Using counters for Performance: HPC [Crummey 2001], CPU profilers

Using counters for Power: CASTLE [Joseph 2001], power profilers event driven OS/cruise control [Bellosa 2000,2002]

Real Power Measurement: Compiler Optimizations [Seng 2003] Cycle-accurate measurement with switch caps [Chang

2002]

Page 7: Runtime Power Measurement/Modeling and Thermal Modeling

7

RELATED WORKRELATED WORK Power Management and Modeling Support:

Instruction level energy [Tiwari 1994] PowerScope: Procedure level energy [Flinn 1999] Event counter driven energy coprocessor [Haid 2003] Power-breakdown driven energy reduction [Huang 2001] Virtual Energy Counters for Mem. [Kadayif 2001] ECOsystem: OS energy accounting [Ellis 2002]

Thermal Management and Modeling Support: PID based DTM [Skadron 2002] Architectural Thermal Model [Skadron 2003] Evaluating DTM techniques [Brooks 2001]

Page 8: Runtime Power Measurement/Modeling and Thermal Modeling

8

Performance Monitoring

Real Power Measurement

PowerModeling

ThermalModeling

Milestone 1Milestone 1

Performance Monitoring

Related Work Performance Monitoring

P4 Performance Counters Performance Reader LKM

Real Power Measurement P4 Power Measurement Setup Examples

Power Modeling P4 Power Model Model + Measurement Sync Setup,

Verification Thermal Modeling

Refined Thermal Model Ex: Ppro Thermal Model

Page 9: Runtime Power Measurement/Modeling and Thermal Modeling

9

Live CPU Performance Monitoring Live CPU Performance Monitoring with Hardware Counterswith Hardware Counters

Most CPUs have hardware performance counters P4 Performance Monitoring HW:

18 Event Counters 18 Counter Configuration Control Registers

Configure how to count 45 Event Selection Control Registers

Configure what to count Additional Control Registers

Page 10: Runtime Power Measurement/Modeling and Thermal Modeling

10

Counter OverviewCounter Overview Counting Types

Non-retirement: At-Retirement:

Can count BOGUS vs NBOGUS, Tag uops,etc.Mechanisms:

Front end taggingExecution taggingReplay TaggingNo Tags

Also:Event Counting Event Based SamplingPrecise EBS

Event Types 59 event classes 100s of events to count Metric Classifications:

GeneralEx: Speculative Uops retiredBranchingEx: Mispredicted conditionalsTrace Cache and Front EndEx: Processor N deliver modeMemoryEx: MOB Load replaysBusEx: Prefetch bus accessesCharacterizationEx: Packed SP retiredMachine ClearEx: Memory Order Machine Clear

Page 11: Runtime Power Measurement/Modeling and Thermal Modeling

11

Our Event-Counter: Performance ReaderOur Event-Counter: Performance Reader

Performance Reader implemented as Linux Loadable Kernel Module

Implements 6 syscalls: select_events()reset_event_counter()start_event_counter()stop_event_counter()get_event_counts()set_replay_MSRs()

User Level Interface: Defines the events

Starts counters Stops counters

Reads counters & TSC

Page 12: Runtime Power Measurement/Modeling and Thermal Modeling

12

Performance Reader: Performance Reader: Example ValidationExample Validation

L1_Dcache benchmark

Controls cache hit behavior

Validated against measured cache events

Vary hit rate from 0-100%

L1 Hit Rate Experiment

0.00%

20.00%

40.00%

60.00%

80.00%

100.00%

120.00%

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1Desired Hit Rate (Benchmark Input)

Acq

uir

ed H

it R

ates

Ideal Hit RateAcquired L1 Hit RateL1 hit rate from L2 Access

Page 13: Runtime Power Measurement/Modeling and Thermal Modeling

13

Performance Monitoring

Real Power Measurement

PowerModeling

ThermalModeling

Milestone 2Milestone 2

Real Power Measurement

Related Work Performance Monitoring

P4 Performance Counters Performance Reader LKM

Real Power Measurement P4 Power Measurement Setup Examples

Power Modeling P4 Power Model Model + Measurement Sync Setup,

Verification Thermal Modeling

Refined Thermal Model Ex: Ppro Thermal Model

Page 14: Runtime Power Measurement/Modeling and Thermal Modeling

14

P4 Power Measuring SetupP4 Power Measuring Setup

1mV/Adc conversion

Clamp ammeter on 12V lines on measured CPU

Voltage readings via RS232 to

logging machine

Serial Reader(PowerMeter)(PowerPlotter)

Convert to Power vs. time window

DMM reading clamp voltages

Page 15: Runtime Power Measurement/Modeling and Thermal Modeling

15Pow

erP

lott

er:

Exa

mp

leP

ower

Plo

tter

: E

xam

ple “Branch exercise”

(Taken rate: 1)“High-Low”“L1Dcache”

Array Size1/100 of L1

“L1Dcache”Array Sizex25 of L1~L2

“L1Dcache”Array Sizex4 of L2

Initialization

BenchmarkExecution

“Fast”

Page 16: Runtime Power Measurement/Modeling and Thermal Modeling

16

SPEC Power ExamplesSPEC Power Examples

Different programs show very different power characteristics

Timescale of interest can be huge => inaccessible via simulation

Spec GCC (O3) with specrun -a run

0

10

20

30

40

50

60

70

80

0 50 100 150 200time (s)

[W]

Spec VPR (O3) with specrun -a run

0

10

20

30

40

50

60

0 100 200 300 400 500time(s)

[W]

Page 17: Runtime Power Measurement/Modeling and Thermal Modeling

17

Performance Monitoring

Real Power Measurement

PowerModeling

ThermalModeling

Milestone 3Milestone 3

PowerModeling

Related Work Performance Monitoring

P4 Performance Counters Performance Reader LKM

Real Power Measurement P4 Power Measurement Setup Examples

Power Modeling P4 Power Model Model + Measurement Sync Setup,

Verification Thermal Modeling

Refined Thermal Model Ex: Ppro Thermal Model

Page 18: Runtime Power Measurement/Modeling and Thermal Modeling

18

DefineComponents

Performance Monitoring

Real Power Measurement

PowerModeling

DefineEvents

Real Power Measurement

Verify total power against measured processor power

PowerModeling

Convert counter info into component power breakdowns

Performance Monitoring

Gather counter info with minimal power overhead and program interruption

DefineEvents

Determine combination of P4 events that represent component accesses best

DefineComponents

Define components (I.e. L1 cache, BPU, Regs, etc.), whose powers we’ll model: from annotated layout

P4 POWER MODELP4 POWER MODEL

Page 19: Runtime Power Measurement/Modeling and Thermal Modeling

19

Defining ComponentsDefining Components

Page 20: Runtime Power Measurement/Modeling and Thermal Modeling

20

Defining ComponentsDefining Components

Page 21: Runtime Power Measurement/Modeling and Thermal Modeling

21

Defining Events Defining Events Access Rates Access Rates We determined 24 events to approximate access rates

for 22 components Used Several Heuristics to represent each access rate Ex: 2nd Level BPU:

Metric 1: Instructions fetched from L2 (predict)Event: ITLB_Reference

Counts ITLB translationsMask:

All hits, misses Metric 2: Branches retired (history update)

Event: branch_retiredCounts branches retired

Mask:Count all Taken/NT/Predicted/MissP

Need to rotate counters 4 times to collect all event data Used 15 counters & 4 rotations to collect all event data

Page 22: Runtime Power Measurement/Modeling and Thermal Modeling

22

Access Rates Access Rates Component Powers Component Powers We gather counter data at measured computer via

the tiny counter reader We send the access rates to logger machine

Don’t want to do any computation at host

Logger machine converts access rates to the component power breakdowns Computation done externally, still at runtime Access rates used as proxy to max component

power weighting together with microarchitectural details

EX: Trace cache delivers 3 uops/cycle maxPower(TC)=Access-Rate(TC)/3 * MaxPower(TC) + Non-gated TC CLK power

Page 23: Runtime Power Measurement/Modeling and Thermal Modeling

23

Generic EquationGeneric Equation

Power(Component)||

Access-Rate(Component)x

Microarchitectural Scalingx

MaxPower(Component)+

Non-gated component Clock power

Page 24: Runtime Power Measurement/Modeling and Thermal Modeling

24

Experiment Setup – Recall:Experiment Setup – Recall:

1mV/Adc conversion

Clamp ammeter on 12V lines on measured CPU

Voltage readings via RS232 to

logging machine

Serial Reader(PowerMeter)(PowerPlotter)

Convert to Power vs. time window

DMM reading clamp voltages

Page 25: Runtime Power Measurement/Modeling and Thermal Modeling

25

Experiment SetupExperiment Setup

Voltage readings via RS232 to logging machine

1mV/Adc conversion

Page 26: Runtime Power Measurement/Modeling and Thermal Modeling

26

Experiment SetupExperiment Setup

POWERCLIENT

POWERSERVER

Voltage readings via RS232 to logging machine

Convert voltage to measured powerConvert access rates to modeled powersSync together in time window

1mV/Adc conversion

Component access rates

over ethernet

Page 27: Runtime Power Measurement/Modeling and Thermal Modeling

27

Area Based Power Estimate – Area Based Power Estimate – Total Power ResultTotal Power Result

“Fast”

“Branch exercise”(Taken rate: 1) “High-Low”“L1Dcache”

(Hit Rate : 0.1)Measured

Modeled

Page 28: Runtime Power Measurement/Modeling and Thermal Modeling

28

After Tuning?After Tuning?

“Fast”

“Branch exercise”(Taken rate: 1) “High-Low”“L1Dcache”

(Hit Rate : 0.1)Measured

Modeled

Page 29: Runtime Power Measurement/Modeling and Thermal Modeling

29Com

pon

ent

Bre

akd

own

sC

omp

onen

t B

reak

dow

ns

Component Breakdowns for “branch_exercise”

Colors for 4 CPU subsystems

Issue - RetireExecution

Page 30: Runtime Power Measurement/Modeling and Thermal Modeling

30

SPEC ResultsSPEC Results Measured

Modeled

Gcc Gzip Vpr Vortex Gap

Crafty

Page 31: Runtime Power Measurement/Modeling and Thermal Modeling

31

Performance Monitoring

Real Power Measurement

PowerModeling

ThermalModeling

Milestone 4Milestone 4

ThermalModeling

Related Work Performance Monitoring

P4 Performance Counters Performance Reader LKM

Real Power Measurement P4 Power Measurement Setup Examples

Power Modeling P4 Power Model Model + Measurement Sync Setup,

Verification Thermal Modeling

Refined Thermal Model Ex: Ppro Thermal Model

Page 32: Runtime Power Measurement/Modeling and Thermal Modeling

32

THERMAL MODELING: A Basic ModelTHERMAL MODELING: A Basic Model

Based on lumpedR-C model from packaging

Built uponpower modeling Sampled

Component Powers

Respective component areas

Physical processor Parameters

PackagingHeat Transfer

Tb,i

Rth,i

Cth,iPi

Tb,j

Rth,j

Cth,jPj

Tb,k

Rth,k

Cth,kPk

Tb,l

Rth,l

Cth,lPl

Th

Rth,h

Cth,hPi+Pj+Pk+Pl HEATSINK

Blki Blkj

BlkkBlkl

DIETb,i

Rth,i

Cth,iPi

Tb,j

Rth,j

Cth,jPj

Tb,k

Rth,k

Cth,kPk

Tb,l

Rth,l

Cth,lPl

Th

Rth,h

Cth,hPi+Pj+Pk+Pl

Tb,i

Rth,i

Cth,iPi

Tb,i

Rth,i

Cth,iPi

Tb,j

Rth,j

Cth,jPj

Tb,j

Rth,j

Cth,jPj

Tb,k

Rth,k

Cth,kPk

Tb,k

Rth,k

Cth,kPk

Tb,l

Rth,l

Cth,lPl

Tb,l

Rth,l

Cth,lPl

Th

Rth,h

Cth,hPi+Pj+Pk+Pl HEATSINK

Blki Blkj

BlkkBlkl

DIE

Blki Blkj

BlkkBlkl

DIE

ithith

i

ith

ii

ibithR

TT

i

RC

tT

C

tPT

dt

dTCP

ith

hib

,,,

,,

:equationdifferenceFinal

,

,

t : Sampling intervalTi : The temperature

difference between block and the heatsink

Page 33: Runtime Power Measurement/Modeling and Thermal Modeling

33

Refined Thermal ModelRefined Thermal Model Steady State Analysis reveals, Heatsink-Die

abstraction is not sufficient for real systems Proceeding to a multilayer thermal model:

Active die thickness metalization/insulation chip-package interface package heatsink

Requires searching of several materials/ dimensions and thermal properties

Multiple layers Multiple T nodes Multiple DEs

Baseline Heat removal Structure: HEATSINKThermal GreaseHeat Spreader

PackageDie

Page 34: Runtime Power Measurement/Modeling and Thermal Modeling

34

Physical Structure vs. Thermal Model Physical Structure vs. Thermal Model

Ambient Temperature

Heatsink

Heat Spreader

Package

Die

Th

Rh

Ch

R_hXA

TA

Tspr

Rspr

Cspr

R_grXspr

Tp,i

Rp,i

Cp,i

Tdie,i

Rdie,i

Cdie,iPi

Ptotal

Thermal Grease

Ambient Airflow

Page 35: Runtime Power Measurement/Modeling and Thermal Modeling

35

Analytical DerivationAnalytical Derivation

4 Nodes 4 DEs 1) Tspr:

sprsprspr

hsprgrsprsprspr

totalspr

sprsprR

TT

total

sprsprR

TT

total

TTT

tTTRCC

tPT

t

TCP

dt

dTCP

grspr

hspr

grspr

hspr

)(1

:equationdifferenceFinal

:timengDiscretizi

Th

Rh

Ch

R_hXA

TA

Tspr

Rspr

Cspr

R_grXspr

Tp,i

Rp,i

Cp,i

Tdie,i

Rdie,i

Cdie,iPi

Ptotal

Th

Rh

Ch

Th

Rh

Ch

R_hXA

TA

Tspr

Rspr

Cspr

Tspr

Rspr

Cspr

R_grXspr

Tp,i

Rp,i

Cp,i

Tp,i

Rp,i

Cp,i

Tdie,i

Rdie,i

Cdie,iPi

Tdie,i

Rdie,i

Cdie,iPi

Ptotal

Page 36: Runtime Power Measurement/Modeling and Thermal Modeling

36

EX: Ppro Thermal ModelEX: Ppro Thermal ModelUse CASTLE [Joseph, 2001] computed

component powersDetermine component areas from Die

photoDetermine processor/packaging

physical parametersGenerate numerical thermal modelApply component difference equations

recursively along power flowTdie,i

Tp,i

Tspr

Th

Update Tdie,i

Update Tp,i

Update Tspr

Update Th

Page 37: Runtime Power Measurement/Modeling and Thermal Modeling

37

Simulation OutputsSimulation Outputs Thermal nodes updated every t~20ms

Component Temperatures Build up to ~350K in ~5hrs Theatsink moves very slowly as expected

Pentium Pro Thermal Simulation

01020304050607080

Ambie

nt

Heatsi

nk

Heat S

prea

der

Decod

eIss

ue

Reord

er

DMem

IMem FUs

Other

Te

mp

era

ture

(C

)

At startupAfter 5 Hours

Page 38: Runtime Power Measurement/Modeling and Thermal Modeling

38

SUMMARYSUMMARY

Performance Monitoring

Real Power Measurement

PowerModeling

ThermalModeling

Page 39: Runtime Power Measurement/Modeling and Thermal Modeling

39

ConclusionsConclusions Contributions:

Portable runtime real power measurement system Performance counter based runtime power & thermal

model and runtime verification with synchronous real power measurement

Thermal model, which can be applied to ANY power model - with good physical characterization - as long as physical component based power breakdowns are used.

Runtime modeling & measurement system for arbitrarily long timescales!

Outcomes: We can do reasonably accurate real power measurements

at runtime without interfering with HW We can perform runtime power modeling, with the tiny

performance reader without inducing any significant overhead to power profile

Page 40: Runtime Power Measurement/Modeling and Thermal Modeling

40

What to do next?What to do next? Keep tuning for SPECs

<1st Stop> Try regression at several corners

Won’t do well due to clk gating?? Get data from Intel? Try runtime self updating model? Compare all to actual data Experiment with March., evaluate several power properties

<2nd Stop> Add thermal Try to add lateral heat diffusion Get Contour results <New bkmrk>

<3rd Result> P4 thermal monitor stuff Could be played from kernel to modulate clock Can we use with our models to do power savings on REAL

HW??

Page 41: Runtime Power Measurement/Modeling and Thermal Modeling

41

Page 42: Runtime Power Measurement/Modeling and Thermal Modeling

42

RELATED WORK RELATED WORK – performance monitoring– performance monitoring

implementing counter readers:

PCL Performance Counter Library, by Rudolf Berrendorf (University of Applied Sciences Bonn-Rhein-Sieg), Heinz Ziegler, and Bernd Mohr at the Central Institute for Applied Mathematics (ZAM) at the Research Centre Juelich , Germany uniform interface for several architectures (intel Pentium,MMX,

Pro, III, 4/linux; IBM Power3, Power3-II/AIX; etc.) Software library with C, C++, Java & Fortran Bindings Kernel patch (Mikael Pettersson) recompile

PAPI Performance Application Programming Interface Project, by Jack Dongarra, Kevin London, Shirley Moore, Philip Mucci, etc., at Innovative Computing Lab, CS dept., University of Tennessee Standard Simple high level API and low level programmable

interface Supports Pentium, MMX, Pro, III/Linux, Windows; Power 3,4/AIX;

etc. PerfCtr kernel patch (Mikael Pettersson) recompile

Page 43: Runtime Power Measurement/Modeling and Thermal Modeling

43

RELATED WORK RELATED WORK – performance monitoring– performance monitoring

implementing counter readers: Perfmon Performance Monitoring Tool by Richard Enbody, Associate

Professor Department of Computer Science and Engineering, Michigan State University.

For SUN Ultra-Sparc & Ppro Device Driver (LKM)

Rabbit Performance Counters Library by Don Heller, Scalable Computing Laboratory, Iowa State University

for Intel Pentium MMX, Pro, II, III/Linux; AMD/Linux functions to access from within C

Cleanest of all, but still ~30 files & ~50instructions LKM

Intel’s VTune Performance analyzer Windows & Linux <New>

IBM’s HPM toolkit Power 3,4/AIX

Brink and Abyss Pentium 4 Performance Counter Tools For Linux, by Brinkley Sprunt, Electrical Engineering, Bucknell University

brink: high level perl script to read experiment/config files abyss: c program to access counters abyss_dev: device driver for counter access EBS kernel patches: to handle PMIs

Page 44: Runtime Power Measurement/Modeling and Thermal Modeling

44

RELATED WORK RELATED WORK – performance monitoring– performance monitoring

using counter readers: CASTLE Project by Margaret Martonosi and Russ Joseph,

Princeton University acquire Ppro counter data to model component power

breakdowns Frank Bellosa, “Benefits of Event Driven energy

Accounting in Power Sensitive Systems”, 9th SIGOPS European workshop, 2000 Counters to show power ~ k x instr-ns/cycle (PII) OS power optimizations:

Throttle down CPU/extend thread time for cache hit/slow down CPU core if main memory is accessed

Andreas Weissel, Frank Bellosa, “Process Cruise Control: Event driven clock scaling for dynamic power management”, CASES 2002 Use event counters info to scale individual thread

frequencies Intel Xscale / Modified Linux kernel

Page 45: Runtime Power Measurement/Modeling and Thermal Modeling

45

RELATED WORK RELATED WORK – performance monitoring– performance monitoring

using counter readers: HPC Toolkit, by John Mellor-Crummey, Rob

Fowler, CS Dept. Rice University Uses perf counter data for profiling converts raw profiling information into platform

independent XML formats and produces performance metric correlations from multiple sources

Used in compiler optimizations Jennifer Anderson, et al, “Continuous Profiling:

Where Have All the Cycles Gone?”, ACM Transactions on Computer Systems, Vol. 15, No. 4, November 1997, pp. 357 - 390. Performance analysis example – from DEC Data collection by counter sampling, performance

info from program level to individual instructions

Page 46: Runtime Power Measurement/Modeling and Thermal Modeling

46

RELATED WORK RELATED WORK – real power– real power

CASTLE Project by Margaret Martonosi and Russ Joseph, Princeton University Shunt R over Ppro power lines to measure total

processor power John Seng, Dean Tullsen, “Effect of compiler

optimizations on Pentium 4 Power consumption”, 7th Annual Workshop on Interaction between Compilers and Computer Architectures, February, 2003 Shunt R between VRM and CPU

Marc A. Viredaz, Deborah A. Wallach, “Power Evaluation of Itsy Version 2.3”, tech. note TN-57, WRL, Compaq Computer Corp., 2000 similar series R to estimate battery life of itsy pocket

computer

Page 47: Runtime Power Measurement/Modeling and Thermal Modeling

47

RELATED WORK RELATED WORK – real power– real power

Frank Bellosa, “Benefits of Event Driven energy Accounting in Power Sensitive Systems”, 9th SIGOPS European workshop, 2000 Crude Current measurement with DMM for Pentium II to help

define per instruction powers Andreas Weissel, Frank Bellosa, “Process Cruise Control:

Event driven clock scaling for dynamic power management”, CASES 2002 series sense resistor added to Intel IQ 80310 evaluation

platform power supply, to measure energy effect of frequency scaling

Naehyuck Chang, Kwanho Kim, and Hyun Gyu Lee, "Cycle-Accurate Energy Consumption Measurement and Analysis: Case Study of ARM7TDMI" ISLPED 2000 & IEEE Transactions on VLSI Systems, Vol. 10, pp. 146 - 154, Apr., 2002. cycle accurate energy consumption measurement based on

charge transfer Inserts switch caps between power supply and Processor that

switch with the same clock frequency!!

Page 48: Runtime Power Measurement/Modeling and Thermal Modeling

48

RELATED WORK RELATED WORK – power model– power model

Simulation Tools:

WATTCH, by David Brooks and Margaret Martonosi, Princeton University, ISCA 2000 Architectural power simulator Power Models intergrated upon SimpleScalar

SimplePower by W. Ye, N. Vijaykrishnan, M. Kandemir, Penn-State University, and M. Irwin “The Design and Use of SimplePower: A cycle-accurate energy estimation tool”, DAC, June 2000 Execution driven, Cycle accurate, RTL power

estimation Emulates 5 stage pipe with SimpleScalar’s Integer

ISA

Page 49: Runtime Power Measurement/Modeling and Thermal Modeling

49

RELATED WORK RELATED WORK – power model– power model

Power Modeling:

R. Joseph and M. Martonosi. “Run-Time Power Estimation in High Performance Microprocessors”, International Symposium on Low Power Electronics and Design, 2001 complete CASTLE Project: Collects Ppro counter data and models

component power breakdowns verifying against measured total power

Also Wattch simulation vs. counter approximation for SimpleScalar architecture

Russ Joseph, David Brooks, and Margaret Martonosi, "Live, Runtime Power Measurements as a Foundation for Evaluating Power/Performance Tradeoffs" Workshop on Complexity Effectice Design (WCED, held in conjunction with ISCA-28), 2001 Evaluate power vs. performance by measuring total power and

acquiring performance data from counters – i.e. Cache hit rate, branch prediction, bitline activity

Page 50: Runtime Power Measurement/Modeling and Thermal Modeling

50

RELATED WORK RELATED WORK – power model– power model

H. Zeng, X. Fan, C. Ellis, A. Lebeck, and A. Vahdat, “ECOSystem: Managing Energy as a First Class Operating System Resource”, Proceedings of ASPLOS X, Oct. 2002 Uses Currentcy Model (Fixed Power & Time budget for a task) for OS

level energy management for battery life ECOsystem is the Linux OS implementation <No counters> Considers CPU ON/OFF could do better with Power model

H. Zeng, C. Ellis, A. Lebeck, A. Vahdat , “Currentcy: Unifying Policies for Resource Management”, USENIX 2003 Annual Technical Conference Detailed description of currency (OS scheduling, etc.)

Flinn J., Satyanarayanan, M., “PowerScope: A Tool for Profiling the Energy Usage of Mobile Applications”, Proceedings of the Second IEEE Workshop on Mobile Computing Systems and Applications February, 1999 Maps Energy Program structure (Power Profiling – Energy efficient SW

design) DMM gets energy for machine kernel modification (system monitor) gets PIDs for processes and

identifies procedures for profiling offline

Page 51: Runtime Power Measurement/Modeling and Thermal Modeling

51

RELATED WORK RELATED WORK – power model– power model

V. Tiwari, S. Malik, and A. Wolfe, “Power analysis of embedded software: A first step towards software power minimization”, International Conference on Computer-Aided Design & IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 1994 PIONEER WORK in Power Measurement/Modeling Measure current drawn by an Intel 486DX2 Processor and DRAM Generate Energy cost table for instructions Identify inter-instructions effects: circuit state overhead, resource

constraint effect, cache miss effects there are 1 million like this: modeling SW energy, I won’t put here

Lee, A. Ermedahl, and S. Min. “An accurate instruction-level energy consumption model for embedded risc processors” ACM SIGPLAN Conf. on Languages, Compilers, and Tools for Embedded Systems (LCTES'01), Jun 2001 Derives energy consumption for instructions rather than functional

units for RISC ARM7TDMI processor Uses their cycle-accurate power measurement scheme Black box approach (similar to F. Bellosa) with linear regression

Page 52: Runtime Power Measurement/Modeling and Thermal Modeling

52

RELATED WORK RELATED WORK – power model– power model

J. Russell and M.F. Jacome, "Software Power Estimation and Optimization for High Performance, 32-bit Embedded Processors," Proc. of ICCD '98 Estimates SW energy for i960 family 32 bit embedded RISC

processors Uses digitizing oscilloscope/series Resistor over processor power

lines for measurement Uses const Pest for processor power and estimates energy based on

runtime ( won’t work with clock gating!) J. Haid, G. Kafer, et al, "Run-Time Energy Estimation in System-

On-a-Chip Designs", ASP-DAC 2003 Proposes a coprocessor for runtime energy estimation for SoC Defines similar event counters in coprocessor and uses power

macro-models M. Lajolo, A. Raghunathan, S. Dey, L. Lavagno, and A.

Sangiovanni-Vincentelli. “Efficient power estimation techniques for hw/sw systems”, IEEE Proc. VOLTA'99 International Workshop on Low Power Design, pages 191--199, March 1999. Power estimation for HW/SW SoC designs RTL HW simulator and Instruction Set simulator using instruction

level power models

Page 53: Runtime Power Measurement/Modeling and Thermal Modeling

53

RELATED WORK RELATED WORK – power model– power model

M. Huang, J. Renau, and J. Torrellas. “Profile-based energy reduction in high-performance processors”, In 4th Workshop on Feedback-Directed and Dynamic Optimization, December 2001 Use profiling to determine when to activate/deactivate low

power methods –i.e. DVS, clock gating, etc. Use energy statistics (power breakdowns) from

performance counters for profiling (SIM) I. Kadayif , T. Chinoda , M. Kandemir , N. Vijaykirsnan ,

M. J. Irwin , A. Sivasubramaniam, “vEC: virtual energy counters”, Proceedings of the 2001 ACM SIGPLAN-SIGSOFT workshop on Program analysis for software tools and engineering, 2001 Uses Perfmon library for UltraSPARC to read SPARC HW

perf counters related to memory Converts readings to power using analytical memory

energy model estimates memory system energy consumption

Page 54: Runtime Power Measurement/Modeling and Thermal Modeling

54

RELATED WORK RELATED WORK – power model– power model

Luca Benini et al “System-level power estimation and optimization”,

Proceedings 1998 international symposium on Low power electronics and design

“System-level power optimization: techniques and tools”, Proceedings of international symposium on Low power electronics and design, 1999

Tutorial on power conscious system level designMemory optimizations, Hardware software partitioning, instruction level power optimizations, DVS, DPM (allow components to sleep)

“Supporting system-level power exploration for DSP applications”, Proceedings of the 10th Great Lakes Symposium on VLSI, 2000

Modified ARM simulator for instruction level power estimation

Page 55: Runtime Power Measurement/Modeling and Thermal Modeling

55

RELATED WORK RELATED WORK – thermal model– thermal model

K. Skadron, T. Abdelzaher, and M. R. Stan. “Control-theoretic techniques and thermal-RC modeling for accurate and localized dynamic thermal management”, In Proc. HPCA-8, pages 17--28, Feb. 2002. Single degree component based thermal R-C model for MIPS

R10000 scaled to 0.18Um Only die heatsink thermal conduction, with const. heatsink and

Si properties only Power/Thermal Simulation using Wattch for verification of DTM

with PID controller

Sabry, M.-N.; Bontemps, A.; Aubert, V.; Vahrmann, R, “Realistic and efficient simulation of electro-thermal effects in VLSI circuits”, Very Large Scale Integration (VLSI) Systems, IEEE Transactions on , Volume: 5 Issue: 3 , Sep 1997 Transistor level with interdevice thermal resistances

Szekely, V.; Poppe, A.; Pahi, A.; Csendes, A.; Hajas, G.; Rencz, M, “Electro-thermal and logi-thermal simulation of VLSI designs”, Very Large Scale Integration (VLSI) Systems, IEEE Transactions on , Volume: 5 Issue: 3 , Sep 1997 LOGITHERM simulator module for gate level thermal simulation, by

thermal characterization of logic gates

Page 56: Runtime Power Measurement/Modeling and Thermal Modeling

56

RELATED WORK RELATED WORK – thermal model– thermal model

COSMOS/FloWorks by NIKA fluid flow and thermal analysis program Heat flow computation based on mesh analysis

A. Dhodapkar, C. H. Lim, G. Cai, and W. R. Daasch. “TEMPEST: A thermal enabled multi-model power / performance estimator”, Proceedings of Workshop on Power-Aware Computer Systems, Nov. 2000. Thermally enabled architectural simulator based on

SimpleScalar Single R,C for the whole processor packaging oriented

D. Brooks and M. Martonosi. Dynamic thermal management for high-performance microprocessors. In Proceedings of the Seventh International Symposium on High-Performance Computer Architecture, pages 171--82, Jan. 2001. Discusses Microarchitectural and scaling DTM mechanisms Uses moving average of power for ~100K cycles of Wattch

simulation as a proxy for temperature to detect thermal emergencies for DTM triggering

Page 57: Runtime Power Measurement/Modeling and Thermal Modeling

57

RELATED WORK RELATED WORK – thermal model– thermal model

Thermal Monitoring, “Intel Architecture SW developer’s Manual vol. 3” Catastrophic shutdown detector

thermal diode resets stop clock duty cycle Automatic Thermal monitor

Internally modulate stop clock duty cycle Software controlled clock modulation

SW modulates stop clock duty cycle

Kevin Skadron et al, “Temperature aware Microarchitecture”, 30th ISCA, 2003 HotSpot: architecture level thermal simulator built

upon Wattch Uses multiple degree thermal R-C model for die,

packaging, heatsink and convection to ambient More realistic area estimates based on Alpha 21364 Back Back

Page 58: Runtime Power Measurement/Modeling and Thermal Modeling

58

Page 59: Runtime Power Measurement/Modeling and Thermal Modeling

59

Counter Access HeuristicsCounter Access Heuristics 1) BUS CONTROL:

No 3rd Level cache BSQ allocations ~ IOQ allocations Metric1: Bus accesses from all agents

Event: IOQ_allocationCounts various types of bus transactions

Should account for BSQ as wellaccess based rather than duration

MASK:Default req. type, all read (128B) and write (64B) types, include OWN,OTHER and PREFETCH

Metric2: Bus Utilization(The % of time Bus is utilized)Event: FSB_data_activity

Counts DataReaDY and DataBuSY events on BusMask:

Count when processor or other agents drive/read/reserve the busExpression: FSB_data_activity x BusRatio / Clocks Elapsed

To account for clock ratios

Page 60: Runtime Power Measurement/Modeling and Thermal Modeling

60

Counter Access HeuristicsCounter Access Heuristics 2) L2 Cache:

Metric: 2nd Level cache referencesEvent: BSQ_cache_reference

Counts cache ref-s as seen by bus unitMASK:

All MESI read misses (LD & RFO)2nd level WR misses

3) 2nd Level BPU: Metric 1: Instructions fetched from L2 (predict)

Event: ITLB_ReferenceCounts ITLB translations

Mask:All hits, misses & UC hits

Metric 2: Branches retired (history update)Event: branch_retired

Counts branches retiredMask:

Count all Taken/NT/Predicted/MissP

Page 61: Runtime Power Measurement/Modeling and Thermal Modeling

61

Counter Access HeuristicsCounter Access Heuristics 4) ITLB & I-Fetch:

etc……… 10) FP Execution:

Metric: FP instructions executedevent1: packed_SP_uop

counts packed single precision uopsevent2: packed_DP_uop

counts packed single precision uopsevent3: scalar_SP_uop

counts scalar double precision uopsevent4: scalar_DP_uop

counts scalar double precision uopsevent5: 64bit_MMX_uop

counts MMX uops with 64bit SIMD operandsevent6: 128bit_MMX_uop

counts integer SSE2 uops with 128bit SIMD operandsevent7: x87_FP_UOP

counts x87 FP uopsevent8: x87_SIMD_moves_uop

counts x87, FP, MMX, SSE, SSE2 ld/st/mov uops Back Back

Page 62: Runtime Power Measurement/Modeling and Thermal Modeling

62

Page 63: Runtime Power Measurement/Modeling and Thermal Modeling

63

INTRODUCTION to RUNTIME

• What is Runtime Power/Thermal Measurement:Methodology for measuring CPU power / temperature and component breakdowns3 alternatives:1. Measuring power/temperature directly from hardware; i.e. with

multimeter probesImpossible with VLSIRuntime speed

2. Simulating processor execution with SW and extracting power/temperature data

WATTCH, Tempest, etc. Computation time problems, especially with thermalCycle level detail

3. Runtime Measurement: Getting Processor power/thermal data at runtime using both hardware and software

Runtime speed and SW support – not cycle detail!

Page 64: Runtime Power Measurement/Modeling and Thermal Modeling

64

INTRODUCTION to RUNTIME

• Why Runtime Power/Thermal Measurement:

Offers a hybrid technique overlapping slow, but detailed simulation and crude, but fast realtime measurementsHardware performance counters help extract lots of useful information – both performance and power – on the flyCan be used for ‘priming’ instead of a long simulation where the last few million instructions bear the most of interest

Page 65: Runtime Power Measurement/Modeling and Thermal Modeling

65

WHY POWER & THERMAL

• Moore’s Law:Transistor count x4 / 3 years

DRAM density x4 / 3 years

Performance improves exponentially SO DOES POWER [1]

• Nuclear Core Example:

Page 66: Runtime Power Measurement/Modeling and Thermal Modeling

66

WHY POWER & THERMAL

Page 67: Runtime Power Measurement/Modeling and Thermal Modeling

67

…WHY POWER & THERMAL

• Battery technology increases much slower

• Packaging costs: +$1/W over 35-40W [2]

Back to slide Back to slide

Page 68: Runtime Power Measurement/Modeling and Thermal Modeling

68

POWER BASICS

• Total Power = Dynamic Power + Static Power + Short Circuit Power

Dynamic Power (switching power):Discharging of Capacitances when switching occurs (0 1) – data dependent

Csw= (1/2)..CL.Vdd2.f

Where this came from

Page 69: Runtime Power Measurement/Modeling and Thermal Modeling

69

Derivation of Switching Power

2)2/1( CVEnergy

dt

dVVCViPower

dt

dVCiC

dissipatedisechthis

transitioneachat

fVC

periodclock

VC

timeEnergyPower

VCEnergy

transitioneachat

ddL

ddL

ddL

arg

:10

/

:01

2

2

2

fVCPower

activityswitching

cycleainswitchingofyprobabilitP

PtransitionEnergyPower

stransitiontotaltransitionEnergy

EnergyTotal

ddL2

10

10

2

1

)2/1(

/

10/

:

Page 70: Runtime Power Measurement/Modeling and Thermal Modeling

70

POWER BASICS

Static Power (leakage power):Due to leakage through the N channel and through the drain-substrate junctions.

Page 71: Runtime Power Measurement/Modeling and Thermal Modeling

71

POWER BASICS

Short Circuit Power :Due to finite rise time of input signal.Generic CMOS feature

• In comparison:Currently: 80% Sw. + 10% Leak + 10% SC

Future: 45% Sw. + 45% Leak + 10% SC [3]

Page 72: Runtime Power Measurement/Modeling and Thermal Modeling

72

WATTCH simulates 80K instr-s/sec

SpecINT 164.GZIP runs:~350s with average upc ~1.3 on 1.4 GHz P4 producing ~665 billion uops

WATTCH simulation would take ~100 days

Assuming a 1GHz Machine:1s of real run ~5 x IPC hrs of WATTCH simulation

Back to slide Back to slide

NEED FOR SPEEDNEED FOR SPEED

Page 73: Runtime Power Measurement/Modeling and Thermal Modeling

73

P4 DetailsP4 DetailsKarelian.ee:

P4 – 1.4GHz0.18, C4-FC-PGA-423Heatsink Folded FinM6, Al interconnectDie Size: 217 mm2

Package Size: 5.34cm x 5.17cmPower: Idle/typ./max=??/51.8/71WD$1&T$1/L2: 8K&12KUops/256KVoltage: 1.7/1.75V

Page 74: Runtime Power Measurement/Modeling and Thermal Modeling

74

P4 DetailsP4 Details 1st LKM: <LKM_CPUinfo & UserLevel_CPUinfo>

Implements syscall: getCPUinfo()Gathers CPU info from:

/asm/processor.hIntel control registers (CR4)CPUID instruction

Reveals:Debug Store mechanism exists for PEBSTSC existsMSRs implemented

We can read/write performance counters

EX:karelian (P4,willamette): UserLevel_CPUinfoviale (P4, Northwood): UserLevel_CPUinfo

Back Back

Page 75: Runtime Power Measurement/Modeling and Thermal Modeling

75

P4 Detector - Counter ClustersP4 Detector - Counter ClustersEvent Detectors Event Counters

4 bit wide bus

P4

Com

pone

nts

EV

EN

TS

Page 76: Runtime Power Measurement/Modeling and Thermal Modeling

76

Counters, ESCRs & CCCRsCounters, ESCRs & CCCRs

Simplified Recipe:1. Select Event to count2. Select a counter

(also defines CCCR)3. Select an ESCR4. Set ESCR fields5. Set CCCR fields6. Enable CCCR

Page 77: Runtime Power Measurement/Modeling and Thermal Modeling

77

Counting MechanismsCounting MechanismsCounting Types

Non-retirement: Events occur any time during execution

At-Retirement: Events at the retirement of instruction

Can count BOGUS vs NBOGUS, Tag uops to count, etc.

TerminologyMechanisms:

Front end tagging (i.e. LD/ST retired)Execution tagging (i.e. packed_DP_retired)Replay Tagging (i.e. L1 misses)No Tags (i.e. uops retired)

Also:Event Counting | IEBS | PEBS

Back Back

Page 78: Runtime Power Measurement/Modeling and Thermal Modeling

78

At Retirement Counting TerminologyAt Retirement Counting Terminology

Back Back

BOGUS/NBOGUS (speculative)Tagging (count uops that encounter event)Replay (Data speculation)

Page 79: Runtime Power Measurement/Modeling and Thermal Modeling

79

Verifying Counter ReaderVerifying Counter Reader1) L1Dcache_exercise:

Uses pointer assignment L1=8K, L2=256K Array Size = (L1 Size/Hit Rate)

i.e. for 10% Hit rate: 80K 20K entriesArray Size < L2 size

Array elements PRBS of array indices Bench loop:

new index array[old index] However, gcc puts 5 LDs in the bench loop

4 static Hit rate ~ 100%1 our load our desired hit rate

Page 80: Runtime Power Measurement/Modeling and Thermal Modeling

80

……Verifying Counter ReaderVerifying Counter Reader

1) L1Dcache_exercise results:

L1Dcache Experiment

-20.00%

0.00%

20.00%

40.00%

60.00%

80.00%

100.00%

120.00%

0.04 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 2 100 1000

Desired Hit Rate

Ac

qu

ire

d R

ate

s

Acquired L1 Hit Rate

Our L1 Hit Rate From L2Accesses

Ex:L1Dcache_exerciseHit Rate = 0.25

Page 81: Runtime Power Measurement/Modeling and Thermal Modeling

81

……Verifying Counter ReaderVerifying Counter Reader2) branch_exercise:

Uses random number comparison Assigns 400K PRBS array outside bench loop

To avoid rand() instructions in bench loop bench loop:

Compares array index to threshodThreshold = RAND_MAX*TakenRate

Repeats 1000 reseeding each time However gcc adds 2 more branches into

bench loop:Loop exit condition (Prediction ~ 100%)Unconditional JMP (Prediction ~ 100%)

Our Branch’s Expected Mispredict Rate:~ (0.5 - |TakenRate – 0.5| )

Page 82: Runtime Power Measurement/Modeling and Thermal Modeling

82

……Verifying Counter ReaderVerifying Counter Reader

2) branch_exercise results:

Ex:branch_exerciseTaken Rate=0.5

Branch Prediction Experiment

0.00%

20.00%

40.00%

60.00%

80.00%

100.00%

120.00%

0 0.1 0.2 0.25 0.3 0.4 0.5 0.6 0.7 0.75 0.8 0.9 1

Desired Taken Rate

Acq

uir

ed R

ates

Approximated Mispredict RateOur Branch's Taken Rate

Back Back

Page 83: Runtime Power Measurement/Modeling and Thermal Modeling

83

Log voltage readings

Convert to instantaneous power: 12 x Vsample x 1000

P4 POWER MEASUREMENTP4 POWER MEASUREMENTComplete Setup:

Serial Reader(PowerMeter)(PowerPlotter)

1mV/Adc conversion

Voltage [

V]

Readings

Clamp Current Probe over 12V

lines

Log Power values Plot Power values

Page 84: Runtime Power Measurement/Modeling and Thermal Modeling

84

MEASUREMENT MethodMEASUREMENT Method Select Power lines that reflect CPU power

P4 uses 12 V lines Clamp the current probe over the 12V lines

1mV/Adc conversion Connect the clamp into DMM Send Voltage reading over serial Log the voltage readings

Convert to instantaneous power as:12 x Vsample x 1000

Log Power values Plot Power values

Page 85: Runtime Power Measurement/Modeling and Thermal Modeling

85

MEASUREMENT ToolsMEASUREMENT ToolsPoll serial port ~20ms

quicker overkill, slower overlookCompute running average sample every t you select

Easier to sync with Power ModelPowerMeter:

Convert voltage reading to power and logP=12 x Vread x 1000

PowerPlotter: Plot Power samples over sliding time

window100 s history with 1000 samples (t = 100ms)

Page 86: Runtime Power Measurement/Modeling and Thermal Modeling

86

Current ProbeCurrent ProbeFluke i410Uses Hall Voltage to measure current

and convert to Voltage:1mV / Adc

Range: 0.5 – 400A Accuracy: 3.5%+0.5AGenerated voltage is fed to DMMCompared against the Ppro Amoeba

shunt setup for verification

Page 87: Runtime Power Measurement/Modeling and Thermal Modeling

87

Clamp vs ShuntClamp vs Shunt

sampled current for L1Dcache from clamp

0

1

2

3

4

5

6

7

8

0 200 400 600 800 1000

current

sampled current for L1Dcache from shunt

0

1

2

3

4

5

6

7

0 200 400 600 800 1000 1200

current

current for grep from shunt

0

1

2

3

4

5

6

7

0 100 200 300 400

100 ms

A Series1

current for grep from clamp

0123456789

0 100 200 300 400 500 600

100 ms

A Series1

Back Back

Page 88: Runtime Power Measurement/Modeling and Thermal Modeling

88

DMMDMMAgilent 34401A Measurement Motive:

We should sample as quick as possible (grep case)

Measurement Setup:Fast 4 digit, Autozero OFF, Display OFF

From [8], 1000 readings/s (x150 faster than fast 6 digit)

Serial Interface:From [9] 55 ASCII readings /s

Polling serial port faster than 20ms is overkill

Back Back

Page 89: Runtime Power Measurement/Modeling and Thermal Modeling

89

P4 Power LinesP4 Power Lines Which power lines should we cut / clamp?

[5] shows the power lines:1-CPU power connector 13-System power connectorP1 13 & P2 1

[6],[7] say P4 uses 12V lines for CPU, rather than 5V lines

Both P1 & P2 have 12, 5 and 3.3 V lines

I run branch_exercise (takenRate=1) and gzip_static obtain the current variation on the lines

 

Page 90: Runtime Power Measurement/Modeling and Thermal Modeling

90

Current on Power LinesCurrent on Power LinesCurrent on Connector P1

line7 (12V)

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

0 20 40 60 80

time (s)

I [A

]

Series1

Current on Connector P1 lines1,3,,6,18,19,20,22 (5V)

0

0.5

1

1.5

2

2.5

0 20 40 60 80

time (s)

I [A

]Series1

Current on Connector lines 11,12,23 (3.3V)

0

0.5

1

1.5

2

2.5

0 10 20 30 40 50 60 70 80

time (s)

I [A

]

Series1

Current on connector P2 line1 (3.3V)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0 10 20 30 40 50 60 70 80

time(s)

I(A

)

Series1

Current on connector P2 line14 (5V)

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0 10 20 30 40 50 60 70

time (s)

I [A

]

Series1

Current on Connector P2 line 3 (12V)

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

0 10 20 30 40 50 60 70

time (s)

I [A

]

Series1

Current on Connector P2 line7 (12V)

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

0 10 20 30 40 50 60 70

time (s)

I [A

]

Series1

Current on connector P2 line 9 (5V)

-0.1

-0.05

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0 10 20 30 40 50 60 70

time (s)

I [A

]

Series1

Reveals ALL 3 12V lines’ currents follow CPU activity All add to CPU Power! Back Back

Page 91: Runtime Power Measurement/Modeling and Thermal Modeling

91

Validating with OptimizationsValidating with Optimizations Compare to Optimizations vs Power of [Seng & Tullsen]

SPECINT AVE. Power vs gcc Optimizations

39

41

43

45

47

49

51

53

GZIP VPR GCC

AV

Era

ge

Po

wer

[W

]

O0O1O2O3O3 unrollO3 unroll ALL

Page 92: Runtime Power Measurement/Modeling and Thermal Modeling

92

OptimizationsOptimizations O0

None at all O1 –fomit-frame-pointer

thread-jumps, delayed-branches, defer-pop O2 –fomit-frame-pointer

CSE related blocks, jumps, expensive optimizations, reschedule instr-ns, etc.

-O3 –fomit-frame-pointerO2 + inline functions heuristically

-O3 –fomit-frame-pointer –funroll-loopsOnly for #iterations known at compile/run time

-O3 –fomit-frame-pointer –funroll-all-loopsDo for all loops (usually bad result)

Page 93: Runtime Power Measurement/Modeling and Thermal Modeling

93

GZIP – power vs timeGZIP – power vs timePower for GZIP Optimizations

0

10

20

30

40

50

60

70

0 100 200 300 400 500 600 700 800 900time (s)

[W]

O0

O1

O2

O3

O3unroll

O3unrollALL

Page 94: Runtime Power Measurement/Modeling and Thermal Modeling

94

……GZIP – power vs timeGZIP – power vs timeAll have similar powerExec. time(O0) ~

x2 Exec Time(Oelse)Different data sets provide

different power profile

Page 95: Runtime Power Measurement/Modeling and Thermal Modeling

95

3 specINT average Power3 specINT average PowerSPECINT AVE. Power vs gcc Optimizations

39

41

43

45

47

49

51

53

GZIP VPR GCC

AV

Era

ge

Po

we

r [W

]

O0

O1

O2

O3

O3 unroll

O3 unroll ALL

Optimized code runs quicker, and yet with less average power

specFP – art seems to be the exception?

Back Back

Page 96: Runtime Power Measurement/Modeling and Thermal Modeling

96

About the ripplesAbout the ripplesAdd ripple stuff

here…!!!!!!!!!!!!!!!!!!!!!!!!!!!

Page 97: Runtime Power Measurement/Modeling and Thermal Modeling

97

P4 Architecture vs LayoutP4 Architecture vs Layout

Components to Model:

1) Bus Control2) L2 Cache3) 2nd Level BPU4) ITLB & Ifetch5) L1 Cache

6) MOB7) Mem Control8) DTLB9) Int EXE10)FP EXE11) Int RF

12)FP RF13)Decode14)Trace $15)1st Level BPU16)Microcode ROM17)Allocation

18)Rename19) Inst-n Qs20)Schedule21) Inst-n Qs22)Retirement

Back Back

Page 98: Runtime Power Measurement/Modeling and Thermal Modeling

98

Defining ComponentsDefining Components

Page 99: Runtime Power Measurement/Modeling and Thermal Modeling

99

Counter RotationsCounter Rotations

Back Back

Page 100: Runtime Power Measurement/Modeling and Thermal Modeling

100

Experiment SetupExperiment Setup

POWERCLIENT

POWERSERVER

Page 101: Runtime Power Measurement/Modeling and Thermal Modeling

Com

pon

ent

Bre

akd

own

sC

omp

onen

t B

reak

dow

ns

Page 102: Runtime Power Measurement/Modeling and Thermal Modeling

102

THERMAL BasicsTHERMAL Basics

Duality heat flow electrical flow

Thermal Mass (Capacitance) :

Cth=c.A.t [J/K]c: Specific heat [J/m3K]A: Block Area [m2]t: Wafer thickness [m]

Thermal Resistance :

Rth,norm=.t/A [K/W] : Thermal resistivity [m.K/W]A: Block Area [m2]t: Wafer thickness [m]

Page 103: Runtime Power Measurement/Modeling and Thermal Modeling

103

Simplified Thermal ModelSimplified Thermal Model Divide the CPU to component blocks

Each block dissipates different power, Pblock reveal different temperature changes, Tblock

Tb,i

Rth,i

Cth,iPi

Tb,j

Rth,j

Cth,jPj

Tb,k

Rth,k

Cth,kPk

Tb,l

Rth,l

Cth,lPl

Th

Rth,h

Cth,hPi+Pj+Pk+Pl HEATSINK

Blki Blkj

BlkkBlkl

DIETb,i

Rth,i

Cth,iPi

Tb,j

Rth,j

Cth,jPj

Tb,k

Rth,k

Cth,kPk

Tb,l

Rth,l

Cth,lPl

Th

Rth,h

Cth,hPi+Pj+Pk+Pl

Tb,i

Rth,i

Cth,iPi

Tb,i

Rth,i

Cth,iPi

Tb,j

Rth,j

Cth,jPj

Tb,j

Rth,j

Cth,jPj

Tb,k

Rth,k

Cth,kPk

Tb,k

Rth,k

Cth,kPk

Tb,l

Rth,l

Cth,lPl

Tb,l

Rth,l

Cth,lPl

Th

Rth,h

Cth,hPi+Pj+Pk+Pl HEATSINK

Blki Blkj

BlkkBlkl

DIE

Blki Blkj

BlkkBlkl

DIE

ithith

i

ith

ii

ibithR

TT

i

RC

tT

C

tPT

dt

dTCP

ith

hib

,,,

,,

:equationdifferenceFinal

,

,

t : Sampling intervalTi : The temperature difference

between block and the heatsink

t should be much smaller than the RC time constant, th,i

Tb,j

Rth,j

Cth,jPj

Numerical Values?

See Quantitative Example >>

Page 104: Runtime Power Measurement/Modeling and Thermal Modeling

104

QUANTITATIVE EXAMPLE Use t=0.1 mm – thinned wafer Areas given in table (c=106 [J/m3K] & =10-2 [m.K/W] ) th=RthCth=c t2=10-4s=100s ind. of Area!

Temperature buildup for Regfile with t =133.4 ns:

21.11

42.85

100100

100

blkthblkth

blk

blkth

blkblk CR

tT

C

tPHeatSinktrwT

,,,

)...(

Back to slide Back to slide

Page 105: Runtime Power Measurement/Modeling and Thermal Modeling

105

THERMAL FORMULATIONTHERMAL FORMULATION

For any block, i:Tb,i

Rth,i

Cth,iPi

Th

ithith

i

ith

ii

iith

ith

ii

ihibib

h

hibi

ibithR

TT

i

ibithR

TT

i

RC

tT

C

tPT

t

TC

R

TP

TTTT

T

TTTDefinet

TCP

dt

dTCP

ith

hib

ith

hib

,,,

,,

0

,,

,

,,

,,

:equationdifferenceFinal

:constAssuming

:

:timengDiscretizi

,

,

,

,

t : Sampling interval Ti: The temperature

difference between block and the heatsink

t should be much smaller than the RC time constant, th,i

Back to slide Back to slide

Page 106: Runtime Power Measurement/Modeling and Thermal Modeling

106

Refined Thermal ModelRefined Thermal Model Steady State Analysis reveals, Heatsink-Die

abstraction is not sufficient for real systems Proceeding to a multilayer thermal model:

Active die thickness metalization/insulation chip-package interface package heatsink

Requires searching of several materials/ dimensions and thermal properties

Multiple layers Multiple T nodes Multiple DEs

Baseline Heat removal Structure:

Tb,j

Rth,j

Cth,jPj

Page 107: Runtime Power Measurement/Modeling and Thermal Modeling

107

Refined Thermal ModelRefined Thermal ModelTb,j

Rth,j

Cth,jPjNeed to define the physical structure All the layers heat-flux propagates through

Corresponding Thermal model Multinode Different Assumptions/decisions

Physical Parameters for different elements Dimensions Material types

th and cth

New set of Thermal update DEs

Page 108: Runtime Power Measurement/Modeling and Thermal Modeling

108

Physical Model vs. Thermal Model Physical Model vs. Thermal Model

Th

Rh

Ch

R_hXA

TA

Tspr

Rspr

Cspr

R_grXspr

Tp,i

Rp,i

Cp,i

Tdie,i

Rdie,i

Cdie,iPi

Ptotal

Th

Rh

Ch

Th

Rh

Ch

R_hXA

TA

Tspr

Rspr

Cspr

Tspr

Rspr

Cspr

R_grXspr

Tp,i

Rp,i

Cp,i

Tp,i

Rp,i

Cp,i

Tdie,i

Rdie,i

Cdie,iPi

Tdie,i

Rdie,i

Cdie,iPi

Ptotal

Page 109: Runtime Power Measurement/Modeling and Thermal Modeling

109

Analytical DerivationAnalytical Derivation

4 Nodes 4 DEs 1) Tspr:

sprsprspr

hsprgrsprsprspr

totalspr

sprsprhsprRtotal

sprsprR

TT

total

sprsprR

TT

total

TTT

tTTRCC

tPT

TCtTTtPt

TCP

dt

dTCP

grspr

grspr

hspr

grspr

hspr

)(1

:equationdifferenceFinal

.)(.

:timengDiscretizi

1

Page 110: Runtime Power Measurement/Modeling and Thermal Modeling

110

……Analytical DerivationAnalytical Derivation

2) Th:

3) Tdie,i:

4) Tp,i:

hhh

Ahahhh

grspr

hspr

h

TTT

tTTRCC

tR

TT

T

)(1

idieidieidie

ipidieidieidieidie

iidie

TTT

tTTRCC

tPT

,,,

,,,,,

, )(1

ipipip

spripipipip

idie

ipidie

h

TTT

tTTRCC

tR

TT

T

,,,

,,,,

,

,,

)(1

Page 111: Runtime Power Measurement/Modeling and Thermal Modeling

111

Temperature UpdatingTemperature Updating and and Initial ConditionsInitial Conditions

D.E.s should be updated along the direction of current (power) flow: Tdie,i Tp,i Tspr Th

It is not reasonable to start from ambient temperatures as initial conditions. Mostly, the processor is already running

TA is given as ~50oC by Intel Thermal Design Guidelines Assume idle power:(Ppro ~2 W)

Th=TA+2W.Rhxa=~52oC Tspr=Th+2W.Rspr+gr=~52oC Tp,i=Tdie,i=Tspr=~52oC

Update Tdie,i

Update Tp,i

Update Tspr

Update Th

Back Back

Page 112: Runtime Power Measurement/Modeling and Thermal Modeling

112

Steady State SolutionSteady State Solution If Rth,iRth,i x20

Tss,i Tss,I x20

Regfile ex. of presentation 1:Pi=10 & Rth,i=4 Ti,ss=40K

Tb,i

Rth,i

Cth,iPi

Tb,j

Rth,j

Cth,jPj

Tb,k

Rth,k

Cth,kPk

Tb,l

Rth,l

Cth,lPl

Th

Rth,h

Cth,hPi+Pj+Pk+Pl HEATSINK

Blki Blkj

BlkkBlkl

DIETb,i

Rth,i

Cth,iPi

Tb,j

Rth,j

Cth,jPj

Tb,k

Rth,k

Cth,kPk

Tb,l

Rth,l

Cth,lPl

Th

Rth,h

Cth,hPi+Pj+Pk+Pl

Tb,i

Rth,i

Cth,iPi

Tb,i

Rth,i

Cth,iPi

Tb,j

Rth,j

Cth,jPj

Tb,j

Rth,j

Cth,jPj

Tb,k

Rth,k

Cth,kPk

Tb,k

Rth,k

Cth,kPk

Tb,l

Rth,l

Cth,lPl

Tb,l

Rth,l

Cth,lPl

Th

Rth,h

Cth,hPi+Pj+Pk+Pl HEATSINK

Blki Blkj

BlkkBlkl

DIE

Blki Blkj

BlkkBlkl

DIE

KT

decodeforyNumericall

A

tPRPT

TtR

TP

C

Ttate

RC

tT

C

tPT

decssi

i

thiithissi

iith

ii

ith

i

ithith

i

ith

ii

15.010.35

:

.

01

0:SolutionSSteady

:equationdifferenceFinal

2,,

,,

,,

,,,

Back Back

Page 113: Runtime Power Measurement/Modeling and Thermal Modeling

113

EX: PproEX: Ppro Thermal Model Thermal ModelTb,j

Rth,j

Cth,jPjUse CASTLE computed component powers

Select– thermal – sampling intervalDetermine component areas from Die

photoDetermine processor/packaging

physical parametersGenerate numerical thermal modelApply component difference equations

recursively

Page 114: Runtime Power Measurement/Modeling and Thermal Modeling

114

SimulationSimulation

and c values hardcoded for materials (except Si)

Areas/Relative Areas Hardcoded for components Individual R and C computed for components D.E. loop is re-executed every t, in the discussed

order Updated Thermal Nodes displayed every t~20ms

Component Temperatures Build up to ~350K in ~5hrs Clock Temp. Shoots up Theatsink moves very slowly as expected

For complete set of computed numerical simulation results go to additional slides

Page 115: Runtime Power Measurement/Modeling and Thermal Modeling

115

Simulation Outputs – at StartupSimulation Outputs – at Startup

Page 116: Runtime Power Measurement/Modeling and Thermal Modeling

116

Simulation Outputs – After 5 hrsSimulation Outputs – After 5 hrs

Back Back

Page 117: Runtime Power Measurement/Modeling and Thermal Modeling

117

Thermal Model ParametersThermal Model Parameters

BASELINE AMBIENT TEMPERATURET_ambient = 323; /* in K */ Intel Thermal Design Guidelines

SAMPLING INTERVALdt = 5e-6 sec.s I Choose

Processor Specific Parameters

Page 118: Runtime Power Measurement/Modeling and Thermal Modeling

118

Physical ParametersPhysical Parameters

15% of Heatsink area has fins, 85% doesn’tOverall Rth estimate:

RfinRnofin

Page 119: Runtime Power Measurement/Modeling and Thermal Modeling

119

……Physical ParametersPhysical Parameters

Temperature assumed uniform along heat spreader – and therefore, above

Page 120: Runtime Power Measurement/Modeling and Thermal Modeling

120

……Physical ParametersPhysical Parameters

We don’t use total R&C for package as it’s decomposed into component areas in the model

DIE:Process info scaled from P4 data in [7] using ITRS 1999 & 2001 and interpolating MPU ½ pitch vs. Wire pitch

Metal layer & Isolation scale factor 2.15

ITRS FEP Si final device thickness ~100nm (130nm tech.)I used the overall wafer thickness

Temperature dependent Si: Si(T)=1.5486.102.(300/T)4/3

Page 121: Runtime Power Measurement/Modeling and Thermal Modeling

121

……Physical ParametersPhysical Parameters DIE Rth Estimate:

Rdie=RSi+Rmetal+Rpoly+RSiO2

For 10% die area:RSi~ 0.1 K/W

Rmetal~ 0.0008 K/W

Rpoly~ single layer ignorable

RSi~0.86 K/W

Rdie~ RSi+RSiO2

DIE Cth Estimate:Only Si considered as rest is much thinner

Back Back

Page 122: Runtime Power Measurement/Modeling and Thermal Modeling

122

Numerical Numerical ValuesValues

Back Back

Page 123: Runtime Power Measurement/Modeling and Thermal Modeling

123

Back Back

Computed ThermalComputed Thermal values values

Page 124: Runtime Power Measurement/Modeling and Thermal Modeling

124

Computed Thermal v.2 valuesComputed Thermal v.2 values

Back Back

Page 125: Runtime Power Measurement/Modeling and Thermal Modeling

125

Ppro info & AreasPpro info & Areas Complete processor info([4],[5],[6])

200MHz4 Metal layersPackage: 387 pin DC-PGAPackage size: 6.76cm x 6.25cm0.35 BiCMOSDie Size: 196mm2 (14x14)

Area estimates for dieScale component areas from [1]:

[1] Ours150 MHz 200 MHz0.50 0.35 <process scaling x0.7>Die size:306mm2 196mm2 <Area scaling x0.64>

I use x0.64 area scaling and [1]’s breakdowns for component area estimates

Page 126: Runtime Power Measurement/Modeling and Thermal Modeling

126

Component AreasComponent Areas

3.9% 11.8%

7.9% 4.0%

4.4%4.2%

7.6%8.6%

14.3%

4.1%

2.5%

2.2%

4.6%

1.3%

Close to Intel data:

These areas cover ~81.3% of die

Clock area found from Intel data as:

Aclk=Pclk/PwrDensityclk = 1.7%

Page 127: Runtime Power Measurement/Modeling and Thermal Modeling

127

CASTLE Breakdown AreasCASTLE Breakdown Areas We need to convert given areas to CASTLE comp-s:

DECODEID+MIS=11.7%

ISSUERS=7.6%

REORDERRAT+(ROB&RRF)=8.6%

DMEMDCU = 8.6%

IMEMIFU=11.8%

FUNC_UNITAGU+IEU+FEU=10%

OTHER100-above=41.7%

CLOCK1.7%

Back Back

Page 128: Runtime Power Measurement/Modeling and Thermal Modeling

128

CASTLE

• Power measurement / profiling tool• Developed by Prof Martonosi and Russ• Implemented on a P6, Linux• Generates power profiles for benchmarks at

runtimeUses performance counters to gather utilization information Uses WATTCH’s per usage wattage values for max power values ([8 p.3])Uses heuristics to extract usage counts for blocksUses register sampling to compute activity factors for single ended bitlines.Computes total processor powerUses a digital multimeter for validation

Page 129: Runtime Power Measurement/Modeling and Thermal Modeling

129

Performance Counters

• Exist on most new processors• Majorly used to track performance related events

Cache missesCommitted intr-s, etc.

• Can be used to gather power related data• P6 has 2 performance counters that count 77 events

Can be accessed with:RDMSR (Read Machine Specific Register)WRMSR (Write Machine Specific Register)RDTSC (Read Time Stamp Counter)

Kernel level (Ring 0) instructionsExemplary events:0. TSC elapsed machine cycles03. 03H L1 read misses 44. C0H instr-ns retired

Page 130: Runtime Power Measurement/Modeling and Thermal Modeling

130

Heuristics

• To extract power related data from performance counters

• Platform Dependent!

Page 131: Runtime Power Measurement/Modeling and Thermal Modeling

131

CASTLE implementation

• Platform:P6, 200 MHz | Linux kernel v2.2.16-3

HW counters

Kernel Code

Server code

Series Resistance

Xmultimeter server

Client Code

Page 132: Runtime Power Measurement/Modeling and Thermal Modeling

132

CASTLE Filesystem – User Code

• Client: <cpu-probe>Includes cpu-monitor & cpu-networkCpu-monitor:

Provides the x-windows for power breakdown bar graphs <gtk and threads>Acquires power breakdowns from cpu-network

Cpu-network:Connects to server side through ethernet <sockets and threads>Gets event counts and number of elapsed cycles for each tracked eventConstructs component power values from event data using heuristics

Client Code

Page 133: Runtime Power Measurement/Modeling and Thermal Modeling

133

CASTLE Filesystem – User Code

• Multimeter: <xmmeter>Real Multimeter reads the voltage over series R and sends over RS232

Xmmeter reads the serial port and converts the voltage reading into power as:

P=(Vread/Rs).Vdd

X-window displays the readings

Series Resistance

Xmultimeter server

Page 134: Runtime Power Measurement/Modeling and Thermal Modeling

134

CASTLE Filesystem – User Code

• Server: <probe-server>Reads the performance counts with syscall “getglobaleventcount” defined in kernel code every second

Acquires event counts and elapsed cycles for all events

Sends the event and cycle data to client as a stream of chars.

Server code

Page 135: Runtime Power Measurement/Modeling and Thermal Modeling

135

CASTLE Filesystem – Kernel Code

• Required to access counters• Scattered in:

/usr/src/linux/arch/i386/kernel/entry.S/usr/src/linux/include/linux/sched.h/usr/src/linux/kernel/fork.c/usr/src/linux/kernel/sched.c

• Defines 2 new system calls:GeteventcountGetglobaleventcount

• Accesses the counters, gets counter & cycle dataSyscall returns the server event and cycle counts as a 2D array

Kernel Code

Page 136: Runtime Power Measurement/Modeling and Thermal Modeling

136

CASTLE Details• In castle code, 12 distinct events are defined• From [1] and [8], 10 of the events are used:

instructions decodedinstructions executedinstructions retiredfloating point operations executedbranches retiredBranchesDecodedL1 instruction cache accessesL1 data cache accessesL2 unified cache accessesmain memory requests

• [1] and [8] suggest a 10ms sampling period• Probe-server samples counters every second

Page 137: Runtime Power Measurement/Modeling and Thermal Modeling

137

Power Breakdown ComponentsPower Breakdown Components CASTLE tracks 12 events

Develops power breakdowns for 8 units:DECODEISSUEREORDERDMEMIMEMFUNC_UNITOTHERCLOCK

Component powers recomputed every second in CPU-network

Page 138: Runtime Power Measurement/Modeling and Thermal Modeling

138

Thermal Modeling with CASTLE

• Thermal model requires only power and sampling time information

Thermal model can be added at user level, by:extending cpu-network for temperature updates

extending cpu-monitor for a new thermal x-window

• A pitfall resides as the sampling periodSampling time should be smaller than time constant, for reliable modeling (<< 100s)

Back Back

Page 139: Runtime Power Measurement/Modeling and Thermal Modeling

139

EOP