power,t and reliability m nanometer cale microprocessorsdbrooks/shang2007-tutorial.pdf ·...

14
. ....................................................................................................................................................................................................................................................... POWER,THERMAL, AND RELIABILITY MODELING IN NANOMETER-SCALE MICROPROCESSORS . ....................................................................................................................................................................................................................................................... POWER IS THE SOURCE OF THE GREATEST PROBLEMS FACING MICROPROCESSOR DESIGNERS.RAPID POWER VARIATION BRINGS TRANSIENT ERRORS.HIGH POWER DENSITIES BRING HIGH TEMPERATURES, HARMING RELIABILITY AND INCREASING LEAKAGE POWER.THE WAGES OF POWER ARE BULKY, SHORT-LIVED BATTERIES, HUGE HEAT SINKS, LARGE ON-DIE CAPACITORS, HIGH SERVER ELECTRIC BILLS, AND UNRELIABLE MICROPROCESSORS.OPTIMIZING POWER DEPENDS ON ACCURATE AND EFFICIENT MODELING THAT SPANS DIFFERENT DISCIPLINES AND LEVELS, FROM DEVICE PHYSICS, TO NUMERICAL METHODS, TO MICROARCHITECTURAL DESIGN. ...... System integration and perfor- mance requirements are dramatically in- creasing the power consumptions and power densities of high-performance micro- processors; some now consume 100 watts. High power consumption introduces chal- lenges to various aspects of microprocessor and computer system design. It increases the cost of cooling and packaging design, reduces system reliability, complicates pow- er supply circuitry design, and reduces battery life. Researchers have recently ded- icated intensive effort to power-related design problems. Modeling is the essential first step toward design optimization. In this article, we explain the power, thermal, and reliability modeling problems and survey recent advances in their accurate and efficient analysis. Implications of power consumption Figure 1 illustrates the relationships among power consumption, temperature, reliability, and process variation. Shaded boxes indicate attributes that do not significantly influence others. Dashed boxes indicate the processes by which one attri- bute affects others. Dynamic and leakage power consumption are at the roots of many problems. Rapid changes in dynamic power con- sumption (indicated by a solid box in Figure 1) can result in transient voltage fluctuations as a result of the distributed inductance and resistance in off-chip and on-chip microprocessor power delivery net- works. These dI/dt effects, indicated by dashed boxes in Figure 1, can change logic combinational path delays, resulting in David Brooks Harvard University Robert P. Dick Russ Joseph Northwestern University Li Shang Queen’s University 0272-1732/07/$20.00 G 2007 IEEE Published by the IEEE Computer Society. ....................................................................... . 49

Upload: lamhanh

Post on 30-Aug-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

........................................................................................................................................................................................................................................................

POWER, THERMAL, AND RELIABILITYMODELING IN NANOMETER-SCALE

MICROPROCESSORS........................................................................................................................................................................................................................................................

POWER IS THE SOURCE OF THE GREATEST PROBLEMS FACING MICROPROCESSOR

DESIGNERS. RAPID POWER VARIATION BRINGS TRANSIENT ERRORS. HIGH POWER

DENSITIES BRING HIGH TEMPERATURES, HARMING RELIABILITY AND INCREASING LEAKAGE

POWER. THE WAGES OF POWER ARE BULKY, SHORT-LIVED BATTERIES, HUGE HEAT SINKS,

LARGE ON-DIE CAPACITORS, HIGH SERVER ELECTRIC BILLS, AND UNRELIABLE

MICROPROCESSORS. OPTIMIZING POWER DEPENDS ON ACCURATE AND EFFICIENT

MODELING THAT SPANS DIFFERENT DISCIPLINES AND LEVELS, FROM DEVICE PHYSICS, TO

NUMERICAL METHODS, TO MICROARCHITECTURAL DESIGN.

......System integration and perfor-mance requirements are dramatically in-creasing the power consumptions andpower densities of high-performance micro-processors; some now consume 100 watts.High power consumption introduces chal-lenges to various aspects of microprocessorand computer system design. It increases thecost of cooling and packaging design,reduces system reliability, complicates pow-er supply circuitry design, and reducesbattery life. Researchers have recently ded-icated intensive effort to power-relateddesign problems. Modeling is the essentialfirst step toward design optimization. In thisarticle, we explain the power, thermal, andreliability modeling problems and surveyrecent advances in their accurate andefficient analysis.

Implications of power consumptionFigure 1 illustrates the relationships

among power consumption, temperature,reliability, and process variation. Shadedboxes indicate attributes that do notsignificantly influence others. Dashed boxesindicate the processes by which one attri-bute affects others. Dynamic and leakagepower consumption are at the roots of manyproblems.

Rapid changes in dynamic power con-sumption (indicated by a solid box inFigure 1) can result in transient voltagefluctuations as a result of the distributedinductance and resistance in off-chip andon-chip microprocessor power delivery net-works. These dI/dt effects, indicated bydashed boxes in Figure 1, can change logiccombinational path delays, resulting in

David Brooks

Harvard University

Robert P. Dick

Russ Joseph

Northwestern University

Li Shang

Queen’s University

0272-1732/07/$20.00 G 2007 IEEE Published by the IEEE Computer Society.

........................................................................

49

timing violations; this results in transientfaults or requires decreasing the processorfrequency. High dynamic power consump-tion implies high current density, resultingin accelerated wear and permanent faultsdue to processes such as electromigration.Dynamic power consumption producesheat, which increases microprocessor tem-perature. The effects of leakage powerconsumption are similar, with the exceptionof dI/dt-related problems. In general, leak-age power variation is smaller than dynamicpower variation. Both dynamic and leakagepower consumption decrease battery lifespans in portable devices and increase serverelectric bills.

Both dynamic and leakage power increasetemperature. Temperature profiles dependon the temporal and spatial distribution ofpower as well as the microprocessor’s cool-ing and packaging solutions. High temper-ature increases charge-carrier concentra-tions, resulting in increased subthresholdleakage power consumption. In addition, itdecreases charge-carrier mobility, decreasingtransistor and interconnect performance,and decreases threshold voltage, increasingtransistor performance. Finally, temperature

has a tremendous impact on the faultprocesses that lead to the majority ofmicroprocessor permanent faults. Modelingand optimizing microprocessor thermalproperties is thus essential for reliability,power consumption, and performance.

Process variation greatly influences theother power-related integrated circuit (IC)characteristics explained in this article. Itinfluences transient fault rate via changes tocritical timing paths; permanent fault ratevia changes to numerous parameters such aswire and oxide dimensions; leakage powerconsumption via changes in dopant con-centration; and dynamic power consump-tion.

As Figure 1 illustrates, the relationshipsamong power, temperature, and reliabilityare complex. Controlling power, tempera-ture, and reliability requires modeling andoptimization at multiple design stages aswell as runtime support.

Power modelingArchitectural power modeling before the

register-transfer level (RTL) has gainedprominence with the rising importance ofenergy-constrained portable devices andthermally constrained high-end machines.Understanding the power characteristicsand ramifications of early-stage designdecisions is essential, because architecturaldecisions can have a huge impact on overallpower efficiency, and serious missteps madeat the architecture level are often toodifficult to correct at later stages in a designproject.

Architectural power simulators are oftentightly coupled with existing cycle-levelarchitectural performance models. Micro-architectural utilization information, in-cluding bit-level activity in some cases, iscombined with energy models for the powerconsumption of microarchitectural blocksunder various usage scenarios—for example,power consumption with various amountsof activity after applying power- or clock-gating techniques. Collecting utilizationinformation is not particularly challenging,so we focus on the more difficult task ofdeveloping power models for microarchi-tectural blocks. Most existing efforts in thisarea can be broadly characterized as either

Figure 1. Power and its implications.

.........................................................................................................................................................................................................................

POWER, THERMAL, AND RELIABILITY MODELING

.......................................................................

50 IEEE MICRO

analytical or empirical techniques for bothdynamic and leakage power consumption.

Analytical power modelsWe can construct analytical models for

dynamic power dissipation by identifyinga microarchitectural block’s key internalnodes and deriving analytical formulas forthe capacitance of these nodes. Microarchi-tects have applied this technique to RAMand CAM structures such as caches, registerfiles, buffers, and queues,1–3 and to regularcombinatorial logic such as decoders, selecttrees, and arbiters. (In contrast, empiricalmodels are generally used to estimate thepower of irregular combinatorial logic suchas integer and floating-point ALUs.)

For example, most popular analyticalmodels use the following equations tocalculate SRAM array bitline energy:

E ~ Cbitline | VDD | Vswing ð1Þ

Cbitline ~ Cdiff-access | Nð Þ

z Cdiff-precharge

zCdiff-column z Cwire

ð2Þ

Here, Cbitline is the total stacked bitlinecapacitance and Vswing is the voltage swingof the bitline. The total bitline capacitanceincludes N stacked access transistor diffu-sion capacitance, precharge circuit capaci-tance, bitline wire capacitance, and columnselect circuit capacitance.

Analytical power models provide a fast,flexible method of estimating dynamicpower consumption. However, the accuracyof these models suffers for several reasons:

N Inaccurate capacitance estimates. Nodecapacitance depends heavily on theactual design layout (including overlapand node-to-node coupling capaci-tances) and the supply voltage. Thus,it can be difficult to accurately esti-mate using simple analytical formulas.

N Direct-path current. Analytical modelsneglect direct-path current, which isdifficult to estimate because it dependsheavily on the circuit design. Howev-er, direct-path current can be appre-

ciable in the precharge circuitry ofcertain memory designs.

N Memory cell toggle power. Many ana-lytical models neglect the power totoggle memory cell bits when writingto memory. Although this power isusually small, it can be a fairly signif-icant error source when there areperiods of intensive writes and togglesto the memory bits.

Analytical power models for leakagepower are also available, based on simpleanalytical formulas for subthreshold leakageand gate leakage.4 For example, a commonway to estimate leakage current is throughthe following equation:

Ileak ~ b | eb vDDVDD0ð Þ

| V2t | 1 { e

{VDD

Vt

� �

| e{ VTHj j{Voff

nVt

ð3Þ

Here, Vt is the thermal voltage; VTH isthe threshold voltage. This leakage model isan approximation; the leakage of a circuit ishighly dependent on circuit topology andcircuit state. Leakage power is also closelyrelated to temperature, requiring tightcoupling between leakage-power and tem-perature models.

Empirical power modelsEmpirical modeling approaches are based

on circuit-level power analysis performed onstructures from recently designed proces-sors. SimplePower and PowerTimer use thisapproach.5,6 PowerTimer’s energy modelsare based on circuit-level power analysis ofall microarchitectural structures in thePower4 microprocessor. Macro-level circuitpower analysis, where several large macroscombine to form a microarchitectural struc-ture, determines power for microarchitec-tural blocks as a function of the activities ofindividual blocks. Empirical modeling ap-proaches then combine these power esti-mates with utilization information fromperformance simulators.

Empirical approaches are accurate informulating models for design macros that

........................................................................

MAY–JUNE 2007 51

tend to be reused in newer designs withrelatively small changes. As such, they arequite useful for exploration of the designspace of existing microarchitectures that aresimilar to the original analyzed design.However, the models rely on completecircuit simulation, which is generally slow.In addition, flexibility is a major problem,especially if a new process technology ora new microarchitecture is under investiga-tion. Simple linear scaling of architecturalparameters might not be accurate, and theuse of newer technologies often brings evenlarger errors.

Microarchitects have widely used bothanalytical and empirical models to explorepower-efficient architectures, and thesemodels are a critical building block fortemperature and reliability modeling andresearch. However, architects must under-stand the limitations of the modelinginfrastructure before attempting a detailedquantitative evaluation. Later, we’ll high-light some future directions in the area ofpower modeling that have the potential toaddress the accuracy and scalability concernsof existing modeling approaches.

Thermal modeling and analysisIC thermal analysis requires the simula-

tion of thermal conduction from an IC’spower sources—that is, transistors andinterconnects—through cooling packagelayers to the ambient environment. Model-ing thermal conduction is analogous tomodeling electrical conduction—with ther-mal conductance corresponding to electricalconductance, power dissipation correspond-ing to electrical current, heat capacitycorresponding to electrical capacitance,and temperature corresponding to voltage.

The following equation governs thermalconduction in chips and cooling packages:

rcLT ~rr,tð Þ

Lt~ + | k ~rrð Þ+T ~rr,tð Þ½ �

z p ~rr,tð Þð4Þ

where r is the material density, c is the massheat capacity, T ~rr,tð Þ and k ~rrð Þ are thetemperature and thermal conductivity ofthe material at position ~rr and time t, and

p ~rr,tð Þ is the power density of the heatsource. To solve this differential equationusing numerical methods, we apply a finite-difference discretization method to decom-pose the chip and package into numerousdiscrete thermal elements, which may be ofnonuniform sizes and shapes. Adjacentthermal elements interact via heat diffusion.Each element has a power dissipation,temperature, thermal capacitance, and ther-mal resistance to adjacent elements. Thus,

CdT tð Þ=dt z AT tð Þ~ Pu tð Þ ð5Þ

where thermal capacitance matrix C is an N3 N diagonal matrix, thermal conductivitymatrix A is an N 3 N sparse matrix, T(t)and P are N 3 1 temperature and powervectors, and u(t) is the unit step function. Inthis tutorial, we assume that P is constantwithin an analysis interval but may varybetween analysis intervals.

Steady-state thermal analysisSteady-state thermal analysis characterizes

temperature distribution when heat flowdoes not vary with time. For microproces-sors, steady-state thermal analysis is suffi-cient for applications with stable powerprofiles or periodically changing powerprofiles that cycle quickly.

For steady-state thermal analysis, we dropthe left term in Equation 5, expressingtemperature variation as a function of timet. Thus,

AT ~ P ? T ~ A{1P ð6Þ

Therefore, given thermal resistance ma-trix A and power vector P, the main task ofsteady-state thermal analysis is to invertmatrix A. The computational complexity ofsteady-state thermal analysis is thus de-termined by the size of matrix A, which is inturn determined by the discretization gran-ularity of the chip and the cooling package.Accurate IC thermal analysis requires fine-grained discretization. The size of matrix Ais typically large. Matrices may be invertedusing direct or iterative methods. Matrixinversion using direct methods, such as LUdecomposition, is sufficient for small ma-trices, such as those associated with models

.........................................................................................................................................................................................................................

POWER, THERMAL, AND RELIABILITY MODELING

.......................................................................

52 IEEE MICRO

that model functional units with singlethermal elements.7 However, it is intractablefor large matrices. Iterative methods permitthe solution of larger problems. Recently,researchers have used multigrid relaxationfor steady-state IC thermal analysis.8,9 Eachrelaxation stage is responsible for eliminat-ing errors in the frequency band corre-sponding to the stage’s discretization gran-ularity. Although multigrid methods oftenpermit rapid convergence for linear systems,our recent analysis shows no significantperformance improvement from using mul-tigrid relaxation for steady-state IC thermalanalysis, compared to an otherwise identicaliterative solver.

Dynamic thermal analysisDynamic thermal analysis is used to

characterize the runtime IC thermal profilewhen transient variations are significant.Researchers have applied time-domain andfrequency-domain methods to this problem.

Time-domain methods use numericalintegration to estimate the runtime ICthermal profile. The targeted time intervalis partitioned into several time steps. If thesize of each time step is small enough, wecan accurately estimate IC thermal transi-tion within each time step using the finite-difference temperature approximation func-tion

T T z hð Þ~ h Pu tð Þ{ AT tð Þ½ �C

z T tð Þð7Þ

where h is the time step size. Numericalintegration comprises a broad family ofmethods, which are typically classified onthe basis of the approximation error bound.Well-known numerical integration methodsinclude Euler’s method, the trapezoidal meth-od, Runge-Kutta methods, and Runge-Kutta-Fehlberg methods. Fourth-order Runge-Kuttamethods are popular because of their accuracyand speed.

Frequency-domain methods estimate a run-time IC thermal profile using approximateanalytical solutions, thereby avoiding repeatednumerical integration. As Equation 8 shows,we use the Laplace transform to translate the

dynamic thermal analysis problem into thefrequency domain

C sT sð Þ{ T 0{ð Þ½ �~ AT sð Þz P=s

T sð Þ~ {A{1 I { sCA{1� �{1

| P=s z CT 0{ð Þ½ �

ð8Þ

Expanding (I 2 sCA21)21 at s 5 0:

T sð Þ~ {A{1X?k~0

sCA{1� �k

(

| P=s z CT 0{ð Þ½ �) ð9Þ

Thus, IC runtime temperature can beaccurately represented as an infinite series inthe frequency domain. Because the long-time-scale impact of high-frequency com-ponents is negligible, truncating Equation 9and ignoring high-frequency componentsyields an analytical approximation.

Time-domain and frequency-domainmethods are each best suited to differentmodeling scenarios. Time-domain methodsrely on numerical integration. Their run-ning times increase with increasing timescale. In addition, because approximationerror accumulates, the accuracy of time-domain methods generally degrades as timescales increase. Therefore, time-domainmethods are most suitable for short-time-scale thermal analysis—that is, up to a fewtens of milliseconds. Frequency-domainmethods, on the other hand, approximatean IC runtime thermal profile usinganalytical solutions, avoiding the need fornumerical integration. However, the one-time computational cost of deriving theanalytical approximation is high. Moreover,frequency-domain methods ignore high-frequency components, introducing short-time-scale errors. Therefore, frequency-domain methods are more appropriate forlong-time-scale thermal analysis—that is,more than a few milliseconds.

Adaptive thermal analysisThe major challenges of numerical IC

thermal analysis are high computationalcomplexity and memory usage. High

........................................................................

MAY–JUNE 2007 53

modeling accuracy requires fine-graineddiscretization, resulting in numerous gridelements. For dynamic thermal analysisusing time-domain methods, higher model-ing accuracy requires fine spatial and tem-poral discretization granularities, increasingthe computational overhead and memoryusage. For dynamic thermal analysis usingfrequency-domain methods, deriving analyt-ical approximations involves computation-and memory-intensive numerical operations,such as the inversion and multiplication oflarge matrices. High thermal element countmay hinder or prevent the application offrequency-domain methods to model com-plicated IC thermal setups.

Recent progress on adaptive numericalmodeling methods tackles the high compu-tation complexity and memory usage ofsteady-state, time-domain, and frequency-domain thermal analysis.9

Spatial adaptation. Chip and packagethermal gradients exhibit significant spatialvariation because of the heterogeneity ofthermal conductivity and heat capacity indifferent chip and cooling package materi-als, as well as variation in power profiles.The wide distribution of temperaturedifferences across the chip and the coolingpackage motivates the design of a thermal-gradient-sensitive, adaptive spatial-discreti-zation refinement technique that automat-ically adjusts the spatial-partitioning resolu-tion to maximize thermal modelingefficiency and guarantee modeling accuracy.In this technique, temperature differenceconstraints govern the spatial discretizationprocess. Iterative refinement takes place ina hierarchical fashion. Regions with high-magnitude gradients are recursively parti-tioned until elements are isothermal and thetemperature profile converges. This methoduses fine-grained discretization only whennecessary for accuracy, thereby improvingthe efficiency of steady-state and time-domain dynamic thermal analysis, andenabling the use of frequency-domainmethods in modeling complicated chipsand packages.

Temporal adaptation. There are two cate-gories of time-domain methods: nonadap-

tive and adaptive. Nonadaptive methods usethe same time-step size throughout analysis.Therefore, their performance is bounded bythe smallest time step required by anythermal element at any time. Temporallyadaptive methods, in contrast, can improveperformance substantially without loss ofaccuracy by adapting time-step sizes duringdynamic analysis.

Temporally adaptive methods, in turn, fallinto two categories: synchronous and asyn-chronous time-marching methods. In syn-chronous time-marching methods, such asadaptive fourth-order Runge-Kutta methods,all thermal elements advance in time inlockstep. However, before each step, thesemethods adjust the step size to the minimalsize required for accuracy by any thermalelement. In contrast, asynchronous time-marching methods permit different elementsto advance forward in time at different rates.9

This method exploits temporal and spatialdifferences in temperature variation withinchip and cooling packages. Each thermalelement uses the largest safe step size for itsposition and time, instead of being forced touse the same step size as all other elements inthe chip and package. By allowing elementsto progress in time asynchronously, thesemethods can dramatically accelerate thermalanalysis without losing accuracy.

Temperature-dependent leakagepower analysis

Technology scaling is increasing the pro-portion of power consumption due toleakage; accurate leakage analysis is nowimportant. IC leakage-power consumption isa strong function of temperature. Manyresearchers have used iterative temperature-dependent leakage analysis for accurateleakage estimation. This approach was de-veloped based on the following observations.First, subthreshold leakage power consump-tion increases superlinearly with tempera-ture. This has led to the incorrect assumptionthat highly accurate thermal modelingembedded within the power analysis flow isnecessary to accurately determine tempera-ture-dependent leakage-power consumption.Second, leakage variation per unit tempera-ture change is less than 1—that is, dPleakage/dT , 1. An iterative-based thermal-leakage

.........................................................................................................................................................................................................................

POWER, THERMAL, AND RELIABILITY MODELING

.......................................................................

54 IEEE MICRO

flow can thus guarantee convergence andcorrectness. The major challenge of existingthermal-leakage analysis flow is high com-putational complexity, which significantlyincreases the running times of both ICthermal analysis and power analysis.

A recent study shows that highly efficientand accurate temperature-dependent leakagepower analysis is possible using coarse-grainedthermal modeling.10 Leakage power dependsmainly on IC thermal profile and circuitdesign style. Despite the nonlinear dependenceof leakage power on temperature, within theoperating temperature ranges of real ICs, usinga linear leakage model for individual func-tional units results in less than a 1 percent errorin leakage estimation and permits more thanfour orders of magnitude reduction in analysistime relative to an approach relying ondetailed accurate thermal analysis.

Power- and temperature-relatedmicroprocessor reliability

As Figure 1 shows, power and its by-products are now a great threat to thereliability of microprocessors. High powerconsumption translates to high tempera-tures: a recipe for permanent faults due toelectromigration and other failure processes.Rapidly changing power consumption ofcomponents supplied by resistive and in-ductive power delivery networks results intransient voltage fluctuations and thereforetransient faults. Soft errors are also becom-ing an increasingly important source of ICreliability problems.11 However, the focus ofthis article is power consumption, and mostsoft errors are orthogonal to power con-sumption.

Modeling permanent power- and temperature-related faults

For IC permanent faults, major failuremechanisms include electromigration, ther-mal cycling, time-dependent dielectricbreakdown, and stress migration.12,13

Electromigration refers to the gradualdisplacement of the atoms in metal wirescaused by electrical current. It leads to voidsand hillocks within metal wires that resultin open- and short-circuit failures. Thefollowing equation gives the mean time tofailure (MTTF) due to electromigration:

MTTFEM ~AEM

Jne

EaEM

kT ð10Þ

where AEM is a constant determined by thephysical characteristics of the metal in-terconnect, J is the current density, EaEM isthe activation energy of electromigration, nis an empirically determined constant, andT is the temperature.

Thermal cycling refers to IC fatiguefailures caused by thermal mismatch de-formation. In the chip and package,adjacent material layers such as copper andlow-k dielectric have different coefficientsof thermal expansion. As a result, runtimethermal variation causes fatigue deforma-tion, leading to failures. The followingequation gives the MTTF due to thermalcycling:

MTTFTC ~ATC

Taverage { Tambient

� �q ð11Þ

where ATC is a constant coefficient, Taverage

is the chip average runtime temperature,Tambient is the ambient temperature, and q isthe Coffin-Manson exponent constant.

Time-dependent dielectric breakdown re-fers to deterioration of the gate dielectriclayer. This effect is a strong function oftemperature and is becoming increasinglyprominent with the reduction of gate-oxidedielectric thickness and nonideal supplyvoltage reduction. The following equationgives the MTTF due to time-dependentdielectric breakdown:

MTTFTDDB ~ ATDDB1

V

� � a{bTð Þ

|eA z B=C z CT

kT

ð12Þ

where ATDDB is a constant; V is the supplyvoltage; and a, b, A, B, and C are fittingparameters.

Stress migration refers to the mass trans-portation of metal atoms in metal wires dueto mechanical stress caused by thermalmismatch among metal and dielectricmaterials. The following equation gives the

........................................................................

MAY–JUNE 2007 55

MTTF due to stress migration:

MTTFSM ~ ASM T0 { Tj j{n

|eEaSM

kT

ð13Þ

where ASM is a constant, T0 is the metaldeposition temperature during fabrication,T is the runtime temperature of the metallayer, n is an empirically determinedconstant, and EaSM is the activation energyfor stress migration.

Equations 10 to 13 indicate that IC faultprocesses are strongly influenced by tem-perature, and in many cases are exponen-tially dependent on it. As a result, it appearsthat detailed and accurate thermal modelingis necessary for reliability estimation. Someof these fault processes are also acceleratedby increases in other parameters related topower, such as current density and voltage.

The most dangerous fault processes inmicroprocessors accelerate as a result ofwear, and this complicates reliability mod-eling and analysis. Several researchers haveassumed that microprocessor fault processesare Poisson processes and have used expo-nential distributions to model them. Expo-nential distributions are mathematicallyconvenient because they permit the additionof the rates of different fault processesoperating on different components to de-termine the failure rate of the entiremicroprocessor. However, they do notmodel wear, which is generally requiredfor accurate reliability modeling.

The log-normal distribution better mod-els prominent microprocessor fault pro-cesses because it models the increase infailure rate with increasing wear. However,this property complicates system-level mod-eling. There is no straightforward methodof deriving a closed-form expression for thefailure rate of a microprocessor composed ofnumerous components for which log-normal fault process models are used.Therefore, some microprocessor reliabilityestimation work assumes exponential faultprocesses, which may be inaccurate; otherwork uses Monte Carlo simulation14 ortechniques based on statistical curve fitting,each of which may be slow. At present,efficient and accurate system-level lifetime

reliability estimation is a goal that remainsjust slightly out of grasp, but toward whichresearch is rapidly progressing.

Modeling transient power-related faultsIn recent years, there has been increased

interest in microarchitectural support for anincreasingly important reliability concern,power supply noise. The circuits on high-performance chips place stringent demandson the power delivery infrastructure re-sponsible for satisfying current demandsand maintaining reference voltages. Powersupply integrity is essential because evenminor deviations in power or groundreference levels can introduce noise or delayinto critical signal transitions, leading to anunrecoverable error. Noise reduction isdifficult due to parasitic inductance in thepower delivery system. Whenever the pro-cessor current demands vary, the transientload causes voltage fluctuations given by thewell known equation V 5 L(dI/dt), where Lis the value of the inductor and dI/dt is therate of change of the current. This relation-ship gives inductive noise its informal name,the dI/dt problem.

Interest in microarchitectural support hasgrown because of the increasing difficultyof mitigating noise through conventionalavenues.15–17 Traditionally, microarchitectshave addressed inductive noise by reducingeffective inductance or by adding capaci-tance on die or in the package to dampenout the noise. However, the voltage noisetolerance of future processors will decreaseat a rate that outstrips our ability to removethe parasitics or add capacitance. In partic-ular, current processor designs use consider-able on-die capacitance; they might devoteas much as 15 to 20 percent of die area todecoupling capacitors. Although on-diecapacitors are very effective at dampeninginductive noise, they consume leakagepower and die area because they areimplemented as nonswitching transistorswith large gate capacitances. Furthermore,in future designs, absolute voltage tolerancewill decrease as the supply voltage scales,and load transients will increase as totalpower increases. Consequently, the rate ofchange for load current (dI/dt) will increase,

.........................................................................................................................................................................................................................

POWER, THERMAL, AND RELIABILITY MODELING

.......................................................................

56 IEEE MICRO

and the allowable variation in supplyvoltage will shrink.

Microarchitectural techniques seek tolimit inductive noise by controlling the rateat which load current changes. They do thisessentially by smoothing out or altering theprocessor’s current profile to eliminateproblematic load transients. These tech-niques can reduce the burden on traditionalcircuit and package solutions for inductivenoise.

Impact of frequency. The severity of in-ductive noise depends not only on themagnitude of transient current, but also onthe frequency range over which it changes.Conventional power supply systems consistof a complex network of die, package, andmotherboard capacitances, inductances, andresistances. Specific current variation pat-terns excite these elements to differentdegrees. Mid-frequency noise in the rangeof 50 to 200 MHz and high-frequencynoise near the processor clock rate haveachieved the most attention in the literature.

Mid-frequency noise is associated withpackage inductance that reacts with die andpackage capacitors whenever the processorcurrent varies within the problematic fre-quency range. Variations at these criticalfrequencies are problematic because theyallow resonance to build in the powerdelivery network, producing violent swingsin supply voltage. High-frequency noiseproduces sharp, localized fluctuations in on-chip supply voltage. These high-frequencynoise patterns arise when processor execu-tion resources have abrupt changes incurrent demand that cannot be adequatelyserviced by neighboring on-chip or in-package capacitors and solder bump con-nections to the off-chip network. We focushere on mid-frequency noise because itprovides the greatest opportunity for archi-tectural solutions.

Mid-frequency models. Researchers in theelectronic packaging community have char-acterized the overall response of the powerdelivery system as a second-order linearsystem,18 which we can essentially representas a single lumped resistance-inductance-capacitance (RLC) underdamped network.19

In this simple model, the R, L, and C valuesdo not correspond to any specific physicalelements in the real system, rather they areeffective values that consider the compositeeffects of all elements in the network.

The dominant characteristic of mid-frequency models is resonance. The mostproblematic load profile is a pulse pattern atthe natural frequency of the RLC network,namely fc 51/2p(LC )1/2. In current pro-cessors this translates to alternating patternsof on/off activity on the order of many tensto about one hundred processor cycles.Figure 2 illustrates this phenomenon, show-ing the impact of a nonresonant currentload and a resonant current pulse train,which has a far more significant noiseimpact. Architecture-level phenomena suchas cache misses and variations in instruc-

Figure 2. Noise impact of different current consumption patterns: an

execution sequence from 164.gzip, which does not hit a resonant

frequency (a), and a trace from a dI/dt microbenchmark, which features

severe ILP variation and mimics a resonant pulse (b).

........................................................................

MAY–JUNE 2007 57

tion-level parallelism (ILP) can producebursty current behavior at this frequencyrange if they are sequenced in an un-fortunate manner.

Microarchitectural simulators with sup-port for cycle-by-cycle power estimates canmodel the impact of mid-frequency in-ductive noise. In essence, the processor’stotal current load can be assumed to be itsinstantaneous power divided by the idealsupply voltage. In an actual circuit, a realsupply voltage droop would cause currentdraw to decrease, making division bya constant value a conservative estimate.However, dividing by varying voltage wouldbe unnecessarily pessimistic. This instanta-neous current can be related to varyingsupply voltage through filter techniques thatuse convolution to relate a history of currentconsumption to voltage,15,16 or throughcircuit simulation that directly models theRLC network.20 In either case, models canbe extended to capture higher-order powerdelivery system effects through use of a morecomplex convolution filter or a detailedcircuit model.

Modeling process variationParameter variation is an unavoidable

consequence of continued technology scal-ing. The impact of random and systematicvariation on physical factors such as gatelength and interconnect spacing will havea profound impact on performance andpower consumption. In current designs,foundry-induced physical deviations alreadyproduce significant die-to-die variation. Inparticular, industry data for a high-perfor-mance processor in 180-nm technologyshows that individual dies produced withthe same fabrication equipment can have asmuch as a 30 percent die-to-die frequencyvariation and a 203 leakage power varia-tion.21 The International Technology Road-map for Semiconductors (http://www.itrs.net) predicts that manufacturing variabilitywill have an increasing prominence infuture designs.

Current architecture-level models forprocess variation have focused on deviationsin effective threshold voltage (vth) andattempt to relate physical uncertainty inthis highly influential parameter to perfor-

mance and power. This involves combininganalytical and empirical transistor models,circuit structure characteristics, and statisti-cal principles that relate physical uncertaintyto architecturally visible characteristics.

Variation under statistical distributionsPhysical variation at the transistor level

can be modeled according to statisticalrelationships. Researchers and microarchi-tects often use a combination of analyticalmodels and Monte Carlo simulations to givemodeled transistor parameters the appropri-ate statistical distributions. Dopant ionconcentration is normally modeled asa Gaussian distribution. Under this model,there is no correlation between any pair oftransistors in the design. On the other hand,gate length has strong spatial-correlationproperties.22 Because of localized imperfec-tions in lithography, neighboring transistorsare likely to have similar gate-length devia-tions. As the distance between transistorsincreases, correlation decreases linearly.22

This is most frequently analyzed usingMonte Carlo simulations. The die is mod-eled as a collection of grid sections. Alltransistors within a section are assumed tohave identical parameter variation. Research-ers have used several methods, includinghierarchical methods23 and convolution ker-nels,24 to assign parameter deviations toneighboring blocks. An alternative to usingMonte Carlo simulations is to modelsystematic components of leakage variationthrough an empirically derived gate-lengthdeviation model.25 This approach is usefulfor average case studies but cannot captureprobabilistic aspects of leakage variation.

Relating VTH variation to microarchitecturePerformance and power variations at the

transistor level are dominated by effectivethreshold voltage, which is in turn de-termined by drawn gate length and dopantconcentration.26 Process variations in eitherof these physical factors will cause devia-tions in VTH and lead to conflicting impactson performance and leakage power.21 Lowereffective threshold voltages caused by shortgate lengths or over-doping decrease delayin a roughly linear fashion but lead to anexponential increase in subthreshold leakage

.........................................................................................................................................................................................................................

POWER, THERMAL, AND RELIABILITY MODELING

.......................................................................

58 IEEE MICRO

current. Deviations that increase vt reduceleakage with the penalty of increased delay.Although vt variation can also affectdynamic power, its impact on leakagepower dominates and has been the primaryfocus of power models for parametervariation.

At the gate level, we can model the effectsof varying threshold voltage analytically orempirically. Analytical models use devicemodels to relate physical variation in criticalparameters, namely gate length and con-centration of dopant ions to calculateeffective threshold voltage. Based on aneffective threshold voltage, we can computethe leakage current using Equation 3.

Likewise, we can use the well-knownAlpha power law to model the effect ofthreshold voltage on transistor delay:

td ! VDD

�VDD { V !

TH

� �ð14Þ

Together, these equations govern therelative impact of process variation at thetransistor level and form the basis foranalytical models. For leakage, existingarchitectural models provide baseline leak-age power on a component basis. Theanalytical models project how leakage scales.For performance, the impact on VTH canalso be related to architectural structure.Within each pipeline stage, the number ofcritical paths determines the minimumfrequency of the stage.27 In essence, a largenumber of critical paths increases theprobability that one of the paths will beslow and hence decreases the clock frequen-cy. Architectural models for critical pathcount can be derived from circuit struc-ture.25 In particular, we can use hardwaredescription language (HDL) descriptions ofprocessor designs to count critical paths ineach stage.28 Spice-based empirical modelsoffer another alternative. Detailed simula-tion on either complete circuits or repre-sentative sections can determine the impacton power and performance.

Continued technology scaling andemerging directions in design will

change the problems of power, thermal,reliability, and process-variation modeling.

In the near future, several problems arelikely to arise in these modeling domains.

In the power-modeling realm, the move tonanoscale technologies will break downtraditional circuit-scaling theory, exacerbat-ing the difficulties associated with analyticaldynamic- and leakage-power modeling.Furthermore, asymmetric and heteroge-neous chip-level multiprocessors (CMPs)will lead to an increased diversity inmicroarchitectures and accelerator cores,limiting the utility of empirical modelingapproaches. As more designers exploremany-core and heterogeneous core systems,power-modeling tools will need to incorpo-rate better estimates for interconnect andcombinatorial logic structures. To addressthese challenges, future power-modelingapproaches might rely on a mixture ofanalytical and empirical modeling tech-niques, to leverage the advantages of both.

Increasing IC integration and powerdensity will bring new thermal-modelingchallenges. Moreover, the changes thatnew fabrication processes, materials, andcooling solutions bring to thermal modelswill require advances in analysis to permitefficiency and speed. For instance, thethermal properties of stacked 3D ICs willdiffer greatly from conventional 2D ICs,and high power density remains one of themajor 3D integration challenges. Increasingmicroprocessor power density will requirenovel cooling solutions such as microchan-nel cooling or nanotube thermal vias,requiring multiresolution modeling downto the micrometer or nanometer scale.

Most system-level reliability models arederived using statistical techniques based onempirical models. These statistical approx-imations can benefit greatly from calibra-tion and validation. Technology scalingfurther increases the importance and chal-lenges of reliability modeling. Reducedfeature size will increase the vulnerabilityof individual devices and interconnects.This will require models that capture thesystem-level impact of nanoscale compo-nents while permitting efficient analysis.However, increasing integration densitycomplicates the development of accuratemicroarchitectural reliability models. FutureCMP architectures also present additional

........................................................................

MAY–JUNE 2007 59

opportunities for per-core power-down anddynamic voltage and frequency scaling,leading to additional dI/dt noise challenges.

To effectively model process variations,architects will need to improve the accessi-bility of their models and adjust them tokeep pace with manufacturing trends. Inaddition, architects will have to developunified models that better capture couplingsbetween power, performance, and reliabilityaspects of variation. Although Monte Carlovariation analysis is relatively easy to applyfor power-performance studies, it does notoffer the same intuition and insights thatanalytic models can. Mathematical modelsthat are easily parameterizable but cancapture probabilistic characteristics such asmean, variance, and skew could be extreme-ly beneficial.

Power-related design challenges havebecome a critically important topic forcomputer architects and system designers.Microarchitectural design requires detailedmodels of power-related phenomena. Asadvanced design techniques and fabricationprocess changes reveal new power-relatedphenomena, power, thermal, reliability, andprocess-variation models will require ongo-ing improvement. MICRO

AcknowledgmentsThis work was supported in part by the

US National Science Foundation underawards CCF-0048313 and CNS-0347941,in part by Intel, in part by IBM, and in partby the Natural Sciences and EngineeringResearch Council of Canada under Discov-ery Grant 388694-01.

................................................................................................

References1. P. Shivakumar and N.P. Jouppi, CACTI 3.0:

An Integrated Cache Timing, Power, and

Area Model, tech. report, Western Re-

search Lab, 2001.

2. D. Brooks, V. Tiwari, and M. Martonosi,

‘‘Wattch: A Framework for Architectural-

Level Power Analysis and Optimizations,’’

Proc. Int’l Symp. Computer Architecture

(ISCA 00), IEEE CS Press, 2000, pp. 83-94.

3. N.S. Kim et al., ‘‘Microarchitectural Power

Modeling Techniques for Deep Sub-Micron

Microprocessors,’’ Proc. Int’l Symp. Low-

Power Electronics and Design (ISLPED 04),

IEEE CS Press, 2004, pp. 212-217.

4. J.A. Butts and G.S. Sohi, ‘‘A Static Power

Model for Architects,’’ Proc. Int’l Symp.

Microarchitecture (MICRO 00), IEEE CS

Press, 2000, pp. 191-201.

5. W. Ye et al., ‘‘The Design and Use of

SimplePower: A Cycle-Accurate Energy

Estimation Tool,’’ Proc. 37th Design Auto-

mation Conf. (DAC 00), ACM Press, 2000,

pp. 340-345.

6. D. Brooks et al., ‘‘New Methodology

for Early-Stage, Microarchitecture-Level

Power-Performance Analysis of Micropro-

cessors,’’ IBM J. Research and Develop-

ment, vol. 47, no. 5-6, 2003, pp. 653-670.

7. K. Skadron et al., ‘‘Temperature-Aware

Microarchitecture,’’ Proc. Int’l Symp. Com-

puter Architecture (ISCA 03), IEEE CS

Press, 2003, pp. 2-13.

8. P. Li et al., ‘‘Efficient Full-Chip Thermal

Modeling and Analysis,’’ Proc. Int’l Conf.

Computer-Aided Design (ICCAD 04), IEEE

CS Press, 2004, pp. 319-326.

9. Y. Yang et al., ‘‘Adaptive Multi-Domain

Thermal Modeling and Analysis for Inte-

grated Circuit Synthesis and Design,’’ Proc.

Int’l Conf. Computer-Aided Design (ICCAD

06), IEEE CS Press, 2006, pp. 575-582.

10. Y. Liu et al., ‘‘Accurate Temperature-

Dependent Integrated Circuit Leakage Pow-

er Estimation Is Easy,’’ Proc. Design Auto-

mation and Test in Europe Conf. (DATE 07),

IEEE CS Press, 2007, pp. 204-209.

11. S.S. Mukherjee et al., ‘‘A Systematic

Methodology to Compute the Architectural

Vulnerability Factors for a High-Perfor-

mance Microprocessor,’’ Proc. Int’l Symp.

Microarchitecture (MICRO 03), IEEE CS

Press, 2003, pp. 29-40.

12. Failure Mechanisms and Models for Semi-

conductor Devices, publication JEP 122-B,

JEDEC, 2003.

13. J. Srinivasan et al., ‘‘Exploiting Structural

Duplication for Lifetime Reliability Enhance-

ment,’’ Proc. Int’l Symp. Computer Archi-

tecture (ISCA 05), IEEE CS Press, 2005,

pp. 520-531.

14. J. Srinivasan et al., ‘‘The Impact of Tech-

nology Scaling on Lifetime Reliability,’’

Proc. Int’l Conf. Dependable Systems and

Networks (DSN 04), IEEE Press, 2004,

pp. 177-186.

.........................................................................................................................................................................................................................

POWER, THERMAL, AND RELIABILITY MODELING

.......................................................................

60 IEEE MICRO

15. W. El-Essawy and D. Albonesi, ‘‘Mitigating

Inductive Noise in SMT Processors,’’ Int’l

Symp. Low-Power Electronics and Design

(ISLPED 04), IEEE CS Press, 2004,

pp. 332-337.

16. R. Joseph, D. Brooks, and M. Martonosi,

‘‘Control Techniques to Eliminate Voltage

Emergencies in High Performance Proces-

sors,’’ Proc. 9th Int’l Symp. High-Perfor-

mance Computer Architecture (HPCA 03),

IEEE CS Press, 2003, pp. 79-90.

17. M.D. Powell and T.N. Vijaykumar, ‘‘Exploit-

ing Resonant Behavior to Reduce Inductive

Noise,’’ Proc. 31st Int’l Symp. Computer

Architecture (ISCA 04), IEEE CS Press,

2004, pp. 288-299.

18. D.J. Herrell and B. Beker, ‘‘Modeling of

Power Distribution Systems for High-

Performance Microprocessors,’’ IEEE

Trans. Advanced Packaging, vol. 22, no. 3,

Aug. 1999, pp. 240-248.

19. I. Kantorovich et al., ‘‘Measurement of Low

Impedance On Chip Power Supply Loop,’’

IEEE Trans. Advanced Packaging, vol. 27,

no. 1, Feb. 2004, pp. 10-14.

20. M.D. Powell and T.N. Vijaykumar, ‘‘Pipeline

Muffling and A Priori Current Ramping:

Architectural Techniques to Reduce High-

Frequency Inductive Noise,’’ Proc. Int’l

Symp. Low-Power Electronics and Design

(ISLPED 03), IEEE Press, 2003, pp. 223-

228.

21. S. Borkar et al., ‘‘Parameter Variations and

Impact on Circuits and Microarchitecture,’’

Proc. 40th Design Automation Conf. (DAC

03), ACM Press, 2003, p. 338.

22. P. Friedberg et al., ‘‘Modeling Within-Die

Spatial Correlation Effects for Process-

Design Co-optimization,’’ Proc. 6th Int’l

Symp. Quality Electronic Design (ISQED

05), IEEE Press, 2005, pp. 516-521.

23. A. Agarwal et al., ‘‘Path-Based Statistical

Timing Analysis Using Bounds and Selec-

tive Enumeration,’’ Proc. Int’l Workshop

Timing Issues in the Specification and

Synthesis of Digital Systems (TAU 02),

ACM Press, 2002, pp. 16-21.

24. K. Meng et al., ‘‘Modeling and Character-

izing Power Variability in Multicore Archi-

tectures,’’ IEEE Symp. Analysis of Soft-

ware and Systems (ISPASS 07), IEEE

Press, 2007, pp. 146-153.

25. E. Humenay, D. Tarjan, and K. Skadron,

‘‘Impact of Parameter Variations on Multi-

core Chips,’’ Proc. Workshop Architectural

Support for Gigascale Integration (ASGI 06),

2006, http://www.cs.virginia.edu/,skadron/

Papers/PV_asgi06.pdf.

26. S.R. Nassif, ‘‘Modeling and Forecasting of

Manufacturing Variations,’’ Proc. 5th Int’l

Workshop Statistical Metrology, IEEE

Press, 2000, pp. 2-10.

27. K.A. Bowman, S.G. Duvall, and J.D. Meindl,

‘‘Impact of Die-to-Die and Within-Die Pa-

rameter Fluctuations on the Maximum

Clock Frequency Distribution for Gigascale

Integration,’’ IEEE J. Solid State Circuits,

vol. 37, no. 2, Feb. 2002, pp. 183-190.

28. B.F. Romanescu, S. Ozev, and D.J. Sorin,

‘‘Quantifying the Impact of Process Vari-

ability on Microprocessor Behavior,’’ Proc.

Workshop Architectural Reliability (WAR

06), 2006, http://www.ee.duke.edu/,sorin/

papers/war06_variability.pdf.

David Brooks is an associate professor ofcomputer science at Harvard University.His research interests include power andthermal-efficient hardware-software systemdesign, and approaches to cope with re-liability and variability in next-generationcomputing systems. He has a BS from theUniversity of Southern California, and anMA and a PhD from Princeton University,all in electrical engineering.

Robert P. Dick is an assistant professor ofelectrical engineering and computer scienceat Northwestern University. His researchinterests span a broad area in computerengineering, with a recent focus on embed-ded systems and temperature-aware andlow-power design. He has a BS computerengineering from Clarkson University anda PhD in electrical engineering fromPrinceton University.

Russ Joseph is an assistant professor ofelectrical engineering and computer scienceat Northwestern University. His researchinterests focus primarily on power- andreliability-aware computer architecture. Hehas a BS in electrical and computer

........................................................................

MAY–JUNE 2007 61

engineering, with an additional major incomputer science, from Carnegie MellonUniversity and a PhD in electrical engi-neering from Princeton University.

Li Shang is an assistant professor in theDepartment of Electrical and ComputerEngineering at Queen’s University, Canada.His research interests include computerarchitecture, interconnection networks, designautomation for embedded systems, and de-sign for nanotechnologies. He has a BE fromTsinghua University and a PhD from

Princeton University, both in electrical engi-neering.

Direct questions and comments about thisarticle to Robert Dick, L477 TechnologicalInstitute, EECS Department, NorthwesternUniversity, 2145 Sheridan Rd., Evanston, IL60208; [email protected].

For more information on this or any

other computing topic, please visit our

Digital Library at http://computer.org/

publications/dlib.

.........................................................................................................................................................................................................................

POWER, THERMAL, AND RELIABILITY MODELING

.......................................................................

62 IEEE MICRO