self-calibrated online wearout detection authors: jason blome shuguang feng

23
1 University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Self-calibrated Online Wearout Detection Authors: Jason Blome Shuguang Feng Shantanu Gupta Scott Mahlke MICRO-40 December 3, 2007

Upload: kelton

Post on 12-Jan-2016

22 views

Category:

Documents


0 download

DESCRIPTION

Self-calibrated Online Wearout Detection Authors: Jason Blome Shuguang Feng Shantanu Gupta Scott Mahlke. MICRO-40 December 3, 2007. [Srinivasan, DSN‘04]. [Borkar, MICRO‘05]. Motivation. “Designing Reliable Systems from Unreliable Components…” - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Self-calibrated Online Wearout Detection Authors: Jason  Blome Shuguang Feng

1 University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science

Self-calibrated Online Wearout Detection

Authors: Jason Blome

Shuguang Feng

Shantanu Gupta

Scott Mahlke

MICRO-40

December 3, 2007

Page 2: Self-calibrated Online Wearout Detection Authors: Jason  Blome Shuguang Feng

2 University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science

Motivation

“Designing Reliable Systems from Unreliable Components…”

- Shekhar Borkar (Intel)

[Srinivasan, DSN‘04] [Borkar, MICRO‘05]

More failures to comeFailures will be wearout

induced

Page 3: Self-calibrated Online Wearout Detection Authors: Jason  Blome Shuguang Feng

3 University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science

Current Approaches

Traditional Design margins Burn-in

Detection: based on replication of computation TMR (Tandem/HP NonStop servers) DIVA (Bower, MICRO’05)

Prediction: utilizes precise analytical models and/or sensors

Canary circuits (SentinelSilicion, RidgeTop) RAMP (Srinivasan, UIUC/IBM)

RA

MP

CostlyStatic

Impractical

Page 4: Self-calibrated Online Wearout Detection Authors: Jason  Blome Shuguang Feng

4 University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science

Wearout Mechanisms

Many failure mechanisms have been shown to be progressive

Hot carrier injection (HCI)

Oxide

Electromigration (EM) Oxide Breakdown (OBD)

GS

I gs DIgd

B

N+N+

P-wellIgb

I gcsIgcd

Negative Bias Temperature Inversion (NBTI)

Oxide

Page 5: Self-calibrated Online Wearout Detection Authors: Jason  Blome Shuguang Feng

5 University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science

Objective

Propose a failure prediction technique that exploits the progressive nature of wearout

Monitor impact on path delays

Prediction

• Monitors evolution of wearout

• Proactive

• enables failure avoidance/mitigation

• Continuous feedback

• False negatives and positives

Detection

• Identifies existing fault

• Reactive

• enables failure recovery

• End-of-life feedback

• False negatives

Page 6: Self-calibrated Online Wearout Detection Authors: Jason  Blome Shuguang Feng

6 University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science

GGGGG

Oxide Breakdown (OBD)

G

Accumulation of defects leads to a conductive path

G

ΔIoxide

GS D

B

N+N+

P-well

Oxide

Percolation Model [Stathis, JAP‘06]

Page 7: Self-calibrated Online Wearout Detection Authors: Jason  Blome Shuguang Feng

7 University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science

OBD HSPICE Model

Post-breakdown leakage modeling

[Rodriguez, Stathis, Linder, IRPS ‘03]

0

0

gdgd

gsgs

IKI

IKI

GS

I gs DIgd

B

N+N+

P-wellIgb

I gcsIgcd

unchangedremain

and ,, gbgcdgcs III

[BSIM4.6.0, ‘06]

GS

I gs DIgd

B

N+N+

P-wellIgb

I gcsIgcd

Page 8: Self-calibrated Online Wearout Detection Authors: Jason  Blome Shuguang Feng

8 University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science

Characterization Testbench

tcircuit

tcell

90nm standard cell library

BUFX4 BUFX4

FO4GATE FO4BUFX4

DC

Gate UUT

Page 9: Self-calibrated Online Wearout Detection Authors: Jason  Blome Shuguang Feng

9 University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science

Impact on Propagation Delay

Page 10: Self-calibrated Online Wearout Detection Authors: Jason  Blome Shuguang Feng

10 University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science

Delay Profiling Unit (DPU)

input signal

LatencySampling

1 1

0

0

0

0

0

0

01

1

1

1

1

1

0

0

1

1

1

uArch Module

Page 11: Self-calibrated Online Wearout Detection Authors: Jason  Blome Shuguang Feng

11 University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science

TRIX Analysis

Magnitude of divergence between TRIXglobal

and TRIXlocal reflects amount of degradation

Page 12: Self-calibrated Online Wearout Detection Authors: Jason  Blome Shuguang Feng

12 University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science

Exponential Moving Average (EMA)

Triple-smoothed Exponential Moving Average

TRIX Analysis Details

size windowby the defined is where

)()( 11

tt EMApriceEMAtEMA

)()(

)()(

)()(

132

133

121

122

11

111

ttt

ttt

ttt

EMAEMAEMAtEMA

EMAEMAEMAtEMA

EMApriceEMAtEMA

Page 13: Self-calibrated Online Wearout Detection Authors: Jason  Blome Shuguang Feng

13 University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science

Noisy Latency Profile

94

96

98

100

102

104

106

108

110

Raw Latency Profile Trix Profile (local) Trix Profile (global)

Per

cen

t N

om

inal

Del

ay (

%)

Increasing Age

Page 14: Self-calibrated Online Wearout Detection Authors: Jason  Blome Shuguang Feng

14 University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science

DPU with TRIX Hardware

input signal

LatencySampling

TRIXl

Calculation

Prediction

TRIXg

Calculation

0

0

0

0

0

0

0

1

1

1

Page 15: Self-calibrated Online Wearout Detection Authors: Jason  Blome Shuguang Feng

15 University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science

Wearout Detection Unit (WDU)

LatencySampling

Prediction

TRIXl

Calculation+

TRIXg

Calculation

Page 16: Self-calibrated Online Wearout Detection Authors: Jason  Blome Shuguang Feng

16 University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science

Evaluation Framework

OR1200Verilog

OR1200Verilog

Synthesis and Place and Route

Synthesis and Place and Route

Timing, Power, and Temperature

Simulations

Timing, Power, and Temperature

Simulations

MediaBenchSuite

MediaBenchSuite

90nm Library

90nm Library

Fully Synthesized, P&R, OR1200 Core

Monte Carlo

Simulator

OBD Wearout Model

OBD Wearout Model

HSPICE Simulations

HSPICE Simulations

Gate-level Processor Simulator

Workload Simulator

Wearout Simulator

Page 17: Self-calibrated Online Wearout Detection Authors: Jason  Blome Shuguang Feng

17 University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science

WDU Accuracy

0

20

40

60

80

100

120

ALU Register File LSU Next PC

Module

Per

cent

age

(%)

Life Expended Signals Flagged

Page 18: Self-calibrated Online Wearout Detection Authors: Jason  Blome Shuguang Feng

18 University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science

WDU Overhead

0

5

10

15

20

25

30

35

40

45

50

1 2 4 8

# Signals Monitored

Per

cen

tag

e O

verh

ead

(%

)

Area-Hybrid Area-Hardware Power-Hybrid Power-Hardware

Page 19: Self-calibrated Online Wearout Detection Authors: Jason  Blome Shuguang Feng

19 University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science

WDU Overhead

0

0.5

1

1.5

2

2.5

3

1 2 4 8

# Signals Monitored

Per

cen

tag

e O

verh

ead

(%

)

Area-Hybrid Area-Hardware Power-Hybrid Power-Hardware

Page 20: Self-calibrated Online Wearout Detection Authors: Jason  Blome Shuguang Feng

20 University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science

Long-term Vision

Introspective Reliability Management (IRM) Intelligent reliability management directed by on-chip

sensor feedback

Prospective sensors Delay (WDU) Leakage/Vt Temperature

Page 21: Self-calibrated Online Wearout Detection Authors: Jason  Blome Shuguang Feng

21 University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science

Introspective Reliability Management

Sen

sor

Dat

a

Virtualization Layer

OS

Ru

nti

me

An

alys

is

Reliability Assesment

Scheduled Jobs IRM Policy

Raw

Sen

sor

Dat

a

Filt

ered

Dat

a S

trea

m

Job Assignment

Thread Migration

Power/CLK Gating

DVFS Configuration

WDU

WDU

WDU

WDU

WDU

Fil

teri

ng

an

d A

nal

ys

is

Raw

Sen

sor

Dat

a

Ag

gre

ga

te A

na

lys

is

Pro

cess

ed D

ata

Virtualization Layer Reliability Assesment

OS

Scheduled Jobs IRM Policy

Thread Migration

Reconfiguration

Power/CLK Gating

DVFS Settings

Page 22: Self-calibrated Online Wearout Detection Authors: Jason  Blome Shuguang Feng

22 University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science

Conclusions

Many progressive wearout phenomenon impact device-level performance.

It’s possible to characterize this impact and anticipate failures

WDU performance Failure predicted within 20% of end of life (tunable) Area overhead < 3% (hybrid)

Low-level sensors can be used to enable intelligent reliability management

Page 23: Self-calibrated Online Wearout Detection Authors: Jason  Blome Shuguang Feng

23 University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science

Questions?

?