self-calibrated online wearout detection authors: jason blome shuguang feng
DESCRIPTION
Self-calibrated Online Wearout Detection Authors: Jason Blome Shuguang Feng Shantanu Gupta Scott Mahlke. MICRO-40 December 3, 2007. [Srinivasan, DSN‘04]. [Borkar, MICRO‘05]. Motivation. “Designing Reliable Systems from Unreliable Components…” - PowerPoint PPT PresentationTRANSCRIPT
1 University of MichiganElectrical Engineering and Computer Science
University of MichiganElectrical Engineering and Computer Science
Self-calibrated Online Wearout Detection
Authors: Jason Blome
Shuguang Feng
Shantanu Gupta
Scott Mahlke
MICRO-40
December 3, 2007
2 University of MichiganElectrical Engineering and Computer Science
University of MichiganElectrical Engineering and Computer Science
Motivation
“Designing Reliable Systems from Unreliable Components…”
- Shekhar Borkar (Intel)
[Srinivasan, DSN‘04] [Borkar, MICRO‘05]
More failures to comeFailures will be wearout
induced
3 University of MichiganElectrical Engineering and Computer Science
University of MichiganElectrical Engineering and Computer Science
Current Approaches
Traditional Design margins Burn-in
Detection: based on replication of computation TMR (Tandem/HP NonStop servers) DIVA (Bower, MICRO’05)
Prediction: utilizes precise analytical models and/or sensors
Canary circuits (SentinelSilicion, RidgeTop) RAMP (Srinivasan, UIUC/IBM)
RA
MP
CostlyStatic
Impractical
4 University of MichiganElectrical Engineering and Computer Science
University of MichiganElectrical Engineering and Computer Science
Wearout Mechanisms
Many failure mechanisms have been shown to be progressive
Hot carrier injection (HCI)
Oxide
Electromigration (EM) Oxide Breakdown (OBD)
GS
I gs DIgd
B
N+N+
P-wellIgb
I gcsIgcd
Negative Bias Temperature Inversion (NBTI)
Oxide
5 University of MichiganElectrical Engineering and Computer Science
University of MichiganElectrical Engineering and Computer Science
Objective
Propose a failure prediction technique that exploits the progressive nature of wearout
Monitor impact on path delays
Prediction
• Monitors evolution of wearout
• Proactive
• enables failure avoidance/mitigation
• Continuous feedback
• False negatives and positives
Detection
• Identifies existing fault
• Reactive
• enables failure recovery
• End-of-life feedback
• False negatives
6 University of MichiganElectrical Engineering and Computer Science
University of MichiganElectrical Engineering and Computer Science
GGGGG
Oxide Breakdown (OBD)
G
Accumulation of defects leads to a conductive path
G
ΔIoxide
GS D
B
N+N+
P-well
Oxide
Percolation Model [Stathis, JAP‘06]
7 University of MichiganElectrical Engineering and Computer Science
University of MichiganElectrical Engineering and Computer Science
OBD HSPICE Model
Post-breakdown leakage modeling
[Rodriguez, Stathis, Linder, IRPS ‘03]
0
0
gdgd
gsgs
IKI
IKI
GS
I gs DIgd
B
N+N+
P-wellIgb
I gcsIgcd
unchangedremain
and ,, gbgcdgcs III
[BSIM4.6.0, ‘06]
GS
I gs DIgd
B
N+N+
P-wellIgb
I gcsIgcd
8 University of MichiganElectrical Engineering and Computer Science
University of MichiganElectrical Engineering and Computer Science
Characterization Testbench
tcircuit
tcell
90nm standard cell library
BUFX4 BUFX4
FO4GATE FO4BUFX4
DC
Gate UUT
9 University of MichiganElectrical Engineering and Computer Science
University of MichiganElectrical Engineering and Computer Science
Impact on Propagation Delay
10 University of MichiganElectrical Engineering and Computer Science
University of MichiganElectrical Engineering and Computer Science
Delay Profiling Unit (DPU)
input signal
LatencySampling
1 1
0
0
0
0
0
0
01
1
1
1
1
1
0
0
1
1
1
uArch Module
11 University of MichiganElectrical Engineering and Computer Science
University of MichiganElectrical Engineering and Computer Science
TRIX Analysis
Magnitude of divergence between TRIXglobal
and TRIXlocal reflects amount of degradation
12 University of MichiganElectrical Engineering and Computer Science
University of MichiganElectrical Engineering and Computer Science
Exponential Moving Average (EMA)
Triple-smoothed Exponential Moving Average
TRIX Analysis Details
size windowby the defined is where
)()( 11
tt EMApriceEMAtEMA
)()(
)()(
)()(
132
133
121
122
11
111
ttt
ttt
ttt
EMAEMAEMAtEMA
EMAEMAEMAtEMA
EMApriceEMAtEMA
13 University of MichiganElectrical Engineering and Computer Science
University of MichiganElectrical Engineering and Computer Science
Noisy Latency Profile
94
96
98
100
102
104
106
108
110
Raw Latency Profile Trix Profile (local) Trix Profile (global)
Per
cen
t N
om
inal
Del
ay (
%)
Increasing Age
14 University of MichiganElectrical Engineering and Computer Science
University of MichiganElectrical Engineering and Computer Science
DPU with TRIX Hardware
input signal
LatencySampling
TRIXl
Calculation
Prediction
TRIXg
Calculation
0
0
0
0
0
0
0
1
1
1
15 University of MichiganElectrical Engineering and Computer Science
University of MichiganElectrical Engineering and Computer Science
Wearout Detection Unit (WDU)
LatencySampling
Prediction
TRIXl
Calculation+
TRIXg
Calculation
16 University of MichiganElectrical Engineering and Computer Science
University of MichiganElectrical Engineering and Computer Science
Evaluation Framework
OR1200Verilog
OR1200Verilog
Synthesis and Place and Route
Synthesis and Place and Route
Timing, Power, and Temperature
Simulations
Timing, Power, and Temperature
Simulations
MediaBenchSuite
MediaBenchSuite
90nm Library
90nm Library
Fully Synthesized, P&R, OR1200 Core
Monte Carlo
Simulator
OBD Wearout Model
OBD Wearout Model
HSPICE Simulations
HSPICE Simulations
Gate-level Processor Simulator
Workload Simulator
Wearout Simulator
17 University of MichiganElectrical Engineering and Computer Science
University of MichiganElectrical Engineering and Computer Science
WDU Accuracy
0
20
40
60
80
100
120
ALU Register File LSU Next PC
Module
Per
cent
age
(%)
Life Expended Signals Flagged
18 University of MichiganElectrical Engineering and Computer Science
University of MichiganElectrical Engineering and Computer Science
WDU Overhead
0
5
10
15
20
25
30
35
40
45
50
1 2 4 8
# Signals Monitored
Per
cen
tag
e O
verh
ead
(%
)
Area-Hybrid Area-Hardware Power-Hybrid Power-Hardware
19 University of MichiganElectrical Engineering and Computer Science
University of MichiganElectrical Engineering and Computer Science
WDU Overhead
0
0.5
1
1.5
2
2.5
3
1 2 4 8
# Signals Monitored
Per
cen
tag
e O
verh
ead
(%
)
Area-Hybrid Area-Hardware Power-Hybrid Power-Hardware
20 University of MichiganElectrical Engineering and Computer Science
University of MichiganElectrical Engineering and Computer Science
Long-term Vision
Introspective Reliability Management (IRM) Intelligent reliability management directed by on-chip
sensor feedback
Prospective sensors Delay (WDU) Leakage/Vt Temperature
21 University of MichiganElectrical Engineering and Computer Science
University of MichiganElectrical Engineering and Computer Science
Introspective Reliability Management
Sen
sor
Dat
a
Virtualization Layer
OS
Ru
nti
me
An
alys
is
Reliability Assesment
Scheduled Jobs IRM Policy
Raw
Sen
sor
Dat
a
Filt
ered
Dat
a S
trea
m
Job Assignment
Thread Migration
Power/CLK Gating
DVFS Configuration
WDU
WDU
WDU
WDU
WDU
Fil
teri
ng
an
d A
nal
ys
is
Raw
Sen
sor
Dat
a
Ag
gre
ga
te A
na
lys
is
Pro
cess
ed D
ata
Virtualization Layer Reliability Assesment
OS
Scheduled Jobs IRM Policy
Thread Migration
Reconfiguration
Power/CLK Gating
DVFS Settings
22 University of MichiganElectrical Engineering and Computer Science
University of MichiganElectrical Engineering and Computer Science
Conclusions
Many progressive wearout phenomenon impact device-level performance.
It’s possible to characterize this impact and anticipate failures
WDU performance Failure predicted within 20% of end of life (tunable) Area overhead < 3% (hybrid)
Low-level sensors can be used to enable intelligent reliability management
23 University of MichiganElectrical Engineering and Computer Science
University of MichiganElectrical Engineering and Computer Science
Questions?
?