aging test strategy and adaptive test scheduling for soc failure
TRANSCRIPT
Aging Test Strategy and Adaptive Test Scheduling for SoC Failure Prediction
Hyunbean Yi, Tomokazu Yoneda, Michiko Inoue, Hiedo Fujiwara,Nara Institute of Science and Technology (NAIST), Japan
Yasuo Sato and Seiji KajiharaKyushu Institute of Technology (KIT), Japan
Presenter: Hyunbean Yi
2/19
Outline
Motivation and Goal
Background-. Failure Mechanisms
Related Work
Proposed Aging Test Strategy -. Assumptions-. Delay test/measurement scheme-. Adaptive test scheduling
Conclusions
3/19
Motivation and Goal
Highly Reliable Systems -. Satellites, Data/Web servers, Power plants …-. Automobiles, Aircrafts, Medical equipments …
System failure-. Leads to a Catastrophe -. Can occur due to Aging
Failure Prediction-. On-Line self aging test-. Alarm or Self-repair before Failure
4/19
Background – Failure Mechanisms (1/4) Hot Carrier Injection
(HCI)-. In NMOS transistors-. By strong electric field
when Vd > 0.
Negative Bias Temperature Instability (NBTI)-. In PMOS transistors-. Under high temp. environment
when Vg < 0.
Gate
Vg
VdVs
n+ n+
Gate
Vg
VdVs
p+ p+
Gate Insulating Film (SiO2)
HSi
HSi
HSi
HSi
HSi
< Example of HCI >
⇒Threshold Voltage (Vth) ↑ ⇒ Transconductance (gm) ↓ : Performance Degradation !!
< Example of NBTI >
5/19
Background – Failure Mechanisms (2/4)
Time Dependent Dielectric Breakdown (TDDB)-. In Gate Oxide film-. By trapped charges under long time application-. Hopping or Tunneling
Gate
Si< Tunneling due to TDDB >
⇒ Leakage Current ↑⇒ Breakdown !!
6/19
Background – Failure Mechanisms (3/4)
Electromigration (EM) -. By Metal Atom’s migration due to Current Flow
Stress Migration (SM) -. By Metal Atom’s migration due to Thermal Expansion
What is Migration?
< Polycrystalline Structure >
(Source: Sony Semi., “Quality and Reliability Handbook,” 2000)
⇒ Resistance ↑⇒ Disconnection or other defects!!
Before Migration After Migration
: Metal Atom: Defect
7/19
Background – Failure Mechanisms (4/4)
Path
Del
ay
Elapsed Time
NBTI and HCI
Electomigration and Stress Migration
TDDB
< Different Types of Delay Increase >
8/19
Related Work Aging Sensor
-. Delay monitors on actual data paths [Nakura ISSCC’07], [Agarwal VTS’07]
On-Line Self Test Architecture -. CASP (Concurrent Autonomous Chip Self-Test Using Stored Test
Patterns) [Li DATE’08]-. DART (Degrade factor, Accuracy, Report, and Test coverage)
[Sato ASICON’09]
Aging Test Scheduling -. Operating system scheduling [Li, ICCAD’09]
Others …-. Aging path selection [Baba VTS’09], [Noda ETS’10] -. Aging adaptive system [Khan DATE’09]
9/19
Aging Test StrategyPa
th D
elay
Elapsed Time
NBTI and HCI
Electomigration and Stress Migration
TDDB
Degree of aging based test scheduling More aged parts
→ tested more often
Power-on/off time for testing
Various types of delay degradation Detection and Prediction
Accurate delay measure. Consider Volt. and Temp.
< Different Types of Delay Increase > < Delay Variation according to Volt. and Temp. >
10/19
Architecture Overview (1/2)
< SoC Aging Test Architecture >
To improve accuracy -. Multiple Volt. and Temp. sensors-. Thermal-Aware test patterns
SoC
Core
ROMor
NonvolatileMemory
Core Core Core
SoC TestController
Dec
ompr
esso
r
Tclk Gen.
Core Test Controller
k
Protocol Interface
Com
pact
or
k
CPUCore
Core
boot_enable
ExternalMemeory
Test Access Mechanism (TAM)
Functional Interconnect(Bus or Network-on-chip (NoC))
from/toTAM
from/toFunctional
Interconnect
Core Logic
clk
VT Sensors
Core Test
Wrapper
To reduce pattern size-. Aging Path selection -. Compression techniques
11/19
Architecture Overview (2/2) SoC Test Controller
-. on/off test mode-. Scheduling
: Select next CUT-. Transfer test patterns-. Aging analysis
Core Test Controller-. Triggered
by SoC test controller-. Test Configuration-. on/off sensors-. Transfer test results
< Test Interface between SoC and Core test controllers >
1. Capture Clk Timing info.
SoCTest
Controller
CoreTest
Controller
2. Test Pattern
3. Test Results(Pass or Fail and
measured Volt. and Temp.)
12/19
Aging Test FlowStart
Select the TPS to be Applied
Warning?
Delay Test/Measurement for Failure Prediction
More Log Needed?
Aging Detected?
Previous PredictionCanceled?
End
No
No
No
Yes
Yes
Yes Assign a Lower Weight
to the TPSNo
Available for Another
Test Session?
No
Yes
Calculate Remaining Test Time
Assign a High Weight
to the TPS
Yes Alarm
-. DetectSudden delay.
-. Measure amount of Gradual delay.(referring to V. and T.)
-. Calculate amount of aging
-. Log results
-. Compare Degree of Aging (DoA)
-. Assign TPSs to Danger List Tablesconsidering DoA
13/19
Delay Test/Measurement (1/2) Testclock
LCPmin
LCPmax
Functional clock period
Launch Capture
guard-band
< Capture Timing Window >
Path
Del
ay
Elapsed Time
LCPmax
WarningPoint
Failuretest test
LCPmin
Capture timing shift technique-. Adjust Capture Timing
: Launch-to-Capture Period (LCP)-. On-Die Clock Shrink [Josephson ITC’01]
< Using LCP to test a Gradual Delay Increase >
14/19
Delay Test/Measurement (2/2) Test
with LCPmax
Decide the LCPtest
Test with LCPtest
Decrease LCPtest
Increase LCPtest
Calculate Amount of Aging
Report“Warning”
Report“More Log Needed”
Report“Aging
Detected”
Report“Previous Prediction Canceled”
P
P F
F
P F
F P
P: PassF: Fail
< Delay Test/Measurement Flow >
-. Detect Sudden Delay Increase
-. Measure Gradual Delay Increase : Find out
minimum LCPtest
-. Log and Analysis: Volt. & Temp. to Delay
mapping and translation-. Report
Analysis results
15/19
Comparison of Degree of AgingPa
th
Del
ay
Elapsed Time
dworst
BWAW
dA(0)=dB(0)
dC(0)
PUT BPUT A PUT C
CWtest2test1
0n n
gd A t dα∆ = ⋅ ⋅ ⋅
< Examples of Delay Increases of PUTs >
Using Predictive Model of NBTI effect-. ∆dg: Increase in the gate delay
-. ∆dp: Increase in the path delay
-. dA(t): Measured delay of PUT A at time t ( ) (0) nA A Ad t d S t= + ⋅
[Bhardwaj, CICC’06]
npd S t∆ = ⋅
⇒ PUT A and C at test1: dA(test1) = dC(test1),
dA(0) < dC(0),and dA(AW) = dC(CW).
⇒ PUT A and B at test2: dA(test2) > dB(test2),dA(0) = dB(0),and dA(AW) = dB(BW).
16/19
Adaptive Test Scheduling
6
6
Cn TSLCPmax
1 6
9
462
7
59
1
3
42
TPSn
TPSn TPSn TPSn TPSn
Level 1 Level 2 Level 3 … Level n
… … … …
… …
< Danger List Tables >
1→0
0→1
0→1
123
6
45
7
10
89
01
01
0
0
TPSn DF
0
< Scheduling Table >
LessDangerous
MoreDangerous
Tested Less often
TestedMore often
< TPS Information Table >
4
1
5
3
1
22
45
5
1
1
0
1
1
11
10
0
12
16
15
12
10
15
10
16
10
10
Cnter Cnter Cnter Cnter
( : Example of TPS movements)
Priority order of the next TPS to be applied 1. The Remaining TPS in the DLT in the previous test2. The TPS in a higher level DLT of DLTs whose Counter is full.3. In the scheduling table, the next TPSs of which DF is ‘0’.
17/19
Case Study – Environment Settings
Power-on time (= Power-off time) 10 ms
Scan Shift Clock Frequency 75 MHzMaximum Number of Scan Chains
in a Core 32
The Size of Aging Test Patterns Selected for an SoC
1/4 of the total test patterns of an SoC given
Compression Ratio 50xNumber of LCPtest Levels 16
Number of Logged Data for a TPS 10
Maximum size of a TPS: 32 x 25,000 bits (≒ 10 ms / 75 MHz / 10 / 3)
Log size for a TPS: 10 bytes
18/19
Case Study – Size of TP, Tables, and Log ITC’02SoC
Bench.
# of cores
TPsize(bytes)
# of TPSs
S&Isize+ DLTsize(bytes)
Logsize(bytes)
u226 9 23.5 K 29 72.5 145d281 8 2.3 K 12 25.5 60d695 10 0.4 K 10 22.5 50h953 8 0.7 K 8 15 40
g1023 14 0.3 K 14 31.5 70f2126 4 3.1 K 3 4.5 15
q12710 4 11.3 K 7 12.1 35p22810 28 4.1 K 29 76.1 145p34392 19 8.5 K 21 55.1 105p93791 32 15.7 K 29 76.1 145t512505 31 99.6 K 79 246.9 395a586710 7 2.3 M 1184 4.5 K 5.8 K
Average size of test patterns: 206.2 Kbytes Average size of tables: 437.2 bytes Average log amount: 593.7 bytes.
19/19
Conclusions
Aging Test Strategy Failure mechanisms Delay Test/Measurement
-. Considering Volt. and Temp. for Accuracy Degree of aging based Test Scheduling
Future work Aging path selection and PUT grouping Sensor placement Self-repair technique