microprocessor reliability
DESCRIPTION
Microprocessor Reliability. Robert Pawlowski ECE 570 – 2/19/2013. Reliability. Involves different aspects about a processor that can affect performance and functionality. Ultimately can reduce the lifetime of the processor. I ssues typically manifest themselves at the device level. - PowerPoint PPT PresentationTRANSCRIPT
1
Microprocessor Reliability
Robert PawlowskiECE 570 – 2/19/2013
2
Reliability
• Involves different aspects about a processor that can affect performance and functionality.– Ultimately can reduce the lifetime of the
processor.
• Issues typically manifest themselves at the device level.– Solutions can be implemented at multiple design
levels.
3
Why the concern?• Operating at highest frequencies and/or lowest power
possible increases sensitivity to process-related variabilities.– Gate length/doping concentration variations– Temperature– Supply voltage droops
• This decreases processor yield
• Decreasing device sizes Increased effect of external issues
4
Outline
• Error Classification• Hard Errors• Soft Errors
• Sources of radiation• Device/Circuit approaches• Architectural approaches
• Error detection• Error correction
• System level impact
5
Processor Error Classification
• Hard Errors will result in permanent processor failure.• Processor lifetime is inversely proportional to hard error rate.
• Soft Errors do not permanently damage the device.
6
Hard Errors
• Extrinsic failures– Caused by process and manufacturing defects– Occur with decreasing rate over time– No impact from micro-architecture
• Intrinsic failures– Related to processor wear-out– Occur with increasing rate over time– Related to wafer packaging, process parameters, and
processor design.
7
Hard Errors
8
Soft Errors• Occur in both memory and logic
– External radiation main issue in memory• Alpha particles• High energy neutrons• Thermal neutrons
• Different causes of transient errors in logic– External radiation– Supply voltage droop
• Power supply fluctuations– Ground bounce, cross-talk
– Process variation, temperature– Affect delay of computational paths
9
Outline
• Error Classification• Hard Errors• Soft Errors
• Sources of radiation• Device/Circuit approaches• Architectural approaches
• Error detection• Error correction
• System level impact
10
Radiation-Induced Soft Errors• Ionized particle strike causing a state change• No permanent damage (Hard-error)• Combo logic – Single Event Transients (SET)• Memory cells – Single Bit Upset (SBU)
Multi Bit Upset (MBU)• Three causes of soft errors
– Alpha particles– Thermal neutrons– High-energy neutrons
11
Alpha-Particles• Emitted from impurities in packaging materials.
• Create electron-hole pairs through direct ionization
• Range for a 10 MeV particle < 100um– Typical energy 4-9MeV
• Improved manufacturing trends Reduced effect– Purified materials– Shielding layers
12
Neutrons• Result of cosmic ray reactions
with atmosphere
• High-Energy neutrons react with chip materials.
• Concrete only shielding material– 1.4x lower flux/foot of
thickness
13
Neutrons• Thermal neutrons (<<< 1MeV) react with Boron-
Doped Phosphosilicate Glass (BPSG) dielectric layer.– Produce ionized particles that can cause soft-errors
• Solution Remove BPSG from advanced processes
• Mostly solved – SEU’s still found in 45nm, 90nm
14
Outline
• Error Classification• Hard Errors• Soft Errors
• Sources of radiation• Device/Circuit approaches• Architectural approaches
• Error detection• Error correction
• System level impact
15
Device-level solutions
• Larger device sizes Larger capacitance– Increase the amount of charge necessary to flip bit
(critical charge)
• Multiple VT design – Sensitivity to variation at low-VDD may limit effectiveness.
• Body biasing also common to both radiation hardening and variation tolerance
16
Circuit-level solutions
• DICE cell– Used for SRAM, FF’s, latches
• Built-in current sensors on supply lines of memory cells.
17
Outline
• Error Classification• Hard Errors• Soft Errors
• Sources of radiation• Device/Circuit approaches• Architectural approaches
• Error detection• Error correction
• System level impact
18
Modular redundancy • Dual Modular Redundancy
• Triple Modular Redundancy
19
Redundant Circuits
• Redundancy increases area/power
• DMR/TMR in sub/near-VT
– Timing variation between circuits increases
• Utilization of redundant lanes for parallel operation can increase throughput at low-VDD
20
Self-Checking Circuits• Partition circuit into smaller blocks
– Error checker for each block
• Use error detection codes– Berger codes– Arithmetic codes
• Increases circuit delay for error computation
21
Circuit-Level Speculation• Uses approximated circuit implementation
– Goal is to reduce critical path
22
Tunable Replica Circuits• Mirrors delay of critical path• Monitors for errors over voltage/frequency
changes
23
Timing Speculation
DFFD Q
Shadow Latch
D Q
01
clk
data in
delayed clk
error
data out
D2
D2D1
D1
clk
delayed clk
error
data out D0
D0data in
• Razor timing error detection– Designed for transient faults– Effective against SET’s and SBU’s on flip-flops
• Requires error recovery
24
Outline
• Error Classification• Hard Errors• Soft Errors
• Sources of radiation• Device/Circuit approaches• Architectural approaches
• Error detection• Error correction
• System level impact
25
Error Recovery Options in Scalar Processors • Clock Gating:
– Global error signal– Clock gating– 1-cycle penalty
26
• Multiple Issue:– Error signals propagated to control unit– Instructions must be flushed– Error instruction then replayed– 2N-cycle penalty
Error Recovery Options in Scalar Processors
27
• Counter-flow pipelining
• Micro-rollback
Error Recovery Options in Scalar Processors
28
Error correcting codes for memories
• Most common is Hamming code• Check bits stored when data written• Identifies error and erroneous bit position
29
Error correcting codes for memories
• Single-bit ECC adds area/power and delay– Low-VDD Increased delay
– Hybrid VDD operation will reduce delay
• Overhead increases for multi-bit ECC– Increased memory density higher probability of
MBU – Current research increase in ratio of MBU to total
SER in sub-VT
30
Outline
• Error Classification• Hard Errors• Soft Errors
• Sources of radiation• Device/Circuit approaches• Architectural approaches
• Error detection• Error correction
• System level impact
31
System-Level Impact
• Soft errors can have a large affect on processor functionality– Increasing issue with further device scaling
• All methods off error detection/correction are costly– Need to be added to system blocks wisely
• SEU distribution• Effects of process variation
32
System-Level Impact• How to determine what blocks have the highest
system-level impact?– Mostly through simulation
• For radiation: all-encompassing– Includes fault injection @ circuit level
• Different models have been developed– ReStore – University of Illinois at Urbana-Champaign
• Focuses on system level effect of radiation-induced errors– RAMP – IBM
• Directed more towards hard-errors and processor failure.
33
Questions?