microprocessor reliability

33
Microprocessor Reliability Robert Pawlowski ECE 570 – 2/19/2013 1

Upload: muncel

Post on 23-Feb-2016

75 views

Category:

Documents


5 download

DESCRIPTION

Microprocessor Reliability. Robert Pawlowski ECE 570 – 2/19/2013. Reliability. Involves different aspects about a processor that can affect performance and functionality. Ultimately can reduce the lifetime of the processor. I ssues typically manifest themselves at the device level. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Microprocessor Reliability

1

Microprocessor Reliability

Robert PawlowskiECE 570 – 2/19/2013

Page 2: Microprocessor Reliability

2

Reliability

• Involves different aspects about a processor that can affect performance and functionality.– Ultimately can reduce the lifetime of the

processor.

• Issues typically manifest themselves at the device level.– Solutions can be implemented at multiple design

levels.

Page 3: Microprocessor Reliability

3

Why the concern?• Operating at highest frequencies and/or lowest power

possible increases sensitivity to process-related variabilities.– Gate length/doping concentration variations– Temperature– Supply voltage droops

• This decreases processor yield

• Decreasing device sizes Increased effect of external issues

Page 4: Microprocessor Reliability

4

Outline

• Error Classification• Hard Errors• Soft Errors

• Sources of radiation• Device/Circuit approaches• Architectural approaches

• Error detection• Error correction

• System level impact

Page 5: Microprocessor Reliability

5

Processor Error Classification

• Hard Errors will result in permanent processor failure.• Processor lifetime is inversely proportional to hard error rate.

• Soft Errors do not permanently damage the device.

Page 6: Microprocessor Reliability

6

Hard Errors

• Extrinsic failures– Caused by process and manufacturing defects– Occur with decreasing rate over time– No impact from micro-architecture

• Intrinsic failures– Related to processor wear-out– Occur with increasing rate over time– Related to wafer packaging, process parameters, and

processor design.

Page 7: Microprocessor Reliability

7

Hard Errors

Page 8: Microprocessor Reliability

8

Soft Errors• Occur in both memory and logic

– External radiation main issue in memory• Alpha particles• High energy neutrons• Thermal neutrons

• Different causes of transient errors in logic– External radiation– Supply voltage droop

• Power supply fluctuations– Ground bounce, cross-talk

– Process variation, temperature– Affect delay of computational paths

Page 9: Microprocessor Reliability

9

Outline

• Error Classification• Hard Errors• Soft Errors

• Sources of radiation• Device/Circuit approaches• Architectural approaches

• Error detection• Error correction

• System level impact

Page 10: Microprocessor Reliability

10

Radiation-Induced Soft Errors• Ionized particle strike causing a state change• No permanent damage (Hard-error)• Combo logic – Single Event Transients (SET)• Memory cells – Single Bit Upset (SBU)

Multi Bit Upset (MBU)• Three causes of soft errors

– Alpha particles– Thermal neutrons– High-energy neutrons

Page 11: Microprocessor Reliability

11

Alpha-Particles• Emitted from impurities in packaging materials.

• Create electron-hole pairs through direct ionization

• Range for a 10 MeV particle < 100um– Typical energy 4-9MeV

• Improved manufacturing trends Reduced effect– Purified materials– Shielding layers

Page 12: Microprocessor Reliability

12

Neutrons• Result of cosmic ray reactions

with atmosphere

• High-Energy neutrons react with chip materials.

• Concrete only shielding material– 1.4x lower flux/foot of

thickness

Page 13: Microprocessor Reliability

13

Neutrons• Thermal neutrons (<<< 1MeV) react with Boron-

Doped Phosphosilicate Glass (BPSG) dielectric layer.– Produce ionized particles that can cause soft-errors

• Solution Remove BPSG from advanced processes

• Mostly solved – SEU’s still found in 45nm, 90nm

Page 14: Microprocessor Reliability

14

Outline

• Error Classification• Hard Errors• Soft Errors

• Sources of radiation• Device/Circuit approaches• Architectural approaches

• Error detection• Error correction

• System level impact

Page 15: Microprocessor Reliability

15

Device-level solutions

• Larger device sizes Larger capacitance– Increase the amount of charge necessary to flip bit

(critical charge)

• Multiple VT design – Sensitivity to variation at low-VDD may limit effectiveness.

• Body biasing also common to both radiation hardening and variation tolerance

Page 16: Microprocessor Reliability

16

Circuit-level solutions

• DICE cell– Used for SRAM, FF’s, latches

• Built-in current sensors on supply lines of memory cells.

Page 17: Microprocessor Reliability

17

Outline

• Error Classification• Hard Errors• Soft Errors

• Sources of radiation• Device/Circuit approaches• Architectural approaches

• Error detection• Error correction

• System level impact

Page 18: Microprocessor Reliability

18

Modular redundancy • Dual Modular Redundancy

• Triple Modular Redundancy

Page 19: Microprocessor Reliability

19

Redundant Circuits

• Redundancy increases area/power

• DMR/TMR in sub/near-VT

– Timing variation between circuits increases

• Utilization of redundant lanes for parallel operation can increase throughput at low-VDD

Page 20: Microprocessor Reliability

20

Self-Checking Circuits• Partition circuit into smaller blocks

– Error checker for each block

• Use error detection codes– Berger codes– Arithmetic codes

• Increases circuit delay for error computation

Page 21: Microprocessor Reliability

21

Circuit-Level Speculation• Uses approximated circuit implementation

– Goal is to reduce critical path

Page 22: Microprocessor Reliability

22

Tunable Replica Circuits• Mirrors delay of critical path• Monitors for errors over voltage/frequency

changes

Page 23: Microprocessor Reliability

23

Timing Speculation

DFFD Q

Shadow Latch

D Q

01

clk

data in

delayed clk

error

data out

D2

D2D1

D1

clk

delayed clk

error

data out D0

D0data in

• Razor timing error detection– Designed for transient faults– Effective against SET’s and SBU’s on flip-flops

• Requires error recovery

Page 24: Microprocessor Reliability

24

Outline

• Error Classification• Hard Errors• Soft Errors

• Sources of radiation• Device/Circuit approaches• Architectural approaches

• Error detection• Error correction

• System level impact

Page 25: Microprocessor Reliability

25

Error Recovery Options in Scalar Processors • Clock Gating:

– Global error signal– Clock gating– 1-cycle penalty

Page 26: Microprocessor Reliability

26

• Multiple Issue:– Error signals propagated to control unit– Instructions must be flushed– Error instruction then replayed– 2N-cycle penalty

Error Recovery Options in Scalar Processors

Page 27: Microprocessor Reliability

27

• Counter-flow pipelining

• Micro-rollback

Error Recovery Options in Scalar Processors

Page 28: Microprocessor Reliability

28

Error correcting codes for memories

• Most common is Hamming code• Check bits stored when data written• Identifies error and erroneous bit position

Page 29: Microprocessor Reliability

29

Error correcting codes for memories

• Single-bit ECC adds area/power and delay– Low-VDD Increased delay

– Hybrid VDD operation will reduce delay

• Overhead increases for multi-bit ECC– Increased memory density higher probability of

MBU – Current research increase in ratio of MBU to total

SER in sub-VT

Page 30: Microprocessor Reliability

30

Outline

• Error Classification• Hard Errors• Soft Errors

• Sources of radiation• Device/Circuit approaches• Architectural approaches

• Error detection• Error correction

• System level impact

Page 31: Microprocessor Reliability

31

System-Level Impact

• Soft errors can have a large affect on processor functionality– Increasing issue with further device scaling

• All methods off error detection/correction are costly– Need to be added to system blocks wisely

• SEU distribution• Effects of process variation

Page 32: Microprocessor Reliability

32

System-Level Impact• How to determine what blocks have the highest

system-level impact?– Mostly through simulation

• For radiation: all-encompassing– Includes fault injection @ circuit level

• Different models have been developed– ReStore – University of Illinois at Urbana-Champaign

• Focuses on system level effect of radiation-induced errors– RAMP – IBM

• Directed more towards hard-errors and processor failure.

Page 33: Microprocessor Reliability

33

Questions?