1st combined r2e workshop & school-days error detection and correction techniques a. marchioro /...

80
1st Combined R2E Workshop & School-Days Error Detection and Correction Techniques A. Marchioro / PH-ESE-ME

Upload: tyler-osborne

Post on 26-Dec-2015

226 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: 1st Combined R2E Workshop & School-Days Error Detection and Correction Techniques A. Marchioro / PH-ESE-ME

1st Combined R2E Workshop & School-Days

Error Detection andCorrection Techniques

A. Marchioro / PH-ESE-ME

Page 2: 1st Combined R2E Workshop & School-Days Error Detection and Correction Techniques A. Marchioro / PH-ESE-ME

Outline SEU Basic Facts

Special technologies for SEU protection Mitigation techniques

Circuit Techniques• In logic• In registers• In RAMS

Logic (Redundancy) Techniques Coding techniques

• Error detection only techniques

Conclusions

2 A. Marchioro / PH-ESE

Page 3: 1st Combined R2E Workshop & School-Days Error Detection and Correction Techniques A. Marchioro / PH-ESE-ME

Significant also in industry

A. Marchioro / PH-ESE3

Terrestrial cosmic rays and soft errorsVol. 40, No. 1, 1996

Soft Errors in Circuits and SystemsVol. 52, No. 3, 2008

Page 4: 1st Combined R2E Workshop & School-Days Error Detection and Correction Techniques A. Marchioro / PH-ESE-ME

SEU errors in “analog” circuitry We live in a (mostly) digital world:

(Occasional) errors in analog circuitry will be ignored or will be fixed at the digital level Particle strike at sensing elements:

Happens all the time at particle detectors• System should be designed to cope with single wrong measurement

Can happen easily in photo-receivers Particle strikes at critical nodes

Biasing nodes • Self-recovery• Hits at high current nodes are probably going to remain unobserved

DAC registers• Not self recovered, but detectable in digital way

Oscillator circuits and PLLs:• Recovery could take ms, but should eventually occur

• May require training or synchronization sequences to be sent

• Can cause long sequences of errors in applications such as self-clocking serial streams

4 A. Marchioro / PH-ESE

Page 5: 1st Combined R2E Workshop & School-Days Error Detection and Correction Techniques A. Marchioro / PH-ESE-ME

SEU Basics

Page 6: 1st Combined R2E Workshop & School-Days Error Detection and Correction Techniques A. Marchioro / PH-ESE-ME

SEU: where does it occur

A. Marchioro / PH-ESE6

“0”

from Darracq et al.: IEEE Trans. on Nuclear Science, VOL. 49, NO. 3, JUNE 2002

Page 7: 1st Combined R2E Workshop & School-Days Error Detection and Correction Techniques A. Marchioro / PH-ESE-ME

All all particles equally “dangerous” for SEU?

A. Marchioro / PH-ESE7

Energy loss (dE/dx) for protons in Si

for reference see: http://pdg.lbl.gov/2008/reviews/rpp2008-rev-passage-particles-matter.pdf

Bethe-Bloch energy loss equation

Page 8: 1st Combined R2E Workshop & School-Days Error Detection and Correction Techniques A. Marchioro / PH-ESE-ME

When and where should we care?

“I have this particular component in my system, should I be worried about SEU?”

Page 9: 1st Combined R2E Workshop & School-Days Error Detection and Correction Techniques A. Marchioro / PH-ESE-ME

SEU: Impact on components

A. Marchioro / PH-ESE9

Component Type Technology (likely) used

Digital SEU risk

Mitigation technique applicable

High end microprocessor and DSP

< 90 nm Very high System level redundancy

Low-end microcontroller > 130 nm High System level redundancy, software protection techniques

High density memory < 90 nm Very high Error correction (coding)

Discrete digital logic > 250 nm Medium Logic Redundancy

Discrete analog components

> 250 nm & bipolar

Low n.a.

SRAM FPGA < 90 nm Very high (*) Redundancy or reload (needs special tool)

AntiFuse FPGA < 90 nm High (**) Redundancy

ASICs 130 or 90 nm Architectural and circuit level protections

(*) Both user and configuration logic are sensitive(**) Only user logic is sensitive

Page 10: 1st Combined R2E Workshop & School-Days Error Detection and Correction Techniques A. Marchioro / PH-ESE-ME

SEU in a circuit

SEU can occur in several places in a circuit: In a storage node (Register, Latch or RAM) Along a logic path (needs to be synchronized with

clock sampling to be relevant) On a clock line (rather bad!) On a global line such as Reset (catastrophic!)

Different techniques are necessary to protect from these different events

No one-size fits-all solution!

A. Marchioro / PH-ESE10

Page 11: 1st Combined R2E Workshop & School-Days Error Detection and Correction Techniques A. Marchioro / PH-ESE-ME

Device Techniques

Page 12: 1st Combined R2E Workshop & School-Days Error Detection and Correction Techniques A. Marchioro / PH-ESE-ME

Device level SEU protection: SOI

A. Marchioro / PH-ESE12

+ -- +

+ -- +

+ -- +

well

substrate

+ -Oxide

The majority of commercial ICs are fabricated on bulk technologies. Charge can be collected from several microns of silicon under a device.

In thin-film SOI, the active silicon layer can be very thin,< 300 nm, therefore little free charge can be produced.

STI

WA

RN

ING

: D

raw

ing

not

to s

cale

!

Page 13: 1st Combined R2E Workshop & School-Days Error Detection and Correction Techniques A. Marchioro / PH-ESE-ME

SOI and SEU

A. Marchioro / PH-ESE13

Bulk SRAM - A

SOI SRAM 1

SOI SRAM 2

Bulk SRAM - B

From J. Doff, TNS, 8/2007

Page 14: 1st Combined R2E Workshop & School-Days Error Detection and Correction Techniques A. Marchioro / PH-ESE-ME

SOI based ASIC design

SOI could be considered for specific and very demanding custom designs, but: Requires special technology (few vendors) Has virtually no library support Has few if any IP available Requires high volume Price: Expensive to very expensive, no second source What about the other chips in your system?

Still, it is used in space and military applications

A. Marchioro / PH-ESE14

Page 15: 1st Combined R2E Workshop & School-Days Error Detection and Correction Techniques A. Marchioro / PH-ESE-ME

Circuit Techniques

Page 16: 1st Combined R2E Workshop & School-Days Error Detection and Correction Techniques A. Marchioro / PH-ESE-ME

Single Event Upset in logic

A. Marchioro / PH-ESE16

A

B

Y A

B

Y

A

B

Y

If the length of the spike is longer than the typical gate delay, it will propagate down the logic path and possible be sampled in the next FF

This used to be a very rare event in logic up to the .25 um generation

Unfortunately it is common in 130, 90 and 65 nm (which means in most commercial chips today)

CLK

Page 17: 1st Combined R2E Workshop & School-Days Error Detection and Correction Techniques A. Marchioro / PH-ESE-ME

Protection against SEU in logic

A. Marchioro / PH-ESE17

Reg

iste

r

Regular (fast) gates Slow gates(filter glitches)

.. or double sample at register

Page 18: 1st Combined R2E Workshop & School-Days Error Detection and Correction Techniques A. Marchioro / PH-ESE-ME

Circuit level mitigation techniques

A. Marchioro / PH-ESE18

Din

CK

CK*

Din

CK

CK*

Normal Latch Strong FeedbackLatch

Din

CK

CK*

Extra Cap Latch

Din

CK

CK*

Large SizeLatch

Page 19: 1st Combined R2E Workshop & School-Days Error Detection and Correction Techniques A. Marchioro / PH-ESE-ME

Special topology D-FF cell

A. Marchioro / PH-ESE19

SEU robust FF: DICE cell

From Calin et al. IEEE TNS Dec 1996

Page 20: 1st Combined R2E Workshop & School-Days Error Detection and Correction Techniques A. Marchioro / PH-ESE-ME

Single Event Upset in SRAM

A. Marchioro / PH-ESE20

WL

BL*BL

01

Sensitive nodes are the drains of off-state transistors

Page 21: 1st Combined R2E Workshop & School-Days Error Detection and Correction Techniques A. Marchioro / PH-ESE-ME

Circuit level protection

A. Marchioro / PH-ESE21

from Canaris, Whitaker: Circuit Techniques for the Radiation Environment of Space, IEEE 1995 CUSTOM INTEGRATED CIRCUITS CONFERENCE

Page 22: 1st Combined R2E Workshop & School-Days Error Detection and Correction Techniques A. Marchioro / PH-ESE-ME

Remarks about SEU in RAMs

In today’s technologies, cells are so small (< 1 m2) that single ions can hit two or more locations at once, multiple SEU are common.Single bit EDAC is likely not sufficient!

While it is true that most of the memory area is covered by the matrix of cells, hits in other areas (decoder, sense-amp), though rare, can be even more catastrophic

A. Marchioro / PH-ESE22

Page 23: 1st Combined R2E Workshop & School-Days Error Detection and Correction Techniques A. Marchioro / PH-ESE-ME

A 65 nm 2-Billion Transistor Itanium

A. Marchioro / PH-ESE23

Page 24: 1st Combined R2E Workshop & School-Days Error Detection and Correction Techniques A. Marchioro / PH-ESE-ME

More on SER…

A. Marchioro / PH-ESE24

Page 25: 1st Combined R2E Workshop & School-Days Error Detection and Correction Techniques A. Marchioro / PH-ESE-ME

Logic Techniques

Page 26: 1st Combined R2E Workshop & School-Days Error Detection and Correction Techniques A. Marchioro / PH-ESE-ME

Redundancy

Redundancy is actually a coding techniques, technically a simple “repetition” code, where the information is duplicated or triplicated and checked at convenient boundaries

Redundancy is well applicable in control blocksData paths are better protected by other

techniques, such as parity etc.

A. Marchioro / PH-ESE26

Page 27: 1st Combined R2E Workshop & School-Days Error Detection and Correction Techniques A. Marchioro / PH-ESE-ME

27

Repetition Code

Take each symbol si in S and repeat it n times.

This is an (n, 1) code.

For example the word {s1s2s3} becomes the codeword {s1s1s1s2s2s2s3s3s3}

Efficiency (= rate) of the code is: 1/n

The minimum distance (see later) is n and the number of errors t that can be corrected is:

t = ½ (n – 1)

A. Marchioro / PH-ESE

Page 28: 1st Combined R2E Workshop & School-Days Error Detection and Correction Techniques A. Marchioro / PH-ESE-ME

28

Triple Module Redundancy Triple redundancy

Three copies of same user logic + state_register

Voting logic decides 2 out of three (majority)

Used regularly in: High reliability electronics Mainframes

Problems: 300% area and power corrects only 1 error can get very wrong with

two errors Problem: How do you make

sure that the voting logic itself is not affected by SEU?

FSM1

FSM2

FSM3

Vot

ing

logi

c

InputOutput

CLK

A. Marchioro / PH-ESE

ABACBC

Logic for Voting

Page 29: 1st Combined R2E Workshop & School-Days Error Detection and Correction Techniques A. Marchioro / PH-ESE-ME

Example of triplicated design Gigabit Optical Link (CERN

design: GOL 0.8 and 1.60 Gb/s optical link

Unidirectional < 300 mW G-Link and Gigabit Ethernet

protocol Redundant logic

More than 20,000 units in Atlas, CMS, LHCb and Alice

http://proj-gol.web.cern.ch/proj-gol/)

29 A. Marchioro / PH-ESE

Page 30: 1st Combined R2E Workshop & School-Days Error Detection and Correction Techniques A. Marchioro / PH-ESE-ME

30

Reduced Module Redundancy

Double redundancy Two copies of same user logic + state_register Voting logic decides if outputs are unequal If mismatch:

• Report to system Problems:

200% area and power Can’t be used in “real-time” but may be sufficient for many

applications

FSM1

FSM2C

ompa

rison

logi

c

InputOutput

CLK

ResetRequest

A. Marchioro / PH-ESE

Page 31: 1st Combined R2E Workshop & School-Days Error Detection and Correction Techniques A. Marchioro / PH-ESE-ME

31

What to duplicate?

Reg

Input

Output

A. Marchioro / PH-ESE

Logic

Reg

Com

paris

on lo

gic R

eg

Input

Output

Logic

Reg

Com

paris

on lo

gic

Logic

Use this:If clock frequency is high and

technology is “advanced”.

Use this: If clock frequency is low and

technology is “old”.

Reg Logic

Reg

Page 32: 1st Combined R2E Workshop & School-Days Error Detection and Correction Techniques A. Marchioro / PH-ESE-ME

32

FSM general structure

A. Marchioro / PH-ESE

Reg

Input

Output

Logic

Reg

Com

paris

on lo

gic

Logic

Do this!Not This.

Logic

Reg

Reg

Input

Output

Logic

Reg

Com

paris

on lo

gic

Logic

Logic

Reg

Page 33: 1st Combined R2E Workshop & School-Days Error Detection and Correction Techniques A. Marchioro / PH-ESE-ME

33

Temporal Redundancy

Redundancy in time: Single user logic block and two state_registers Two clocks (F1 and F2) Voting logic decides if outputs are unequal at completion of F2 If error:

• Compute again Problems:

Needs time for 3 evaluations (…not really, three transients time constants are enough)

No problem at 40 MHz and “modern” technology Needs multi-phase clock

LogicC

ompa

rison

logi

c

InputOutput

CLK2

Re-evaluateRequest

CLK1

Reg1

Reg2

A. Marchioro / PH-ESE

CLK2

CLK1

Page 34: 1st Combined R2E Workshop & School-Days Error Detection and Correction Techniques A. Marchioro / PH-ESE-ME

34

Memory Boundary Redundancy Check for consistency only

when results will be committed to memory: For instance when two

computers/microcontrollers perform a STORE operation

Advantages: Processors can be “standard” Write operations are relatively

rare and therefore requirements on comparison resources are small

Less resources needed for checking

Used in some mainframes with triple redundancy Problem: if you detect an error in

processor, how do you resync it?

uP 1C

ompa

rison

logi

c

Error

Shared Memory

A. Marchioro / PH-ESE

uP 2

Page 35: 1st Combined R2E Workshop & School-Days Error Detection and Correction Techniques A. Marchioro / PH-ESE-ME

35

I/O Boundary Redundancy Check for consistency only

when results will become used by external devices: For instance when two

computers/microcontrollers want to commit results to disk

Advantages: Synchronization is less of a

problem Less resources needed for

checking• In some cases it could even be done in

software uP Architectures and/or hardware

could even be different Used in high-reliability

computer boxes and avionics

uP 1

Com

paris

on lo

gic

I/Odevice

I/O CLK

Re-evaluateRequest

I/O Intf1

I/OIntfc2

A. Marchioro / PH-ESE

uP 1

Mem1

Mem1

Page 36: 1st Combined R2E Workshop & School-Days Error Detection and Correction Techniques A. Marchioro / PH-ESE-ME

Mission critical redundancy

A. Marchioro / PH-ESE36

Various computer configurations used during a Shuttle mission.

from: NASA Shuttle documentation

Page 37: 1st Combined R2E Workshop & School-Days Error Detection and Correction Techniques A. Marchioro / PH-ESE-ME

Redundancy in avionics

A. Marchioro / PH-ESE37

from: IEEE Aerospace & Electronic Systems Magazine, October 2000

Page 38: 1st Combined R2E Workshop & School-Days Error Detection and Correction Techniques A. Marchioro / PH-ESE-ME

Coding Techniques

Page 39: 1st Combined R2E Workshop & School-Days Error Detection and Correction Techniques A. Marchioro / PH-ESE-ME

39

Hamming Coding

“Two weekends in a row I came in and found that all my stuff had been dumped and nothing was done. I was really aroused and annoyed and I wanted those answers and two weekends had been lost. And so I said, ‘Damn it, if the machine can detect an error, why can’t it locate the position of the error and correct it?’”

from an interview with R. Hamming,

February 3-4, 1977, quoted in T. Thompson, p.17

“The purpose of this memorandum is to give some practical codes which may detect and correct all errors of a given probability of occurrence, and which detect errors of even a rarer occurrence”.

from R. Hamming,

‘Self-Correcting Codes – Case 20878,

Memorandum 1130-RWH-MFW,

Bell Telephone Laboratories, July 27, 1947

A. Marchioro / PH-ESE

Page 40: 1st Combined R2E Workshop & School-Days Error Detection and Correction Techniques A. Marchioro / PH-ESE-ME

40

Coding for memory repair

A. Marchioro / PH-ESE

Page 41: 1st Combined R2E Workshop & School-Days Error Detection and Correction Techniques A. Marchioro / PH-ESE-ME

Mitigating SEU: Forward Error Correction

A. Marchioro / PH-ESE41

D

f(D)

TTP

Examples of FEC: Simple Parity (actually only error

detection) EDC: Hamming coding

• single error detection capability, popular in computer DRAM

BCH • Sophisticated multiple bit

error detection and correction; requires complex logic

Reed-Solomon• Sophisticated and efficient

multi-word error detection and correction; requires complex logic

R

f(R)

D

RP =

? OK/NotOK

Transm

itterR

eceiver

Page 42: 1st Combined R2E Workshop & School-Days Error Detection and Correction Techniques A. Marchioro / PH-ESE-ME

Mitigating SEU: FEC (2)

A. Marchioro / PH-ESE42

The “parity” function must be such that, if an error is detected, one can also use it to recover the right data!

R

f(R)

D =

R

f -1(R)

RP =

?

OK/NotOK

Receiver

f -1(R)

Page 43: 1st Combined R2E Workshop & School-Days Error Detection and Correction Techniques A. Marchioro / PH-ESE-ME

Families of Error Control Methods Block Codes: codeword built only on current message-word Non-block codes: codeword depends on current message word and

of some past words, ex: Convolutional, used (obviously) in streaming channels

Examples of codes: Hamming Bose-Chauduri-Hocqueghem (BCH) Golay Reed-Solomon (RS) Reed-Müller Low Density Parity Check Codes Turbo Codes …

43 A. Marchioro / PH-ESE

Page 44: 1st Combined R2E Workshop & School-Days Error Detection and Correction Techniques A. Marchioro / PH-ESE-ME

44

Parity

In B = {0,1}, start with a message word: S = {s1s2s3s4s5s6s7}

Compute a “Parity” character s8 defined as:

where is the exclusive-OR (or the sum mod 2).

Parity check can detect all single errors (but can not give the position)Parity check can not detect double (or even count) errors

Used:- often in computer memories- in serial terminals data transmission

A. Marchioro / PH-ESE

c8 = s1 ⊗ s2 ⊗ s3 ⊗ s4 ⊗ s5 ⊗ s6 ⊗ s7

Page 45: 1st Combined R2E Workshop & School-Days Error Detection and Correction Techniques A. Marchioro / PH-ESE-ME

45

Two-Dimensional Parity

Par

ityX

ParityY

2 Errors

1 0 1 1 1 0 0 0

0 1 0 0 0 1 1 1

1 1 0 0 0 0 1 1

0 1 0 1 1 0 1 0

1 0 0 1 0 1 1 0

0 0 0 1 0 1 0 0

1 1 0 0 1 0 0 1

0 0 1 0 1 1 0  

1 0 1 1 1 0 0 0

0 1 0 0 0 1 1 1

1 1 0 0 0 0 1 1

0 1 0 1 0 0 1 1

1 0 0 1 1 1 1 1

0 0 0 1 0 1 0 0

1 1 0 0 1 0 0 1

0 0 1 0 1 1 0  

A. Marchioro / PH-ESE

Page 46: 1st Combined R2E Workshop & School-Days Error Detection and Correction Techniques A. Marchioro / PH-ESE-ME

46

Two-Dimensional Parity

A. Marchioro / PH-ESE

Page 47: 1st Combined R2E Workshop & School-Days Error Detection and Correction Techniques A. Marchioro / PH-ESE-ME

47

Hamming (intuitive version)

s1 s2

s3

s4

c5

c6c7

s1 s2 s3 s4 c5 c6 c70 0 0 0 0 0 00 0 0 1 0 1 10 0 1 0 1 1 10 0 1 1 1 0 00 1 0 0 1 1 00 1 0 1 1 0 10 1 1 0 0 0 10 1 1 1 0 1 01 0 0 0 1 0 11 0 0 1 1 1 01 0 1 0 0 1 01 0 1 1 0 0 11 1 0 0 0 1 11 1 0 1 0 0 01 1 1 0 1 0 01 1 1 1 1 1 1

Definition:cj = computed to give even parity in the circle

source parity

Notice:the 16 code words in Hamming(7,4) differ from each other by at least 3 bits.

A. Marchioro / PH-ESE

Page 48: 1st Combined R2E Workshop & School-Days Error Detection and Correction Techniques A. Marchioro / PH-ESE-ME

48

Hamming Codes (3)

a0

a1

a2

a3

a0

a1

a2

a3

p0

p1

p2

Hardware for encoder

A. Marchioro / PH-ESE

Page 49: 1st Combined R2E Workshop & School-Days Error Detection and Correction Techniques A. Marchioro / PH-ESE-ME

49

Hamming Codes (4)

a0

a1

a2

a3

a’0

a’1

a’2

a’3

p’0

p’1

p’2

Hardware for decoder

Correction Logic

+

+

+

+

A. Marchioro / PH-ESE

Page 50: 1st Combined R2E Workshop & School-Days Error Detection and Correction Techniques A. Marchioro / PH-ESE-ME

Cost of Hamming SEC

Data Word width[nbit]

Correction bits Total bits

4 3 7

8 4 12

16 5 21

32 6 38

64 7 71

50 A. Marchioro / PH-ESE

Page 51: 1st Combined R2E Workshop & School-Days Error Detection and Correction Techniques A. Marchioro / PH-ESE-ME

51

Hamming in use

A. Marchioro / PH-ESE

Page 52: 1st Combined R2E Workshop & School-Days Error Detection and Correction Techniques A. Marchioro / PH-ESE-ME

Multiple-Errors

Errors often come in bursts For example:

• An ion can strike more than one memory cell in an array• In close space proximity• In close time proximity

Most simple correction scheme can handle only one errorE.g. Parity or Hamming

Multiple bit correction scheme exists but they are considerably more complicated

52 A. Marchioro / PH-ESE

Page 53: 1st Combined R2E Workshop & School-Days Error Detection and Correction Techniques A. Marchioro / PH-ESE-ME

Interleaving: Basic idea

53 A. Marchioro / PH-ESE

Diffuse

Recombine

Byte_0 Byte_1 Byte_2 Byte_3

If the error correction capability is limited to one bit/byte, then try to spread error bursts across different data chunks

Page 54: 1st Combined R2E Workshop & School-Days Error Detection and Correction Techniques A. Marchioro / PH-ESE-ME

Interleaving in Memories

Requires more complicated addressing and decoders, but it is comparatively simple to implement in ASICs

54 A. Marchioro / PH-ESE

b0..b1..b2.. ..b7

a0

..a

1..

a2

..

..a

n

b0.. b1.. b2.. .. b7

Page 55: 1st Combined R2E Workshop & School-Days Error Detection and Correction Techniques A. Marchioro / PH-ESE-ME

55

Cross-Interleaver

d

d d

d d d

d

d d

d d d

A. Marchioro / PH-ESE

Page 56: 1st Combined R2E Workshop & School-Days Error Detection and Correction Techniques A. Marchioro / PH-ESE-ME

Techniques for serial links

Today’s high (and low) speed links all use some form of coding for reasons related to the electrical or optical characteristic of the links

A. Marchioro / PH-ESE56

Page 57: 1st Combined R2E Workshop & School-Days Error Detection and Correction Techniques A. Marchioro / PH-ESE-ME

Elementary review of link types

Link Coding Error Det/Corr Comment

RS-232 None + Parity 1/0

USB2 NRZI + Bit stuffing 0/0 Error detect through CRC at protocol layer

Ethernet 1000 Base X 8b/10b some/0 Line balancing

SATA 8b/10b some/0 Line balancing

GOL (CERN design) 8b/10b or 16/20 EC at protocol layer

GBT (CERN design) Reed-Solomon FEC 16 out of 120 Complex block coding

A. Marchioro / PH-ESE57

Page 58: 1st Combined R2E Workshop & School-Days Error Detection and Correction Techniques A. Marchioro / PH-ESE-ME

Error detection, no correction

In some cases detecting the presence of an error may be sufficient to avoid problemsIn applications or protocols allowing for re-

computation or re-transmission• Examples: file reading from a disk can be

reattempted in case of errorVery often measurements can be repeated

without bad consequences for systems.

A. Marchioro / PH-ESE58

Page 59: 1st Combined R2E Workshop & School-Days Error Detection and Correction Techniques A. Marchioro / PH-ESE-ME

Error detection with CRC

For occasional single or non-burst errors an extremely popular and powerful error detection technique is based on computing a “Cyclic Redundancy Check” code to attach to the data

This is based on the properties of so called “Cyclic Groups”, and the basic mathematics is related to the fact that while a protection code computed additively is relatively easy to fool, one computed on the properties of the remainder of a division turns out to be much more robust

A. Marchioro / PH-ESE59

Page 60: 1st Combined R2E Workshop & School-Days Error Detection and Correction Techniques A. Marchioro / PH-ESE-ME

CRC in practice

Use one of the recognized CRC polynomials standard:

CRC-4 g(x) = x4+x3+x2+x+1

CRC-7 g(x) = x7+x6+x4+1

CRC-8 g(x) = x8+x7+x6+x4+x2+1

CRC-12 g(x) = x12+x11+x3+x2+x+1

CRC-ANSI g(x) = x16+x15+x2+1

CRC-CCITT g(x) = x16+x12+x5+1

CRC-24 g(x) = x24+x23+x14+x12+x8+1

CRC-32b g(x) = x32+x26+x23+x22+x16+x12+x11+x10+x8+x7+x5+x4+x2+x+1

A. Marchioro / PH-ESE60

Page 61: 1st Combined R2E Workshop & School-Days Error Detection and Correction Techniques A. Marchioro / PH-ESE-ME

Conclusion

SEU events are more and more important in digital logic Mitigation of SEU can be performed at several levels

Device, circuit, system levels The correct strategy can only be decided once the

relevance of a given error on an overall system is clear Do not apply expensive mission critical techniques when simple

recovery techniques are applicable! How efficient a given strategy really easy can (unfortunately)

only be assessed through thorough testing, rough estimations can be very wrong and can lead to disasters.

A. Marchioro / PH-ESE61

Page 62: 1st Combined R2E Workshop & School-Days Error Detection and Correction Techniques A. Marchioro / PH-ESE-ME

Extra material

Page 63: 1st Combined R2E Workshop & School-Days Error Detection and Correction Techniques A. Marchioro / PH-ESE-ME

63

Bibliography on Error Coding

Good books on Coding:

R. Blahut, Algebraic Codes for Data Transmission, Cambridge U.P., 2003

O. Pretzel, Error Correcting Codes and Finite Fields, Oxford U.P. 1992

S. Wicker, Error Control Systems, Prentice Hall, 1995

The Mathematics underneath:

J. A. Gallian, Contemporary Abstract Algebra, Houghton Mifflin, 2006

McEliece, Finite Fields for Scientists and Engineers, Kluwer, 1986

A. Marchioro / PH-ESE

Page 64: 1st Combined R2E Workshop & School-Days Error Detection and Correction Techniques A. Marchioro / PH-ESE-ME

A. Marchioro / PH-ESE64

Density of e-h pairs is important

Heavy Ion

eh eh

eh

eh

eheh eh eh eh eh eheh eh ehh

hh e

eeeeeee

hhhhh

hhh

hhhhhh

e

e

eeeeeeeh

ehhhhh eeeee

eeehhhhhh

Nwell

p- silicon

p+

1.

1. Ion strike: ionization takes place along the track (column of high-density pairs)

e

h

e

h

eh

eh

eh

ehe

h

eh eh

e

h eh

e

he

h

e

h

h

hheeeeeee e

h

h

hhh

h

hhhhhh

hh

e ee ee

eee

he

hhhhh

e

e e

e

eehhhhhh

+-

2.

2. Charges start to migrate in the electric field across the junctions. Some drift (fast collection, relevant for SEEs), some diffuse (slow collection, less relevant for SEEs)

e

h

e

h

e

h

e

h

e

h

e

h

e

h

e

h

e

h

e

h

e

h

e

h

e

h

e

h

h

hh

eeeeeee

e

h

h

hhh

h

hhhhhh

hh

e ee ee

e

e

e

h

e

hhhhh

e

e e

ee

e

hhhhhh

+-

3.

3. Charges are collected at circuit nodes. Note that, if the relevant node for the SEE is the p+ diffusion, not all charge deposited by the ion is collected there.

Illustration from F. Faccio, this Course

Page 65: 1st Combined R2E Workshop & School-Days Error Detection and Correction Techniques A. Marchioro / PH-ESE-ME

Units

LET = Linear Energy Transfer, i.e. how much energy (to create charged pairs) has been deposited by a ionizing particle in a given amount of material Units:

or, multiplying by the density of the material

A. Marchioro / PH-ESE65

[LET] =[MeV ]*[cm2]

[gr]

[LET] =[MeV ]

[cm]

Page 66: 1st Combined R2E Workshop & School-Days Error Detection and Correction Techniques A. Marchioro / PH-ESE-ME

Metrics (1) MTTF: Mean Time To Failure

Time between two faults in a given component Total System MTTF

(Units: could be measured in hours, days or years)

66

MTTFSystem =1

1

MTTFii= 0

n

MTTF MTTR

time

MTBF

Error detected

System Re-Start

A. Marchioro / PH-ESE

Page 67: 1st Combined R2E Workshop & School-Days Error Detection and Correction Techniques A. Marchioro / PH-ESE-ME

Metrics (2)

FIT: Failure In TimeDefinition: 1 FIT is one error in 109 device-

hours of operation

Total System FIT

67

FITSystem = FITii= 0

n

A. Marchioro / PH-ESE

Page 68: 1st Combined R2E Workshop & School-Days Error Detection and Correction Techniques A. Marchioro / PH-ESE-ME

Metrics (3)

Converting between them:

Example: a FIT of 500 corresponds to an MTTF of 228 years.

[This conversion is valid for an exponential probability distribution, i.e. a distribution where events (i.e. errors) have no memory of time, which is indeed the case for particle hits, under constant beam intensity assumptions. Notice that this would not apply for a distribution representing ageing]

68

MTTF[years] =109

FIT ⋅24[hours]⋅365[days]

A. Marchioro / PH-ESE

Page 69: 1st Combined R2E Workshop & School-Days Error Detection and Correction Techniques A. Marchioro / PH-ESE-ME

A. Marchioro / PH-ESE69

A commercial fault-tolerant computerfor telecom applications

Page 70: 1st Combined R2E Workshop & School-Days Error Detection and Correction Techniques A. Marchioro / PH-ESE-ME

70

Coding as a map Fk Fn

Fk

Fn

A. Marchioro / PH-ESE

Page 71: 1st Combined R2E Workshop & School-Days Error Detection and Correction Techniques A. Marchioro / PH-ESE-ME

71

Error Detection in Fn

Degradation due to Transmission or storage

(retrieval)

Recoverable

Undetected Error

Confused, unrecoverable

A. Marchioro / PH-ESE

Page 72: 1st Combined R2E Workshop & School-Days Error Detection and Correction Techniques A. Marchioro / PH-ESE-ME

72

Cyclic Codes: Simple Example

The code:

c0 : 0000000 c1 : 1011100

c2 : 0101110 c3 : 1110010

c4 : 0010111 c5 : 1001011

c6 : 0111001 c7 : 1100101

is cyclic, in fact it can be noticed that using shift and linearity, starting with cg=(1011100):

c0 : 0000000 c1 : cg

c2 : cg>>1c3 : c2+c4

c4 : cg>>2c5 : c1+c4

c6 : c2+c4 c7 : c1+c2+c3

A. Marchioro / PH-ESE

Page 73: 1st Combined R2E Workshop & School-Days Error Detection and Correction Techniques A. Marchioro / PH-ESE-ME

73

Hamming (2)

r1 r2

r3

r4

r5

r6r7

During transmission the message word

s1s2s3s4c5c6c7

is (potentially) modified by an error in (the unknown) position j and is received as:

r1r2r3r4r5r6r7

for example, for j = 2:

r1r2r3r4r5r6r7 = s1s2s3s4c5c6c7 0100000

A. Marchioro / PH-ESE

Page 74: 1st Combined R2E Workshop & School-Days Error Detection and Correction Techniques A. Marchioro / PH-ESE-ME

74

Hamming (3)

1 1*

0

0

1

01

Example:for an original word: 1000101assume that e=0100000 occurred, resulting in r=1100101

Circles with odd (=wrong) parity are now marked

Decoding and correcting trick:can we find a single bit (assuming that there was just one error) that lies inside all the marked circles and outsideof the unmarked one?

0

A. Marchioro / PH-ESE

Page 75: 1st Combined R2E Workshop & School-Days Error Detection and Correction Techniques A. Marchioro / PH-ESE-ME

75

Hamming Codes (1)

Another simple construction of Hamming Code:

Given the four data bits (a0,a1,a2,a3), construct three parity bits as follows:

p0 = a0 + a1 + a2

p1 = a1 + a2 + a3

p2 = a0 + a1 + a3

(here “+” is modulo 2 addition) and send the codeword: (a0, a1, a2, a3, p0, p1, p2).

The valid codewords are therefore given in the table on the right:

Notice that we use the space of 27 code-words to represent 24 possible message-words

0 0 0 0 0 0 0

0 0 0 1 0 1 1

0 0 1 0 1 1 0

0 0 1 1 1 0 1

0 1 0 0 1 1 1

0 1 0 1 1 0 0

0 1 1 0 0 0 1

0 1 1 1 0 1 0

1 0 0 0 1 0 1

1 0 0 1 1 1 0

1 0 1 0 0 1 1

1 0 1 1 0 0 0

1 1 0 0 0 1 0

1 1 0 1 0 0 1

1 1 1 0 1 0 0

1 1 1 1 1 1 1

A. Marchioro / PH-ESE

Page 76: 1st Combined R2E Workshop & School-Days Error Detection and Correction Techniques A. Marchioro / PH-ESE-ME

76

Hamming Codes (2)

The decoder receives: (a’0, a’1, a’2, a’3, p’0, p’1, p’2) and computes:

s0 = p’0 + a’0 + a’1 + a’2

s1 = p’1 + a’1 + a’2 + a’3

s2 = p’2 + a’0 + a’1 + a’3

called the “syndromes”. If there has been no error, these are all zero, if there has been one error, one or more of them may be non-zero. The syndromes depend only on the error pattern, as in the table below:

0 0 0 0 0 0 0 0 0 0

0 0 1 0 0 0 0 0 0 1

0 1 0 0 0 0 0 0 1 0

0 1 1 0 0 0 1 0 0 0

1 0 0 0 0 0 0 1 0 0

1 0 1 1 0 0 0 0 0 0

1 1 0 0 0 1 0 0 0 0

1 1 1 0 1 0 0 0 0 0

Syndrome Error

A. Marchioro / PH-ESE

Page 77: 1st Combined R2E Workshop & School-Days Error Detection and Correction Techniques A. Marchioro / PH-ESE-ME

77

Hamming Codes (5)

A compact description of the encoding operation and of the syndrome computations may be given by using matrix notation such as:

3

2

1

0

2

1

0

3

2

1

0

1011

1110

0111

1000

0100

0010

0001

a

a

a

a

p

p

p

a

a

a

a

s0

s1

s2

⎢ ⎢ ⎢

⎥ ⎥ ⎥=

1 1 1 0 1 0 0

0 1 1 1 0 1 0

1 1 0 1 0 0 1

⎢ ⎢ ⎢

⎥ ⎥ ⎥

a'0a'1a'2a'3p'0p'1p'2

⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢

⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥

A. Marchioro / PH-ESE

Page 78: 1st Combined R2E Workshop & School-Days Error Detection and Correction Techniques A. Marchioro / PH-ESE-ME

78

Encoding block in CD

RS(28,24)

RS(32,28)…

d

2 d

26d

27d

Din{24x8} Dout{32x8}

C2 Encoder C1 Encoder

A. Marchioro / PH-ESE

Page 79: 1st Combined R2E Workshop & School-Days Error Detection and Correction Techniques A. Marchioro / PH-ESE-ME

79

RS combined with interleaving UDP packets in TCP/IP protocol do not

have guaranteed delivery RS is used to replace lost packets

(“erasures”) Data stream is framed into blocks of

249 bytes and encoded in RS(255,249) blocks, this has dmin = 7 and can correct 6 erasures

Messages are interleaved in blocks of 255xN

Blocks are send from columns If a packet is lost, it is replaced by a

“0” column The receiver knows that packet “j” is

lost because it is missing in the sequence

The RS code (organized in N rows) can recover up to 6 missing columns

c1,1 c1,2 c1,3 … c1,255

c2,1 c2,2 c2,3 … c2,255

… … … … …

cN,1 cN,2 cN,3 … cN,255

A. Marchioro / PH-ESE

Page 80: 1st Combined R2E Workshop & School-Days Error Detection and Correction Techniques A. Marchioro / PH-ESE-ME

80

Other coding techniques

Block coding introduces redundancy on finite blocks of data, without reference to previous blocks, and with all redundant information contained in the block itself.

Convolutional coding performs encoding based on the current set of data to be coded and on the history of previous blocks, i.e., a given data set is mapped on a number of different data sets, depending on the content of the previously coded sets. These coding techniques are extremely powerful and are largely used in

telecom and space applications.

A. Marchioro / PH-ESE