tmr in virtex-4 and rhbd in...
TRANSCRIPT
September 11, 2009 ESTEC FPGA Workshop Xilinx Confidential • Unpublished Work © Copyright 2009
TMR in Virtex-4 and RHBD in Virtex-5Current Status of Two Approaches to Attaining Robustness of Reconfigurable FPGA Applications in Upset Environments
Gary Swiftand the Xilinx Radiation Test Consortium
Page 2 ESTEC FPGA Workshop Xilinx Confidential • Unpublished Work © Copyright 2009 Xilinx
1. Do nothing - live with intrinsic upset rateFor rFPGA’s not all config upsets yield errorsHowever, nth error ‘breaks’ design (n≈10)
2. Upset mitigation - upsets ≠ errorsPrevent a single upset from causing an errorPrevent upset accumulation
3. Harden to upset - no upset = no error
Three Options for Dealing with Upsets
Page 3 ESTEC FPGA Workshop Xilinx Confidential • Unpublished Work © Copyright 2009 Xilinx
Outline
XRTC (Xilinx Radiation Test Consortium) Background– Voluntary membership– Test types: static, dynamic, and mitigation
Design Robustness– TMR (triple modular redundant) designs plus config management– RHBD (rad-hard by design) fabric
Calculating Upset and Error Rates – Ordinary (single-node) - SEFI example– TMR case - application example– Dual-node case - data-based example
Page 4 ESTEC FPGA Workshop Xilinx Confidential • Unpublished Work © Copyright 2009 Xilinx
XRTC Apparatus in Action
BEAM
FPGAundertest
Testing at Texas A&M Cyclotron Institute in Air
Boeing
XilinxRadiation
TestConsortium
Page 5 ESTEC FPGA Workshop Xilinx Confidential • Unpublished Work © Copyright 2009 Xilinx
XRTC Beam Tests
Static Results on V4-QV– Config cells– User BRAM & FFs– Functional Upsets
(aka SEFIs)– Both Heavy Ions
& Protons
Dynamic Results– Digital Clock Managers– Half Latches
Mitigation Campaign, Ongoing
10-8
10-7
10-6
10-5
0 50 100Effective LET (MeV-cm2/mg)
Cro
ss S
ectio
n (c
m2 /d
evic
e)
SX55FX60LX200
POR SEFI
10-16
10-15
10-14
10-13
0 20 40 60 80 100 120Energy (MeV)
Cro
ss S
ectio
n (c
m2 /bit)
SX55FX60LX200
Configuration Cells
10-7
10-8
10-9
10-10
0 20 40 60 80 100 120Effective LET (MeV*cm2/mg)
Cro
ss S
ectio
n (c
m2 /b
it)
SX55FX60LX200
Configuration Cells
from http://parts.jpl.nasa.gov/docs/NEPP07/NEPP07FPGAv4Static.pdf
Page 6 ESTEC FPGA Workshop Xilinx Confidential • Unpublished Work © Copyright 2009 Xilinx
Upset Mitigation
Redundancy -Extra information (bits) prevents all upsets from yielding system
errors.
Scrubbing required –Accumulation of errors rapidly kills mitigation effectiveness.
Effective –Most spacecraft now fly large arrays of upset-soft memories with few
or no errors.Typically, uncorrectable errors are detectable.
Page 7 ESTEC FPGA Workshop Xilinx Confidential • Unpublished Work © Copyright 2009 Xilinx
Upset Hardening – Two Basic Approaches
Both Approaches -- Add circuit elements to basic storage cell- Increase storage cell stability
Approach 1 - Increase “critical charge” to upset- Add passive element(s) into cell feedback path.- Cell size increase may be small, but it’s slower- Standard upset rate calculation does work
Approach 2 - Require charge in two nodes- Add geometrically separated active elements.- Standard upset rate calculation doesn’t work
Page 8 ESTEC FPGA Workshop Xilinx Confidential • Unpublished Work © Copyright 2009 Xilinx
Involves three basic elements:1. Upset susceptibility measurements
cross section vs. “effective” LET2. Environment specification
integral flux vs. LET3. Angular response model
RPP† chord length distributionone adjustable parameter:
charge collection depth (aka funnel length)
Space Upset Rate Calculation
† RPP = rectangular parallel piped, i.e. a 3-D box
Page 9 ESTEC FPGA Workshop Xilinx Confidential • Unpublished Work © Copyright 2009 Xilinx
Critical charge:If a node collects more charge than the critical
amount, then the cell upsets.
Effective LET:An ion’s “effective” energy (or charge) deposition is related to the cosine of the tilt angle (off normal incidence) that it strikes.
RPP charge collection volume:All charge deposited in RPP goes to node, whileall charge outside does not.
Simplifying concepts (or useful fictions)
Page 10 ESTEC FPGA Workshop Xilinx Confidential • Unpublished Work © Copyright 2009 Xilinx
Although a cell may contain multiple charge collection nodes capable of upsetting the cell, the charge collected is only dependent on the “tilt” angle and not the rotational orientation:
Inherently, this is a “single node” calculation
rotation
tilt
Several ion trajectories all with same tilt angle, but various rotation angles.
Page 11 ESTEC FPGA Workshop Xilinx Confidential • Unpublished Work © Copyright 2009 Xilinx
Configuration upsets:Less than twelve per day
SEFIs:About one per century
Results for Virtex-4QV FPGAs in GEO
Page 12 ESTEC FPGA Workshop Xilinx Confidential • Unpublished Work © Copyright 2009 Xilinx
Processor Upset Rates – Mild Environment
for GEO:
Notes: Assumes 100% duty cycle on all bits (registers and L1 caches)Environment = Galactic Cosmic Ray (GCR) background at solar minimumShielding = 100 mil Aluminum-equivalent
RHBD = radiation (upset) hardened by design* Processor-level TMR, scrubbed ten times per second, with ~3% performance hit
Hardening Upset Rate Ratio BAE RAD750 (estimated) RHBD 2.2 x1 Maxwell SCS750 TMR* 1.1x10-5 ÷200,000 Virtex II-Pro ePPC405 none 13 x6
per year
Page 13 ESTEC FPGA Workshop Xilinx Confidential • Unpublished Work © Copyright 2009 Xilinx
Limits of Upset Mitigation
Common sense says -At some point, upsets will occur too rapidly and the mitigation will be
“overwhelmed.”
In fact, Edmonds approx. equation says –There’s not really a “cliff.”The relationships are known; the error rate:
(1) increases with the square of upset rate(2) decreases linearly with faster scrub rates(3) is directly proportional to EDAC word size†
† EDAC word size = data bits + check bits ; EDAC=error detection and correction
Page 14 ESTEC FPGA Workshop Xilinx Confidential • Unpublished Work © Copyright 2009 Xilinx
Edmonds TMR Equation
)(3 22 rTMR C M≈
Total Modules Scrub Time
System Error Rate Underlying
Upset Rate
“Fitting” Parameteris
root mean square module size
Approximation when r (upset rate) is small:
Page 15 ESTEC FPGA Workshop Xilinx Confidential • Unpublished Work © Copyright 2009 Xilinx
Single-String Design
Functional Block
Conceptually, a design is a string of logic blocks (sequential or combinational) bounded by feedback loops.
Feedback
InOut
Page 16 ESTEC FPGA Workshop Xilinx Confidential • Unpublished Work © Copyright 2009 Xilinx
TMR Design
Functional Block Voter
Feedback from the voters corrects state errors inside blocks
Feedback
A
B
C
A
A
B
C
C
B
Page 17 ESTEC FPGA Workshop Xilinx Confidential • Unpublished Work © Copyright 2009 Xilinx
TMR prevents almost all errors
Single upsets cannot cause errors
Error propagation requires upsets in two parallel modules (within a scrub cycle).
X
X X
X
X
X
XX
Multiple upsets but no error
Multiple upsets but no error
Page 18 ESTEC FPGA Workshop Xilinx Confidential • Unpublished Work © Copyright 2009 Xilinx
TMRtool
Designer’s TMR “Burden”
Run the working single-string design through the TMRtool to obtain the correct Xilinx-style triplicated and voted design.
Page 19 ESTEC FPGA Workshop Xilinx Confidential • Unpublished Work © Copyright 2009 Xilinx
BRSCRUB
r (bit errors/bit-second)1e-6 1e-5 1e-4 1e-3 1e-2 1e-1
R (
syst
em e
rror
s/se
cond
)
0.001
0.01
0.1
1
10
100
1000 datageneral approximation small-r form extrapolated
Example App - XQR2V6000 BRAM Scrubber
Given parameters: T=2 ms, M=48000 Fit parameter: M2=M3=M4= 250
Page 20 ESTEC FPGA Workshop Xilinx Confidential • Unpublished Work © Copyright 2009 Xilinx
Extrapolating to Space RatesBRSCRUB
r (bit errors/bit-second)1e-6 1e-5 1e-4 1e-3 1e-2 1e-1
R (
syst
em e
rror
s/se
cond
)0.001
0.01
0.1
1
10
100
1000 datageneral approximation small-r form extrapolated
1e-71e-81e-91e-10
1e-5
1e-6
1e-7
1e-8
1e-9
1e-10
GEO upset rate of≈10-10 per bit-sec.gives error rate of< one per millennium.
Page 21 ESTEC FPGA Workshop Xilinx Confidential • Unpublished Work © Copyright 2009 Xilinx
V4 DCM Dynamic Results
All DCM fails fixed by either DCM reset, re-writing settings through the DRP,or scrubbing with GLUTMASK disabled.
GLUTMASK enabled during beam
Greg Allen
Page 22 ESTEC FPGA Workshop Xilinx Confidential • Unpublished Work © Copyright 2009 Xilinx
V4 DCM Mitigation Results
“per DCM” means “per DCM triplet”
Greg Allen
Page 23 ESTEC FPGA Workshop Xilinx Confidential • Unpublished Work © Copyright 2009 Xilinx
V4-QV TMR-Counter Results
37% full, no single points-of-failure
1.E-04
1.E-03
1.E-02
1.E-01
1.E+00
1.E+01
1.E-07 1.E-06 1.E-05 1.E-04 1.E-03
Raw Bit Flip Rate (upsets/bit-sec)
Syst
em E
rror
Rat
e (e
rror
s/se
c)
Kevin Heldt &
Scott A. Anderson
Page 24 ESTEC FPGA Workshop Xilinx Confidential • Unpublished Work © Copyright 2009 Xilinx
To upset a cell requires some charge collection at both of a pair of nodes, that is, if one node collects no charge, the cell will not upset no matter how much charge is collected at the other of the pair.
A cell may contain one or several such pairs, but the two nodes of a given pair must be as widely separated as possible.
Geometrical RHBD is two-node problem
Page 25 ESTEC FPGA Workshop Xilinx Confidential • Unpublished Work © Copyright 2009 Xilinx
The more an ion trajectory aligns with the line defined by the two nodes, the more likely it is to be able to cause an upset:
Two-node case makes rotation important
rotation
tilt
For a given tilt, different rotation angles give more or less alignment with line through the nodes.
Page 26 ESTEC FPGA Workshop Xilinx Confidential • Unpublished Work © Copyright 2009 Xilinx
Model to Guide Data Fits
Model necessary because ‘brute force’ :requires too much data.needs extrapolation to impossible tilts (90º).
Model assumes existence of a charge collection efficiency function with ellipsoidal volumes (like rounded RPPs).
Many (8) fitting parameters in current model:two (A, B) relate to ellipsoid shapefour – LET threshold and sat. cross section per nodeplus two others
Page 27 ESTEC FPGA Workshop Xilinx Confidential • Unpublished Work © Copyright 2009 Xilinx
Directional Upset Response
SRAM 6: 65nm feature size, 2µm epi, 2µm spacingAll 1's pattern, LET = 3.40, Tilt = 75 degrees
Rotation Angle (degrees)90 120 150 180 210 240 270 300 330 360 390 420 450
Dire
ctio
nal C
ross
Sec
tion
(µm
2 /bit)
0.00
0.01
0.02
0.03
0.04dataworst-case model nominal model
Clearly there is a strong dependence on rotation angle:
Page 28 ESTEC FPGA Workshop Xilinx Confidential • Unpublished Work © Copyright 2009 Xilinx
Necessary Extrapolation
Note factor of 4 or 5
SRAM 6: 65nm feature size, 2µm epi, 2µm spacingAll 1's pattern, LET = 3.40, Rotation = 180 degrees
Tilt Angle (degrees)0 10 20 30 40 50 60 70 80 90
Dire
ctio
nal C
ross
Sec
tion
(µm
2 /bit)
0.00
0.04
0.08
0.12
0.16
dataworst-case model nominal model
Page 29 ESTEC FPGA Workshop Xilinx Confidential • Unpublished Work © Copyright 2009 Xilinx
Model Results
GEO rate for ones is <9E-10 upsets per bit-day.GEO rate for zeros is <9E-11 upsets per bit-day.
Typical design has more than 90% zeros and takes about ten (or more) upsets to cause an error:
GEO rate for typical design is <2E-11 errors per bit-day
or approx. one every 2 years.
Page 30 ESTEC FPGA Workshop Xilinx Confidential • Unpublished Work © Copyright 2009 Xilinx
Data & Model for an LET Sweep
Good agreement at worst rotation:SRAM 6: 65nm feature size, 2µm epi, 2µm spacing
All 1's pattern, Tilt = 75, Rotation = 180 degrees
LET (MeV-cm2/mg)
0.1 1 10
Dire
ctio
nal C
ross
Sec
tion
(µm
2 /bit)
0.0001
0.001
0.01
0.1
1datamodel (conservative) model (nominal)
Page 31 ESTEC FPGA Workshop Xilinx Confidential • Unpublished Work © Copyright 2009 Xilinx
Average Cross Sections
SRAM6: 65nm feature size, 2µm epi,2µm spacing, all 1's pattern
LET (MeV-cm2/mg)
0.1 1 10 100
Dire
ctio
nal-A
vera
ge X
S (c
m2 /b
it)
1e-14
1e-13
1e-12
1e-11
1e-10
1e-9
1e-8model (conservative) model (nominal)
… are useful for ‘estimating’ rates via standard calculation
Page 32 ESTEC FPGA Workshop Xilinx Confidential • Unpublished Work © Copyright 2009 XilinxPage 32
Preliminary Results
Energy (MeV/u) IonEff. LET
(Mev-cm^2/mg)Flux
(ions/cm^2/s)Fluence
(ions/cm^2)Resets
(events/device)Runaways (events/device)
15 Au 90.1 1.340E+03 7.271E+05 17 1* - due to SEFI
24.8 Xe 61.6 9.870E+03 5.500E+06 38 0
24.8 Ne 1.9 1.000E+05 2.000E+08 18 0
Page 33 ESTEC FPGA Workshop Xilinx Confidential • Unpublished Work © Copyright 2009 XilinxPage 33
S IR F µB laze R esets
LET (M eV-cm 2/m g)
0 20 40 60 80 100
Cro
ss S
ectio
n (re
sets
/dev
ice)
1e-8
1e-7
1e-6
1e-5
1e-4
datafit using:L th = 0.1, S = 1.43,W = 4620, S igm asat = 5.2E-3
Note : Weibull Fit is just a guide for the eye
Preliminary Results – Resets
Page 34 ESTEC FPGA Workshop Xilinx Confidential • Unpublished Work © Copyright 2009 Xilinx
MicroBlaze Results
TMRed in V4
Theoretically, TMRed MicroBlaze in V4 will extrapolate to a lower system error rate in space than single-string in RHBD V5, but SEFI performancemakes RHBD V5 better overall.
Analysis in progress Mike Pratt
Page 35 ESTEC FPGA Workshop Xilinx Confidential • Unpublished Work © Copyright 2009 Xilinx
Maximum Robustness Conclusion
Single-FPGA design robustness is limited by the SEFI rate:– Approx. 1 per century in GEO for V4– Approx. 1 per 100 centuries in GEO for RHBD V5
Properly TMRed Virtex 4-QV designs, i.e. having no single points of failure, extrapolate to an upset-induced system error rate lower than the SEFI rate– 100 bits that are single points of failure translate to a system error rate
of about one per century in GEO
Not good enough? Fly through SEFIs by using three FPGAs– See XAPP987
Assuming a SEFI outage lasts one second, then one per century is better than 10 nines of availability.
Page 36 ESTEC FPGA Workshop Xilinx Confidential • Unpublished Work © Copyright 2009 Xilinx
Conclusions - RHBD vs. TMR
Both can yield good system robustness.
– TMR • Requires designer involvement • Costs times 3+ in gates and power• Extrapolation required for space error rates
– RHBD• Transparent to the designer• Requires extra silicon area• Extrapolation required for space error rates• Potentially more robust in “extreme” environments
Page 37 ESTEC FPGA Workshop Xilinx Confidential • Unpublished Work © Copyright 2009 Xilinx
BACKUP MATERIAL
Page 38 ESTEC FPGA Workshop Xilinx Confidential • Unpublished Work © Copyright 2009 Xilinx
Processor Upset Rates – Severe Environment
for JPL Design Case Flare (DCF):
Notes: Assumes 100% duty cycle on all bits (registers and L1 caches)Environment ≅ actual events in October 1989 and January 1972Shielding = 100 mil Aluminum-equivalent
* Upsets from heavy ions only; proton upsets insignificant or neglected** Includes 0.14 /flare from protons
Upset Rate Ratio BAE RAD750 (estimated) 6.6* x3 Maxwell SCS750 0.36** ÷ 6 Virtex II-Pro ePPC405 85* x40
per flare
Page 39 ESTEC FPGA Workshop Xilinx Confidential • Unpublished Work © Copyright 2009 Xilinx
TMR System Model
Functional Block, i Voter, i
Ni = # of “care’ bits in each block and voter
NiNi-1 Ni+1
M = # of modules
NiNi-1 Ni+1
NiNi-1 Ni+1
…NMN1 …
N1 …
N1 …
…NM
…NM
TC = scrub time
Feedback, i+1