analysis and mitigation of seu-induced noise in fpga-based
TRANSCRIPT
Brigham Young University Brigham Young University
BYU ScholarsArchive BYU ScholarsArchive
Theses and Dissertations
2011-02-11
Analysis and Mitigation of SEU-induced Noise in FPGA-based DSP Analysis and Mitigation of SEU-induced Noise in FPGA-based DSP
Systems Systems
Brian Hogan Pratt Brigham Young University - Provo
Follow this and additional works at: https://scholarsarchive.byu.edu/etd
Part of the Electrical and Computer Engineering Commons
BYU ScholarsArchive Citation BYU ScholarsArchive Citation Pratt, Brian Hogan, "Analysis and Mitigation of SEU-induced Noise in FPGA-based DSP Systems" (2011). Theses and Dissertations. 2482. https://scholarsarchive.byu.edu/etd/2482
This Dissertation is brought to you for free and open access by BYU ScholarsArchive. It has been accepted for inclusion in Theses and Dissertations by an authorized administrator of BYU ScholarsArchive. For more information, please contact [email protected], [email protected].
Analysis and Mitigation of SEU-induced Noise
in FPGA-based DSP Systems
Brian H. Pratt
A dissertation submitted to the faculty ofBrigham Young University
in partial fulfillment of the requirements for the degree of
Doctor of Philosophy
Michael J. Wirthlin, ChairBrent E. NelsonMichael D. RiceDavid A. PenryDoran K. Wilde
Department of Electrical and Computer Engineering
Brigham Young University
April 2011
Copyright c© 2011 Brian H. Pratt
All Rights Reserved
ABSTRACT
Analysis and Mitigation of SEU-induced Noise
in FPGA-based DSP Systems
Brian H. Pratt
Department of Electrical and Computer Engineering
Doctor of Philosophy
This dissertation studies the effects of radiation-induced single-event upsets (SEUs)on digital signal processing (DSP) systems designed for field-programmable gate arrays (FP-GAs). It presents a novel method for evaluating the effects of radiation on DSP and digitalcommunication systems. By using an application-specific measurement of performance inthe presence of SEUs, this dissertation demonstrates that only 5–15% of SEUs affecting acommunications receiver (i.e. 5–15% of sensitive SEUs) cause critical performance loss. Italso reports that the most critical SEUs are those that affect the clock, global reset, andmost significant bits (MSBs) of computation.
This dissertation also demonstrates reduced-precision redundancy (RPR) as an effec-tive and efficient alternative to the popular triple modular redundancy (TMR) for FPGA-based communications systems. Fault injection experiments show that RPR can improvethe failure rate of a communications system by over 20 times over the unmitigated systemat a cost less than half that of TMR by focusing on the critical SEUs. This dissertationcontrasts the cost and performance of three different variations of RPR, one of which is anovel variation developed here, and concludes that the variation referred to as “ThresholdRPR” is superior to the others for FPGA systems. Finally, this dissertation presents severalmethods for applying Threshold RPR to a system with the goal of reducing mitigation costand increasing the system performance in the presence of SEUs. Additional fault injectionexperiments show that optimizing the application of RPR can result in a decrease in criticalSEUs by as much 65% at no additional hardware cost.
Keywords: FPGA, reliability, single-event upset, radiation effects, triple modular redun-dancy, reduced-precision redundancy, digital signal processing, digital communications
ACKNOWLEDGMENTS
This dissertation is the result of several years of hard work and wouldn’t have been
possible without the support of many people, to whom I am very grateful.
First and foremost, I would like to thank my family. My wife Aubrey has been my
inspiration and my best friend throughout the years of my studies. I thank her and our
daughter Celeste for their support, encouragement, and patience. I am also grateful to my
parents for the great start in life and for all the advice and support they have given me over
the years.
My professors at BYU assisted me in this work in many ways. I would like to thank
my advisor, Dr. Michael Wirthlin, for his guidance during this process. He helped me find a
path of research that I am excited to share and encouraged me when things didn’t go quite
as planned. Drs. Brent Nelson and Michael Rice were also great sources of assistance as I
planned what to do and how to do it.
I could not have made the contributions I have without the support and past re-
search of many BYU students. In particular, I’d like to thank Nathan Rollins, Jonathan
Johnson, Megan Fuller, Jon-Paul Anderson, Chris Lavin, Marc Padilla, William Howes,
Derrick Gibelyou, Keith Morgan, Daniel McMurtrey, and Eric Johnson.
I also would like to acknowledge the major sources of funding for the research that
went into this dissertation. The ISR division at Los Alamos National Laboratory has been a
longtime supporter of this and other work in FPGA reliability done at BYU. This research
was also supported by the I/UCRC Program of the National Science Foundation under Grant
No. 0801876 through the NSF Center for High-Performance Reconfigurable Computing
(CHREC).
TABLE OF CONTENTS
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Summary of Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Dissertation Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Chapter 2 Radiation Effects and Mitigation on FPGAs . . . . . . . . . . . . 7
2.1 Single Event Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Types of Single Event Effects . . . . . . . . . . . . . . . . . . . . . . 8
2.1.2 SEE within ASICs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.3 SEE within FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.4 SEUs on SRAM-based FPGAs . . . . . . . . . . . . . . . . . . . . . . 10
2.2 SEU Mitigation for FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.1 Configuration Scrubbing . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.2 Redundancy Techniques . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.3 Triple Modular Redundancy . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.4 Application Specific Fault Tolerance . . . . . . . . . . . . . . . . . . 17
2.3 Evaluating FPGA Design Reliability . . . . . . . . . . . . . . . . . . . . . . 18
2.3.1 Sensitivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.2 Fault Injection Experiments . . . . . . . . . . . . . . . . . . . . . . . 19
v
2.3.3 Failure Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Chapter 3 Evaluating the Performance of FPGA-based DSP Systems in
the Presence of SEUs . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.1 Reliability Analysis of DSP Systems . . . . . . . . . . . . . . . . . . . . . . . 29
3.2 Reliability Analysis of Communications Systems . . . . . . . . . . . . . . . . 31
3.3 Application-Specific Fault Injection . . . . . . . . . . . . . . . . . . . . . . . 32
3.4 Fault Injection for Communications Systems . . . . . . . . . . . . . . . . . . 34
3.5 Feed-forward System Experiments . . . . . . . . . . . . . . . . . . . . . . . . 35
3.5.1 Experimental Configuration . . . . . . . . . . . . . . . . . . . . . . . 36
3.5.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.5.3 Application-Specific Failure Rate . . . . . . . . . . . . . . . . . . . . 42
3.6 Recursive System Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.6.1 Experimental Configuration . . . . . . . . . . . . . . . . . . . . . . . 45
3.6.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Chapter 4 Reduced Precision Redundancy . . . . . . . . . . . . . . . . . . . . 49
4.1 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.3 Protecting Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.4 RPR Upset Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.5 Bit-width Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.5.1 General Bit-width Selection . . . . . . . . . . . . . . . . . . . . . . . 60
4.5.2 RPR Bit-widths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.6 RPR Decision Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.7 RPR Demonstration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
vi
4.7.1 Experimental Configuration . . . . . . . . . . . . . . . . . . . . . . . 64
4.7.2 Mitigation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.7.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Chapter 5 Comparison of RPR Variations . . . . . . . . . . . . . . . . . . . . 73
5.1 RPR Variations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.1.1 Threshold RPR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.1.2 Bounded RPR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.1.3 RP-TMR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.2 RPR Variation Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.2.1 Decision Block Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.2.2 Reduced-precision Module Implementation . . . . . . . . . . . . . . . 87
5.2.3 Upset Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.2.4 Error Detection Limits . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.2.5 Suitability for FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.3 Fault Injection Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.3.1 Experimental Configuration . . . . . . . . . . . . . . . . . . . . . . . 95
5.3.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
Chapter 6 Application of Threshold RPR . . . . . . . . . . . . . . . . . . . . . 103
6.1 Threshold Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6.1.1 Average Threshold RPR Noise Limit . . . . . . . . . . . . . . . . . . 104
6.1.2 Reduction of Th . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.1.3 Experimental Determination of Th . . . . . . . . . . . . . . . . . . . . 109
6.1.4 Reduced Threshold Experiments . . . . . . . . . . . . . . . . . . . . 112
vii
6.2 Bit-width Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
6.2.1 Bit-width Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
6.2.2 General Bit-width Selection . . . . . . . . . . . . . . . . . . . . . . . 117
6.2.3 Bit-width Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6.3 RPR System Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
6.3.1 RPR Decision Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
6.3.2 Mixing RPR with TMR . . . . . . . . . . . . . . . . . . . . . . . . . 125
6.3.3 System Mitigation Design . . . . . . . . . . . . . . . . . . . . . . . . 130
6.3.4 Recursive System Experiments . . . . . . . . . . . . . . . . . . . . . . 132
6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
Chapter 7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
7.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
7.3 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
ACRONYMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
GLOSSARY OF TERMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
Appendix A Fault Injection Experiment Configuration . . . . . . . . . . . . 159
A.1 Sensitivity Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
A.2 Bit Error Rate Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
Appendix B Sample Noise Data . . . . . . . . . . . . . . . . . . . . . . . . . . 165
B.1 FIR Filter Estimation Error . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
B.2 SEU-Induced Noise Probability Mass Functions . . . . . . . . . . . . . . . . 166
B.3 SEU-Induced Noise Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . 172
viii
Appendix C RPR Comparison Designs . . . . . . . . . . . . . . . . . . . . . . 175
C.1 General Filter Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
C.2 System Generator FIR Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
C.3 VHDL FIR Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
Appendix D RPR Decision Blocks . . . . . . . . . . . . . . . . . . . . . . . . . 179
D.1 Decision Block Area Costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
D.1.1 Threshold RPR Decision Block . . . . . . . . . . . . . . . . . . . . . 179
D.1.2 Bounded RPR Decision Block . . . . . . . . . . . . . . . . . . . . . . 181
D.1.3 RP-TMR Decision Block . . . . . . . . . . . . . . . . . . . . . . . . . 181
D.2 RPR Decision Block Placement . . . . . . . . . . . . . . . . . . . . . . . . . 182
D.3 Triplicated Decision Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
Appendix E Component Utilization Tables . . . . . . . . . . . . . . . . . . . . 187
Appendix F On-Orbit Experiments . . . . . . . . . . . . . . . . . . . . . . . . 193
F.1 MISSE-8 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
F.2 CFE Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
ix
LIST OF TABLES
2.1 Orbit characteristics and composite upset rates for the Xilinx Virtex-4 SX-55
FPGA from [1]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2 Sensitivity of some simple designs and the Virtex-4 SX-55 device on which
they were implemented. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3 Failure rates (λ) in various orbits for some simple designs and the Virtex-4
SX-55 device on which they were implemented. . . . . . . . . . . . . . . . . 24
2.4 Number of “nines” in the steady-state availability (As) of some sample designs
in terms of sensitive upsets with a scrubbing interval of 100 ms. . . . . . . . 25
3.1 Number of SEUs causing each class of effect for several designs. . . . . . . . 42
3.2 Percentage of SEUs causing certain SNR losses at BER of 10−5. . . . . . . . 42
3.3 Sensitive failure rates (λ) for several designs in various orbits. . . . . . . . . 43
3.4 Catastrophic failure rates (λ) for several designs in various orbits. . . . . . . 44
3.5 Number of SEUs causing each class of effect for the binary PAM demodulator
with timing synchronization. . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.6 Percentage of SEUs causing certain SNR losses at BER of 10−5 for the binary
PAM demodulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.7 Sensitive failure rates (λ) for the recursive demodulator design in various
orbits. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.8 Catastrophic failure rates (λ) for the recursive demodulator design in various
orbits. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.1 Summary of the possible upset cases for a general RPR module. . . . . . . . 58
xi
4.2 Fault injection results for three FIR filter designs protected with RPR and
TMR, compared against the unmitigated filters. . . . . . . . . . . . . . . . 67
4.3 Normalized percentage of SEUs causing certain SNR losses at BER of 10−5
for an FIR filter protected with RPR and TMR compared against the un-
mitigated filters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.1 Comparison of the error signals and bounds of three variations of RPR for
each RPR upset case. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.2 Number of SEUs causing each class of effect for the FIR filter protected with
full TMR and Threshold RPR, compared against the unmitigated filter. . . 97
5.3 Number of SEUs causing each class of effect for the FIR filter protected with
full TMR and Bounded RPR, compared against the unmitigated filter. . . 97
5.4 Number of SEUs causing each class of effect for the FIR filter protected with
full TMR and RP-TMR, compared against the unmitigated filter. . . . . . 97
6.1 Pfp values for a Gaussian-distributed εe signal. . . . . . . . . . . . . . . . . . 107
6.2 Mathematical (Th) vs. experimental (T ∗h ) threshold values for RPR FIR filter
designs with several different reduced-precision bit-widths (Br). The mean
(µe) and standard deviation (σe) values for the signal εe are also shown. . . . 111
6.3 Number of SEUs causing each class of effect for an FIR filter protected with
TMR and several levels of Threshold RPR using experimentally-determined
thresholds (T ∗h ), compared to mathematically-determined thresholds (Th). . . 112
6.4 Normalized percentage of SEUs causing certain SNR losses at BER of 10−5
for an FIR filter protected with several levels of Threshold RPR, comparing
the use of experimentally-determined thresholds (T ∗h ) and mathematically-
determined thresholds (Th). . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
xii
6.5 Detection factor (a) for an FIR filter protected with several levels of Threshold
RPR, comparing the use of experimentally-determined thresholds (T ∗h ) with
mathematically-determined thresholds (Th) at an SNR of 8 dB. . . . . . . . 113
6.6 Number of SEUs causing each class of effect for an FIR filter protected with
TMR and several levels of Threshold RPR using experimentally-determined
thresholds (T ∗h ), compared to the unmitigated filter. . . . . . . . . . . . . . 121
6.7 Estimated cost of several 4-tap FIR filter circuits protected with RPR. . . . 124
6.8 Number of SEUs causing each class of effect for the binary PAM demodulator
protected with full TMR and RPR+TMR, compared to the unmitigated
demodulator. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
6.9 Normalized percentage of SEUs causing certain SNR losses at BER of 10−5
for the binary PAM demodulator protected with full TMR and RPR+TMR. 137
A.1 Fault injection run times for each SNR input value. . . . . . . . . . . . . . . 164
C.1 Number of SEUs causing each class of effect for the FIR filter design with
α = 0.5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
C.2 Percentage of SEUs causing certain SNR losses at BER of 10−5 for the FIR
filter design with α = 0.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
C.3 Sensitive failure rates (λ) for the FIR filter design in various orbits. . . . . . 177
C.4 Catastrophic failure rates (λ) for the FIR filter design in various orbits. . . . 177
E.1 Resource utilization for two-input adder modules with a range of input bit-
widths. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
E.2 Resource utilization for single-input (constant coefficient) multiplier modules
with a range of input bit-widths. . . . . . . . . . . . . . . . . . . . . . . . . 188
E.3 Resource utilization for two-input multiplier modules with a range of input
bit-widths. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
E.4 Resource utilization for FIR filter modules with a range of input bit-widths. 189
xiii
E.5 Resource utilization for standard Threshold RPR decision modules with 17-bit
full-precision input and a range of reduced-precision input bit-widths. . . . . 190
E.6 Resource utilization for Shim’s optimized Threshold RPR modules with 17-bit
full-precision input and a range of reduced-precision input bit-widths. . . . . 190
E.7 Resource utilization for Bounded RPR decision modules with 17-bit full-
precision input and a range of reduced-precision input bit-widths. . . . . . . 191
E.8 Resource utilization for TMR voter modules in a range of input bit-widths. . 191
F.1 Results of the CFE RPR Test . . . . . . . . . . . . . . . . . . . . . . . . . . 197
xiv
LIST OF FIGURES
2.1 (a) An abstraction of an FPGA logic cell with 1’s and 0’s representing the
contents of the configuration memory and the red indicating the routing and
functions implemented, (b) an upset in a LUT module, and (c) an upset in
the routing matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Sample of the reliability over time, R(t), of a TMR system with and without
repair, compared to an unmitigated system [2]. . . . . . . . . . . . . . . . . . 16
2.3 Simplified block diagram of an FIR filter protected with triple modular re-
dundancy (TMR). The portion surrounded by the dotted box is implemented
on the FPGA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4 Fault injection of an FIR filter using two FPGAs. . . . . . . . . . . . . . . . 20
2.5 The exhaustive fault injection flow described in [3]. . . . . . . . . . . . . . . 21
2.6 The continuous-time reliability function for the FIR Filter design in a GPS
orbit assuming an exponential fault distribution. . . . . . . . . . . . . . . . . 26
3.1 (a) Model of a DSP system with an additive noise component and (b) the
same system with an additional SEU-induced noise component. . . . . . . . 30
3.2 Bit error rate curves for several phase-shift keying (PSK) communications
systems with an AWGN channel. . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3 A fault injection flow for general DSP systems. . . . . . . . . . . . . . . . . . 33
3.4 A fault injection flow for communications systems. . . . . . . . . . . . . . . . 33
3.5 Model of a binary pulse amplitude modulation (PAM) communications sys-
tems with an AWGN channel. . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.6 A high-level block diagram of the receiver system. . . . . . . . . . . . . . . . 36
xv
3.7 The FIR filter structures examined in the fault injection experiments: (a)
direct form 1 FIR filter; (b) transposed direct form 1 FIR filter. . . . . . . . 37
3.8 BER plot showing representative samples from each of the four error classes
from the 16-bit logic-based FIR filter with α = 1.0. . . . . . . . . . . . . . . 38
3.9 BER plot for the 16-bit logic-based FIR filter with α = 1.0. . . . . . . . . . . 40
3.10 BER plot for the 16-bit logic-based FIR filter with α = 0.25. . . . . . . . . . 40
3.11 BER plot for the 8-bit logic-based FIR filter with α = 1.0. . . . . . . . . . . 40
3.12 BER plot for the 8-bit logic-based FIR filter with α = 0.25. . . . . . . . . . . 40
3.13 BER plot for the 16-bit DSP48-based FIR filter with α = 1.0. . . . . . . . . 41
3.14 BER plot for the 16-bit DSP48-based FIR filter with α = 0.25. . . . . . . . . 41
3.15 Block diagram of the binary PAM demodulator with timing synchronization. 44
3.16 BER plot for the unmitigated binary PAM receiver system with timing syn-
chronization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.1 Simplified block diagram of a module protected with reduced-precision redun-
dancy (RPR). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2 Simplified block diagram of a module protected with reduced-precision redun-
dancy (RPR) designed for soft error environments. . . . . . . . . . . . . . . 53
4.3 Block diagram of an 8-bit register holding a fractional fixed-point number. . 55
4.4 Truncation of a fixed-point binary number to several levels of precision. . . . 61
4.5 Simplified block diagram of an 16-bit FIR filter protected with Threshold
RPR using two 8-bit filters as the reduced precision modules. . . . . . . . . . 65
4.6 BER plot for the 16-bit logic-based FIR filter with α = 1.0 with RPR using
two 8-bit reduced-precision filter replicas. . . . . . . . . . . . . . . . . . . . . 69
4.7 BER plot for the 16-bit logic-based FIR filter with α = 0.25 with RPR using
two 8-bit reduced-precision filter replicas. . . . . . . . . . . . . . . . . . . . . 69
4.8 BER plot for the 16-bit DSP Block-based FIR filter with α = 1.0 with RPR
using two 8-bit reduced-precision filter replicas. . . . . . . . . . . . . . . . . 70
xvi
5.1 Simplified block diagram of an n-bit (B = n) full-precision module protected
with Threshold RPR using two k-bit (Br = k) reduced-precision modules,
where k < n. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.2 Block diagram of a Threshold RPR decision block. . . . . . . . . . . . . . . 76
5.3 Block diagram of a optimization on the Threshold RPR decision block sug-
gested by Shim [4]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.4 Simplified block diagram of a full-precision module protected with Bounded
RPR using upper-bound and lower-bound reduced precision modules. . . . . 79
5.5 Error cases for Bounded RPR, modified from [5]. Categorized in rows by the
location of the error and in columns by the response to each type of event. . 79
5.6 Block diagram of a Bounded RPR decision block. Sign extensions, where
necessary, are not shown in this diagram. . . . . . . . . . . . . . . . . . . . . 81
5.7 Simplified block diagram of an n-bit full-precision module protected with
Bounded RPR using an add-and-subtract-threshold method of bounding the
full-precision output. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.8 Block diagram of an 8-bit register protected by RP-TMR. For simplicity, the
register inputs are not shown. The three clock domains are indicated by
dotted lines and are labeled clk1, clk2, and clk3. . . . . . . . . . . . . . . 83
5.9 Block diagram of an 8-bit adder protected by RP-TMR. For simplicity, the x
and y inputs of each full adder are not shown. The three clock domains, cor-
responding to the clock domains of the inputs to each full adder sub-module,
are indicated by dotted lines and are labeled clk1, clk2, and clk3. The full
adder submodule is detailed in the inset. . . . . . . . . . . . . . . . . . . . . 84
xvii
5.10 Block diagram of an array multiplier with annotations for RP-TMR. The
shading indicate the protected modules and the underlines note the replicated
partial product inputs. The full adder and half adder sub-modules used are
detailed in the insets, with each partial product shown as one input to each
module. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.11 Relative cost of RPR decision blocks in terms of 4-input LUTs for a range of
reduced-precision bit-widths. . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.12 Threshold RPR multiplier: full-precision output and reduced-precision output
with error bounds. h = 0.4921875, B = 7, Br = 3, and Th = εmax . . . . . . . 92
5.13 Bounded RPR multiplier: full-precision and reduced-precision outputs. h =
0.4921875, B = 7, and Br = 3. . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.14 Normalized percentage of SEUs causing certain SNR losses at BER of 10−5
for an FIR filter protected with three levels of Threshold RPR compared to
the unmitigated design. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.15 Normalized percentage of SEUs causing certain SNR losses at BER of 10−5
for an FIR filter protected with three levels of Bounded RPR compared to the
unmitigated design. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.16 Normalized percentage of SEUs causing certain SNR losses at BER of 10−5
for an FIR filter protected with three levels of RP-TMR compared to the
unmitigated design. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.1 (a) The pmf of the estimation error, εe, of an RPR module, (b) the pmf for the
maximum undetected upset error signal, εu, and the pmf for (c) a mid-range
upset which crosses the reduced threshold, T ∗h . . . . . . . . . . . . . . . . . . 108
6.2 Bit error rate curves for several FIR filters (SRRC pulse shape, α = 0.5) with
different bit-widths. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
xviii
6.3 RPR filter decision signals for RPR with Br = 6 and Th = 0.3106. No errors
are present in the system. The upper and lower comparison bound signals are
calculated by adding and subtracting Th to and from RPout. . . . . . . . . . 116
6.4 RPR filter decision signals for RPR with Br = 3 and Th = 2.3871. The FPout
signal is frozen at zero. The upper and lower comparison bound signals are
calculated by adding and subtracting Th to and from RPout. . . . . . . . . . 117
6.5 ERPR-avg of the FIR filter design for several bit-widths and using two failure
rates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
6.6 Normalized percentage of SEUs causing certain SNR losses at BER of 10−5
for an FIR filter protected with several levels of Threshold RPR compared to
the unmitigated design. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
6.7 Block diagram of a 4-tap FIR filter. . . . . . . . . . . . . . . . . . . . . . . . 124
6.8 Workflow for choosing the location and number of decision blocks in an RPR
system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
6.9 Block diagram of a simple circuit with feedback. . . . . . . . . . . . . . . . . 129
6.10 Workflow for applying RPR+TMR to a digital system. . . . . . . . . . . . . 131
6.11 Block diagram of the recursive binary PAM demodulator with annotations for
RPR+TMR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
6.12 Block diagram of the NCO block within the recursive binary PAM demodu-
lator, exported from Xilinx System Generator. . . . . . . . . . . . . . . . . . 134
6.13 BER plot for the binary PAM receiver system with timing synchronization
using RPR+TMR for mitigation. . . . . . . . . . . . . . . . . . . . . . . . . 137
A.1 A photograph of the fault injection test board. . . . . . . . . . . . . . . . . . 160
A.2 A block diagram of the ConfigMon FPGA used in the fault injection tests. . 160
A.3 Comparison between the (a) sensitivity test architecture and the (b) utiliza-
tion test architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
A.4 A block diagram of the BER fault injection test. . . . . . . . . . . . . . . . . 164
xix
B.1 Probability mass function (pmf) of the estimation error, εe, of the reduced-
precision FIR Filter with Br = 2. . . . . . . . . . . . . . . . . . . . . . . . . 166
B.2 Probability mass function (pmf) of the estimation error, εe, the reduced-
precision FIR Filter with Br = 3. . . . . . . . . . . . . . . . . . . . . . . . . 166
B.3 Probability mass function (pmf) of the estimation error, εe, the reduced-
precision FIR Filter with Br = 4. . . . . . . . . . . . . . . . . . . . . . . . . 167
B.4 Probability mass function (pmf) of the estimation error, εe, the reduced-
precision FIR Filter with Br = 5. . . . . . . . . . . . . . . . . . . . . . . . . 167
B.5 Probability mass function (pmf) of the estimation error, εe, the reduced-
precision FIR Filter with Br = 6. . . . . . . . . . . . . . . . . . . . . . . . . 167
B.6 Probability mass function (pmf) of the estimation error, εe, the reduced-
precision FIR Filter with Br = 7. . . . . . . . . . . . . . . . . . . . . . . . . 167
B.7 Sample probability mass functions (pmfs) of the SEU-induced noise signals,
εu, for several upsets in an FIR filter design. . . . . . . . . . . . . . . . . . . 168
B.8 More sample probability mass functions (pmfs) of the SEU-induced noise sig-
nals, εu, for several upsets in an FIR filter design. . . . . . . . . . . . . . . . 169
B.9 More sample probability mass functions (pmfs) of the SEU-induced noise sig-
nals, εu, for several upsets in an FIR filter design. . . . . . . . . . . . . . . . 170
B.10 More sample probability mass functions (pmfs) of the SEU-induced noise sig-
nals, εu, for several upsets in an FIR filter design. . . . . . . . . . . . . . . . 171
B.11 Histogram of the mean of the SEU-induced noise signals, εu, for all sensitive
SEUs in an FIR filter design. . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
B.12 Detail of the histogram in Figure B.11. . . . . . . . . . . . . . . . . . . . . . 173
B.13 Histogram of the variance of the SEU-induced noise signals, εu, for all sensitive
SEUs in an FIR filter design. . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
B.14 Detail of the histogram in Figure B.13. . . . . . . . . . . . . . . . . . . . . . 173
xx
B.15 Histogram of the power (mean square) of the SEU-induced noise signals, εu,
for all sensitive SEUs in an FIR filter design. . . . . . . . . . . . . . . . . . . 174
B.16 Detail of the histogram in Figure B.15. . . . . . . . . . . . . . . . . . . . . . 174
C.1 Block diagram of a type I direct form FIR filter with seven taps, optimized
for symmetric coefficients. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
D.1 Block diagram of a 4-tap FIR filter. . . . . . . . . . . . . . . . . . . . . . . . 183
F.1 Block diagram of the experiment designed for the MISSE-8 experiment on the
International Space Station. . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
F.2 Block diagram of the experiment designed for the Cibola Flight Experiment
satellite. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
xxi
CHAPTER 1. INTRODUCTION
1.1 Motivation
Field-programmable gate arrays (FPGAs) are becoming an increasingly popular al-
ternative to general purpose CPUs and application-specific integrated circuits (ASICs) in
many application domains. Compared to general purpose CPUs, FPGAs can offer faster
processing and increased performance per watt [6], [7]. Compared to custom ASICs, FP-
GAs provide a lower cost per device in small quantities and more flexibility due to their
re-programmability [8]. FPGAs provide an alternative to these two technologies, offering an
attractive trade-off between the features and costs of each.
Given these trade-offs, FPGAs are becoming a popular target for processing and
communications in space systems. As scientific experiments on board satellites become
more complex, the amount of data collected often exceeds the capacity of the downlink
from the satellite to the ground station. In order to reduce the amount of data that must
be transferred to the ground, an increasing number of satellites include on-board processing
modules and systems [9], [10]. FPGAs provide good performance for digital signal processing
(DSP) and communications applications often used by these systems [11]–[17].
Aside from processing power, FPGAs offer other attractive features to satellite sys-
tems. FPGA-based systems can be re-programmed on demand after deployment to perform
the functions of several different devices at different times through time-sharing. This can
reduce system weight and power requirements, which are important in satellite systems.
This re-programmability also allows the circuit implemented to be changed in-flight for later
upgrades, bug fixes, and to add additional functionality. Also, since satellites are typically a
1
low-volume product, the low cost of a single FPGA is attractive compared to the high cost
of the first ASIC chip manufactured.
Unfortunately, the harsh space environment makes processing using standard SRAM-
based (static random access memory) FPGAs difficult. Outside the atmosphere of the Earth,
there is a large amount of radiation that may interfere with the electronics of a spacecraft.
Memory cells are especially susceptible to the effects of this radiation. Since SRAM-based
FPGAs are based on large arrays of memory cells, they are particularly susceptible to
radiation-induced upsets, called single event upsets (SEUs). This problem is exacerbated
by the fact that the configuration of the FPGA is stored in these memory cells in addition
to the basic data normally stored in a digital circuit’s memory bank. That is, the hardware
implemented by the FPGA is defined by the configuration memory cells and any SEU in
these cells has the potential to corrupt the hardware implemented by the FPGA.
Though there are existing methods for dealing with radiation, these methods are
costly in terms of area, power, and/or circuit timing. These techniques add redundancy to
the circuit in the form of additional hardware, redundant data, or repeated processing. The
most popular technique used is triple modular redundancy (TMR) coupled with configuration
scrubbing. This method, although effective, requires three times the area and power of the
original circuit along with a degradation in its speed.
FPGA-based DSP and communications applications considered for space systems
must deal with these radiation effects and typically use the same expensive redundancy
techniques to mitigate SEUs. The hypothesis in this dissertation is that it is possible to
reduce the cost of mitigation by exploiting the properties of these types of applications.
DSP and communications systems are designed to process data that has been corrupted
by noise inherent in the applications. If that same processing can filter out some of the
corruption caused by SEUs, a reduced-cost mitigation approach may be feasible.
2
1.2 Summary of Research
This dissertation shows that FPGA-based DSP and communications systems can be
protected from radiation effects at a lower cost than TMR. It demonstrates the inherent
resilience of these systems to radiation effects and pinpoints their most critical sections. It
also demonstrates a specific reduced-cost mitigation technique that takes advantage of the
noise-handling properties of DSP and communications systems. This dissertation suggests
specific methods for implementing this technique on FPGA systems.
First, this dissertation presents a novel method for analyzing the reliability of FPGA-
based DSP and communications systems. This method focuses on measuring the perfor-
mance of the system in the presence of an SEU in order to classify SEUs according to the
severity of their effects. Fault injection experiments demonstrate that only 5–15% of SEUs
affecting a communications receiver (i.e. 5–15% of sensitive SEUs) cause critical performance
loss. The most critical SEUs were found to be those that affect the clock, global reset, and
most significant bits (MSBs) of computation of the FPGA design.
Using this detailed analysis of the SEU effects on a communications system, this dis-
sertation suggests a technique known as reduced-precision redundancy (RPR) to combat the
negative effects of SEUs. This technique focuses redundancy on the MSBs of computation
and leaves the less critical SEUs to the noise-handling processing of the DSP or communica-
tions application. Fault injection experiments show that RPR is able to improve the failure
rate of several simple communications systems by 20 times at a cost of less than half that of
TMR in most cases.
After identifying RPR as a reduced-cost alternative to TMR, this dissertation presents
methods for optimizing the application of RPR on a system. This includes a comparison
of three variations of the RPR technique, including a novel variation introduced here called
Reduced-Precision TMR (RP-TMR). These variations are compared for their area cost and
their ability to protect against SEUs. The variation called Threshold RPR is demonstrated as
3
the best fit for FPGA-based systems with an analysis of the projected cost and performance
of each as well as with fault injection experiments.
Finally, this dissertation presents several methods for applying Threshold RPR to a
system with the goal of reducing mitigation cost and increasing the system performance in
the presence of SEUs. Additional fault injection experiments show that optimizing the ap-
plication of RPR can result in a decrease in critical SEUs by as much as 65% at no additional
hardware cost. A final example demonstrates the application of RPR to a more complex
communications receiver system, showing how RPR may be applied to larger systems and
providing a workflow to do so.
1.3 Dissertation Organization
This dissertation is divided into seven chapters:
• Chapter 2 gives background on reliable processing on FPGAs. It describes the radi-
ation effects faced by these devices and current methods of dealing with these issues,
including the aforementioned TMR and configuration scrubbing.
• Chapter 3 presents a novel method for evaluating the effects of radiation on FPGA-
based DSP systems. Fault injection experiments show that several communications
systems are naturally resilient to radiation effects. The chapter also identifies the most
critical sections of these systems.
• Chapter 4 describes the RPR technique as an alternative to TMR which reduces costs
by focusing on protecting the most critical sections of the circuit and largely ignoring
the naturally resilient sections. This chapter demonstrates RPR’s effectiveness as well
as its potential area savings over TMR with fault injection experiments on a simple
communications systems.
• Chapter 5 compares and contrasts three different variations of the RPR technique
by comparing their area cost and evaluating the error bounds of each technique. Two
4
of these methods are previously-suggested implementations and the third is a new
variation introduced here called Reduced-Precision TMR (RP-TMR). This chapter
shows that one of these variations, called Threshold RPR, is superior to the other two
for FPGA-based systems. This conclusion is verified with fault injection experiments.
• Chapter 6 suggests methods to optimize the application of Threshold RPR to an
existing communications system. This chapter demonstrates the trade-offs of selecting
different parameters for the RPR implementation and validates the methods presented
on a more complex communications system with fault injection experiments.
• Chapter 7 summarizes the research and contributions provided by this dissertation
and gives suggestions for future work in this area.
5
CHAPTER 2. RADIATION EFFECTS AND MITIGATION ON FPGAS
Satellites and other spacecraft operate in the harsh radiation environment outside
the Earth’s atmosphere. Charged particles in this environment can cause voltage or cur-
rent spikes in a circuit which can alter the contents of digital memory cells. Any comput-
ing systems in these environments must somehow tolerate or mitigate these complications.
SRAM-based FPGAs are especially susceptible to these radiation effects.
This chapter discusses the various radiation effects faced by FPGAs and other elec-
tronic systems. Next, it presents some of the standard techniques used to protect FPGA
systems from these effects and introduces the most common fault tolerance technique used
in FPGAs, triple modular redundancy (TMR). Additional application-specific mitigation
techniques are also mentioned, including reduced-precision redundancy (RPR). Finally, this
chapter describes the methods used in this dissertation to evaluate the sensitivity of a par-
ticular FPGA design to radiation effects.
2.1 Single Event Effects
Outside the Earth’s atmosphere, objects are regularly bombarded with various en-
ergetic particles including solar and extra-solar cosmic rays as well as protons trapped in
the Earth’s magnetic field [18], [19]. On the ground, electronic systems are protected from
most of these energetic particles by our atmosphere.1 A particle which passes through a
digital system may alter the current or voltage in a portion of the circuit. The results of an
energetic particle affecting a circuit is called a single event effect (SEE) [23], [24].
1With shrinking transistor sizes, cosmic rays are predicted to soon become a larger problem for computersystems on the ground as well [20]–[22].
7
2.1.1 Types of Single Event Effects
Single event effects affect both ASIC and FPGA devices in several different forms.
These effects include single event upsets (SEU), single event transients (SET), single event
latchup (SEL), single event burnout (SEB), and single event gate rupture (SEGR) [24]. Both
SEU and SET are non-destructive events, which are called “soft errors.” The other events can
cause permanent damage to the device if not properly monitored or if sufficient mitigation
is not in place. The single event effects which are of main concern on SRAM-based FPGAs,
and upon which this dissertation focuses, are SEU and SET [25].
Particle strikes which occur in the transistors making up a memory element in the
device can alter the contents of memory. That is, a memory cell storing a binary ‘1’ could be
upset and its contents changed to a ‘0.’ This event is called an SEU2. An SET is the result
of a charged particle temporarily altering the amount of current or voltage passing through
a circuit element. If this transient effect passes through a memory cell at the moment that
the cell is capturing and storing its input, the result is the same as an SEU.
2.1.2 SEE within ASICs
Single event effects affect ASICs in addition to FPGAs. Soft errors, including SEU
and SET, are a significant concern in ASIC-based systems in radiation environments. SEUs
can alter the contents of memory elements in the system including flip-flops (FFs), random
access memories (RAMs), and processor caches. The common static random access memory
(SRAM) and dynamic random access memory (DRAM) are especially susceptible to SEUs
compared to electrically-erasable programmable read only memory (EEPROM) and flash
memory [27]. Similarly, SETs may cause transient voltage or current pulses in any logic,
which may in turn be latched into a memory element causing an SEU.
2A single particle strike may affect multiple memory cells, in which case the SEU is called a multi-bitupset (MBU) [26]. For simplicity, this dissertation considers only SEUs which are single-bit upsets (SBUs).
8
These soft errors can cause several types of errors in ASIC devices. An SEU in a
memory array can cause data corruption. A particle strike within a processor can halt, reset,
or cause an unintended jump within the program flow. Other SEUs can cause miscellaneous
corruption of the data stored within and being operated on by processing modules. These
effects are problematic, but the processing components themselves are of less concern than
the memory components since errors in the logic are temporary unless they are latched by a
memory element [28].
2.1.3 SEE within FPGAs
In contrast to ASICs, FPGAs use a large memory array to store their configuration.
This configuration memory defines the hardware implemented in the FPGA. By changing
the contents of this memory, the FPGA may be configured to operate as an FIR filter,
a microprocessor, or any other custom circuitry. A major concern with using FPGAs in
radiation environments, then, is that an SEU in the configuration memory could alter the
hardware implemented in addition to the user memory (flip-flops, RAMs, etc.). This can
result in more significant errors than those expected in ASICs.
There are several types of FPGAs available, each of which has different characteristics
in radiation environments. All standard FPGA fabrics are susceptible to upsets directly in
the user memory as well as through SETs in the logic that may be latched into the user
memory. The technology used to define the configuration of the FPGA, however, greatly
affects its resilience against radiation-induced upsets.
• SRAM FPGAs use a large array of SRAM memory cells to store the hardware
configuration of the device. Typical SRAM cells, and thus the configuration of the
FPGA, are especially susceptible to SEUs.
• Antifuse FPGAs are configured by antifuses rather than memory cells. These devices
are configured once and their functionality cannot be changed again. This type of
configuration is immune to SEUs [29].
9
• Flash memory FPGAs use non-volatile flash memory to store the FPGA configu-
ration. These memory cells are also immune to SEUs.
Although SRAM-based FPGAs are the most susceptible to radiation-induced upsets,
they are desirable for other reasons. Antifuse FPGAs can only be programmed once, which
eliminates the benefits of reconfigurability that SRAM-and flash-based FPGAs have. Flash
FPGA currently suffer from low total ionizing dose (TID) effects, resulting in decreased clock
speeds and loss of reconfigurability after the threshold radiation dose is reached [30]. For
these reasons, SRAM-based FPGAs are preferred in many applications.
2.1.4 SEUs on SRAM-based FPGAs
As mentioned above, SRAM-based FPGAs are susceptible to SEUs in the user mem-
ory (flip-flops, RAMs, etc.) as well as the configuration memory. This dissertation primarily
focuses on SEUs in the configuration memory of the FPGA device. The configuration mem-
ory makes up the vast majority of the memory cells available on an FPGA [31].
The configuration memory controls the type of logic implemented by the FPGA de-
vice as well as the interconnect between logic functions, as illustrated in Figure 2.1(a). An
SEU in the configuration memory can alter the function of the circuit, as shown in Fig-
ure 2.1. Figure 2.1(b) illustrates how an upset in an FPGA lookup table (LUT) can alter
the function implemented by that LUT. Figure 2.1(c) shows an example of an upset in a
routing matrix, which controls the routing of signals between FPGA logic blocks. These
upsets can disconnect routes, create new routes, or even bridge two routes together [32].
The consequences of these configuration SEUs can be drastic. The logic implemented
by the FPGA can be altered to produce a different function than intended. Routing upsets
can prevent critical signals from reaching their destination. An upset in the clocking logic
can effectively turn off an entire FPGA design.
Fortunately, SEUs in SRAM FPGAs are not permanent and are repairable simply by
restoring the original configuration of the FPGA. This can be done by reloading the entire
10
(a)
(b)
(c)
Figure 2.1: (a) An abstraction of an FPGA logic cell with 1’s and 0’s representing the contentsof the configuration memory and the red indicating the routing and functions implemented,(b) an upset in a LUT module, and (c) an upset in the routing matrix.
11
FPGA configuration or by reloading only the portion of the configuration that has been
corrupted.
With their susceptibility to SEUs in the configuration memory, it is often desirable to
protect a design from SEU-induced errors. Section 2.2 will discuss some common methods
for mitigating SEUs in the FPGA configuration. Section 2.3 will describe how to measure the
sensitivity of FPGA designs to SEUs, which will provide a way to analyze the effectiveness
of SEU mitigation techniques.
2.2 SEU Mitigation for FPGAs
To protect an FPGA system from errors caused by SEUs, upsets must be prevented or
tolerated in some manner. In space environments, prevention of upsets is impractical due to
the high energy of the particles in question and the size and weight of physical shielding that
would be required. For this reason, SEU mitigation methods are used instead to minimize
the negative impact of upsets.
A variety of SEU mitigation techniques have been developed and tested for FPGAs.
These mitigation approaches typically involve some form of redundancy, whether that be
multiple processing modules, repeated processing steps, or data redundancy. In addition,
each technique is coupled with a repair process which restores the original configuration of
the FPGA after an SEU occurs.
This section begins with a description of the most common repair processes collec-
tively known as configuration scrubbing. A brief overview of the different types redundancy
techniques follows. The most popular of these techniques is triple modular redundancy
(TMR), which will be described in detail. Finally, this section concludes by mentioning
some alternatives to TMR which take advantage of knowledge of the specific application
to reduce the cost of mitigation in some way. One of these methods is reduced-precision
redundancy (RPR), which is a main focus of this dissertation.
12
2.2.1 Configuration Scrubbing
Section 2.1.4 mentioned that SEUs can be repaired by re-writing the configuration
memory of the FPGA with its original content. This is often done using a method known as
configuration scrubbing [33], [34]. Scrubbing is a method for repairing SEUs in the configu-
ration memory by periodically rewriting the original configuration of the FPGA. It is also
is used to prevent the accumulation of upsets to improve the reliability of SEU mitigation
techniques. Scrubbing has several forms, each of which satisfies these goals.
One scrubbing method simply re-writes the entire configuration of the FPGA at a
chosen interval. The re-write is done whether an upset exists in the configuration or not.
This is the simplest scrubbing method, requiring little system overhead. Some FPGAs can
be reconfigured while continuing to run so the design does not have to be paused during the
writing process.
Another scrubbing method periodically reads the configuration memory to detect
upsets before re-writing the configuration. For this scrubbing method, the configuration
memory is read out and compared to the original configuration, perhaps stored in an external
radiation-hardened memory. If a difference is discovered, the correct configuration is restored.
This form of scrubbing is also called “readback and compare.”
It is important to include configuration scrubbing in any SEU mitigation scheme.
Without scrubbing, SEUs would build up over time, eventually overwhelming even the most
robust mitigation technique. The scrubbing rate should be sufficiently higher than the rate
of SEU occurrence such that the most probable outcome is that no more than a single upset
will exist in the FPGA configuration at one time. Unless otherwise noted, this dissertation
assumes an adequate scrubbing system and that no more than one upset is present in the
FPGA configuration at one time.
13
2.2.2 Redundancy Techniques
In addition to preventing the build-up of configuration upsets with scrubbing, it is
desirable to prevent the effects of any single SEUs from reaching the circuit outputs. To do
this, scrubbing must be combined with a redundancy technique which masks errors while
SEUs are present in the system. This redundancy may be in space (parallel computing),
time (repeated computing), or information (e.g. data encoding) [35].
Spatial Redundancy
Spatial redundancy uses parallel computation to mask errors. Using multiple copies of
a circuit and comparing the outputs, the most likely outcome can be determined. With three
copies of a circuit, any single module can fail and the system can still provide the correct
output. With five copies of a circuit, any two modules can fail, etc. Spatial redundancy
techniques tend to have high area costs due to this circuit replication.
Temporal Redundancy
Temporal redundancy, as its name implies, involves repeated computation. This is
done using a single processing module, in contrast to spatial redundancy which uses multiple
processing modules in parallel. Both error detection and error correction can be achieved
using temporal redundancy. It can be used to detect and correct both transient (SET) and
permanent (SEU) faults [36], [37].
Though temporal redundancy aims to have a lower area cost than spatial redundancy
methods, the extra hardware to detect and correct faults after running multiple computations
is also susceptible to SEUs in FPGAs. This has been shown to significantly reduce the
reliability of these methods for FPGA systems [35]. In an FPGA design, spatial redundancy
can be added to temporal redundancy schemes to obtain adequate reliability in order to
protect this additional hardware [38].
14
Information Redundancy
Information redundancy is a third option for protecting a system from errors. This
type of redundancy is often used in blocks of memory or in data streams in the form of error-
correcting codes [39]. Information redundancy can also be used to protect circuits in the
form of state machine encoding [40]. State machines are protected by only allowing certain
valid states and using error correction to determine the most likely correct state when an
error occurs. This form of redundancy only protects state machines and may also suffer from
the high costs of protecting the coding and decoding circuitry [35].
2.2.3 Triple Modular Redundancy
Though there are various forms of redundancy, the most popular for FPGA-based
systems is triple modular redundancy (TMR). Jon von Neumann suggested this method in
1956 as a way of creating a reliable system from unreliable components [41]. An under-
standing of TMR is essential since the mitigation techniques developed in this work will be
compared directory to this standard.
TMR triplicates the circuit module to be protected and the circuit output is deter-
mined by a majority voter module with the three circuit replicas as input. In this manner,
if any one of the three replicas is in error, the other two replicas “out vote” the erroneous
module and the correct output is given by the voter.
To obtain maximum reliability, a system protected with TMR should include a repair
process. The repair process fixes any existing faults in the system to prevent their build-up.
Figure 2.2 plots the reliability over time of a TMR system with and without repair, compared
to an unmitigated system [2]. The repair process vastly improves the reliability of TMR.
In an FPGA, TMR is often coupled with configuration scrubbing as the repair process.
To simplify analysis, this dissertation makes the assumption that the only a single SEU exists
in an FPGA at any one time. When the scrubbing rate is sufficiently higher than the rate
15
0 1000 2000 3000 40000
0.2
0.4
0.6
0.8
1
Time
R(t
)
UnmitigatedTMR with repairTMR without repair
Figure 2.2: Sample of the reliability over time, R(t), of a TMR system with and withoutrepair, compared to an unmitigated system [2].
of SEU occurrence, this is not an unreasonable assumption. Coupled with scrubbing, TMR
is very effective at protecting against SEUs in FPGAs [42], [43].
Figure 2.3 shows a simplified block diagram of an FIR filter design protected with
TMR. The dotted line shows the bounds of the FPGA. Since, in an FPGA, even the signal
routing and voter circuitry is susceptible to SEUs, triplicated inputs and outputs are often
utilized and voting is performed off-chip, often with radiation-hardened circuitry. In addition
to the data inputs, the clock and reset input signals that connect to all of the internal memory
components in the filter module are also triplicated (not pictured). This ensures that even
an SEU affecting the clock distribution network will not affect more than a one module at
one time.
In feed-forward systems, such as the finite impulse response (FIR) filter in Figure 2.3,
voters only need to be added at the final outputs of the circuit in order to reduce the three
outputs down to one. Circuits with feedback logic, such as phase-locked loops (PLLs) and
infinite impulse response (IIR) filters, contain extra internal memory state that must be
16
Figure 2.3: Simplified block diagram of an FIR filter protected with triple modular redun-dancy (TMR). The portion surrounded by the dotted box is implemented on the FPGA.
synchronized between the three circuit replicas. These more complicated circuits must also
have extra voter modules inserted within the feedback loops in each replicate to ensure that
memory state is maintained [44], [45]. Due to the triplication of the circuit and the addition
of voter modules, TMR has a hardware overhead of over 200%.
2.2.4 Application Specific Fault Tolerance
Due to the high cost of TMR, researchers have looked into alternative mitigation
strategies. In searching for alternatives to TMR, various authors have noted that reduced-
cost mitigation techniques might be obtained by using knowledge of the system in question.
These approaches, primarily targeting ASIC-based systems, have been called algorithm-
based fault tolerance (ABFT) [46], algorithmic soft error tolerance (ASET) [47], and system
knowledge [48].
Some authors have shown that the effects of soft errors in a DSP system can sometimes
be viewed as noise. Several papers have examined soft errors produced in ASICs by deep-
submicron (DSM) noise as well as those produced by using voltage overscaling (VOS) to
reduce power [4], [47], [49], [50]. Although this dissertation makes a similar analysis, the
17
effects of soft errors in ASICs are distinct from those which are of main concern for SRAM
FPGA systems as explained in Sections 2.1.2–2.1.4.
Others have published papers dealing with the effects of radiation-induced SEUs
in ASIC-based DSP systems [48], [51]–[53]. These papers focus on errors in the memory
elements of the systems, which is the dominant issue in ASIC technologies. In contrast, this
dissertation considers the effects of SEUs in any part of the FPGA configuration memory,
which specifies the logic implemented in addition to the user memory.
This dissertation evaluates reduced-precision redundancy (RPR) as an alternative
to TMR in FPGA-based DSP and communications systems. RPR was introduced as an
alternative to TMR for ASIC-based DSP systems [54]. RPR offers less protection than
TMR, but at a much lower cost. Chapter 4 will describe RPR in detail and Chapters 5 and
6 will present its application on FPGAs for SEU mitigation.
2.3 Evaluating FPGA Design Reliability
The reliability of an FPGA design can be assessed by experimentally determining
the effects of SEUs on the design. Evaluating the sensitivity of a design to SEUs allows the
designer to predict the failure rate of the design once deployed. A reliability assessment can
also be used to evaluate the effectiveness of a mitigation technique or to compare different
mitigation schemes. This section describes how fault injection experiments are used to
determine the effects of SEUs on an FPGA design and to predict the reliability of the
design.
2.3.1 Sensitivity
Each individual FPGA design has a distinct level of susceptibility to SEUs. The
FPGA configuration is made up of a large array of memory cells which control the hardware
implemented. For any particular FPGA design, however, only a fraction of these cells are
18
utilized. The FPGA fabric includes many different options for routing and logic configura-
tion. Even a design with “100%” logic utilization only uses a small percentage of the total
number of resources available since it does not make use of all of these options [55]–[57]. The
configuration cells which are utilized by a particular design are called the utilized bits of the
design.
A subset of the utilized configuration bits is the set of sensitive bits. Sensitive bits are
those which cause the output of the design to change when they are upset. For an unmitigated
design, the set of utilized bits and set of sensitive bits is the same. SEU mitigation applied to
a design may mask the errors caused by some upsets, resulting in some utilized bits which are
not sensitive to SEUs. The number and location of the sensitive bits is called the sensitivity
of the design [3].
2.3.2 Fault Injection Experiments
The utilized and sensitive bits of a particular FPGA design can be discovered through
fault injection experiments. Fault injection involves manually inserting faults into the config-
uration bitstream by changing the contents of individual memory cells. Using fault injection,
every configuration bit in the FPGA can be tested one by one to determine the utilization
and sensitivity of a particular design.
Several fault injection methods have been suggested for evaluating FPGA designs [3],
[32], [58]. Each of these methods alters the contents of the configuration memory and then
examines the output of the design for errors. The experiments presented in this dissertation
are based on the fault injection method presented in [3]. This method is described here and
will be extended in Section 3.1. Appendix A describes the specific hardware used for the
experiments presented in this dissertation.
Figure 2.4 illustrates the method used for fault injection in [3]. In this figure, an FIR
filter design is the target for characterization. The figure shows two FPGAs, each with a
copy of the filter design. The golden FPGA contains the original filter with no modifications.
19
The design under test (DUT) FPGA contains the filter being tested by injecting faults in
the configuration. The two FPGAs receive identical input streams, in this case random bits,
and the outputs of the two chips are compared.
Figure 2.4: Fault injection of an FIR filter using two FPGAs.
The control flow for a fault injection test is illustrated in Figure 2.5. A fault is
injected by choosing a configuration cell and inverting its memory contents. Output errors
are detected by comparing the outputs of the golden and DUT FPGAs bit for bit across a
number clock cycles. If any deviation is observed, the bit is marked sensitive. The test is
repeated until every configuration bit has been tested.
This test determines the sensitivity of the FPGA design, as described above. Fig-
ure 2.4 includes a graphical representation of a filter design characterized with this tool.
For this particular design, 149,696 configuration bits out of the total 5,810,024 available in
the Virtex 1000 FPGA were marked as sensitive. With the count of sensitive bits and a
description of an upset environment, the failure rate of the design can be predicted in that
environment. The failure rate and its various uses will be discussed in Section 2.3.3.
2.3.3 Failure Rate
In this dissertation, failure rate will be used to compare the reliability of different
designs and mitigation techniques. Each design has a distinct failure rate and different
mitigation techniques will improve the failure rate to varying degrees. The improvement in
failure rate that a particular mitigation approach offers will be used to evaluate the different
approaches.
20
Figure 2.5: The exhaustive fault injection flow described in [3].
The failure rate, λ, of any system is so named because it describes the rate at which
failures occur in time. More precisely, λ is the number of expected failures in the system
per unit time. For random independent events such as SEUs, a constant failure rate is often
assumed which ignores effects such as wear-out and infant mortality [2].
The failure rate for a system, of course, depends on the definition of failure. Failure
may be defined in many ways including non-optimal operation, an error count above a certain
threshold, or as complete failure to operate. The definition of failure can have a great impact
on the reported failure rate of a system. This chapter defines failure in an FPGA design
as any deviation in the output from an SEU-free version of the design. In later chapters, a
more loose interpretation of failure will be used in some circumstances.
21
Failure Rate of an FPGA Design
The failure rate of an FPGA design due to SEUs is dependent upon the radiation
environment, the physical characteristics of the FPGA device, and the cross-section of (i.e.
the area taken up by) the design. The radiation environment defines the type, flux, and
energy of the charged particles in the environment. The flux is the rate at which the particles
flow through a certain area of space. The physical characteristics of the FPGA device define
how the radiation environment characteristics affect the rate of upset occurrence in the
FPGA fabric. The physical cross-section of the design determines the rate that SEUs occur
in that particular design.
Table 2.1 gives the expected upset rates for the Xilinx Virtex-4 FPGA family. The
device upset rates for low Earth orbit (LEO), polar orbit (Polar), and geosynchronous orbit
(GEO) for the Virtex-4 SX-55 device were obtained from [1]. This is a composite number of
upsets per device per day over several types of solar conditions. The configuration bit upset
rates are simply the device upset rates divided by the number of configuration memory cells
in the device and represent the number of configuration bits that are expected to be upset
per unit time in each radiation environment. In this dissertation, these rate will be taken as
constant upset rates for simplicity. Although upset rates may change over time even within
the same orbit (such as the increase in radiation when a satellite passes through the South
Atlantic Anomaly in a LEO orbit), such considerations are beyond the scope of this work.
Table 2.1: Orbit characteristics and composite upset ratesfor the Xilinx Virtex-4 SX-55 FPGA from [1].
Orbit InclinationDevice Configuration Bit
Altitude Upset Rate Upset Rate(km) SEUs/Device/s SEUs/bit/s
GEO 35,786 0◦ 3.46×10−3 1.52×10−10
GPS 20,200 55◦ 3.03×10−3 1.34×10−10
Molniya 39,305/1,507 63.2◦ 3.30×10−3 1.45×10−10
Polar 833 98.7◦ 8.01×10−4 3.53×10−11
LEO 560 35.0◦ 2.16×10−5 9.52×10−13
22
By combining the upset rate of the environment and the sensitivity of an FPGA
design, the failure rate can be predicted. The failure rate, λ, of an FPGA design is the
configuration bit upset rate multiplied by the number of sensitive configuration bits in a
particular FPGA design.
Sample Failure Rate Calculations
Table 2.2 shows the size and sensitivity of a small FIR filter design implemented
on a Virtex-4 SX-55 FPGA. For comparison, the first row of Table 2.2 shows the number
of FPGA “slice” resources and configuration bits available in the entire FPGA as if every
configuration bit were utilized and marked as sensitive. The second row of Table 2.2 indicates
that the FIR Filter design uses 2.9% of the slices in the FPGA device but only 0.189% of
the total configuration bits in the device are sensitive to SEUs. The third row shows these
same numbers for the FIR Filter design as protected with TMR. The TMR FIR Filter
design utilized roughly 3 times the amount of hardware as the original FIR Filter design, as
expected.
Table 2.2: Sensitivity of some simple designs and the Virtex-4 SX-55device on which they were implemented.
Target Slices Utilized Sensitive Bits
Entire Device 24,576 (100%) 22,702,848 (100%)FIR Filter 712 (2.90%) 42,978 (0.189%)
TMR FIR Filter 2,089 (8.50%) 2 (8.81×10−6%)
Table 2.3 gives the failure rates (λ) for each design based on the number of sensitive
bits and the configuration bit upset rate for each orbit. This table is simply the configu-
ration bit upset rates in Table 2.1 multiplied by the number of sensitive bits in Table 2.2.
Predictably, the failure rate of the FIR Filter design is much lower than that of the entire
device and the failure rate of the TMR filter is lower still.
Although TMR theoretically offers complete protection of the configuration memory,
the fault injection experiments revealed two configuration bits that were still susceptible to
23
Table 2.3: Failure rates (λ) in various orbits for some simple designs and the Virtex-4SX-55 device on which they were implemented. For the circuit designs,
these rates are based on the number of sensitive bits in the design.
Target GEO GPS Molniya Polar LEO
Entire Device 3.46×10−3 3.03×10−3 3.30×10−3 8.01×10−4 2.16×10−5
FIR Filter 6.55×10−6 5.74×10−6 6.25×10−6 1.52×10−6 4.09×10−8
TMR FIR Filter 3.05×10−10 2.67×10−10 2.91×10−10 7.06×10−11 1.90×10−12
SEUs in the TMR design. This left the value of λ at slightly higher than zero in all cases, but
the improvement over the unmitigated design is clear. The failure rate of the TMR design
is over 21,000 times better than the original design.
Applications of Failure Rate
Failure rate can be used to describe the reliability of an FPGA design in several ways.
The raw failure rate of a design gives the number of expected failures per unit time. This
rate can also be used to compute other interesting characteristics of the design including
mean time to failure (MTTF), availability, and continuous time reliability. Each of these
measures may be used for different purposes in different applications.
The mean time to failure (MTTF) of a design is the expected time from initial
operation until a failure occurs. Assuming a constant failure rate in an unmitigated design,
MTTF is simply the inverse of that rate:
MTTF =1
λ. (2.1)
This is a useful quantity that may be easier to visualize than the raw failure rate since it
has units of time (rather than 1/time). For example, in the GPS orbit, the MTTF of the
sample FIR Filter design would be 174,216 seconds. Thus after beginning operation, this
small design would not be expected to be affected by an SEU for roughly 120 days. This
is only an expected value, of course. Failure could occur much sooner or later than this
estimate.
24
Availability is another useful metric which describes the probability that a system
which includes a repair process is functioning correctly. System availability can be expressed
as a function of time, A(t). It is defined as the probability that a system is functional at
the instant of time t [2]. As t→∞, A(t) approaches its steady-state value, As. The steady-
state value expresses the fraction of a time interval that the correct output of the system is
available.
For a constant failure rate λ and constant repair rate µ, this steady-state availability
can be expressed as
As =µ
λ+ µ. (2.2)
For an FPGA design, µ is the rate of configuration scrubbing.
Table 2.4 gives some sample availability estimations. The availability numbers for
are very close to 1 so the availability numbers are given in terms of the number of “nines”
in each case. For example, As = 0.90 has an availability of “one nine” and As = 0.999990
has an availability of “five nines.” For these examples, the scrubbing interval is chosen to be
100 ms, or µ = 1/0.1 scrubs per second (as in [1]).
Table 2.4: Number of “nines” in the steady-state availability (As) of some sampledesigns in terms of sensitive upsets with a scrubbing interval of 100 ms.
Target GEO GPS Molniya Polar LEO
Entire Device 3 3 4 4 5FIR Filter 6 6 6 6 8
TMR FIR Filter 10 10 10 11 12
Another common use of the failure rate metric is to predict the continuous-time
reliability of a system. An example of this function was plotted in Figure 2.2 to demonstrate
the advantage of adding a repair process to TMR. A continuous-time reliability function,
R(t), describes the probability of not observing any failure before time t [2]. Several fault
distribution models can be used to form the R(t) for a particular system. The most common
fault distribution used to describe the time between SEUs in a radiation environment is the
25
0 2 4 6 8 10
x 105
0
0.2
0.4
0.6
0.8
1
MTTF
Time (seconds)
R(t
)
Figure 2.6: The continuous-time reliability function for the FIR Filter design in a GPS orbitassuming an exponential fault distribution.
exponential distribution, for which the reliability function of an unmitigated design is:
R(t) = e−λt. (2.3)
Figure 2.6 plots the reliability function of the FIR Filter design in the GPS orbit, assuming
an exponential fault distribution. The MTTF of the design is also indicated.
Given that there are many methods for expressing the reliability of a system, this
dissertation will use the most basic to evaluate and compare different designs. The failure
rates, λ, for each design and mitigation approach will be given for the orbits used above. For
cases in which some SEU mitigation scheme is used to improve the reliability of a design, an
improvement factor will be given. When comparing these designs, the factor of improvement
in failure rate, which is equivalent to the increase in MTTF, will be provided.
26
2.4 Summary
SRAM-based FPGAs operating in space are susceptible to radiation-induced upsets
(SEUs) in their configuration memory array. These SEUs corrupt the data processed within
the FPGA as well as the function of the circuit implemented. The configuration memory is
the most significant source of SEU-induced errors due to its large size compared to the other
memory elements in an FPGA.
SEUs in the configuration memory are soft errors and can be repaired through con-
figuration scrubbing. Scrubbing can be combined with SEU mitigation techniques to mask
errors caused by SEUs. TMR, the most popular SEU mitigation technique for FPGAs, is
very effective but is expensive in terms of circuit area and power. Some alternative mitiga-
tion techniques have been suggested that are specific to a particular application domain and
have a lower cost than TMR.
The effectiveness of different mitigation techniques can be compared with fault in-
jection. Fault injection is an effective method for measuring the sensitivity of a particular
design to configuration SEUs. The failure rate derived from the fault injection results can be
used to predict the reliability and availability of a mitigated design in a particular radiation
environment.
Chapter 3 presents a novel fault injection method for evaluating DSP and communi-
cations systems for susceptibility to SEUs which measures performance loss instead of raw
sensitivity. Chapter 4 then presents RPR as an application-specific mitigation technique for
DSP communications systems using this new measure of performance and compares it with
TMR.
27
CHAPTER 3. EVALUATING THE PERFORMANCE OF FPGA-BASEDDSP SYSTEMS IN THE PRESENCE OF SEUS
Although all designs on SRAM-based FPGAs are susceptible to radiation-induced
SEUs, the effects of each SEU are not identical. In addition to characterizing a design’s
sensitivity, as described in Section 2.3, the performance of a design in the presence of SEUs
can be measured. SEUs can degrade the performance of a design by preventing it from
operating as intended. This performance metric should be specific to the design and system
in question. In a communications system, for example, this metric may be bit error rate
(BER).
This chapter introduces a new method for evaluating the impact of SEUs on commu-
nications systems. This new method will be used to evaluate several sample communications
systems. Using the application-specific performance metric of BER clearly shows that most
of the SEUs affecting these system do not cause critical errors. Later chapters will use this
performance measurement approach to evaluate and compare SEU mitigation techniques.
3.1 Reliability Analysis of DSP Systems
The sensitivity metric described in Section 2.3 and in other previous work simply
marks configuration bits as sensitive or non-sensitive [61]. The advantage of using this
measure is that any system may be tested with the same simple criteria. In many cases,
however, considering all sensitive configuration bits to be equal gives an overly pessimistic
view of the system. By limiting the reliability analysis to a particular system or application
domain, however, it can be possible to utilize a more detailed measure of the performance
of the system in the soft error environment.
29
(a)
(b)
Figure 3.1: (a) Model of a DSP system with an additive noise component and (b) the samesystem with an additional SEU-induced noise component.
In many DSP systems, processing is expected to be somewhat imprecise due to noise.
Noise in a data transmission channel corrupts the signals being processed. This noise is
often expressed in terms of the ratio of the signal power to the noise power, or signal-to-noise
ratio (SNR). With more noise added to the input signal of the system (a lower SNR), the
output tends to degrade further.
Analog signal processing systems carry with them the notion of a noise figure, the
measure of noise added to the system by the processing element itself. The noise figure is
defined as the difference between the output SNR and the input SNR (in decibels (dB)):
NF = 10 log
(SNRin
SNRout
)= SNRin,dB − SNRout,dB. (3.1)
In the presence of soft errors, the performance of a DSP system may degrade in a
similar manner to channel noise. In many instances, a DSP system corrupted by an SEU
may be thought of as having a noise figure, since the SEU adds “noise” to the system in a
similar way. Figure 3.1 compares a standard additive noise channel model (Figure 3.1(a))
with a model including this SEU-induced noise (Figure 3.1(b)).
30
0 2 4 6 8 10 12 14 16 1810
−8
10−7
10−6
10−5
10−4
10−3
10−2
10−1
100
Eb/N
o (dB)
BE
R
16−PSK8−PSKBPSK/QPSK
Figure 3.2: Bit error rate curves for several phase-shift keying (PSK) communications sys-tems with an AWGN channel.
3.2 Reliability Analysis of Communications Systems
This dissertation uses the digital communications application domain as a specific
example of a type of DSP system to be evaluated. Rather than using SNR to measure
performance, communications systems typically use bit error rate (BER), the number of
incorrectly-received bits in a signal divided by the total number of bits transferred.
The BER of a system is often reported as a function of the SNR, assuming an additive
white Gaussian noise (AWGN) channel. Gaussian noise is a often a product of the thermal
noise in the analog components of the communications system [62]. Figure 3.2 plots the effect
of different levels of noise for several phase-shift keying (PSK) communications systems. As
the noise in the communications channel decreases, increasing the SNR (Eb/N0), the BER
for each system decreases.
Communications systems are designed to tolerate some degree of noise. Although
the BER of such systems is theoretically directly related to the SNR (in an ideal AWGN
31
channel), the important metric in the end is the BER. Thus if a system is able to tolerate
some SEU-induced noise in addition to the Gaussian noise (or other type of noise) it was
designed for, the BER may remain low for these SEUs.
Having the ability to tolerate some SEU-induced noise would also mean that some
forms of SEU-induced noise may be ignored when developing an SEU mitigation approach.
As this chapter will show, the percentage of SEUs that can be mitigated by the inherent
noise handling of the communications systems can be quite high. This allows the use of
a mitigation approach that reduces overhead cost significantly by ignoring these types of
upsets. Thus rather than protecting the entire circuit with TMR, incurring 200% overhead
or more, only the most critical parts of the circuit may need to be protected: those in which
the SEUs have the most detrimental effect on the system performance. This protection could
be added using TMR selectively or using some other approach.
Section 3.5 will show that the different possible SEUs in an FPGA communications
system cause varying levels of noise. Those that cause lower levels of noise could be ignored
by a mitigation approach. That section will also identify the sections of our test circuits that
are most susceptible to high levels of noise with the intent of focusing a mitigation approach
on those most critical sections.
3.3 Application-Specific Fault Injection
Section 2.3.2 described a method for evaluating the sensitivity of a design to SEUs
using fault injection. Performing traditional sensitivity measurement using fault injection,
however, is pessimistic in nature for DSP systems. This form of fault injection assumes that
each configuration cell in the design is equally important. For DSP and Communications
systems, each SEU has a different effect on the output of the design. To measure these
differences, the application of the design must be accounted for.
For example, a DSP system could be evaluated in terms of the SNR loss at its output
instead of bitwise equality. Figure 3.3 shows an example of such a test system using a digital
32
Figure 3.3: A fault injection flow for general DSP systems.
Figure 3.4: A fault injection flow for communications systems.
filter as the test design. An identical set of input is fed to both a golden filter and a DUT
filter, in which faults are inserted. The output signal of each filter is recorded and the SNR
(in dB) of each is calculated. The difference between these SNR values is the noise figure
of the filter corrupted by that particular SEU. By testing every sensitive bit in the FPGA
design, a noise figure can be recorded for every possible SEU.
For a digital communications application, BER is the metric of interest. Figure 3.4
illustrates how a communications system could be tested. For each upset, a BER curve
similar to those in Figure 3.2 could be generated by sweeping the SNR of the signal at the
input to the FPGAs. The BER curve of the golden FPGA could then be compared to that of
the DUT FPGA for each upset. Thus the effect of every sensitive bit on the communications
system (not just the direct effect on the module being tested) can be determined.
33
Figure 3.5: Model of a binary pulse amplitude modulation (PAM) communications systemswith an AWGN channel.
3.4 Fault Injection for Communications Systems
This section describes the method used to test a communications system using fault
injection. Figure 3.5 shows the block diagram of a simple binary pulse amplitude modulation
(PAM) communications system with a Gaussian noise channel. This system will be used
throughout this dissertation as an example of a communications system. The binary PAM
system is the basis for many complex systems including other PAM systems and phase-shift
keying (PSK) systems. The demodulator portion of the system is the focus of the fault
injection experiments reported on here.
The fault injection experiments used to evaluate communications systems are similar
to those described in Section 2.3.2 except that BER is used as a measure of performance.
The fault injection hardware used is described in Appendix A in Section A.2.
The fault injection experiments were conducted as follows:
1. The demodulator design was targeted to a Xilinx Virtex 4 SX-55 FPGA (the DUT
FPGA).
2. The sensitive bits of the demodulator were identified according to the method described
in Section 2.3.2.
3. One of the bits in the set defined in Step 2 was inverted in the original, clean configu-
ration bit file and the FPGA was configured using this corrupt file.
34
4. For this configuration upset, a bit error rate curve was generated by processing the
modulated signal from the FuncMon with the system defined by the corrupted config-
uration bit file.
5. For the non-catastrophic SEUs, the bit error rate curve produced by the previous step
was compared to the curve for the system in the absence of upsets. The performance
loss (in terms of SNR) is estimated by taking the difference of the SNR value of each
curve at a bit error rate of 10−5.
Steps 3–5 were repeated for each of the sensitive configuration bits, as defined in Step 2.
This process simulated the occurrence of all relevant SEUs, each being present one at a time
as expected in an FPGA system with a proper scrubbing system.
With this hardware-driven test with minimal communication with the host PC, the
BER tests for an entire design were able to run very rapidly. These tests measured bit error
rates down to 10−6 at SNR values of 2, 4, 6, 8, and 10 dB for every sensitive configuration
bit. For a filter design utilizing 50,000 configuration bits, these tests ran in about 18 hours.
For more details, see Appendix A.
This fault injection method is used in this as well as in future chapters. Sections 3.5
and 3.6 will show the results of using this application-specific method on feed-forward and
recursive communications systems, respectively. Chapters 4–5 will use this fault injection
method to evaluate and compare different SEU mitigation techniques.
3.5 Feed-forward System Experiments
This section reports on fault injection experiments run on a simple binary PAM
demodulator system. This demodulator design is shown in Figure 3.6. The matched filter
makes up the bulk of the system in terms of FPGA resources.1 To simplify the analysis of
the fault injection results, the filter was the only block implemented on the test FPGA.
1The downsample block is simply an enabled register and the decision block reads and inverts the MSBof the downsample block output as a comparison against zero in two’s complement arithmetic.
35
Figure 3.6: A high-level block diagram of the receiver system.
3.5.1 Experimental Configuration
A fault injection experiment with the method described in Section 3.4 was used to
examine the impact of SEUs on system performance for several versions of the matched
filter in Figure 3.6. In these experiments, the pulse shape of the modulating and matched
filters was the square-root raised-cosine (SRRC) pulse shape with excess bandwidth α using
Lp = 3 [63]. In each case, the matched filter operated at N = 4 samples/bit. Filter
implementations with 16-bit filter coefficients and 8-bit filter coefficients were examined.
The inputs and filter registers had the same bit-widths as the coefficients. Two filter designs
were considered:
• A direct form 1 FIR (finite impulse response) filter, as shown in Figure 3.7 (a), was
constructed directly from FPGA slices2.
• An alternative approach, based on the built-in DSP blocks of the Xilinx FPGA (called
“DSP48” blocks), was used to design a transposed direct form 1 FIR filter, as illustrated
in Figure 3.7 (b).
Six combinations of these design parameters were investigated:
• “16b logic α = 1.0” means the SRRC pulse shape with α = 1.0 using 16-bit inputs and
filter coefficients in the arrangement illustrated in Figure 3.7 (a).
• “16b logic α = 0.25” means the SRRC pulse shape with α = 0.25 using 16-bit inputs
and filter coefficients in the arrangement illustrated in Figure 3.7 (a).
2The hardware architecture used for this type of filter is illustrated in Figure C.1.
36
z−1
p(−LpN) p(LpN)p(0)
z−1 z−1z−1r(nT )
x(nT )
· · ·
· · ·
· · ·
· · ·
(a)
p(−LpN)
z−1
p(0)
z−1 z−1
p(LpN)
x(nT )z−1
r(nT )
DSP Block
· · · · · ·
(b)
Figure 3.7: The FIR filter structures examined in the fault injection experiments: (a) directform 1 FIR filter; (b) transposed direct form 1 FIR filter.
• “8b logic α = 1.0” means the SRRC pulse shape with α = 1.0 using 8-bit inputs and
filter coefficients in the arrangement illustrated in Figure 3.7 (a).
• “8b logic α = 0.25” means the SRRC pulse shape with α = 0.25 using 8-bit inputs and
filter coefficients in the arrangement illustrated in Figure 3.7 (a).
• “16b dsp48 α = 1.0” means the SRRC pulse shape with α = 1.0 using 16-bit inputs
and filter coefficients in the arrangement illustrated in Figure 3.7 (b).
• “16b dsp48 α = 0.25” means the SRRC pulse shape with α = 0.25 using 16-bit inputs
and filter coefficients in the arrangement illustrated in Figure 3.7 (b).
3.5.2 Experimental Results
Results from these experiments confirmed that the sensitive SEUs do in fact differ in
their impact on BER. Some examples of the bit error rate curves resulting from the fault-
injection experiment are illustrated in Figure 3.8. The examples included in the figure are
37
0 2 4 6 8 10 1210
−7
10−6
10−5
10−4
10−3
10−2
10−1
100
Eb/N
0 (dB)
BE
R
TheoreticalClass 1 SEUClass 2 SEUClass 3 SEUClass 4 SEU
Figure 3.8: BER plot showing representative samples from each of the four error classes fromthe 16-bit logic-based FIR filter with α = 1.0.
representative cases for what we consider to be four types of effects. We label these SEU
categories “Class 1 SEU” through “Class 4 SEU.”
In addition to categorizing the SEUs, this section presents the location and function
of the different classes of SEUs. This analysis is based on a reverse-engineered configuration
bitstream in conjunction with the FPGA design implementation file in the Xilinx Design
Language (XDL) format. Knowing the criticallity of each configuration bit provides insight
which can be very useful when crafting a reduced-cost mitigation technique.
A description of the SEU classes and their main causes is summarized as follows:
1. A Class 1 SEU causes virtually no perturbation in the bit error rate performance of
the matched filter detector. The measured loss is less than 0.2 dB, allowing for mea-
surement error of the SNR loss value. The SEUs in this class are those that alter the
38
memory cells defining the low-order bits of the filter coefficients, the low-order bits of
the outputs of the arithmetic units (i.e. the addition and multiplication blocks), etc.
2. A Class 2 SEU degrades the bit error rate performance in the same way an additional
source of additive noise degrades performance. This effect can be thought of as either an
implementation loss or, as mentioned earlier, as a noise figure. Class 2 SEUs are those
that impact the memory cells defining the middle-order bits of the filter coefficients,
the middle-order bits of the outputs of the arithmetic units, etc.
3. A Class 3 SEU produces an unusably high bit error rate floor.3 SEUs impacting the
memory cells that define the high-order bits in the filter coefficients, the high-order
bits in the outputs of the arithmetic units, etc. are the main causes of SEUs in this
category. These SEUs are considered catastrophic.
4. A Class 4 SEU produces a bit error rate of 1/2. These SEUs are also catastrophic and
are caused by faults in the memory cells defining the clock distribution network, the
global reset signal, the most significant bits (MSBs) of the matched filter output, etc.
The number of SEUs in each class is a function of the properties of the filter coefficients
(controlled in these experiments using the excess bandwidth parameter, α), the number of
bits used to quantify the filter coefficients, and the degree to which built-in units such as the
DSP48 blocks are used.
Graphical representations of the impact of all SEUs on the six designs used in the fault
injection experiments are shown in Figures 3.9 – 3.14. For each design, five fault injection
tests were run for input SNR values of 2, 4, 6, 8, and 10 dB. Each plot shows five histograms
corresponding to each of these tests. Each histogram shows the BER values measured for
3Note that our simulations ran only long enough to estimate bit error rates greater than 10−6 with anyuseful fidelity. It could be the case that many of the Class 2 SEUs really do have a bit error rate floorsomewhere below 10−6. A case could be made that these Class 2 SEUs should be Class 3 SEUs. Given thefact that most modern digital communication system use some form of error control coding and that anyuseful error correcting code can easily correct random errors at the rate of 10−6 or less, there is little meritin determining if such low bit error rate floors exist.
39
Figure 3.9: BER plot for the 16-bit logic-based FIR filter with α = 1.0.
Figure 3.10: BER plot for the 16-bit logic-based FIR filter with α = 0.25.
Figure 3.11: BER plot for the 8-bit logic-based FIR filter with α = 1.0.
Figure 3.12: BER plot for the 8-bit logic-based FIR filter with α = 0.25.
each upset sensitive configuration bit in the design. Combined, the five histograms give a
summary of the effects of SEUs on each design similar to BER curves.
These plots dramatically illustrate how the majority of the SEUs are Class 1 and
Class 2 SEUs. For example, the Class 4 errors can be seen in the histogram spikes at a BER
of 0.5 (seen between 0 and 1e-1). The Class 1 and 2 errors are concentrated near the BER
curve of the unmitigated design, marked in black. Or, stated in another way, a relatively
small percentage of the SEUs are catastrophic.
Numerical summaries are tabulated in Table 3.1. An important observation is that
the distribution of SEUs between Class 1 and Class 2 depends on the excess bandwidth α.
40
Figure 3.13: BER plot for the 16-bit DSP48-based FIR filter with α = 1.0.
Figure 3.14: BER plot for the 16-bit DSP48-based FIR filter with α = 0.25.
This is due to the fact that when α = 1.0, almost half of the filter coefficients are very close
to 0. In fact, when 8-bit coefficients are used, these small filter coefficients are quantized to 0.
The FPGA synthesis tools are smart enough to recognize that “multiplication by 0 followed
by accumulation” is unnecessary and does not devote any resources to this operation. When
α = 0.25, most of the filter coefficients are sufficiently non-zero to survive quantization.
Hence, the shortcut is not available to the synthesis tools and FPGA resources are devoted
to the computation. The number of FPGA slices used as well as the total number of utilized
bits in the design are larger for the α = 0.25 design than for the corresponding α = 1.0 design.
Interestingly, the percentage of non-catastrophic SEUs remains approximately constant.
The SEUs may also be quantified by the SNR loss they cause. These results are
summarized in Table 3.2. These data define a cumulative distribution of the SNR loss4 for
each of the 6 designs. As an example, consider the designs using 16 bit filter coefficients
with the filter structure of Figure 3.7 (a). Approximately 14% of all possible SEUs lead to
an SNR loss in excess of 1 dB. In other words, 86% of all sensitive SEUs give an SNR loss
less than 1 dB. The consequence of this observation is significant. If a 1 dB SNR loss is
acceptable, only 14% of the SEUs need to be targeted for mitigation. This represents a huge
potential savings in FPGA resources.
4Note that Class 3 and Class 4 SEUs have infinite SNR loss and are included in the percentages shown.
41
Table 3.1: Number of SEUs causing each class of effect for several designs.
TotalSlices/ Class 1 Class 2 Class 3 Class 4 Utilized Total
Design DSP48s Bits Bits Bits Bits Bits Cat. Bits
16b logicα = 1.0
712/034,829 5,612 1,638 899
42,9782,537
(81%) (13%) (3.8%) (2.1%) (5.90%)16b logicα = 0.25
1,029/050,798 14,479 2,908 1,022
69,2073,930
(73%) (21%) (4.2%) (1.5%) (5.68%)8b logicα = 1.0
194/03,158 4,914 768 841
9,6811,609
(33%) (51%) (7.9%) (8.7%) (16.62%)8b logicα = 0.25
297/02,210 12,445 1,816 908
17,7792,724
(12%) (70%) (10%) (5.1%) (15.32%)16b dsp48α = 1.0
554/1322,047 5,498 867 1,118
29,5301,985
(75%) (19%) (2.9%) (3.8%) (6.72%)16b dsp48α = 0.25
554/1324,140 13,861 1,263 1,031
40,2952,294
(60%) (34%) (3.1%) (2.6%) (5.69%)
Table 3.2: Percentage of SEUs causing certain SNR losses at BER of 10−5.
Design > 0.2 dB > 0.5 dB > 1 dB > 3 dB > 6 dB
16b logic α = 1.0 18.96% 16.37% 14.32% 11.20% 9.21%
16b logic α = 0.25 26.60% 17.39% 14.36% 10.58% 9.08%
8b logic α = 1.0 67.38% 51.93% 43.65% 33.33% 26.20%
8b logic α = 0.25 85.32% 45.27% 37.19% 27.92% 24.11%
16b dsp48 α = 1.0 25.34% 22.13% 20.18% 15.92% 12.05%
16b dsp48 α = 0.25 40.09% 22.54% 18.40% 13.69% 10.92%
The situation is less dramatic for the design based on 8-bit filter coefficients. This
is because a higher percentage of the filter coefficient bits are significant in terms of how
much they contribute to the output of the filter. These coefficients are the same as the
16-bit filters, but half of the lower-order (and less significant) bits have been truncated. As a
result, the Class 2 SEUs are associated with higher SNR losses relative to the corresponding
16-bit designs and a larger percentage of the SEUs are Class 3 SEUs.
3.5.3 Application-Specific Failure Rate
Using this fault injection data, the failure rate numbers in Section 2.3.3 can be up-
dated for this specific application. The failure rate of a system, of course, depends on the
42
definition of failure for that specific application. For some applications, failure may be
defined as a drop in performance below a certain threshold.
For example, when calculating the failure rate of a communications system, it is
reasonable to define failure as the bit error rate of the system rising above 10−3. Or, in the
context of the results presented here, failure could be an SEU causing an SNR loss of 3 dB
from the theoretical value at a BER of 10−5 or failure could be be defined as any catastrophic
SEU.
Table 3.3 presents the failure rate predictions for these filter designs in various orbit
environments. This table shows the failure rates for these filters when considering every
sensitive upset a failure. In contrast, Table 3.4 presents the failure rates when considering
only catastrophic upsets as failures. As expected, the failure rates for catastrophic upsets
are roughly an order of magnitude less than for sensitive upsets. These tables emphasize the
importance of defining failure appropriately for the system in question.
Table 3.3: Sensitive failure rates (λ) for several designs in various orbits.
Design GEO GPS Molniya Polar LEO
16b logicα = 1.0 6.55×10−6 5.74×10−6 6.25×10−6 1.52×10−6 4.09×10−8
16b logicα = 0.25 1.05×10−5 9.24×10−6 1.01×10−5 2.44×10−6 6.59×10−8
8b logicα = 1.0 1.48×10−6 1.29×10−6 1.41×10−6 3.42×10−7 9.21×10−9
8b logicα = 0.25 2.71×10−6 2.37×10−6 2.58×10−6 6.27×10−7 1.69×10−8
16b dsp48α = 1.0 4.50×10−6 3.94×10−6 4.29×10−6 1.04×10−6 2.81×10−8
16b dsp48α = 0.25 6.14×10−6 5.38×10−6 5.86×10−6 1.42×10−6 3.84×10−8
3.6 Recursive System Experiments
In addition to the simple feed-forward system demonstrated in the previous section,
we have tested the SEU robustness of a communications system with a recursive structure.
43
Table 3.4: Catastrophic failure rates (λ) for several designs in various orbits.
Design GEO GPS Molniya Polar LEO
16b logicα = 1.0 3.87×10−7 3.39×10−7 3.69×10−7 8.95×10−8 2.41×10−9
16b logicα = 0.25 5.99×10−7 5.25×10−7 5.71×10−7 1.39×10−7 3.74×10−9
8b logicα = 1.0 2.45×10−7 2.15×10−7 2.34×10−7 5.68×10−8 1.53×10−9
8b logicα = 0.25 4.15×10−7 3.64×10−7 3.96×10−7 9.61×10−8 2.59×10−9
16b dsp48α = 1.0 3.03×10−7 2.65×10−7 2.89×10−7 7.01×10−8 1.89×10−9
16b dsp48α = 0.25 3.50×10−7 3.06×10−7 3.33×10−7 8.10×10−8 2.18×10−9
This type of test is significant because recursive (or feedback) systems often have more
complex error dynamics than feed-forward systems. This test was intended to determine if
the conclusions from the previous section would hold for a recursive system as well. This
section presents the experimental results from fault injection on a binary PAM receiver
with a symbol timing synchronization phased-locked loop (PLL). The full receiver system is
pictured in Figure 3.15.
Figure 3.15: Block diagram of the binary PAM demodulator with timing synchronization.
44
3.6.1 Experimental Configuration
In this experiments, the matched filter pulse shape was the square-root raised-cosine
(SRRC) pulse shape with excess bandwidth α = 0.5 using Lp = 3. The matched filter
operated at N = 4 samples/bit. The unmitigated filter used 16-bit registers, coefficients,
and input all using signed fixed-point numbers with 15 fractional bits.
The timing recovery loop operates at a rate of 2 samples/bit. The interpolator is a
piecewise parabolic Farrow interpolator [63]. The TED block is a zero-crossing timing error
detector. The loop filter is a first order filter—a single constant multiplier. The NCO is
the numerically-controlled oscillator which generates the timing synchronization pulses and
provides the fractional interpolation interval back to the interpolator.
3.6.2 Experimental Results
This experiment confirms that the results presented for the feed-forward system are
valid for this more complex communications receiver system. Tables 3.5 and 3.6 show the
numerical results for the binary PAM receiver system. The results are similar to those
observed for the feed-forward system. The total number of configuration bits is larger for
this design due to the added components. Still, the number of catastrophic bits was only
6.24% of the total sensitive configuration bits.
Table 3.5: Number of SEUs causing each class of effect for thebinary PAM demodulator with timing synchronization.
TotalSlices Class 1 Class 2 Class 3 Class 4 Utilized Total
Design Used Bits Bits Bits Bits Bits Cat. Bits
recursive 5,998demod 1,410 75,783 14,335 4,450 1,548 96,116 (6.24%)
Figure 3.16 shows the BER histogram for the recursive system. Similar to the feed-
forward system, most of the SEUs created BER curves near the theoretical curve. This is
reflected in the table by the number of Class 1 and 2 SEUs recorded compared to the total.
45
Table 3.6: Percentage of SEUs causing certain SNR losses ata BER of 10−5 for the binary PAM demodulator
Design > 0.2 dB > 0.5 dB > 1 dB > 3 dB > 6 dB
recursivedemod 21.15% 15.19% 13.12% 9.917% 8.400%
Figure 3.16: BER plot for the unmitigated binary PAM receiver system with timing syn-chronization.
An analysis of the catastrophic bits again reveals a bias towards the most significant bits of
computation and the clock and global reset signals.
Tables 3.7 and 3.8 give the failure rate numbers for the recursive receiver design. As
with the feed-forward designs, the catastrophic failure rates are significantly lower than the
standard sensitive failure rates.
Table 3.7: Sensitive failure rates (λ) for the recursive demodulator design in various orbits.
Design GEO GPS Molniya Polar LEO
recursivedemod 1.47×10−5 1.28×10−5 1.40×10−5 3.39×10−6 9.15×10−8
46
Table 3.8: Catastrophic failure rates (λ) for the recursivedemodulator design in various orbits.
Design GEO GPS Molniya Polar LEO
recursivedemod 9.14×10−7 8.01×10−7 8.72×10−7 2.12×10−7 5.71×10−9
3.7 Summary
This chapter presented an application-specific method for evaluating the impact of
SEUs on FPGA-based communications systems. The experimental results confirm that it
can be very useful to consider the specific application in question when measuring a design’s
performance in the presence of SEUs.
The experiments suggest that not all SEUs need to be targeted for mitigation in
an FPGA design subject to SEUs, depending on the design and application. This desirable
result follows the fact that the figure of merit is bit error rate (rather than bit-level accuracy)
and that the majority of the SEUs have the same effect as additive noise. Analysis of the
experimental data showed that the sections that must be protected from SEUs are the clock
distribution networks, global reset, and the MSBs of the arithmetic modules.
Because not all SEUs cause critical errors, mitigation techniques with much lower
cost are possible. For example, one might use TMR to protect only the upper bits of
the arithmetic modules, leaving the lower bits unprotected. This type of approach may
substantially reduce the resources required to produce a reliable system. Chapter 4 will
describe a mitigation technique that can be used for this purpose.
47
CHAPTER 4. REDUCED PRECISION REDUNDANCY
With the knowledge that few SEUs cause critical errors in a communications system
and knowing which portions of the circuit to protect, a reduced-cost mitigation approach can
be suggested. The ideal candidate protects the clock, global reset, and the most significant
bits of computation while possibly ignoring the least significant bits in order to mitigate
critical SEUs at a lower cost than TMR.
This chapter provides background on reduced-precision redundancy (RPR), a reduced-
cost mitigation technique designed to protect arithmetic computation. RPR focuses on pro-
tecting the most significant bits of computation, those which were found in Chapter 3 to
be associated with catastrophic SEUs. RPR can also be implemented such that the global
clock and reset signals are protected as well. With this focus, RPR is a good candidate for
reducing the cost of SEU mitigation on FPGA-based DSP systems.
This chapter describes the mechanics of RPR, the various operating modes of an
RPR system in the presence of SEUs, and a discusses the size vs. performance of RPR. It
lays the groundwork for a comparison of different RPR techniques which will be presented
in Chapter 5. The chapter concludes with an initial demonstration of RPR on the simple
communications system designs of Section 3.5. Fault injection experiments show that RPR
is well-suited to protect against the most critical SEUs at a much lower cost than TMR.
4.1 Previous Work
RPR was introduced by Shim, et al. as part of a power reduction technique for
ASIC-based DSP systems [4], [54]. Shim used RPR to overcome errors introduced by voltage
overscaling (VOS), which reduces the supply voltage of a circuit to save power. This voltage
49
reduction slows the operation of the circuit and can cause intermittent errors at the circuit
output when the longer logic paths are excited. RPR was used to reduce the effects of these
intermittent errors, which had the tendency to occur in the most significant bits of the circuit
output since those generally correspond to the longer paths through the logic. Shim’s version
of RPR is referred to as Threshold RPR in this dissertation.
Shim later modified this RPR technique and analyzed it as a means for protecting
against deep-submicron noise and soft errors in ASIC-based DSP systems [47]. This mod-
ification of RPR is more suited towards SEU mitigation for FPGAs than the original. In
a radiation environment, SEUs are distributed uniformly across an FPGA similar to soft
errors in ASIC systems. These errors are not biased towards the most significant bits as in
the VOS case. Still, because SEUs may impact the logic implemented by the FPGA, soft
errors in ASIC systems tend to be less severe than those of concern in FPGAs.
Snodgrass presented an alternate RPR configuration (called Bounded RPR in this
dissertation) and demonstrated it on FPGAs in [5]. Sullivan later provided details on how
to implement Bounded RPR on several elementary arithmetic operations and characterized
the performance of some RPR systems in simulation [64]. Both of these authors confirmed
that RPR could be a valuable SEU mitigation technique for certain FPGA-based systems.
This dissertation expands on previous work regarding RPR in several ways. Fault
injection experiments in this chapter and in Chapters 5–6 make direct comparisons of RPR
with TMR, clearly showing their costs and benefits. Chapter 5 compares several variations of
RPR, including those suggested by Shim and Snodgrass, in order to determine the best option
for FPGA implementation. Chapter 6 then presents methods to optimize the application of
RPR on communications systems and demonstrates how to apply RPR to complex systems
which are not completely suited to RPR.
50
4.2 Overview
RPR is a redundancy technique used to protect the most significant bits of an arith-
metic operation. RPR can be implemented in several different ways, but the core idea is
the same: by focusing redundancy on the most significant bits of computation, RPR can be
implemented with a lower cost than TMR. Each implementation of RPR includes a reduced-
precision (RP) replica of the module in question and uses the reduced-precision output as a
rough check on the output of the full-precision (FP) module, as illustrated in Figure 4.1.
Figure 4.1: Simplified block diagram of a module protected with reduced-precision redun-dancy (RPR).
The intent of RPR is to use the output of the full-precision module when it is operating
correctly and to use the output of the reduced-precision module otherwise. The output of
the reduced-precision module (which is assumed to be free of soft errors) is compared to
that of the full-precision module in order to determine whether the full-precision module is
operating correctly or not. If the full-precision module is found to be in error, the estimate
produced by the reduced-precision module is used instead.
In this dissertation, the following shorthand is used to refer to the output signals
involved in RPR:
• FPout - the output of the full-precision module
• RPout - the output of the reduced-precision module
51
• FPtrue - the ideal (full-precision) output
• RPRout - the output of the RPR module as a whole
The core functionality of RPR is summarized as follows:
if FPout ≈ RPout then
RPRout ← FPout
else
RPRout ← RPout
end if,
where the specifics of the approximation operation are dependent on the RPR implemen-
tation. The difference between the RPRout signal and the desired FPtrue signal is the error
signal, εRPR, of the RPR module. The error of the RPR module, then, is defined as
εRPR = FPtrue − RPRout. (4.1)
RPR can be operating in three different modes: full-precision perfect, full-precision
degraded, and reduced-precision.
• In full-precision perfect mode, there are no upsets in the FP module and the output of
the system is the correct full-precision output. In this case, εRPR = 0.
• In full-precision degraded mode, the FP module is not operating perfectly, but its output
is still approximately equal to the reduced-precision output, so the slightly-degraded
FP output is used. In this case, εRPR = FPtrue − FPout.
• In reduced-precision mode, the FP module output is different enough from the RP
output to determine that there is an error in the FP module and the RP output is
used instead of the erroneous FP output. In this case, εRPR = FPtrue − RPout.
In Shim’s initial implementation of RPR, the reduced-precision module was small
enough to avoid the VOS errors which affected the full-precision module, which were his
52
primary concern. The reduced-precision module was thus assumed to always be a valid
estimator of the full-precision output. In a soft error environment where both the FP and
RP modules may be affected, a second reduced-precision module is used to identify the
problem module [47].1 Since FPGAs operating in an SEU environment fall into the soft
error category, the RPR implementations presented here use two reduced-precision modules,
as in Figure 4.2. In this case, the RPR decision block also forms the reduced-precision output
from the three inputs.
With three separate modules, RPR can also be designed to protect the clock and reset
signals. If these global input signals are triplicated, as they often are with TMR, each of the
three modules of RPR can receive a distinct set. With this architecture, if one of these signal
replicas is upset, the worst case is that one of the three modules is disabled completely. With
a single module disabled, RPR can still operate correctly. Thus, in addition to protecting
the the most significant bits of computation, the critical global signals are also protected.
Figure 4.2: Simplified block diagram of a module protected with reduced-precision redun-dancy (RPR) designed for soft error environments.
RPR is similar to TMR, but sacrifices some of the protection offered by TMR in
order to reduce area cost. First, while TMR can protect any type of circuitry, RPR is only
suitable for arithmetic operations. Second, while TMR uses exact replication to produce
1The way the second reduced-precision module adds the ability to identify the problem module is depen-dent on the RPR implementation chosen and will be discussed in Chapter 5.
53
an error-free output, RPR uses smaller reduced-precision modules to limit the output error.
RPR has an advantage over TMR when it is able to sufficiently limit the magnitude of the
SEU-induced noise at a lower hardware cost.
The following sections will elaborate on these two points. Section 4.3 explains the
suitability of RPR for protecting arithmetic circuits. Section 4.4 presents the operating
modes of RPR along with a description of the general error bounds for each mode.
4.3 Protecting Arithmetic
As mentioned above, RPR is designed to protect arithmetic operations. Arithmetic
operations have a natural ordering and weighting of data with the more significant bits
located to the left of a vector of bits. In general, digital logic is not organized in this manner
and the relative significance of different portions of a circuit is not clear. The natural ordering
of arithmetic operations allows RPR to focus on the most important sections of a circuit.
The numeric operands of arithmetic modules are naturally ordered by their signifi-
cance. For these operations, the bits in a data word are arranged in descending weight from
the most significant bit (MSB) to the least significant bit (LSB). For example, an unsigned
binary number is represented as
bNbN−1...b1b0, (4.2)
bN being the MSB and b0 being the LSB. This binary representation is interpreted as
N∑i=0
bi2i. (4.3)
Thus the bits on the left have a larger value and are more significant than the bits on the
right.
A mitigation technique might exploit this property by giving priority to the upper
bits of the number since those have the greatest value. The simplest demonstration of this
concept in hardware is a register holding a binary number. Each flip-flop (FF) in the register
54
holds a single binary value. Figure 4.3 shows such a register where each FF is labeled with
the weight of the binary value held. In this case, the binary number stored is a fixed-point
value with the range [0, 1). With 8 bits of precision, any real number in this range can be
represented within a maximum error of Ereg = | ± 2−9| = 0.001953125.
Figure 4.3: Block diagram of an 8-bit register holding a fractional fixed-point number.
The importance of protecting the most significant bits of this register can be illus-
trated by computing the expected error resulting from an upset in the register. The effect of
upsetting a particular bit in the register depends on the position of that bit in the register.
The absolute error caused by upsetting the MSB of the register is EMSB = 2−1 since the
numerical output of the register is altered by that quantity. The error caused by upsetting
the LSB of the register is ELSB = 2−8 in this case. Since all of the FFs are the same in
size, we assume that they are all equally likely to be upset by an energetic particle. If the
probability of changing any bit in the register is p, the expected error at the output of an
n-bit register is:
Eunmitigated =p
n
n∑i=1
2−i = pn∑i=1
2−i−k
=p
n(1− 2−n). (4.4)
If the upper k bits of the register are protected with a technique such as RPR, the
expected error becomes:
ERPR−k =p
n
n∑i=k+1
2−i
=p
n(2−k − 2−n). (4.5)
55
Note that the first (and largest) k terms of the sum were eliminated since each of those FFs
were protected.
As an example, if the probability of an upset in the original register p = 0.5, an
unmitigated 8-bit register has an expected error of Eunmitigated ≈ 0.0623. With this same
value for p, the same register with the upper 3 bits protected has an expected error of
only ERPR−3 ≈ 0.0076. If the same amount of redundancy were added to protect the least
significant bits, the expected error would be
E =0.5
8
5∑i=1
2−i ≈ 0.0605, (4.6)
nearly equal to that of the unprotected register. This example highlights the importance of
protecting the upper bits of a numerical value or arithmetic computation.
4.4 RPR Upset Cases
The performance of an RPR system in the presence of soft errors can be measured
by the deviation of its output from the unmitigated system in the absence of soft errors. In
the context of DSP systems, this deviation could be termed “noise.” The performance of an
RPR DSP system, then, can be described in terms of the noise of the system in the presence
of upsets.
Each individual upset causes a different amount of noise to be added to the system
output. The amount of noise added to the output depends on the location of the upset within
the circuit. For example, an upset affecting a high-order bit of computation is expected to
cause more noise than an upset affecting a low-order bit.
The upsets in a system protected with RPR can be categorized by the location of the
upset and its effect on the system. There are four possible upset cases for RPR in general:
56
• Detected upset (DU) — An upset occurs in the full-precision module and the RPR
decision block determines that there is an error in the full-precision module. The RPR
system enters reduced-precision mode.
• Undetected upset (UU) — An upset occurs in the full-precision module but the RPR
decision block does not indicate an error. The RPR system operates in full-precision
degraded mode.
• False detection, no upset (FD) — Though there is no upset in the full-precision module,
the RPR decision block indicates that there is an error. In this case, the RPR system
is incorrectly in reduced-precision mode.
• No upset, no false detection (NU) — No upset exists in the full-precision module and
there is no false detection. The RPR system is in full-precision perfect mode.
The details of the RPR implementation (discussed in more detail in Chapter 5) control the
distribution of upsets between the upset cases.
Upset Case Probabilities
Each upset case has a distinct probability of occurrence and a distinct noise level or
range that is added to the system output. The probability of these upset cases is dependent
on several factors.
• Pupset is the probability of a soft error in the full-precision module, altering its output
in some way. This is a function of the environment upset rate and the size of the
unmitigated design.
• a is the detection factor, the fraction of upsets which trigger the reduced-precision
mode in a particular RPR implementation. This factor is dependent on the detection
capability of the specific RPR implementation: the type and magnitude of upsets that
can be detected.
57
• Pfp is the probability of a false positive detection event, which occurs when RPR
erroneously chooses the reduced-precision output over the full-precision output even
when the full-precision module was correct. The frequency of occurrence is dependent
on the RPR implementation and the properties of the signals being processed. For
some implementations of RPR, Pfp can be forced to be zero by design.
Table 4.1 includes the probabilities of the four upset cases.
Table 4.1: Summary of the possible upset cases for a general RPR module.
Noise Signal AbsoluteUpset Case Probability Added Noise Limit
DU Pupset · a εe εmax
UU Pupset · (1− a) εu εmax
FD (1− Pupset) · Pfp εe εmax
NU (1− Pupset) · (1− Pfp) 0 0
Upset Case Noise Levels
The noise added in each upset case is dependent on the estimation error of the
RP module, εe, and the error in the full-precision module induced by a specific upset, εu.
Table 4.1 summarizes the amount of noise added in each case for RPR in general.2
The estimation error of the reduced-precision module is simply the difference between
the true (no errors) full-precision output and the reduced-precision output:
εe = FPtrue − RPout. (4.7)
The statistical properties of this signal measure how well the reduced-precision module esti-
mates the full-precision output. It can also be thought of as the quantization noise incurred
2As will be demonstrated in Section 5.2.3, the noise limits shown can be lowered for specific implemen-tations of RPR.
58
by using a reduced-precision operation. This signal and its statistics can sometimes be com-
puted for a specific implementation of an RPR module [4]. Appendix B includes some sample
εe data for an FIR filter design as an example of the properties of this signal.
The signal εu is the difference between the true and actual full-precision outputs:
εu = FPtrue − FPout. (4.8)
This signal is non-zero when an upset has affected the full-precision module. The statistical
properties of this signal are heavily dependent on the module implemented, the signal being
processed, and the location of the upset in the module. For some upsets, this signal is very
small compared to the desired output signal. For others, it can be very large. For this reason,
it is impossible to generalize the properties of this signal. Section B.2 in Appendix B shows
the probability mass functions (pmfs) of εu for several SEUs within an FIR filter design,
demonstrating how different these signals can be.
Although the specifics of εu cannot be generalized, the maximum magnitude of this
signal can be stated when limited to the UU upset case, which is the case in which this signal
is important, according to Table 4.1. This maximum value is the maximum undetected error
value of the RPR system, which is also the maximum magnitude of εe:
εmax = max |εe|. (4.9)
This maximum value is dependent on the type of RPR and the bit-width of the RP modules.
RPR cannot be guaranteed to detect errors smaller than this value. Since the full-precision
and reduced-precision modules may differ by this amount, RPR cannot distinguish between
such low-magnitude upsets and the natural difference between the full- and reduced-precision
outputs.
59
In the DU case, the output noise of the RPR system is equal to the difference be-
tween the ideal full-precision output and the reduced-precision output (εe) since the reduced-
precision output will be used.
In the UU case, the noise is dependent on the upset. Some upsets result in low-
magnitude noise and others result in higher-magnitude noise. As explained above, the max-
imum undetected error value is εmax.
The FD case occurs if the RPR system erroneously chooses the reduced-precision
output when no error exists in the full-precision module. For this false positive error event,
the noise is again equal to the estimation error, εe.
The NU case is simply when no upset exists in the full-precision module. In the
absence of upsets and false positive error events, the noise at the output of the RPR system
is zero.
4.5 Bit-width Selection
When applying RPR to a module, the size of the reduced-precision modules must be
chosen. This paper refers to the relative sizes of these modules in terms of the bit-width of
their inputs. Engineers must always choose the bit-width of any arithmetic module based on
the system requirement and the available resources. After the bit-width of the full-precision
module has been set, RPR also requires that the engineer also choose the bit-width of the
reduced-precision modules. As this section will show, the reduced-precision bit-width affects
both the performance of the system in reduced-precision mode, as well as the ability of the
RPR system to detect errors in the full-precision module.
4.5.1 General Bit-width Selection
For any computer hardware, designers must choose the bit-widths used in arithmetic
computations. The number of bits selected for each value or signal affects the size of the
60
system as well as its ability to represent numbers. Using more bits for a value gives greater
integer range and/or fractional precision but uses more hardware.
In DSP systems, where real numbers are typically represented, either fixed-point or
floating point numbers are used. FPGAs most often use fixed-point arithmetic rather than
the more flexible but more costly floating-point arithmetic [65]. Except where indicated
otherwise, the numbers represented in this dissertation are in fixed-point, Qn format: 2’s
complement numbers in the range [-1,1) with n bits to the right of the binary point and only
the sign bit to the left.
The inability of a fixed bit-width number to precisely represent a real number is called
quantization. The difference between the real number and its quantized digital value is the
quantization error. For a signal, the error signal is called the quantization noise [66]. A
DSP engineer must take this quantization noise into account when designing a system. The
optimization of bit-widths for the signals within DSP systems is an actively studied field and
will not be treated here [67]–[69].
As an example of quantization error, Figure 4.4 shows the number 0.3359375 rep-
resented with various amounts of precision in binary and decimal representations. As the
number of bits used to represent the number shrinks, the estimation becomes worse. In the
case of a fractional number truncated from n bits to k bits, the maximum error is 2−k−2−n,
which would happen if all of the bits truncated in the n-bit number were 1.
Figure 4.4: Truncation of a fixed-point binary number to several levels of precision.
61
4.5.2 RPR Bit-widths
The selection of bit-widths for the modules in an RPR system is similar to the general
bit-width selection problem. The full-precision module in an RPR system is assumed to
use the same bit-width that an engineer would select for the unmitigated module. The
reduced-precision module naturally has a smaller bit-width and uses less hardware than the
original module. This dissertation refers to the bit-widths of the full-precision and reduced-
precision modules as B and Br, respectively. The full-precision module has QB inputs and
the reduced-precision modules have QBr inputs.
The bit-width of the RP modules, Br affects two main properties: the estimation
error, εe, of the reduced-precision modules and the error detection capability, limited by
εmax, of the RPR system. The estimation error, εe, is directly determined by this bit-width.
The larger the value of Br, the smaller the estimation error and the lower the noise at the
output of the system in the DU and FD error cases. If Br = B, the expected difference
between the two is zero and the result is essentially equivalent to TMR. As Br decreases,
however, the range of this expected difference grows because the module is a less-perfect
estimator of the full-precision output. Figure 4.4 emphasizes that as a bit-width such as Br
decreases, the estimation error of the true value increases.
The error detection capability is also dependent on Br. A larger bit-width results in
a better estimate of the FP output, which means a higher confidence in the estimate and a
tighter bound on the error. Br determines the value εmax, the maximum difference between
the full-precision and reduced-precision outputs. From Table 4.1, this value bounds the error
in the UU condition since upsets any larger than this value can be detected.
In relation to the estimation error, Br also determines the performance of the system
in the DU and FD error cases, when the RPR system is in reduced-precision mode. Br can be
chosen similar to the method the engineer used for choosing B by using relaxed constraints
on the quantization noise. Although this noise is greater in reduced-precision mode, this mode
is only active when the full-precision module has a significant fault due to a soft error or in
62
the case of a false detection event. Depending on the SEU rate the system operates in and
the false positive probability, operation in this mode may be quite infrequent. With a low
probability of occurrence, lower performance may be tolerated from the reduced-precision
module by using a smaller Br value, resulting in a savings in circuit area. Section 6.2 will
explore the methods for choosing Br in more detail for one variation of RPR.
4.6 RPR Decision Blocks
In addition to the reduced-precision blocks which must be designed and for which
Br must be chosen, the RPR decision block must also be created. The RPR decision block
uses the outputs of the three RPR modules to determine whether to use the full-precision
or reduced-precision output. In essence, RPR can be thought of as an encoding scheme in
which the reduced-precision outputs are the redundant portion of the codeword. The RPR
decision block is the decoding circuit used to map the redundant word back to a single, valid
code word.
The design of the decision block is dependent upon the variation of RPR used. Chap-
ter 5 will describe the decision blocks used for each type of RPR and how their design affects
the cost and performance of the RPR implementation.
4.7 RPR Demonstration
As a demonstration of the RPR technique, this section will provide the results of fault
injection experiments on several FPGA designs. To evaluate RPR’s potential for protecting
communications applications, we applied RPR to the feed-forward binary PAM matched filter
detector of Section 3.5. This demonstration applies the RPR technique to three different
receiver configurations, varying the matched filter architecture and the coefficient values.
These initial experiments with RPR confirm that RPR is able to provide good pro-
tection against the most critical SEUs in these communications systems at a much lower cost
than TMR. The experiments also show that RPR’s effectiveness is not strongly dependent on
63
either of these filter parameters, while it can be a cost-effective alternative to TMR. Future
chapters will look further into the implementation details of RPR, such as how to select the
best Br value and what specific implementation of RPR is best.
4.7.1 Experimental Configuration
In order to test the effectiveness of RPR on a communications system, we applied the
technique to several of the designs described in Section 3.5:
• “16b logic α = 1.0” means the SRRC pulse shape with α = 1.0 using 16-bit inputs and
filter coefficients.
• “16b logic α = 0.25” means the SRRC pulse shape with α = 0.25 using 16-bit inputs
and filter coefficients.
• “16b dsp48 α = 1.0” means the SRRC pulse shape with α = 1.0 using 16-bit inputs
and filter coefficients and using the embedded DSP blocks of the Virtex 4 SX55 FPGA.
The fault injection experiments run were of the same form as those in Section 3.4
except that the utilized configuration bits were discovered rather than the sensitive bits.
This allowed the full designs to be tested, even when redundancy would typically mask some
errors. Section A.2 in Appendix A describes this distinction in detail.
To determine the effectiveness of RPR, we measured the effect on the BER of every
possible SEU in each of these systems. To measure the efficiency of RPR, we used TMR to
protect the same circuits and compared the results in terms of circuit area consumed. By
using RPR, we expected to eliminate or significantly reduce “catastrophic” SEUs compared
to the original designs. We also expected to see a significantly smaller implementation cost
(in terms of circuit area overhead) than TMR.
64
4.7.2 Mitigation Details
Figure 4.5 shows a block diagram of a 16-bit FIR filter (B = 16) protected with RPR.
For these experiments, Threshold RPR, which will be discussed in Section 5.1.1 in detail,
was used. The figure shows that the inputs to the filter are triplicated, as with TMR, and
the second and third replicas of the circuit are implemented with reduced-precision (Br = 8)
FIR filters. Note that the decision blocks and outputs are triplicated as well. The outputs of
the three identical decision blocks are voted on, just as in a TMR system, to avoid problems
with SEUs in a single decision block.
This RPR implementation was designed to protect the critical clock and reset lines
in addition to the standard protection of the most significant bits of computation. Each of
the FP filter and the RP filters receive an independent set of clock and reset signals (not
pictured). Thus even if one of these “global” signals is upset, the other two modules and
their associated decision blocks continue to operate and RPR continues to function.
Figure 4.5: Simplified block diagram of an 16-bit FIR filter protected with Threshold RPRusing two 8-bit filters as the reduced precision modules.
For each of these designs, the reduced-precision modules were logic-based3 FIR filters
with Br = 8. As discussed in Section 4.5.2, the value of Br affects the area overhead of RPR
as well as the amount of protection offered. Though other values of Br might be appropriate
as well for these designs, Br = 8 was selected as a compromise between these two factors.
3Logic-based filters were used for all designs because the embedded DSP blocks have a fixed bit-widthof 18 and a reduced-precision filter using these modules would consume as many DSP blocks as a TMRimplementation.
65
A value of Th = 0.5 was chosen as the threshold to compare between the reduced-
precision and full-precision outputs. For these filter sizes, this threshold ensured that an
error would never be declared when no SEUs were present in the system. Sections 5.1.1
and 6.1 will discuss the selection of the value of Th in more detail.
In these experiments, the triplicated RPR decision blocks were imperfect and were
found to be susceptible to several single SEU-induced errors. Although the RPR decision
block was triplicated in each case (essentially, TMR was applied to that module), some
SEUs caused upsets more than one of the replicas and overcame the TMR protection. This
is reflected in the test results, where some of the catastrophic SEUs remain in the RPR
implementation. TMR has been shown to be imperfect in FPGAs in some instances, where
a single configuration bit affects signal routing in two of the three TMR domains [70]. This
problem has also been shown to be correctable, in a large extent, using reliability-oriented
routing techniques [71]. Section D.3 (in Appendix D) discusses this issue in more detail.
4.7.3 Experimental Results
Tables 4.2 and 4.3 show the numerical results from the fault injection experiments.
The tables illustrate the sensitivity differences between the RPR and TMR implementations
of each filter design and their sensitivity improvements over the original unmitigated filter.
Table 4.2 reports on the number of SEUs observed in each of the four classes defined
in Section 3.5.2. The implementation overhead of each mitigation technique in terms of
FPGA slices is given as well as the reduction in the number of catastrophic (Class 3 and 4)
SEUs. This table also shows the factor by which the failure rate of each system (in terms of
catastrophic upsets) improved.
This table shows that virtually all of the SEUs affecting the three TMR designs fall
in the Class 1 category (no measurable SNR loss). In fact, there were only two upsets that
adversely affected the output of the TMR design. This confirms the effectiveness of the
TMR approach in eliminating virtually all SEU-induced errors. The number of FPGA slices
66
Table 4.2: Fault injection results for three FIR filter designs protected with RPR and TMR,compared against the unmitigated filters (repeated from Table 3.1).
Slices/ Total Total Improv.Slices/ DSP48s Class 1 Class 2 Class 3 Class 4 Utilized Catastrophic in failure
Design DSP48s Overhead Bits Bits Bits Bits Bits (% Reduction) rate
logic α = 1.0 712/0 - 34,829 5,612 1,638 899 42,978 2,537 (-) -TMR logic α = 1.0 2,089/0 193%/0% 129,387 0 0 2 129,389 2 (99.9%) 1,268.5×RPR logic α = 1.0 1,191/0 67.3%/0% 58,092 5,627 96 2 63,817 98 (96.1%) 25.89×logic α = 0.25 1,029/0 - 50,798 14,479 2,908 1,022 69,207 3,930 (-) -TMR logic α = 0.25 3,084/0 200%/0% 212,102 0 0 2 212,104 2 (99.9%) 1,965×RPR logic α = 0.25 1,718/0 67.0%/0% 98,070 12,067 192 2 110,331 194 (95.1%) 20.26×dsp48 α = 1.0 554/13 - 22,047 5,498 867 1,118 29,530 1,985 (-) -TMR dsp48 α = 1.0 1,659/39 199%/200% 62,483 0 0 2 62,485 2 (99.9%) 992.5×RPR dsp48 α = 1.0 1,232/13 122%/0% 55,661 6,767 9 11 62,448 20 (98.99%) 99.25×
67
and the total number of utilized bits in each TMR design, of course, increased due to the
logic added by the filter replicas. The addition of these SEUs is reflected in the increase in
the number of Class 1 SEUs as compared to the original design. Also notice that the failure
rate improvement factor for each TMR design is very high. Theoretically, TMR eliminates
all SEUs and offers an infinite improvement in failure rate. Similar to the triplicated RPR
decision blocks, however, some catastrophic SEUs remain in these TMR implementations,
resulting in a finite failure rate improvement.
The RPR designs also showed good resilience to SEUs. For example, for the 16-bit
logic-based filter with α = 1.0, the 899 Class 4 SEUs were reduced to 2 and the number of
Class 3 SEUs was reduced from 1,638 to 96. In fact, RPR reduced the number of catastrophic
SEUs by over 95% for each design, improving the failure rate of each by over 20×. Similar
to TMR, the added redundant modules increased the total number of SEUs affecting the
design as well as the number of Class 1 errors. In two of the RPR designs, there was also an
increase in the number of Class 2 SEUs. As discussed earlier, however, these SEUs are not
as critical since the errors induced by these SEUs are likely to be correctable by standard
communications error-handling techniques such as error-control coding.
Table 4.3 shows the SNR loss results for the RPR and TMR filter designs. For this
table, we present a normalized percentage figure by dividing the number of SEUs in each
category by the number of utilized configuration bits in the original unmitigated design to
obtain each percentage. Note again that the TMR designs show no SNR loss for virtually
all SEUs. Again, taking the 16-bit logic-based filter with α = 1.0 as an example, we see
that RPR technique reduced the percentage of SEUs that caused large SNR losses. Losses
of more than 6 dB were reduced from 9.21% to 1.19% while losses of more than 3 dB were
reduced from 11.20% to 4.17%.
Figures 4.6 – 4.8 show BER plots which illustrate the impact of all SEUs on each
of the three RPR designs. As with the unmitigated versions of the designs reported on in
Section 3.5.2, the vast majority of SEUs cause little impact on the BER curve. Compare
68
Table 4.3: Normalized percentage of SEUs causing certain SNR losses at BER of 10−5
for an FIR filter protected with RPR and TMR compared against the unmitigatedfilters (repeated from Table 3.2). The number of SEUs in each category were
divided by the total number of utilized bits in the unmitigated design.
Design > 0.2 dB > 0.5 dB > 1 dB > 3 dB > 6 dB
16b logic α = 1.0 18.96% 16.37% 14.32% 11.20% 9.21%
TMR 16b logic α = 1.0 0% 0% 0% 0% 4.65×10−3%
RPR 16b logic α = 1.0 13.32% 9.65% 7.66% 4.17% 1.19%
16b logic α = 0.25 26.60% 17.39% 14.36% 10.58% 9.08%
TMR 16b logic α = 0.25 0% 0% 0% 0% 4.65×10−3%
RPR 16b logic α = 0.25 17.72% 11.38% 7.88% 3.00% 1.69%
16b dsp48 α = 1.0 25.34% 22.13% 20.18% 15.92% 12.05%
TMR 16b dsp48 α = 1.0 0% 0% 0% 0% 4.65×10−3%
RPR 16b dsp48 α = 1.0 22.98% 17.69% 15.31% 9.02% 1.59%
Figure 4.6: BER plot for the 16-bit logic-based FIR filter with α = 1.0 with RPRusing two 8-bit reduced-precision filter repli-cas.
Figure 4.7: BER plot for the 16-bit logic-based FIR filter with α = 0.25 with RPRusing two 8-bit reduced-precision filter repli-cas.
these figures to their counterparts in Figures 3.9, 3.10, and 3.13. The lack of visible histogram
content away from the theoretical curves shows that the number of SEUs causing higher bit
error rates have been significantly reduced.
Naturally, TMR was much more effective at protecting the receiver system against
SEUs than RPR in our experiments. Note, however, the number of FPGA slices needed to
implement each mitigated system, shown in Table 4.2. In the case of the logic-based FIR
69
Figure 4.8: BER plot for the 16-bit DSP Block-based FIR filter with α = 1.0 with RPRusing two 8-bit reduced-precision filter replicas.
filters, the overhead cost of implementing RPR in terms of configuration bits was about one-
third that of TMR. Though the protection is not as thorough, RPR was able to accomplish
the goal of significantly reducing the number of catastrophic SEUs at a cost much lower than
TMR.
Comparing the implementation cost of TMR and RPR for the DSP48-based filters
is more complicated than for the logic-based filters. The overhead for the TMR filter was
predictably about 200% for both FPGA slices and DSP48 blocks. The RPR filter, on the
other hand, used no extra DSP48 blocks but used 122% more slices. This is due to the use
of logic-based filters for the reduced-precision modules (see footnote 3 on page 65).
Interestingly, the TMR and RPR versions of the DSP48-based filter had about the
same number of total utilized configuration bits. Fewer than 3 times the number of con-
figuration bits are needed to fully triplicate this design, presumably because the DSP48
blocks require fewer configuration bits per operation than general logic. In the case of this
DSP block-based filter, then, TMR may be preferable to RPR. If resource constraints limit
the availability of DSP blocks, however, or possibly if a different set of filter bit-widths is
selected, some form of RPR may be appropriate for this type of filter.
70
4.8 Summary
This chapter gave an introduction to the RPR technique for FPGA systems. It
presented the general architecture of RPR and explained the different operating modes of
RPR. Fault injection experiments demonstrated that RPR is an efficient means to protect
an FPGA-based communications system from catastrophic SEUs. Combined with the fact
that most non-catastrophic SEUs result in effects similar to additive noise, RPR can be a
good alternative to TMR for low-cost SEU mitigation.
These experiments are only an initial demonstration of RPR. Future chapters will look
into RPR in more detail and show how the results shown here can be improved. Chapter 5
will describe and contrast three different variations of RPR and select the best type for use
in FPGA systems. Chapter 6 focuses on one version of RPR and shows how to calculate the
best Br value and other parameters as well as how to integrate RPR into larger systems.
These chapters will also present fault injection results to demonstrate improvements over
these initial results.
71
CHAPTER 5. COMPARISON OF RPR VARIATIONS
As suggested in the previous chapter, there are many ways to implement RPR. This
chapter describes and analyzes three different variations of RPR. Each variation includes a
full-precision module as well as two reduced-precision modules and each has a method of
choosing between the different outputs. We give these three variations of RPR the names
Threshold RPR, Bounded RPR, and Reduced-Precision TMR (RP-TMR), in order to dis-
tinguish between them. Threshold RPR and Bounded RPR were suggested by previous
researchers while RP-TMR is a novel variation of RPR introduced in this dissertation. This
chapter will describe the three methods in detail and will explain the relative costs and
benefits of each.
Section 5.1 describes the architecture and function of each variation of RPR. Sec-
tion 5.2 compares the potential cost and performance of each variation of RPR on FPGA
systems. As a practical demonstration of these variations, Section 5.3 presents fault injec-
tion experiments on a simple communication system protected with each type of RPR and
compares the results.
5.1 RPR Variations
Each variation of RPR has a distinct architecture and error handling method. Before
describing each variation of RPR in detail, a brief summary of each is appropriate:
Threshold RPR uses two identical reduced-precision modules to estimate the full-
precision result and determine if the full-precision module is in error. It was suggested
by Shim, et al. as a protection against voltage over-scaling (VOS) for ASICs [4]. It was
73
also suggested as a possible protection against soft errors in ASICs [72]. In contrast, this
dissertation evaluates this type of RPR specifically for FPGAs and communications systems.
Bounded RPR uses distinct reduced-precision modules whose outputs bound the full-
precision module in the absence of errors. The two reduced-precision modules form two less
precise bounds on the desired full-precision result. Bounded RPR was designed by Snodgrass
specifically for FPGA-based systems [5].
RP-TMR uses gate-level TMR on the upper-most bits of computation to directly
protect the most significant bits of the module output. This novel variation of RPR could also
be considered a variation of TMR. Section 3.7 suggested that TMR could possibly be applied
to the most critical sections of the circuit directly in order to reduce the cost of full TMR
while retaining the most significant benefits. RP-TMR uses this straightforward approach
and will be considered against the first two RPR techniques which were first presented
elsewhere.
The following three sections describe the function and implementation of these three
variations of RPR. The general architecture of each is given, including the structure and
function of the decision block. In addition, the design of the reduced-precision modules is
discussed for each variation.
5.1.1 Threshold RPR
This section analyzes the type of RPR introduced by Shim, et al. [54]. It is called
“Threshold RPR” due to the use of a pre-set threshold to determine when the full-precision
module is in error.
Overview
Threshold RPR is implemented by creating two identical reduced-precision (RP)
versions of the module to be protected, as illustrated in Figure 5.1. The outputs of the two
RP modules are used to determine if there is a fault in the full-precision (FP) module. If the
74
FP output differs from the RP outputs by more than a pre-set threshold, Th, the FP module
is assumed to be in error. When the FP module is found to be in error, the output of the
RP modules is used instead as an estimate of the FP output. If the FP output differs from
the RP outputs by less than Th, the FP module is assumed to be correct and its output is
used.
Figure 5.1: Simplified block diagram of an n-bit (B = n) full-precision module protectedwith Threshold RPR using two k-bit (Br = k) reduced-precision modules, where k < n.
Decision Block
In order to determine if there is an error in the RPR system, assuming no more than
a single upset at one time, the decision block compares the outputs of the full-precision (FP)
and two reduced-precision (RP1 and RP2) filters as follows:
if ( (|FPout − RP1out| > Th) and (RP1out = RP2out) ) then
output ⇐ RP2out
else
output ⇐ FPout
end if.
In other words, if the FP and RP1 outputs differ by more than Th, an error in one
of those two modules (or the decision block) exists. If the RP1 and RP2 outputs differ, the
error detected is in the RP1 filter, the FP filter is assumed to be correct and its output is
75
used. If the threshold is exceeded and the RP1 and RP2 outputs are equal, the FP module
is assumed to be in error and the output of the RP2 filter is used (though either RP output
would be suitable). Thus the full-precision output is used when no error is found or when the
two reduced-precision modules disagree. Otherwise, the reduced-precision output is used,
providing an estimate of the correct full-precision output.
A block diagram of this decision block is shown in Figure 5.2 and its place in the
overall Threshold RPR system is seen in Figure 5.1. Note in Figure 5.1 that the RPR
decision block can be triplicated since these modules are just as susceptible to SEUs as the
computation modules. As with TMR, the three outputs are combined with an SEU-immune
voter off chip.
Figure 5.2: Block diagram of a Threshold RPR decision block.
Shim suggested that the Threshold RPR decision block can be optimized to consume
less area if the value of Th is constrained to a power of two. This modified decision block
is shown in Figure 5.3. The module takes the difference of the full-precision and reduced-
precision module outputs, and uses simple m-bit NAND and OR gates on the upper bits
of this difference to determine if the threshold is exceeded. The width m of the simplified
comparator gates is dependent on the threshold value chosen. This circuit replaces the three
upper modules in Figure 5.2.
The lower cost of this optimized decision block can make Threshold RPR more ef-
ficient, requiring less overhead. Due to the restriction of Th to a power of two, however,
76
this type of decision block is limited in the precision offered by the threshold value. This
can limit the effectiveness of the decision block in situations when a more precise value is
preferable.
Figure 5.3: Block diagram of a optimization on the Threshold RPR decision block suggestedby Shim [4].
For either type of decision block, the value of Th affects the balance between the DU
and the UU upset cases (as defined in Section 4.4). Th controls the a factor in Table 4.1,
the fraction of upsets in the full-precision module with RPR detects. A smaller Th detects
lower-magnitude errors to be detected, increasing the a factor. A larger Th decreases the a
factor and allows higher-magnitude errors in the UU upset case.1
For a particular instantiation of Threshold RPR (i.e. for a particular module and Br
value), there is an optimal range for Th. If Th is too large, the full-precision output will be
used even when there are significant errors in that module. A Th that is too small will cause
the RP output to be chosen even when there are no errors in the FP module, resulting in
the false detection (FD) upset case. The limits on the optimal range of Th will be discussed
in Sections 5.2 and 6.1.
Reduced-Precision Module Design
In addition to the decision block, the reduced-precision modules must also be de-
signed. For Threshold RPR, the two reduced-precision modules are identical, making any
error in one of the modules trivial to detect. If the reduced-precision outputs differ, an upset
1Table 6.5 in Section 6.1.4 reports on some measured values of the a factor for changing Th values.
77
certainly exists in one of them and the full-precision output is used. Thus no upset in the
reduced-precision modules cause any error in the RPR system output.
Aside from having identical outputs, the reduced-precision modules may be designed
in any way and their architecture need not match that of the full-precision module. For
simplicity, this dissertation uses the same technique used to design the full-precision module
in order to design the reduced-precision module, but with reduced-precision inputs. For
example, if the full-precision module is a standard array multiplier with two (B + 1)-bit
inputs and a (B + 1)-bit output, the reduced-precision module is also designed as an array
multiplier with two (Br + 1)-bit inputs with a (Br + 1)-bit output.
5.1.2 Bounded RPR
Snodgrass introduced another type of RPR specifically for soft-error environments [5].
It is called “Bounded RPR” here because it uses two reduced-precision modules to create
bounds on the full-precision output.
Overview
Bounded RPR is similar to Threshold RPR in that two reduced-precision modules
are utilized. In this case, however, the two RP modules are not identical. Instead, the RP
modules are designed to create bounds on the FP output, as illustrated in Figure 5.4. The
output of the first RP module, RPupper, is the upper bound. The output of RPlower is the
lower bound. RPupper must be designed such that its output is always greater than or equal
to the FP module’s output for any set of inputs in the absence of SEUs. Similarly, the output
of RPlower must always be less than or equal to the FP output.
Decision Block
The function of the Bounded RPR decision block is somewhat more complex than
that of Threshold RPR. The final output is determined by comparing the relative positions
78
Figure 5.4: Simplified block diagram of a full-precision module protected with Bounded RPRusing upper-bound and lower-bound reduced precision modules.
of each of the outputs. There are several possible error cases to consider. These error cases
are illustrated in Figure 5.5 and are categorized into rows according to the module in which
the error has occurred.
Figure 5.5: Error cases for Bounded RPR, modified from [5]. Categorized in rows by thelocation of the error and in columns by the response to each type of event.
79
The error cases of Figure 5.5 are partitioned into columns of as follows:
• Left column: the decision block cannot detect any error since the bounds are not
violated. In these cases the decision block chooses the full-precision output.
• Middle column: the decision block can tell that one of the reduced-precision modules
is in error since the upper and lower bounds have crossed and chooses the full-precision
output.
• Right column: the decision block detects an error but cannot determine the location
of the error. It cannot distinguish between cases 2 and 6 since the relative positions
of the three outputs is the same in each. That is, the full-precision output is found
to be greater than the reduced-precision output in both cases but where the error lies
is impossible to determine from this information. The same ambiguity exists between
cases 3 and 9. For these cases, the decision block must assume that the full-precision
module is in error and chooses the reduced-precision output.
Snodgrass did not consider cases 6 and 9 and thus did not consider this ambiguity in his
work. The consequences of this ambiguity are discussed further in Section 5.2.3.
Applying the error cases above, the decision block for Bounded RPR performs the
following function:
if (RPupper < RPlower) or (RPlower > RPupper) or (RPlower < FPout < RPupper) then
RPRout ← FPout
else
RPRout ← 12
(RPupper+ RPlower)
end if.
A block diagram of this decision block is shown in Figure 5.6. This RPR decision
block is slightly less costly, in general, than the Threshold RPR decision block in Figure 5.2.
80
Figure 5.6: Block diagram of a Bounded RPR decision block. Sign extensions, where neces-sary, are not shown in this diagram.
Reduced-Precision Module Design
Designing the reduced-precision blocks for Bounded RPR is also more complex than
Threshold RPR. As discussed above, Bounded RPR requires that the outputs of the upper
and lower reduced-precision modules bound the full-precision output for all possible inputs
and input sequences. For simple modules this can be done by rounding up the input to the
upper bound module and rounding down the input to the lower bound module. However, this
works only for the most simplistic functions and, as discussed by Snodgrass [5], could require
the addition of complex hardware to ensure the correct bounds even for simple functions.
Sullivan was able to design bounding modules for several arithmetic modules by making
some simplifying assumptions [64].
Alternately, the RP modules can be designed in the same manner as Threshold RPR
and then adding or subtracting the maximum estimation error, εmax, as in Figure 5.7. This
would essentially re-create Threshold RPR, but requires an extra adder for each RP module.
Snodgrass also suggests that a reduced-precision computation module could be im-
plemented as a lookup table. If this lookup table can be pre-generated, the output of the two
RP modules can be made to ensure correct bounding of the FP output. In fact, they can be
81
Figure 5.7: Simplified block diagram of an n-bit full-precision module protected withBounded RPR using an add-and-subtract-threshold method of bounding the full-precisionoutput.
designed such that for every possible input, the output bounds the full-precision module’s
output as tightly as possible. The bound will be tighter for some inputs than others, but
each would be ensured to be as close to the FP output as it can be, given the input and
output precision limits. This results in an error detection limit that is better than the other
variations of RPR, as will be shown in Section 5.2.4.
Although small lookup tables are quite efficient in current FPGAs, lookup tables
grow exponentially in size with the number of input bits. This makes them impractical
for modules with several inputs or with wide input buses. For simple operations such as
small constant-coefficient multipliers and other unary operators, however, the use of lookup
table-based RP modules for Bounded RPR can be beneficial. Lookup table-based modules
are used in the constant-coefficient multipliers in the demonstration systems in Section 5.3.
5.1.3 RP-TMR
Reduced-precision TMR (RP-TMR) is a novel type of RPR introduced in this dis-
sertation. Chapter 3 suggested that a mitigation approach could be applied only to sections
of a circuit most susceptible to error. RP-TMR is the application of TMR to only the most
significant bits of computation, protecting these most critical sections.
82
Overview
RP-TMR involves direct triplication of the higher-weighted components of an arith-
metic module as shown in Figure 5.8. This results in three identical higher-order branches
(the upper bits of computation) all supported by the same lower-order trunk (the lower bits
of computation). A majority voter at the output of these three branches, identical to that
used in TMR, combines the outputs into one. The output of the single lower-order section
is concatenated to this voted signal to produce the final full-precision output.
In a clocked circuit, the three upper branches are not identical. The single trunk
section of the RP-TMR module must be associated with a clock. As shown in Figure 5.8,
where each clock domain is indicated by the dotted lines, the trunk shares a clock with one
of the three branches. For convenience, this trunk and branch with the shared clock (the
clk1 domain in this figure) is referred to as the “full precision (FP) module.” The other two
branches are referred to as “reduced-precision (RP)” modules. This nomenclature simplifies
comparisons with the other types of RPR.
Figure 5.8: Block diagram of an 8-bit register protected by RP-TMR. For simplicity, theregister inputs are not shown. The three clock domains are indicated by dotted lines andare labeled clk1, clk2, and clk3.
83
As examples of RP-TMR implementation, Figures 5.9 and 5.10 illustrate an RP-
TMR adder and multiplier, respectively. Figure 5.9 shows the entire redundant adder circuit.
Figure 5.10 shows a standard array multiplier with fractional inputs and marks the portion of
the circuit that is triplicated for RP-TMR. Note that the redundant portion of the RP-TMR
multiplier makes up a reduced-precision multiplier similar to a standalone reduced-precision
module.
Figure 5.9: Block diagram of an 8-bit adder protected by RP-TMR. For simplicity, the xand y inputs of each full adder are not shown. The three clock domains, corresponding tothe clock domains of the inputs to each full adder sub-module, are indicated by dotted linesand are labeled clk1, clk2, and clk3. The full adder submodule is detailed in the inset.
For each module, the branches and trunk work together to produce the full-precision
output. The three higher-order branches of each RP-TMR module are essentially stand-alone
reduced-precision modules. Alone, each branch produces an estimate of the full-precision
output. In the case of the adder or multiplier, the output of the lower-order trunk feeds into
all three branches. When the lower-order trunk produces a useful output, these branches are
able to give a better estimate of the true full-precision result.
The output of the upper branches depends on the location of any upsets within the
RP-TMR module. When the entire circuit is free of upsets, the branches all produce the
84
Figure 5.10: Block diagram of an array multiplier with annotations for RP-TMR. The shad-ing indicate the protected modules and the underlines note the replicated partial productinputs. The full adder and half adder sub-modules used are detailed in the insets, with eachpartial product shown as one input to each module.
true full-precision result and the voted output is correct. When an upset affects one of the
upper branches, the voters correct the error and the voted output produces the correct result.
When the lower-order trunk is incorrect, however, the upper branches all produce a poorer
estimate since the erroneous output from the trunk feeds into all three branches.
Decision Block
As mentioned above, the decision blocks for RP-TMR are identical to the majority
voters of TMR. These voters are much smaller than the decision blocks required by Threshold
and Bounded RPR. Each bit of the voter takes three inputs and produces one output. These
are by far the simplest and least costly of the three types of RPR decision blocks.
85
Reduced-Precision Module Design
RP-TMR does not have separate reduced-precision modules as Threshold RPR and
Bounded RPR do. Redundancy is added by directly triplicating the most significant bits of
computation of the unmitigated module. Thus the architecture of the redundant portions of
the circuit (the “reduced-precision modules”) mirrors that of the unmitigated module. This
makes the design of the redundant portions of the circuit simple, but inflexible.
5.2 RPR Variation Comparison
This section compares the cost and performance of the three variations of RPR.
Section 5.2.1 compares the area cost of the various RPR decision blocks. Section 5.2.2
compares the implementation issues for the reduced-precision modules of each type of RPR.
Section 5.2.3 compares the differences in how the four RPR upset cases manifest themselves
for each RPR variation. Section 5.2.4 compares the error detection limits for each type of
RPR. Section 5.2.5 discusses some issues related to the FPGA implementation of the RPR
variations.
5.2.1 Decision Block Cost
The three variations of RPR have distinct overhead costs. The cost of the reduced-
precision modules can be similar across all three variations, depending on how they are
designed. The cost of the RPR decision blocks, however, can vary greatly. Appendix D
estimates the cost of the RPR decision blocks for each variation of RPR. Figure 5.11 plots
the results of these estimations in terms of the number of 4-input LUTs required for each
decision block with a range of reduced-precision bit-widths, Br.
Clearly, RP-TMR has the lowest cost of the three variations. The Bounded RPR
decision blocks have slightly lower cost than the standard Threshold RPR blocks. The
optimized Threshold RPR decision blocks are more efficient than either of these, though
these blocks have limited application, as explained in Section 5.1.1.
86
0 2 4 6 8 10 12 14 160
10
20
30
40
50
60
70
80
Br
App
roxi
mat
e 4−
inpu
t LU
T c
ost
Threshold RPRBounded RPROptimized Threshold RPRRP−TMR
Figure 5.11: Relative cost of RPR decision blocks in terms of 4-input LUTs for a range ofreduced-precision bit-widths.
5.2.2 Reduced-precision Module Implementation
The design of reduced-precision modules for Threshold RPR and RP-TMR is rela-
tively straightforward. For Threshold RPR, both reduced-precision modules are identical
and are designed to approximate the output of the full-precision module. For RP-TMR,
redundancy is added by directly triplicating the most significant bits of computation of the
unmitigated module.
Designing reduced-precision modules for Bounded RPR, on the other hand, is more
complicated. As discussed in Section 5.1.2, it can be difficult to design the upper and lower
bound modules such that they completely bound the full-precision output for all inputs.
This makes the implementation of Bounded RPR more difficult, in general, than the other
types of RPR.
87
For Threshold and Bounded RPR, there is no limitation the architecture of the
reduced-precision modules. The reduced-precision modules are designed separately from the
full-precision module. They can be created with a standard module with reduced-precision
inputs, implemented with lookup tables, or using any number of circuit optimization tech-
niques. The smaller bit-widths may result in relaxed constraints (such as timing) compared
to the full-precision module. For example, Shim suggested that reduced-precision constant-
coefficient multipliers could be replaced with shift and add operations to reduce the hardware
cost of the reduced-precision modules [72].
The redundancy added to RP-TMR, on the other hand, follows the architecture of
the unmitigated module, even if a more optimized reduced-precision module could be de-
signed. Although this strict method of creating the RP-TMR redundancy limits optimiza-
tion, it likely makes the application of RP-TMR easier to automate than either Threshold
or Bounded RPR. It should be possible to create an automated tool which could identify
the most significant bits of computation in each module within a system, assuming the ar-
chitecture of the components is known. The identified components could then be passed to
a tool such as the BYU-LANL TMR (BL-TMR) tool which can apply TMR to only those
components [73], [74].
5.2.3 Upset Cases
The four upset cases is another point in which the three variations of RPR differ. Each
variation adds a different error or noise signal in each upset case and the upset cases occur
with distinct probabilities for each type of RPR. Table 4.1 presented the general noise signals
and added noise bounds for the RPR upset cases. Table 5.1 summarizes the differences in
the added noise when considering each variation of RPR individually. This section also
compares the noise signal added in each case for the three variations of RPR.
88
Table 5.1: Comparison of the error signals and bounds ofthree variations of RPR for each RPR upset case.
Threshold RPR Bounded RPR RP-TMRNoise Absolute Noise Absolute Noise Absolute
Upset Signal Noise Signal Noise Signal NoiseCase Added Limit Added Limit Added Limit
DU εe εmax εe εmax 0 0UU εu Th εu εmax εu εmax
FD εe εmax εe εmax 0 0NU 0 0 0 0 0 0
DU case
In the DU case, where an upset in the full-precision module is detected, both Thresh-
old and Bounded RPR use the reduced-precision output, adding the estimation error to the
signal output. For both of these variations of RPR, εRPR = εe in the DU case. For RP-TMR,
an upset in the upper bits of computation in any of the three branches corresponds to a de-
tected upset. In this case, RP-TMR perfectly votes out the error and adds zero noise to the
system output, i.e. εRPR = 0.
UU case
In the UU case, an upset has occurred in the full-precision module but is undetected.
In the case of RP-TMR, an upset has affected the lower bits of computation, the non-
redundant trunk. For all variations of RPR, the SEU-induced noise passes to the output of
the system. In general, this noise is limited in magnitude by the minimum detectable noise,
which is the maximum of the estimation error signal, εmax.
For Threshold RPR, however, the minimum detectable noise is a separately-controlled
parameter. The error detection threshold, Th, is the minimum detectable noise, by definition.
Threshold RPR allows the designer to set this value to something other than εmax. Lowering
Th can increase the detection capability, a, of Threshold RPR, but also may increase the
false positive probability, as explained below.
89
FD Case
For the FD case, there has been no upset in the full-precision module, yet the RPR
system mistakenly chooses the reduced-precision output. This can happen for both Thresh-
old RPR and Bounded RPR, but does not apply to RP-TMR: all upsets in the redundant
portion of the system are masked and all upsets in the non-redundant portion are classified
as UU events.
For Bounded RPR, the FD case can occur when there are upsets in the reduced-
precision modules. These are shown in Figure 5.5 as error cases 6 and 9, where an upset
in one of the bounding modules causes its output to cross the full-precision output, but
not the other bound’s output. These FD events reduce the overall performance of RPR by
introducing noise equal to εe at the RPR system output whenever these upsets occur, even
when no upset is present in the full-precision module.
Threshold RPR does not share this Bounded RPR problem. Any single upset affecting
the output of one of the two Threshold RPR reduced-precision modules causes a mismatch in
the comparison between the two reduced-precision modules in which case the full-precision
output is used. In fact, FD events can be completely eliminated by carefully setting the
error detection threshold.
For Threshold RPR, the false positive probability as used in Table 4.1 is:
Pfp = Pr(|FPout − RPout| > Th), (5.1)
or the probability of the difference between the full-precision and reduced-precision outputs
exceeding the set threshold in the absence of upsets. In order to avoid these false error
detection events, Th can be chosen such that RPout is never chosen in the absence of any
errors in the system by setting Th >= εmax.
Shim concluded that the optimal value for threshold is Th = εmax, in the general case.
A threshold for which Th > εmax prevents false positives, but is undesirable since upsets with
90
higher noise magnitude go undetected. On the other hand, Th < εmax adds the possibility of
an FD event, which add unnecessary noise to the system.
In this chapter, we set Th = εmax. This results in Pr(FD)= 0 for Threshold RPR and
simplifies the comparisons to the other types of RPR. However, Section 6.1 will discuss an
alternative approach that can lower Th while maintaining good performance.
NU Case
The NU case is simply the case in which there is no upset in the full-precision module
and there is no false positive error detection. All variations of RPR add no noise to the
system output in this case.
5.2.4 Error Detection Limits
The magnitude of the error that each RPR system can detect is another point in which
the three variations differ. The lookup table implementation used here for the constant
coefficient multiplier blocks gives Bounded RPR an advantage over Threshold RPR and
RP-TMR. The ability to set the error detection threshold (Th) manually, however, gives
Threshold RPR added flexibility.
As an example of these advantages, consider a constant coefficient multiplier module
with the reduced-precision replica implemented as an array multiplier for Threshold RPR
(or RP-TMR) and as a lookup table for Bounded RPR. Figures 5.12 and 5.13 illustrate the
error bounds of a sample RPR multiplier for Threshold and Bounded RPR, respectively.
Each RPR system is configured with B = 7, Br = 3, and a constant coefficient value of
h = 1/2− 2−B = 0.4921875 (a value fully representable by the full-precision module but not
the reduced-precision module).
Figure 5.12 shows the full-precision and reduced-precision module outputs for the
Threshold RPR multiplier. The error detection limits are simply the full-precision output
plus (and minus) the threshold value, Th. The error detection threshold is based on the
91
−0.75
−0.625
−0.5
−0.375
−0.25
−0.125
0
0.125
0.25
0.375
0.5
0.625
Multiplier input
Mul
tiplie
r out
put
−
1
−0.8
75
−0.
75
−0.6
25
−0.
5
−0.3
75
−0.
25
−0.1
25
0
0.1
25
0.2
5
0.3
75
0.
5
0.6
25
0.7
5
0.8
75
FPout+ThFPoutRPoutFPout−Th
Figure 5.12: Threshold RPR multiplier: full-precision output and reduced-precision outputwith error bounds. h = 0.4921875, B = 7, Br = 3, and Th = εmax
maximum quantization error of the full-precision and is constant for all multiplier inputs.
The error limits for an RP-TMR multiplier are the same when Th = εmax since the reduced
multiplier also has the architecture of a standard array multiplier as shown in Figure 5.10.
Figure 5.13 shows full-precision output and the error limits for the Bounded RPR
multiplier. Note that the error detection limits change depending on the multiplier input
value. The drawn grid shows the quantization of the reduced-precision inputs and outputs.
The output of the reduced-precision multiplier follows the full-precision output as closely
as the quantization limits allow. Also note that the worst case error for the Bounded RPR
multiplier is smaller than the worst case error for the Threshold RPR multiplier.
As discussed in Section 5.2.3, the error detection threshold of Threshold RPR is
configurable. Although setting Th = εmax guarantees that no false positive error detection
events will occur, Section 6.1 will show that a lower threshold is sometimes desirable. Thus
the configurability of this error detection limit can be an advantage of Threshold RPR.
92
−0.75
−0.625
−0.5
−0.375
−0.25
−0.125
0
0.125
0.25
0.375
0.5
0.625
Multiplier input
Mul
tiplie
r ou
tput
−
1
−0.
875
−0.
75
−0.
625
−0.
5
−0.
375
−0.
25
−0.
125
0
0.1
25
0.2
5
0.3
75
0.
5
0.6
25
0.7
5
0.8
75
−
1
−0.
875
−0.
75
−0.
625
−0.
5
−0.
375
−0.
25
−0.
125
0
0.1
25
0.2
5
0.3
75
0.
5
0.6
25
0.7
5
0.8
75
RPupper
FPout
RPlower
Figure 5.13: Bounded RPR multiplier: full-precision and reduced-precision outputs. h =0.4921875, B = 7, and Br = 3.
5.2.5 Suitability for FPGAs
In addition to other strengths and weaknesses, each type of RPR varies in its suit-
ability for FPGA implementation. In general, RPR lends itself well to use in FPGAs, as
demonstrated in Section 4.7. Some weaknesses are apparent in Bounded RPR and RP-TMR,
however. These issues are explained here and will be demonstrated in Section 5.3.
As mentioned in Section 4.2, the smaller reduced-precision modules in Shim’s RPR
system were not susceptible to VOS errors in ASIC systems. In contrast, the reduced-
precision modules in an FPGA system are still susceptible to SEUs. Threshold RPR and
RP-TMR handle this situation well and any upsets in the reduced-precision modules are
masked. SEUs in the reduced-precision modules in Bounded RPR, however, are not fully
masked, as explained in Section 5.2.3.
93
RP-TMR has a particular disadvantage in current FPGA technology in the nature of
its redundancy. Current FPGAs often contain fast carry chain logic to implement arithmetic
modules such as adders and multipliers. By necessity, RP-TMR interrupts this carry chain
in order to split the output of the trunk to drive the three upper branches of logic. This may
result in an increase in the critical path of the system and thus a decrease in its clock rate.
5.2.6 Summary
A summary of the relative strengths and weaknesses of the three variations of RPR
are presented here:
Threshold RPR
• Strengths:
– Straightforward implementation of reduced-precision modules
– Configurable error detection threshold (to eliminate false positive detection events
or increase performance)
• Weaknesses:
– Large decision block
– Static error detection threshold for all inputs (unlike Bounded RPR)
Bounded RPR
• Strengths:
– Tightest error detection limits (when using lookup tables as RP modules)
• Weaknesses:
– Large decision block
– Sensitive to false positive detection events
94
– Difficult to design reduced-precision modules as bounds (except when using lookup
tables, but these grow exponentially with the input bit width)
RP-TMR
• Strengths:
– Small decision block
– Only propagates errors in UU mode
– High potential for automated application of redundancy
• Weaknesses:
– No flexibility in reduced-precision module implementation
– Interrupts fast carry chain in current FPGAs
5.3 Fault Injection Experiments
This section presents fault injection experiments which were run in order to validate
the comparison presented above. Each variation of RPR will be applied to a communications
receiver and compared to the unmitigated design as well as a TMR version of the receiver.
These experiments will show the actual overhead cost and performance of each design in
the presence of SEUs. The results will also show to what extent each of the strengths and
weakness presented in Section 5.2 affect the overall system tested.
5.3.1 Experimental Configuration
To demonstrate and compare the performance of the three different RPR techniques,
several FIR filter designs of the form described in Section C.1 and shown in Figure C.1 were
implemented. For each type of RPR, an unmitigated FIR filter (B = 15) was taken and both
TMR and the RPR method was applied to it. For each RPR version, three levels of RPR
were applied using Br values of 3, 5, and 7 for the reduced-precision modules. Fault injection
95
experiments were then run to characterize each version of the filter, unmitigated and TMR
versions included. The test methodology for these experiments is identical to those in earlier
chapters and described in Section 3.5.1.
For both Threshold and Bounded RPR, the same unmitigated filter was used which
was developed with the Xilinx System Generator software, as described in Section C.2. In
order to apply the RP-TMR method, however, a custom FIR filter was created using VHDL
and EDIF design tools with the same parameters and function as the other filter. For reasons
described in Section C.3, the filter used in the RP-TMR tests is larger than that used in the
Threshold and Bounded RPR tests. This chapter will present the experimental results in
percentages rather than raw numbers to make fair comparisons between the RPR variations
using different filters.
In these experiments, as in those presented in Section 4.7, the triplicated RPR decision
blocks were found to be susceptible to some SEUs. Section D.3 explains that these TMR
failures can most likely be corrected with a specialized tool. Since this enhancement is
beyond the scope of this work, the SEUs causing TMR failures in the decision blocks are
ignored in the SEU classification. The configuration bits of the decision blocks are assigned
to the Class 1 category (less than 0.2 dB SNR loss). This avoids skewing the performance
measures of RPR due to the imperfect triplication of the decision block.
5.3.2 Experimental Results
Tables 5.2–5.4 and Figures 5.14–5.16 give a summary of the fault injection results
obtained. For each test design, the tables show the number of utilized configuration bits
in each SEU class. Each table also highlights the number of catastrophic bits discovered
in each design. To compare against the unmitigated design, the tables show the hardware
overhead and reduction in catastrophic bits for each mitigated version of the design.
Table 5.2 shows the SEU classification results for Threshold RPR. Notice that the
unmitigated filter has 2,444 catastrophic bits out of its total of 68,072 bits. This means that
96
Table 5.2: Number of SEUs causing each class of effect for the FIR filter protected withfull TMR and Threshold RPR, compared against the unmitigated filter.
Total Total ImprovementSlices Slices Class 1 Class 2 Class 3 Class 4 Utilized Catastrophic in failure
Design Used Overhead Bits Bits Bits Bits Bits (% Reduction) rate
Unmitigated 1,030 - 59,156 6,472 1,501 943 68,072 2,444 (-%) -TMR 3,171 208% 218,304 0 0 2 218,306 2 (99.92%) 1,222×Thresh. RPR, Br = 7 1,755 70.4% 106,751 6,239 11 2 113,003 13 (99.47%) 188×Thresh. RPR, Br = 5 1,470 42.7% 84,284 7,819 226 2 92,331 228 (90.67%) 10.7×Thresh. RPR, Br = 3 1,313 27.5% 73,992 6,875 1,598 666 83,601 2,264 (7.36%) 1.1×
Table 5.3: Number of SEUs causing each class of effect for the FIR filter protected withfull TMR and Bounded RPR, compared against the unmitigated filter.
Total Total ImprovementSlices Slices Class 1 Class 2 Class 3 Class 4 Utilized Catastrophic in failure
Design Used Overhead Bits Bits Bits Bits Bits (% Reduction) rate
Unmitigated 1,030 - 59,156 6,472 1,501 943 68,072 2,444 (-%) -TMR 3,171 208% 218,304 0 0 2 218,306 2 (99.92%) 1,222×Bound. RPR, Br = 7 2,214 115% 123,720 4,746 0 1 128,467 1 (99.96%) 2,444×Bound. RPR, Br = 5 1,593 54.7% 88,121 9,957 88 1 98,167 89 (96.36%) 27.5×Bound. RPR, Br = 3 1,382 34.2% 75,817 7,189 3,037 423 86,466 3,460 (-41.57%) 0.7×
Table 5.4: Number of SEUs causing each class of effect for the FIR filter protected withfull TMR and RP-TMR, compared against the unmitigated filter.
Total Total ImprovementSlices Slices Class 1 Class 2 Class 3 Class 4 Utilized Catastrophic in failure
Design Used Overhead Bits Bits Bits Bits Bits (% Reduction) rate
Unmitigated 2,457 - 112,066 14,719 6,581 1,646 135,012 8,227 (-%) -TMR 7,351 199% 422,665 0 0 4 422,669 4 (99.95%) 2,056×RP-TMR, Br = 7 4,183 70.2% 228,910 302 135 1 229,348 136 (98.35%) 60.5×RP-TMR, Br = 5 3,587 46.0% 189,211 6,464 2,313 2 197,990 2,315 (71.86%) 3.6×RP-TMR, Br = 3 3,111 26.6% 150,751 13,056 3,323 20 167,150 3,343 (40.63%) 2.5×
97
over 96% of the configuration bits are likely to be protected through standard error handling
techniques at the application level.
The results from the filter designs with three levels of Threshold RPR (Br = 3, 5, 7)
are shown in this table. Each design has a hardware overhead significantly less than TMR
and each significantly reduces the number of catastrophic bits in the resulting design. The
RPR filter with Br = 5 (using 6-bit inputs to the reduced-precision modules), reduced the
number of catastrophic bits by over 90% at a cost of 43% on top of the cost of the original
filter. The RPR filter with Br = 7 eliminated nearly all of the catastrophic bits at a cost of
only 70%, about one-third that of TMR. In contrast, the RPR filter with Br = 3 was only
able to reduce the number of catastrophic bits by about 7%. Though the overhead of this
filter is very low, this bit-width appears to be too small to adequately protect this design.
Table 5.3 shows the SEU classification results for the filters with Bounded RPR
applied. The unmitigated and TMR results are repeated from Table 5.2 for ease of compar-
ison. Note that for the same Br value, Bounded RPR required more area than Threshold
RPR. This is not surprising given the lookup table implementation of the multipliers in this
implementation of Bounded RPR. For the RPR filter with Br = 5 and 7, Bounded RPR
performed slightly above Threshold RPR, as predicted by the slightly smaller error detection
limits shown in Section 5.2.4.
The Bounded RPR filter withBr = 3, however, performed very poorly, even increasing
the number of catastrophic SEUs over the unmitigated design. Recall that an upset in the
reduced-precision modules can affect the RPR output, as explained in Section 5.2.3, which
allows for this increase to occur. This adverse effect is not possible when using Threshold
RPR or RP-TMR, whose output are not affected by upsets in the reduced-precision modules.
Table 5.4 shows the SEU classification results for RP-TMR. Recall that the RP-TMR
test used a different unmitigated FIR filter than the Threshold and Bounded RPR tests and
uses a larger number of FPGA slices and configuration bits. The percentage of catastrophic
bits in this design, however, is similar: about 6.09%. The TMR version of the filter had a
98
similar overhead cost of about 200% and was able to eliminate nearly all of the catastrophic
bits.
The RP-TMR filters had a similar overhead cost to that of Threshold RPR in terms
of FPGA slices. Although RP-TMR does not require large, complex decision blocks like
Threshold RPR, the extra overhead is not seen in these experiments, where the decision
blocks are significantly smaller than the module being protected. The overhead in terms of
utilized configuration bits, however, was larger. Therefore, the additional sensitive area of
RP-TMR is likely due to additional configuration bits used for signal routing. This is not
unexpected due to the added routing congestion of RP-TMR, since the three upper branches
of logic are all dependent on the lower trunk logic and must be located physically close to
each other. In contrast, the three module replicas of Threshold and Bounded RPR are not
tied to each other at all and can be physically spread out.
RP-TMR did not provide the same protection against catastrophic SEUs as Threshold
RPR or Bounded RPR for the filters with Br = 5 and 7. RP-TMR actually exceeded the
performance of Threshold and Bounded RPR in the Br = 3 case in terms of percentage of
catastrophic SEUs mitigated. Although RP-TMR mitigated the Class 4 SEUs very well for
all bit-widths, it underperformed in preventing Class 3 SEUs in the Br = 5 and 7 designs.
It is not clear why RP-TMR did not mitigate these catastrophic upsets as well as the other
forms of RPR.
Figures 5.14–5.16 show the SNR loss values for all the designs tested. These figures
give a different view on the performance of the different types of RPR. While the distribution
of SEUs in the SNR loss cases shown is similar for Threshold and Bounded RPR, that of
RP-TMR is more favorable. Although RP-TMR did not perform as well with the high-noise
upsets, Figure 5.16 indicates that it performed better than Threshold and Bounded RPR
for lower-noise upsets, keeping the SNR loss lower on average. This is likely due to the
fact that, when RP-TMR is operating in reduced-precision mode, it uses both the upper and
lower bits of output—the output of the branches and the trunk, respectively. In contrast,
99
Threshold and Bounded RPR use only the truncated reduced-precision output, comparable
to only using the upper bits of the RP-TMR output.
> 0.2 dB > 0.5 dB > 1 dB > 3 dB > 6 dB0
2
4
6
8
10
12
14
16
18
SNR Loss
Nor
mal
ized
Per
cent
age
of U
pset
s
UnmitigatedRPR, B
r=3
RPR, Br=4
RPR, Br=5
Figure 5.14: Normalized percentage of SEUs causing certain SNR losses at BER of 10−5 foran FIR filter protected with three levels of Threshold RPR compared to the unmitigateddesign.
5.4 Summary
This chapter presented three different variations of reduced-precision redundancy.
Threshold RPR and Bounded RPR were previously suggested while RP-TMR is a new
technique introduced here. The preceding sections compared three types of RPR in several
aspects and fault injection experiments demonstrated each of the three techniques on a
simple communications receiver.
From the combination of the theoretical analysis and the experimental results pre-
sented here, the unique properties of each variation of RPR are summarizes as follows:
100
> 0.2 dB > 0.5 dB > 1 dB > 3 dB > 6 dB0
2
4
6
8
10
12
14
16
18
SNR Loss
Nor
mal
ized
Per
cent
age
of U
pset
s
UnmitigatedRPR, B
r=3
RPR, Br=4
RPR, Br=5
Figure 5.15: Normalized percentage of SEUs causing certain SNR losses at BER of 10−5
for an FIR filter protected with three levels of Bounded RPR compared to the unmitigateddesign.
RP-TMR is a straightforward way to protect the upper bits of computation. It
offers theoretically comparable error detection limits to Threshold RPR and uses significantly
smaller decision modules than either form of RPR. Its ability to use the lower-order bits of
computation even when the upper bits are in error gives it an advantage over the other
forms of RPR in some cases. As demonstrated, however, RP-TMR is not suitable for FPGA
implementation due to its additional timing cost and routing congestion.
Bounded RPR, as shown here and as suggested in previous work, obtains very small
error detection limits by implementing reduced-precision modules with lookup tables with
pre-computed contents. Bounded RPR, however, has an architectural flaw that allows for an
increase in sensitivity to catastrophic SEUs in some cases due to ambiguous error detection
cases. In general, the reduced-precision modules for this type of RPR are difficult to design.
101
> 0.2 dB > 0.5 dB > 1 dB > 3 dB > 6 dB0
2
4
6
8
10
12
14
16
18
SNR Loss
Nor
mal
ized
Per
cent
age
of U
pset
s
UnmitigatedRPR, B
r=3
RPR, Br=4
RPR, Br=5
Figure 5.16: Normalized percentage of SEUs causing certain SNR losses at BER of 10−5 foran FIR filter protected with three levels of RP-TMR compared to the unmitigated design.
Threshold RPR is relatively straightforward to implement and has a better decision
architecture than Bounded RPR in which false positive detection events can be completely
avoided. Also, although the static error comparison threshold resulted in higher average
error detection limits than Bounded RPR, this threshold is configurable and can be used to
enhance the performance of Threshold RPR.
In light of these results, Threshold RPR appears to be superior to the other two
variations of RPR. Chapter 6 will develop the Threshold RPR technique further. It will
present methods to select the Th and Br parameters to improve performance in the presence
of SEUs and reduce overhead cost.
102
CHAPTER 6. APPLICATION OF THRESHOLD RPR
Chapters 4 and 5 discussed many issues related to the implementation of RPR and
provided sample experimental results. These experiments were run simply to determine if
RPR could be suitable for these types of systems and to determine which variation of RPR
was best for FPGA systems. However, the RPR implementation details for these experiments
were chosen somewhat naıvely, without any detailed analysis of how RPR could best meet
the system requirements. This chapter provides a detailed look at several issues that need
to be addressed when designing a system with Threshold RPR.
Section 6.1 describes the benefits and drawbacks of lowering the error detection
threshold, Th. It presents a method for lowering Th which increases the SEU performance
of Threshold RPR in some cases. Fault injection experiments are presented to compare the
performance of the systems utilizing the standard and optimized Th values.
Section 6.2 discusses the effects of setting the bit-width of the reduced-precision mod-
ules, Br, in a Threshold RPR system. It describes how to determine the valid upper and
lower bounds on Br for a specific system. The section concludes with fault injection ex-
periments that demonstrate the effects of a wide range of Br values on a communications
receiver protected with Threshold RPR.
Section 6.3 provides insight into the design of Threshold RPR into more complex
systems. Until this point, RPR had been demonstrated only on simple modules and systems.
This section gives insight into several issues that must be considered when implementing
RPR on a recursive system with several different types of components. This section presents
a workflow for applying Threshold RPR to such a system. It concludes with a detailed
demonstration of mitigating SEUs in a recursive communications receiver using RPR.
103
6.1 Threshold Selection
As described in Section 5.1.1, Th is the error detection threshold of Threshold RPR.
Th is an important parameter which controls the magnitude of errors that are detected by
RPR. This value controls the noise limits of the RPR output.
In Chapter 5, Th was set to the maximum estimation error, εmax, as in previous work
by Shim. Shim’s Th value is the optimal value in the general case, where the probability
distribution of the estimation error signal is unknown. If the designer of a particular system
has additional information about this εe signal, however, a lower threshold value may be
offer better RPR performance.
This section describes the factors involved in setting the value of Th and suggests a
method for obtaining higher performance with a value of Th < εmax for a fixed Br value.
This novel method is made possible by limiting the scope of the RPR implementation to
a particular system and cannot offer higher performance for all systems. Fault injection
experiments then demonstrate the added benefit of these new Th values over the FIR filter
experiments presented in Section 5.3.
6.1.1 Average Threshold RPR Noise Limit
In order to summarize the effect of changing Th on the performance of the system,
we define an average noise limit for Threshold RPR, ERPR-avg. The average Threshold RPR
noise limit is based on the probabilities and noise limits of Table 4.1:
ERPR-avg = Pr(DU) · εmax + Pr(UU) · Th + Pr(FD) · εmax
= Pupset · a · εmax + Pupset · (1− a) · Th + (1− Pupset) · Pfp · εmax. (6.1)
This takes into account the probability of occurrence of each upset case and gives an average
value of the noise limit ERPR over time. This formula will be used to illustrate the ways in
which altering the Th value of the RPR system affects the performance of the system.
104
6.1.2 Reduction of Th
Recall that the value for Th affects both the distribution of UU and DU events as
well as the noise limits for each of these event types. This shift is represented by the change
in the value a in Equation 6.1.1 Increasing Th causes more UU events and fewer DU events,
decreasing a. Decreasing Th has the opposite effect. Decreasing Th also affects the noise
limit in the UU upset case. This makes it difficult to determine the overall effect of altering
Th on ERPR-avg.
A low value of Th (lower than εmax) is desirable because it lowers the noise limit in
the UU case. However, there are two possible disadvantages to a lower Th value:
1. There are possible false-positive error detection events, as discussed earlier. This in-
troduces noise equal to εe even when no upsets exist in the system.
2. Upsets that cause errors with magnitude above Th but below εmax are replaced with
the estimation error which has a bound at εmax. The resulting error, then, could be
larger than the error caused by the upset itself in some cases.
In each of these cases, the RPR system introduces a higher-magnitude noise than would
otherwise be present (in the unmitigated module). Each of these cases will now be described
in detail.
False Positive Error Events
In previous work, Th was set to the maximum estimation error, εmax. This value of Th
ensures that the false detection upset case did not occur. If the probability PFD is sufficiently
small, however, it may be desirable to lower Th to allow some false positive events. Knowledge
of the input signal characteristics or the operating environment could allow one to predict
PFD for lower Th values. Similarly, knowledge of the statistical properties of the εe signal
directly can provide enough information to be able to lower Th to obtain a better ERPR-avg.
1Table 6.5 in Section 6.1.4 reports on some measured values of the a factor for changing Th values.
105
In some cases, with knowledge of the input signal and the properties of a specific
module, it is possible to choose Th < εmax to avoid false positive detection events a large
portion of the time. In this case, PFD << 1, but may be non-zero. This alters the final term
in Equation 6.1, which is zero when using Th = εmax. However, the first and second terms
are also altered since the value a is dependent on Th and Th itself is the noise limit in the
UU case. Without knowing the value of a as a function of Th, it is difficult to predict the
effect on ERPR-avg. This function is dependent on the module being protected and the upset
environment and is difficult to generalize.
A more direct method is to examine the distribution of the estimation error signal, εe.
Shim showed that, for a uniformly-distributed εe signal, the optimal value for Th is εmax [47].
This is reasonable because all values of εe between 0 and εmax are equally probable, including
those above any value Th less than εmax. Thus Pfp increases sharply as Th is lowered below
εmax. This, in turn, increases the frequency of the FD upset event which decreases the overall
performance of RPR.
If, on the other hand, the distribution of the εe signal is such that higher values of εe
are less probable than lower values, the increase in Pfp may not be enough to severely affect
the performance of the system. For example, if the distribution of εe is Gaussian,2 the false
error probability can be predicted based on the relation of Th to the standard deviation (σ)
of the distribution. Table 6.1 shows the relation of Pfp to Th for this case. A system with
Th = σ can expect a false positive every third clock cycle, on average. Values of Th = 5σ
and Th = 6σ, however, result in false positive error rates of less than 10−6. With rates this
low, it can certainly be feasible to lower Th without fear of significantly increasing the FD
upset case probability.
The distribution of εe is highly dependent on the type of module being protected as
well as the signal environment at its input. For example, a simple register with a uniformly-
distributed input would have a uniformly-distributed εe signal due to the simple truncation
2The actual εe signal cannot be a true Gaussian, of course. The εe signal has an actual cutoff at εmax
while a true Gaussian distribution has infinite support.
106
Table 6.1: Pfp values for a Gaussian-distributed εe signal.
Th Pfp
σ 0.3172σ 0.04553σ 2.70× 10−3
4σ 6.33× 10−5
5σ 5.73× 10−7
6σ 1.97× 10−9
effect. In our testing, the constant-coefficient multipliers showed varying distributions for εe
based on the coefficient value and the Br value. For each of these combinations, a different
amount of truncation occurred in the coefficient resulting in several error distributions. These
included distributions that appeared approximately uniform, Gaussian, or triangular. For
the full FIR Filter, however, with the modulated input signal, the εe signal appeared Gaussian
when the input signal had an SNR less than 30 dB. Section B.1 in Appendix B plots a sample
probability distribution of εe for the FIR filter. This property is exploited in Section 6.1.3
in order to find a valid Th < εmax for this circuit.
Mid-range Upset Errors
The second problem mentioned with lowering Th below εmax is the possible increase
in the error level for some upsets. In this case, the noise induced by some upsets will be
replaced by the noise of the RPout signal: εe. This results in the εmax value being the noise
limit a higher percentage of the time while reducing the time the reduced threshold value,
T ∗h , is the noise limit. Depending on the noise induced by the SEU, this could result in a
higher overall noise level.
For example, consider the probability mass functions (pmf) shown in Figure 6.1 rep-
resenting some error signals of a hypothetical RPR system.3 Figure 6.1(a) shows the pmf
of the estimation error signal, εe, of an RPR module along with its noise limit, εmax. Fig-
3The pmfs displayed were created to be Gaussian distributions for illustration purposes. It is importantto note that these types of error signals do not always have this type of distribution.
107
−1 −0.5 0 0.5 1
εmax−ε
max
(a)
−1 −0.5 0 0.5 1
Th*−T
h*
(b)
−1 −0.5 0 0.5 1
εu−max−ε
u−max
(c)
Figure 6.1: (a) The pmf of the estimation error, εe, of an RPR module, (b) the pmf for themaximum undetected upset error signal, εu, and the pmf for (c) a mid-range upset whichcrosses the reduced threshold, T ∗
h .
ure 6.1(b) shows the pmf of the upset error signal, εu, of an SEU which causes the maximum
undetected error signal for a given reduced threshold, T ∗h . Figure 6.1(c) shows the pmf of
another upset error signal for which the maximum value of εu is T ∗h < εu-max < εmax.
In the case of Figure 6.1(c), the upset causes noise higher than T ∗h and is detected as
an error. The RPR system thus enters the reduced-precision mode and the error signal of
Figure 6.1(c) is replaced with that of Figure 6.1(a). In this case, the error of the system is
increased due to the lowered threshold value.
108
This discussion shows that the effect of lowering Th below εmax can have mixed conse-
quences. With additional knowledge about a specific system (including characteristics of the
input signal, SEU-induced noise, and estimation error) it would be possible to pre-determine
the optimal value for Th. In the end, however, the most general acceptable rule is that Th
should not be lowered below εmax. With that in mind, the following section introduces a
method for finding an acceptable lower value for Th experimentally.
6.1.3 Experimental Determination of Th
Although lowering the value of Th below εmax can have negative consequences, these
negative effects only occur during time periods when Th < εe < εmax. The value of εmax
is determined mathematically based on the structure of the module in question and the
possible input signals. This section shows that it is possible to determine an acceptable
tighter bound on εe experimentally. For some modules, the practical maximum value of εe
can be significantly lower than the theoretical value. Section 6.1.4 will then demonstrate the
RPR performance gains that can be achieved by basing Th on this lower value.
Recall the following definition from Section 4.4:
εmax = max |εe| = max |FPtrue − RPout|. (6.2)
The value of εmax was determined mathematically for several simple modules for several
types of RPR. This determination was done with the theoretical maximum error values for
each module. The values of εmax for the register, adder, and multiplier were then combined
to form εmax for an FIR filter.
Although the theoretical maximum values for these modules are accurate, meaning
there is some input sequence that can produce the maximum value given, there was no notion
given of their probability of occurrence. For simple components and known input signal
characteristics, the probability of the maximum estimation error (Pr(εe = εmax)) or the
109
probability of any value above a certain threshold (Pr(εe > T ∗h )) can be easily determined.
For more complex modules, or for combinations of simple modules such as the FIR filter in
Figure 3.6, these probabilities can be much more difficult to calculate theoretically.
If, under known conditions, the probability Pr(εe > T ∗h ) for a chosen threshold, T ∗
h is
very close to zero, it may be desirable to use T ∗h as the RPR detection threshold Th. If this
probability is sufficiently close to zero, we may use the same assumptions as if we had used
Th = εmax, namely that Pr(FD) = 0 and that the noise in the FD error case will never be
larger than the upset noise. The measure of sufficiently close to zero is subjective and tied
to the system in question. In a communications system where there is a BER requirement
of 10−5 at a certain SNR, the value of Pr(FD) should be low enough such that FD events
cause bit errors well below that rate.
Rather than using the theoretical value, as suggested by Shim, the maximum value
of εe can be determined experimentally. We label this experimentally-determined value ε∗max,
which is used to determine the experimental decision threshold labeled T ∗h , where ε∗max < εmax
and T ∗h < Th. An experimentally-determined threshold, of course, is only valid for a specific
circuit. Without a specific assumption like this, Shim’s Th = εmax is the correct value to use.
For the FIR filter circuit, we have experimentally measured the signal εe for several
different RPR bit-widths. To do this, we created bit-accurate simulation models of the full-
precision and reduced-precision FIR filter circuits using Matlab. We then generated several
representative modulated input signals, each with a different SNR level (SNR values of 2, 4,
6, 8, and 10 dB). These models were then used as follows:
1. Each of the input signals was processed by the FP filter and the output signals recorded
2. The same input signals were processed by each RP filter and the output signals recorded
3. For each RP filter and each SNR, the estimation error signal, εe, was calculated
4. The absolute maximum value of each εe signal was recorded as ε∗max
5. The mean (µe) and standard deviation (σe) of each εe signal were calculated
110
For this design and these input characteristics, the signal εe was roughly Gaussian-
distributed. As expected, the ε∗max value was dependent on the test duration. We also
discovered that the SNR of the input signal did not have a significant impact on the statistics
of the εe signal. Section B.1 in Appendix B plots the probability mass functions (pmfs) for
the FIR filter design with various bit-widths at an SNR of 8 dB, demonstrating this Gaussian
distribution and the effect of changing Br.
Using the Gaussian distribution of εe and the values in Table 6.1 as a hint, we calcu-
lated the experimental threshold as:
T ∗h = µe + 6σe. (6.3)
We confirmed this to be a valid threshold (i.e. T ∗h > ε∗max) for simulation durations up to 106
samples. With this value of T ∗h , we expected Pfp to be very low, as suggested by Table 6.1.
Table 6.2 shows the different threshold values obtained for several different reduced-
precision FIR filters. Both the theoretical (Th) and experimental (T ∗h ) threshold values are
shown for each filter as well as the mean (µ) and standard deviation (σ) values for the signal
εe. Notice that the experimentally-determined values, in these cases, become increasingly
lower than their theoretical counterparts as Br decreases. This can greatly increase the
number of errors detected for a particular bit-width and has the potential to make even
lower Br values feasible for a particular system.
Table 6.2: Mathematical (Th) vs. experimental (T ∗h ) threshold values for RPR FIR filter
designs with several different reduced-precision bit-widths (Br). The mean (µe)and standard deviation (σe) values for the signal εe are also shown.
Br Th T∗h % Change µe σe
7 0.1597 0.1049 −34.3% 0.3659 0.094656 0.3106 0.1844 −40.6% 0.2453 0.05625 0.6046 0.3182 −47.4% 0.1431 0.028914 1.2212 0.5849 −52.1% 0.08365 0.015633 2.3871 0.9222 −61.4% 0.05380 0.008500
111
Table 6.3: Number of SEUs causing each class of effect for an FIR filter protected withTMR and several levels of Threshold RPR using experimentally-determinedthresholds (T ∗
h ), compared to mathematically-determined thresholds (Th).
Total Improv.Slices Class 1 Class 2 Class 3 Class 4 Catastrophic in failure
Design Used Bits Bits Bits Bits (% Reduction) rate
Unmitigated 1,030 59,156 6,472 1,501 943 2,444 (-%) -
RPR, Br = 7, Th 1,755 106,751 6,239 11 2 13 (99.47%) 188×RPR, Br = 7, T ∗
h 1,755 106,863 6,191 11 2 13 (99.47%) 188×RPR, Br = 5, Th 1,470 84,284 7,819 226 2 228 (90.67%) 10.7×RPR, Br = 5, T ∗
h 1,470 84,583 7,709 42 2 44 (98.20%) 55.5×RPR, Br = 3, Th 1,313 73,992 6,875 1,598 666 2,264 (7.36%) 1.08×RPR, Br = 3, T ∗
h 1,313 74,129 8,267 634 36 670 (72.59%) 3.65×
The Th values shown in the table are those used in the fault injection experiments
presented in Chapter 5. The next sections will present experimental results for the same
designs using the T ∗h values. The results will show that the lowered threshold values can
have a significant impact on the performance of RPR, especially for the lower values of Br.
6.1.4 Reduced Threshold Experiments
To demonstrate the effects of using the experimentally-determined T ∗h values, fault
injection experiments were run on the same FIR filter designs as those used in Chapter 5. The
configuration of these experiments was the same as those described in Section 4.7. Tables 6.3
and 6.4 show the results of these experiments, including results repeated from Table 5.2 and
Figure 5.14 for convenience.
Table 6.3 shows the SEU classification results from the fault injection experiments.
Notice that there was no change in the number of catastrophic upsets for Br = 7, which
had the smallest percentage change from Th to T ∗h shown in Table 6.2. For the lower Br
values, the difference in threshold value is larger and the effect on performance is greater.
The coverage of catastrophic errors increased by 8% for Br = 5 and by 65% for Br = 3.
Table 6.4 compares the SNR loss results for the designs using Th and T ∗h . For all
values of Br there was moderate improvement for all dB ranges using the experimentally-
determined thresholds. This emphasizes the benefit of using a threshold with T ∗h < εmax.
112
Table 6.4: Normalized percentage of SEUs causing certain SNR losses at BER of 10−5
for an FIR filter protected with several levels of Threshold RPR usingexperimentally-determined thresholds (T ∗
h ), compared tomathematically-determined thresholds (Th).
Design > 0.2 dB > 0.5 dB > 1 dB > 3 dB > 6 dB
Unmitigated 13.1% 10.6% 9.07% 6.29% 5.32%
RPR, Br = 7, Th 9.19% 3.29% 1.62% 0.0191% 0.0191%RPR, Br = 7, T ∗
h 9.11% 2.48% 0.696% 0.0191% 0.0191%
RPR, Br = 5, Th 11.8% 9.16% 7.23% 3.08% 1.62%RPR, Br = 5, T ∗
h 11.39% 8.85% 6.89% 1.51% 0.602%
RPR, Br = 3, Th 13.4% 10.7% 8.97% 5.98% 5.01%RPR, Br = 3, T ∗
h 13.1% 10.3% 8.50% 5.47% 3.71%
Table 6.5: Detection factor (a) for an FIR filter protected with several levels ofThreshold RPR using experimentally-determined thresholds (T ∗
h ), comparedto mathematically-determined thresholds (Th) at an SNR of 8 dB.
Design a
RPR, Br = 7, Th 0.0754RPR, Br = 7, T ∗
h 0.1082
RPR, Br = 5, Th 0.0519RPR, Br = 5, T ∗
h 0.0859
RPR, Br = 3, Th 0.0495RPR, Br = 3, T ∗
h 0.0699
Table 6.5 reports on measured values of the RPR detection factor, a, for both thresh-
old values. This value is the fraction of upsets in the full-precision module that were detected
by the RPR system and for which the reduced-precision output was used. Note that, as ex-
pected, the a factor increases with the lower threshold T ∗h for each Br value.
6.2 Bit-width Selection
The previous section discussed setting Th for a fixed reduced-precision bit-width,
Br. This section presents the considerations necessary when setting Br. The value of Br
determines the quality of the estimate that the reduced-precision modules produce relative
113
to the full-precision module. This in turn controls the valid range of Th and the level of noise
that is detectable by the system.
In general, a higher Br has a higher area cost and gives better performance. The
effect on performance can be seen in Equation 6.1: since both εmax and Th decrease with an
increase in Br, the average noise limit of Threshold RPR decreases as well.
This section emphasizes that the selection of Br has a large impact on the performance
and cost of RPR. It describes this impact and presents how to calculate the valid range of
Br available for a particular module. It also demonstrates the trade-offs between the cost
and performance factors with fault injection experiments.
6.2.1 Bit-width Effects
The primary effect of setting Br is to set the accuracy of the estimate of the full-
precision module and thus the estimation error signal, εe. This affects the noise of the
system in reduced-precision mode, but also the level of SEU-induced noise that is detectable.
Effect on Performance
The Br value directly sets the noise level of the RPR system while in reduced-precision
mode. RPR operates in this mode when an error is detected in the full-precision module and
the reduced-precision output is used. Thus the noise level in this mode depends solely on
the performance of the reduced-precision module and upon its bit-width.
For example, Figure 6.2 shows several BER curves for the binary PAM system de-
scribed in Section 3.5, each for an FIR filter with a different input bit-width. If one of
the application requirements specifies that the BER in reduced-precision mode should be at
most 10−4 at an SNR of 10 dB, the input bit-width of the RP modules must be Br ≥ 5.
The Br value also controls the level of SEU-induced noise that is detectable. A
smaller Br value means the reduced-precision module produces a poorer estimate of the full-
114
0 2 4 6 8 10 1210
−10
10−8
10−6
10−4
10−2
100
Eb/N
o
BE
R
TheoreticalB
r=1
Br=2
Br=3
Br=4
Br=5
Br=6
Br=7
Br=8
Figure 6.2: Bit error rate curves for several FIR filters (SRRC pulse shape, α = 0.5) withdifferent bit-widths.
precision output, resulting in a larger possible difference between the two outputs. Thus a
higher threshold, Th, is needed for a smaller Br.
Effect on Error Detection Threshold
Lowering the Br value decreases the performance of an RPR system, resulting in a
cutoff of its usefulness as Br approaches zero. As Br is lowered, Th must become larger.
Obviously, there are few interesting circuits that would be estimated well by a reduced-
precision module with Br = 0 (a 1-bit signed number). Depending on the application, the
value for Th could become too large to be usable at Br values significantly higher than 0.
Using the feed-forward binary PAM system as an example, the output of the full-
precision FIR filter has a bit-width of Q1.15, giving it a possible range of [-2,2). From
Table 6.2, the theoretical value of Th for Br = 3 is 2.3871. This is over 50% of the total
115
range of the output signal of the filter. In fact, the output range of the filter is typically
smaller than this.
As an example of a system with a valid threshold, Figure 6.3 gives a representation
of the signals used by the RPR decision block to determine if there is an error in the
system. This figure was generated from the outputs of an RPR FIR filter with Br = 6 and
Th = 0.3106 and no errors present. By adding and subtracting Th to and from the RPout
signal, the upper and lower bounds for the FPout signal can be visualized. Note that in
this system, the noise limits are fairly close to the full-precision output. An error in the
full-precision module which caused the output to exit these bounds would be flagged as an
error and the reduced-precision output would be used instead.
0 20 40 60 80 100−2
−1
0
1
2
time
RPout
+ Th
FPout
RPout
− Th
Figure 6.3: RPR filter decision signals for RPR with Br = 6 and Th = 0.3106. No errorsare present in the system. The upper and lower comparison bound signals are calculated byadding and subtracting Th to and from RPout.
By adding and subtracting Th = 0.3106 to and from the RPout signal, the upper and
lower bounds for the FPout signal can be visualized. In contrast, Figure 6.4 shows the signals
for the FIR filter with Br = 3 and Th = 2.3871. The figure illustrates the system with a
catastrophic error in the full-precision module: FPout is frozen at 0. With this value of Th,
the erroneous FPout signal is always completely within the displayed bounds. Thus the RPR
decision block determines that no error is present in the full-precision module and uses the
frozen output as RPRout.
116
0 20 40 60 80 100−4
−2
0
2
4
time
RPout
+ Th
FPout
RPout
− Th
Figure 6.4: RPR filter decision signals for RPR with Br = 3 and Th = 2.3871. The FPout
signal is frozen at zero. The upper and lower comparison bound signals are calculated byadding and subtracting Th to and from RPout.
This Th value is too large to handle this type of error. This type of error is fairly
common for this FPGA design when the clock or reset line is upset. This explains the poor
performance of RPR with Br = 3 in terms of preventing catastrophic errors as reported in
Table 5.2. For this design, then, a larger Br value must be used to give adequate performance.
With a larger Br and a lower Th value, the frozen full-precision output would be more likely
to be outside the noise limits. Using the theoretical Th values, a bit-width of Br = 6 or
Br = 7 would be more appropriate for a signal with this output range.
6.2.2 General Bit-width Selection
Selecting the best value of Br is highly dependent on the application in question.
This section presents a general overview of selecting possible Br values for an RPR module.
Upper Bound
The upper bound of Br depends on several factors. The most obvious of these is
Br < B (the full-precision bit-width) since Br = B is essentially TMR, which gives full
117
protection against single upsets. Even values close to B are undesirable due to the increased
overhead of the RPR decision blocks compared to TMR voters.
Another simple upper bound is an area or power limit imposed by application con-
straints. Besides the area and power costs of higher Br values, there is no general downside to
increased precision in the reduced-precision modules. This can only increase the performance
of the system.
Lower Bound
The lower bound of Br is determined by when the detection capabilities of RPR
degrade to unusable levels. Section 6.2.1 described an example where the Br value caused
the Th value to increase such that critical errors went undetected. Similar methods can be
used for other systems.
In a more general sense, the Th value is the general noise limit on the RPR system,
as seen in Equation (6.1). The designer of the RPR module can thus define an acceptable
noise limit at the output of the RPR decision block and increase Br until the calculated or
measured value of Th falls below this bound.
Optimization
These bounds, of course, are only a starting point for selecting Br for a particular
module. At this point, the designer must find the optimal trade-off between the cost of im-
plementation and the performance of the system. If the upset rate of the target environment
is very low, ERPR-avg will be small even with a low Br value. If the upset rate is higher, it
may be more important to use a high Br value to keep the noise low in the DU upset case.
For example, Figure 6.5 plots the value of ERPR-avg of the FIR filter design for several
bit-widths in two different upset environments: GPS orbit and Polar orbit. If the target
ERPR-avg for this system is 10−6, the system in the Polar orbit requires a Br of 5. With the
118
higher upset rate of the GPS orbit, however, the system requires a Br of at least 7 to meet
the noise limit target.
3 4 5 6 710
−7
10−6
10−5
Br
ER
PR
−av
g
GPS orbitPolar orbitError Target
Figure 6.5: ERPR-avg of the FIR filter design for several bit-widths and using two failurerates.
In this case, using ERPR-avg as the measure of performance of the RPR system, the
upsets are not frequent enough in the Polar orbit to warrant a high cost of RPR. In the GPS
orbit, however, the RPR system is predicted to enter reduced-precision mode much more
often, increasing ERPR-avg significantly.
The effects of these trade-offs are highly dependent on the application in question and
cannot be generalized. What is important is that RPR can give many options for increasing
the performance of a system in the presence of SEUs. The next section presents results from
fault injection experiments that demonstrate these options which trade-off circuit area for
performance.
119
6.2.3 Bit-width Experiments
In order to demonstrate the effects of varying the reduced-precision bit-width (Br)
for Threshold RPR, the previous fault injection experiments were expanded. This section
reports on the performance of the simple feed-forward communications system of Section 5.3
for Br = 3–7. The designs tested used the experimentally-determined thresholds T ∗h in
Table 6.2. The results emphasize the flexibility of RPR by demonstrating the wide range of
cost and performance trade-off points that Threshold RPR offers this system.
Table 6.6 shows the SEU classification results from the fault injection experiments.
As expected, increasing the bit-width of the reduced-precision filters improved the handling
of catastrophic SEUs. The cost of implementation increased with Br as well.
Figure 6.6 plots the SNR loss values for the various versions of this filter. Notice that
the increase in Br does more than increase the design’s resistance to catastrophic SEUs. As
the size of the reduced-precision filters increases, the number of higher-noise SEUs decreases
as well. As expected, the more costly the RPR system, the lower the overall noise and the
higher the performance.
Again, TMR was much more effective at protecting the receiver system against SEUs
than RPR in our experiments. In the case of the RPR implementation with Br = 6, the
overhead cost of implementing RPR was about one-quarter that of TMR. This version of
RPR reduced the number of catastrophic bits by over 99% and significantly reduced the
number of high-noise SEUs. Although the RPR implementation with Br = 7 did not offer
any improvement in protection against catastrophic SEUs over the Br = 6 design, Figure 6.6
reflects the improvements in SNR loss offered by the extra hardware required. Even the
implementation with Br = 3 offers a significant improvement. At a cost of only 28% more
hardware, the number of catastrophic bits decreased by over 70%.
These results emphasize that RPR offers flexibility in its implementation options. It
is fairly straightforward to increase the performance of an RPR system in the presence of
SEUs by increasing the amount of redundancy in the reduced-precision modules. The range
120
Table 6.6: Number of SEUs causing each class of effect for an FIR filter protected with TMR and several levels ofThreshold RPR using experimentally-determined thresholds (T ∗
h ), compared to the unmitigated filter.
Total Total Improv.Slices Slices Class 1 Class 2 Class 3 Class 4 Utilized Catastrophic in failure
Design Used Overhead Bits Bits Bits Bits Bits (% Reduction) rate
Unmitigated 1,030 - 59,156 6,472 1,501 943 68,072 2,444 (-%) -
TMR 3,171 208% 218,304 0 0 2 218,306 2 (99.92%) 1222×RPR, Br = 7 1,755 70.4% 106,863 6,191 11 2 113,067 13 (99.47%) 188×RPR, Br = 6 1,602 55.5% 95,980 7,731 9 2 103,722 11 (99.55%) 222×RPR, Br = 5 1,470 42.7% 84,583 7,709 42 2 92,336 44 (98.20%) 55.5×RPR, Br = 4 1,394 35.3% 79,334 8,252 254 2 87,842 256 (89.53%) 9.55×RPR, Br = 3 1,313 27.5% 74,129 8,267 634 36 83,066 670 (72.59%) 3.65×
121
> 0.2 dB > 0.5 dB > 1 dB > 3 dB > 6 dB0
2
4
6
8
10
12
14
SNR Loss
Nor
mal
ized
Per
cent
age
of U
pset
s
UnmitigatedRPR, B
r=3
RPR, Br=4
RPR, Br=5
RPR, Br=6
RPR, Br=7
Figure 6.6: Normalized percentage of SEUs causing certain SNR losses at BER of 10−5 foran FIR filter protected with several levels of Threshold RPR compared to the unmitigateddesign.
of options RPR offers a particular application depends on the system to be protected and
the application requirements. It is clear, however, that RPR can offer intriguing trade-offs
between cost and performance.
6.3 RPR System Considerations
The preceding sections focused on the application of the RPR technique to a single
module. When considering a system made up of several smaller modules, there are additional
considerations. In addition to the bit-width parameter on the individual reduced-precision
modules, one must also consider the RPR decision blocks. Also, since RPR can only be
applied to arithmetic modules, it is important to choose which modules should be protected
with RPR and which should be protected with another method such as TMR. This section
discusses these issues and provides guidelines for making these decisions.
122
6.3.1 RPR Decision Blocks
RPR decision blocks, or RPR voters, must be used to resolve the outputs of the full-
precision and reduced-precision outputs into a single output. The number and placement
of voters in a TMR- or RPR-mitigated system can have a large effect on the reliability and
cost of that system [45], [75]–[78]. RPR voters are needed for all types of RPR, though their
complexity varies. The voters required for Threshold and Bounded RPR are more costly
than those of RP-TMR, requiring approximately 8–9 times more logic.
Voter Count and Placement
When several reduced-precision modules are connected, the quantization noise of
each contributes to the total quantization noise at the output. In essence, the maximum
quantization error of a simple module is generally less than a more complex module. Sullivan
noted the effects of these “compound operations” in her thesis and calculated the increased
bounds needed for some sample operations [64].
In the extreme cases RPR voters could be placed: 1) at the output of every individual
arithmetic module or 2) only at the final output of the system. With voters at the output of
every module, either Br or Th could be lowered for each module and decision block. With a
voter at only at the output of the system, either Br or Th must be increased to account for
the extra quantization error accumulated through multiple reduced-precision modules.
As an example of the trade-offs involved in selecting the number and locations of
voters, consider the 4-tap FIR filter illustrated in Figure 6.7. Section D.2 estimates the cost
of this circuit with several different voter configurations:
1. RPR voters at the output of each multiplier
2. Triplicated RPR voters at the output of each multiplier
3. A single RPR voter at the output of the filter
4. One triplicated RPR voter at the output of the filter
123
In each case, the RPR voters use Shim’s optimized architecture. The appendix shows that,
when moving the voters from the multiplier outputs to the filter output, the bit-width of
the reduced-precision modules must increase in order to maintain the same error detection
limit, Th.
Figure 6.7: Block diagram of a 4-tap FIR filter.
Using the resource utilization tables in Appendix E, the estimates of these configura-
tions are shown in Table 6.7. In this example, the cost of each system is nearly equal in the
case of optimized, non-triplicated voters. If the voters are to be protected from SEUs as well,
however, the cost of the first system is significantly higher. In this case it is preferable to
increase the bit-widths of the reduced-precision modules rather than add more RPR decision
blocks in order to maintain the same threshold.
Table 6.7: Estimated cost of several 4-tap FIR filter circuits protected with RPR.
Single TriplicatedVoter Locations Voters Voters
After multipliers 1,169 1,609At filter output 1,157 1,245
This analysis is a simple example and will certainly be different for each system
considered. In general, however, it is important to consider the number and position of the
124
RPR decision blocks in a digital system. There is a trade-off between the accumulation of
quantization noise and the area cost of more voters.
Suggested Workflow for Placing RPR Decision Blocks
A reasonable starting point for selecting the number and location of RPR voters is
to constrain the noise in the system by limiting the noise limit or Th value at a certain point
in the circuit. The simplistic approach places one RPR decision block at that point in the
circuit and calculates the necessary Br values of the components leading up to that point
to achieve that Th value. If there are many components leading to that point, however, the
Br values required may be very large. As an optimization step, one or more RPR decision
blocks could be added to the system. This can reduce the strain on the initial decision block
and allow smaller Br values, reducing the cost of the reduced-precision modules.
Figure 6.8 presents the workflow diagram for this process. As suggested above, the
first time through the loop, a single decision block, or voter, can be placed at the output
of the system. This results in the minimal possible voter cost and results in the simplest
choice of the decision error threshold, Th. When multiple voters are present in the system,
a Th value must be chosen for each point in the system. This dissertation does not discuss
the optimal locations of RPR voters in a system. The reader is referred to voter placement
work for TMR systems as a starting point for this analysis [45], [75]–[78].
6.3.2 Mixing RPR with TMR
In a full system, one must consider where to apply RPR and where to use some
other mitigation method. RPR cannot be applied to all forms of logic and other constraints
can make RPR less desirable than other mitigation approaches for certain modules. It
is important to find the right balance between RPR and other mitigation methods for a
particular system.
125
Figure 6.8: Workflow for choosing the location and number of decision blocks in an RPRsystem.
This section makes the assumption that all modules within the system are to be
protected. For simplicity, RPR will be used whenever possible to reduce costs and TMR will
be used otherwise. This is a reasonable approach for a designer wishing to save cost over a
TMR implementation. Of course, there are many other trade-offs which can be made such
as leaving some non-critical modules unprotected or applying a broader mix of mitigation
techniques. These assumptions, however, allow a simple and fair comparison with a system
fully protected with TMR, which will be demonstrated in Section 6.3.4.
Following is a list of basic rules to follow when partitioning a design into RPR and
TMR sections and placing voters:
126
1. A voter should be inserted in every feedback loop: either a TMR voter or an RPR
voter.
2. A voter must be inserted before changing from RPR to TMR or from mitigated to
unmitigated.
3. Non-arithmetic modules must be protected with TMR.
4. Small modules with feedback should be protected with TMR due to the large cost of
RPR voters.
5. RPR voters, which are large, should be used sparingly.
6. Both RPR and TMR voters should be triplicated for maximum reliability.
Each of these points will be explained individually in the following paragraphs.
A voter should be inserted in every feedback loop
Some method must be used to synchronize the data within the three loop replicas in
the event of an SEU corruption. The simplest way to do this is to cut the path of every
feedback loop with a voter. When an SEU causes an error in a of the feedback loops, the data
in that loop remains corrupted even after scrubbing repairs the SEU. The circuit, at that
point, must either be reset or the loop can be resynchronized with the other two functioning
loops with a voter in the feedback loop [44], [45].
A voter must be inserted before changing from RPR to TMR
Changing from RPR to non-RPR mitigation (or to unmitigated) requires the insertion
of an RPR voter. As explained in Section 6.3.1, an RPR voter is required to decode the
three RPR outputs into a generally-usable signal. This applies when moving from a section
of the circuit protected with RPR to a section protected by TMR, as well. TMR requires
three identical inputs, so the differing outputs of RPR must be resolved into a single output
that is then read by three TMR modules, or directly into three identical outputs.
127
A voter is not needed when moving from TMR to RPR. In this case, one of the
TMR outputs is taken as the full-precision input and the two other TMR outputs are simply
truncated to the precision needed by the two reduced-precision modules. Similarly, moving
from unmitigated to RPR is accomplished by the three RPR modules tapping off from the
same signal and no voter modules are required.
Non-arithmetic modules must be protected with TMR
As explained in Section 4.3, RPR is only suitable for protecting arithmetic modules.
There is no general method for creating reduced-precision state machines or other arbitrary
logic.4 Thus non-arithmetic modules must be left unprotected or another method such as
TMR can be used.
Small modules with feedback should be protected with TMR
Because of the high cost of RPR voters, it is undesirable to use them within small
feedback loops—those with a relatively small amount of hardware. For example, consider
the simple feedback circuit illustrated in Figure 6.9. The feedback loop should be cut by
a voter to keep the redundant circuit synchronized. In addition, if the register holds and
arithmetic value, an RPR implementation of the circuit might be considered. An RPR voter
for this module, however, would be the same size as that for a more complex module such as
a multiplier or filter with the same bit-widths. For this particular module, it would actually
be more efficient to apply full TMR to the circuit to make use of simple TMR voters rather
than to create an RPR version.
In this example, a simple circuit like this requires roughly 16 LUTs (for the multi-
plexer) and 16 FFs (for the register) in a Xilinx Virtex architecture. If the reduced-precision
replicas of this circuit use signals that are 4 bits wide, each requires 8 LUTs and 8 FFs.
4It is possible that suitable reduced-cost circuits could be designed which perform some type of estimatesimilar to RPR for certain systems, though such cases are beyond the scope of this work. Snodgrass presenteda theoretical discussion of the characteristics of possible candidates for an extension of RPR [5].
128
Figure 6.9: Block diagram of a simple circuit with feedback.
The estimated cost of the RPR version of the circuit (with a triplicated, optimized decision
block) is:
Asimple = (M + 2Mr) + (R + 2Rr) + 3Vr
= 196. (6.4)
In contrast, the cost of the TMR version of this circuit is:
Asimple = 3M + 3R + 3VTMR
= 144. (6.5)
Even though the reduced-precision modules are half the cost of each TMR replica,
the higher cost of the RPR voters outweighs the benefit of using RPR. Outside of a feedback
loop, the RPR decision blocks can be spread out to amortize their cost across many modules.
Inside the feedback loop, however, at least one voter must be used. In this case, TMR is
preferable to RPR.
RPR voters should be used sparingly
As discussed in Section 6.3.1, RPR voters can be placed at the output of every
arithmetic module or spread out in the system. Placing RPR voters closer together decreases
129
the estimation error at the input to the voters, but the cost of the voters can quickly outpace
the cost of TMR, as shown in the previous example. To achieve an area gain over TMR,
then, the number of RPR voters should be limited as much as possible while still achieving
the desired performance.
Both RPR and TMR voters should be triplicated
In an FPGA, RPR and TMR voters are implemented using the same configurable
logic as the modules to be protected. This means the voters themselves are susceptible to
SEUs. It has long been known, therefore, that TMR voters internal to the FPGA should be
triplicated [42], [44] for maximum reliability. The same is true for RPR voters. A single RPR
voter creates a single point of failure and triplicating the voter removes that vulnerability.
By triplicating the RPR voter, it is essentially protected completely using TMR.
By following these basic rules, a system can be partitioned into TMR and RPR
sections and the locations of voters chosen. Section 6.3.4 will use these rules to partition the
recursive binary PAM receiver from Section 3.6.
6.3.3 System Mitigation Design
When applying SEU mitigation to an entire system, it is clear that many issues
and options must be considered. Section 6.1 provided a method for setting the decision
threshold Th for an RPR decision block. Section 6.2 discussed the considerations for setting
the reduced-precision bit-width, Br, for an RPR module. Section 6.3 has presented important
issues for using RPR voters and for mixing RPR with TMR. This section presents a possible
workflow for designing a system using RPR with TMR.
Figure 6.10 shows this suggested workflow, which includes elements from the decision
block workflow shown in Figure 6.8. First, the rules of Section 6.3.2 should be applied
to choose which modules should be protected with RPR and which should be protected
130
Figure 6.10: Workflow for applying RPR+TMR to a digital system.
with TMR. The locations of TMR voters for synchronization should be straightforward to
identify by analyzing the feedback loops in the design. Then an initial placement of RPR
voters should be made again using the rules from Section 6.3.2. Initial Br values for any
RPR modules can then be selected as discussed in Section 6.2. Having set the Br values,
131
the Th values for each RPR voter can be determined mathematically or experimentally as
described in Section 6.1.
With this initial implementation of RPR, the Th values can be analyzed at their
respective locations in the system to determine if they are acceptable for the system in
question, as discussed in Section 6.2.1. If any of the Th values is too large, either more voters
should be added or Br should be increased in that section. Once an acceptable set of Th
values is found, the resulting noise bounds should be examined with respect to the system
requirements. If the noise bound is tighter than necessary, area overhead can be reduced by
either reducing the number of RPR voters in the system or by decreasing the Br values of
some of the RPR modules. This optimization loop can be repeated until a suitable, low cost
implementation is found.
The following section gives an example of a system that benefits from this type of
workflow. It will require the application of several of the rules and considerations presented
in the preceding sections. By using these techniques, a reliable system can be constructed
at a much lower cost than full TMR.
6.3.4 Recursive System Experiments
This section demonstrates the RPR technique on the recursive system described in
Section 3.6. This system is more complex that the simple feed-forward systems that RPR
was demonstrated on in Sections 4.7, 5.3, and 6.2.3. This system contains feedback loops,
non-arithmetic components, and sections with small feedback loops. As a larger system, the
question of the number of RPR voters to use is also more complicated.
Implementation Details
Figure 6.11 shows a diagram of the receiver system annotated with the type of mit-
igation applied to each component in the system. The locations of a TMR voter (for syn-
chronization) and an RPR decision block are also indicated.
132
Figure 6.11: Block diagram of the recursive binary PAM demodulator with annotations forRPR+TMR.
This particular design has several characteristics which make it impractical to apply
RPR to the entire design. First, the system contains non-arithmetic logic (in the decision
block) for which RPR is not suited. Second, there are small feedback loops in the NCO
block which are not pictured in Figure 6.11. The logic within these feedback structures is
very small as seen in Figure 6.12.
Notice the two feedback loops in Figure 6.12. One contains a multiplexer, register,
and two addition units. The second contains only a multiplexer and a register. An RPR
version of this module with Br = 12B would cost about twice that of the original module not
considering voters. A TMR version would cost about three times as much as the original.
Adding triplicated voters to each of the two feedback loops in this module, however, increases
the cost of RPR significantly.
Since a voter must be inserted in each feedback loop in the design and an RPR voter
is similar in size to the logic in this feedback loop, it is preferable to apply TMR to this
module which uses much simpler and smaller voters. The loop filter and TED blocks are
relatively small modules as well, so it is better to apply TMR to both of these modules,
which feed into the NCO block, rather than switch between RPR and TMR at that point.
133
mu
2
strobe
1
zero
0
underflow
a
b
a>bz−1
one
1
add one
a
b
a + b
add
a
b
a + b
Scale
21
Register 3
d qz−1
Register 2
d qz−1
Register 1
d qz−1
Mux 2
sel
d0
d1
Mux 1
sel
d0
d1
1/N + LFout
a
b
a + b
1/N
0.5LFout
1
Figure 6.12: Block diagram of the NCO block within the recursive binary PAM demodulator,exported from Xilinx System Generator.
The matched filter and interpolator blocks, however, are ideal candidates for RPR.
Each contains a significant amount of arithmetic logic. The size of these blocks offsets the
cost of adding an RPR voter. In this case, we have chosen to place a single RPR voter at
the output of the interpolator. This means that the quantization error within the filter and
interpolator structures adds and requires a larger threshold for the RPR voter. Recall that a
larger threshold means that more SEU-induced noise passes through the system unnoticed.
The cost of the system is reduced, however, compared to an implementation with RPR voters
at the output of both the filter and the interpolator.
For the matched filter and interpolator blocks, reduced-precision modules with Br =
7 bits of precision at their inputs were added. We determined experimentally that this
redundancy factor (k = 7) would be a suitable trade-off between mitigation cost and SEU
protection for this system. The value of Th = 0.35 was also chosen experimentally to avoid
FD events in the RPR decision block.
The “RPR Voter” at the output of the interpolator used three identical decision blocks
and converted the three interpolator output signals (one full-precision and two reduced-
precision) into three identical full-precision outputs. The three identical outputs were needed
by the triplicated TED and decision blocks which were protected with TMR. The TMR voter
at the output of the loop filter block intersects the two feedback loops pictured, correcting
any synchronization issues between the three branches.
134
Experimental Results
Tables 6.8 and 6.9 show the fault injection results for the recursive demodulator
system. The results are similar to those observed for the feed-forward systems examined.
Specifically, TMR again eliminated virtually all of the catastrophic SEUs, leaving only four
susceptible configuration bits. The TMR implementation was over three times as large as
the unmitigated system in terms of FPGA slices. The system with the combination of RPR
and TMR (RPR+TMR) reduced the number of catastrophic bits by over 97% while only
doubling the size of the design.
Note that the number of Class 3 and Class 4 SEUs is higher than the feed-forward
designs protected with a similar level of RPR. In this experiment, any TMR failures in
the triplicated RPR decision block were not removed from the SEU classification test, as
explained in Section 5.3.1. The location of the decision block within the feedback loop
made it difficult to identify those SEUs accurately. Thus some of the catastrophic SEUs are
expected to be a result of the imperfect application of TMR to the RPR decision block and
could be removed as described in Section D.3.
Table 6.9 summarizes the SNR losses for each design. Again, while TMR eliminates
any SNR loss, the RPR+TMR approach reduces the overall number of high-noise errors
significantly. Losses of more than 6 dB were reduced from 8.40% to 0.190% while losses of
more than 3 dB were reduced from 9.92% to 1.85%. Recall that these numbers include the
catastrophic SEUs, which cause infinite SNR loss.
Figure 6.13 shows the combined BER plot for the RPR+TMR implementation of the
receiver. This plot is similar to the unmitigated version of Figure 3.16, but most of the
catastrophic bits (including the histogram spikes at a BER of 0.5) have been eliminated.
These results confirm that RPR is a viable option for mitigating catastrophic SEUs
in a recursive communications system as well. Though TMR nearly perfectly protected the
system, the overhead cost was predictably near 200%. The RPR+TMR design was effective
135
Table 6.8: Number of SEUs causing each class of effect for the binary PAM demodulator protectedwith full TMR and RPR+TMR, compared to the unmitigated demodulator.
Total Total ImprovementSlices Slices Class 1 Class 2 Class 3 Class 4 Utilized Catastrophic in failure
Design Used Overhead Bits Bits Bits Bits Bits (% Reduction) rate
Unmitigated 1,410 - 75,783 14,335 4,450 1,548 96,116 5,998 (-%) -TMR 4,526 221% 277,714 0 0 4 277,718 4 (99.93%) 1,499.5×RPR+TMR 3,030 115% 156,610 21,933 136 7 178,686 143 (97.62%) 41.944×
136
Table 6.9: Normalized percentage of SEUs causing certain SNR losses at BER of 10−5 forthe binary PAM demodulator protected with full TMR and RPR+TMR.
Design > 0.2 dB > 0.5 dB > 1 dB > 3 dB > 6 dB
Unmitigated 21.15% 15.19% 13.12% 9.92% 8.40%
TMR 0% 0% 0% 0% 2.08× 10−3%
RPR+TMR 22.97% 10.44% 6.86% 1.87% 0.19%
Figure 6.13: BER plot for the binary PAM receiver system with timing synchronizationusing RPR+TMR for mitigation.
at significantly reducing catastrophic SEUs, improving the failure rate of the system by 42×
at a cost about half that of TMR.
6.4 Summary
This chapter discussed many factors that are important when applying Threshold
RPR to a system. It showed that, for some systems, an error-detection threshold value T ∗h
lower than the maximum reduced-precision estimation error, εmax, could be found experimen-
tally which can increase the performance of an RPR system. This chapter also emphasized
the importance of the bit-with of the reduced-precision modules, Br, and demonstrated the
flexibility of RPR by varying this value. Finally, the chapter gave guidelines for the applica-
tion of RPR on complex, recursive systems and suggested a workflow for designing such an
RPR-based system.
137
CHAPTER 7. CONCLUSION
7.1 Summary of Contributions
The following is a summary of the research presented in this dissertation and its major
contributions:
Chapters 1 and 2 provided motivation for this research. These chapters gave the
background necessary to understand the importance and impact of mitigating SEUs on
FPGA systems and for reducing the cost of said mitigation. Chapter 2 explained current
techniques for mitigating SEUs and for evaluating the impact of SEUs on FPGA systems.
Chapter 3 went beyond previous SEU evaluation methods, focusing on FPGA-based
DSP and digital communications systems. That chapter presented a novel approach for
evaluating the SEU tolerance of these systems. While previous work treated all sensitive
upsets the same, this work showed that by analyzing the bit error rate caused by each
sensitive upset in a communications system, SEUs could be categorized into catastrophic
and non-catastrophic upsets.
Chapter 3 also introduced a novel fault injection platform that used this application-
specific method of evaluating an FPGA system. This platform allowed very rapid evaluation
of the communications systems in the presence of SEUs. With the new analysis method
and optimized fault injection platform, new mitigation approaches could be quickly and
comprehensively evaluated.
This new fault injection platform allowed for a detailed analysis of the locations of
critical and non-critical SEUs in a simple communications system. The critical SEUs in this
communications system were a small fraction of its sensitive SEUs and mainly made up the
139
global clock, global reset, and the most significant bits of arithmetic. This important result
lead to a search for reduced-cost mitigation techniques focusing on this critical subset of
SEUs in DSP circuits.
Using the knowledge gained from these fault injection experiments, Chapter 4 intro-
duced RPR as a possible alternative to TMR for DSP and communications systems. RPR
protects against the most critical SEUs in an arithmetic circuit by focusing mitigation on
the most significant bits of computation. Extensive fault injection experiments demonstrated
RPR to be an effective and efficient alternative to TMR. RPR required less than half the
overhead of TMR while providing good coverage of the most critical SEUs.
After determining RPR to be a valid alternative to TMR, Chapter 5 examined sev-
eral different approaches for applying RPR to a system. That chapter provided a description
and comparison of three different types of RPR, including Threshold RPR, Bounded RPR,
and RP-TMR. RP-TMR is a novel variation of RPR first presented in this dissertation and
was shown to have several desirable traits. Both Threshold and Bounded RPR were intro-
duced elsewhere but had not previously been compared directly. Fault injection experiments
demonstrated each type of RPR and Threshold RPR was determined to be the best technique
for FPGA-based systems.
As the superior implementation of RPR for FPGA systems, Threshold RPR was
examined further in Chapter 6. That chapter explained the effects of choosing reduced-
precision bit-widths and decision threshold and presented methods for determining good
values for each. A novel experimental approach for determining the error detection threshold
of RPR was presented that can significantly improve the performance of RPR in some systems
with no additional hardware cost.
Chapter 6 also included a demonstration of applying RPR to a more complex commu-
nications receiver. No previous work has used RPR to protect a complex system not entirely
suited to RPR. In examining this system, this dissertation identified several important steps
that should be taken to mitigate such a system using RPR and showed how RPR can be used
140
in conjunction with TMR. Fault injection experiments confirmed that a system protected
with a mixture of RPR and TMR had significantly improved reliability over the unmitigated
system at a much lower cost than TMR only.
Appendix F presents the design of a pair of on-orbit experiments to validate RPR as
a reduced-cost alternative to TMR. One design has already been deployed in orbit and the
other is scheduled to launch in 2011.
7.2 Future Work
This dissertation demonstrated how RPR could be applied to communications sys-
tems in order to reduce the cost of SEU mitigation on FPGAs. There is much that can still
be done to further investigate the properties and utility of RPR. Several examples of future
work are suggested here:
Application of RPR to new modules and applications
Although RPR was analyzed here specifically for DSP and communications systems,
it can be used in other types of systems that use arithmetic. It would be interesting to
examine other application domains and apply RPR in those systems. More types of DSP
systems could also be considered such as fast Fourier transform (FFT) modules, infinite
impulse response (IIR) filters, and trellis decoders.
Optimal placement of RPR decision blocks
Section 6.3.1 brought up the question of the optimal locations of RPR decision blocks
in large systems. This is an open question which is likely related to the optimal placement
of TMR voters, for which applicable research was cited in that section.
141
Automated tool to apply RPR
All of the example systems presented in this dissertation were created by hand. An
automated tool that could automatically apply RPR to an unmitigated system would be
extremely useful to an FPGA design engineer. This tool could be completely automatic,
determining the location of RPR voters and reduced-precision bit-widths, or could use input
from a designer to assist in these choices.
Extension of RPR with history-aided decision blocks
As presented here and in previous work, RPR decision blocks makes a cycle-by-cycle
comparison and decision. This can cause RPR to switch rapidly between the full-precision
and reduced-precision outputs. In some applications, this may not be desirable. By keeping
some history of previous decisions, RPR could be extended to switch over to reduced-precision
mode for an extended period of time rather than this cycle-by-cycle determination. In an
FPGA, for example, RPR could stay in reduced-precision mode until scrubbing has repaired
the configuration.
Use of variable thresholds
Threshold RPR need not have a static threshold in all systems. As mentioned, knowl-
edge of the input signal characteristics or operating environment can allow one to lower the
Th value. If these conditions are known to change, the performance of RPR may be improved
by using lower values of Th when the probability of false positive detections drop based on
these conditions.
Comparison of RPR with error-control coding
RPR in a communications system is designed to reduce the bit error rate of the
system in the presence of SEUs. This comes at a hardware cost. It would be interesting to
142
compare the cost and performance of RPR against data level redundancy techniques such
as error-control coding circuits in SEU environments.
7.3 Concluding Remarks
With FPGAs being used for DSP and communications applications in space systems,
SEU mitigation techniques for these systems are increasingly important. TMR offers good
protection, but comes at a high implementation cost. Application-specific mitigation tech-
niques such as RPR may be the future of SEU mitigation. These techniques can offer a
significant decrease in failure rate at a much lower cost than TMR.
This dissertation has provided significant insight into evaluating these alternative
mitigation techniques for FPGA systems. The application-specific evaluation technique pre-
sented for communications systems allows a superior evaluation of the effects of SEUs on the
systems and can be mimicked in other applications for similar results. With the knowledge
that catastrophic SEUs are a small fraction of the sensitive SEUs of some communications
systems, reduced-cost mitigation techniques such as RPR can provide significant advantages
to space-bound FPGA systems.
This dissertation has also showed that RPR is an excellent technique for protecting
DSP and communications systems, which rely heavily on arithmetic operations. With the de-
tailed comparison of different RPR techniques presented here, the strengths and weaknesses
of each are readily apparent. The examples given demonstrated that RPR is an effective
mitigation technique for FPGA systems and should be considered where cost of mitigation
is important. Even in a complex system in which RPR is unable to replace TMR completely,
RPR can be used jointly with TMR to significantly reduce costs.
Using the knowledge gained through this research, RPR could find its way aboard
future space systems. Knowing that most SEUs do not cause catastrophic errors in commu-
nications systems is key for evaluating the suitability of an FPGA design for such systems.
RPR, with its relatively low cost, could be used where TMR is prohibitively expensive.
143
This could allow new space systems to take advantage of the many benefits that commercial
FPGAs offer or could allow additional functionality to be added to systems by freeing up
valuable FPGA resources. RPR can open doors that the expensive TMR technique has
effectively shut.
144
ACRONYMS
ASIC application-specific integrated circuit. 1
BER see bit error rate. 29, 30
DSP digital signal processing. 1
DU see Detected upset. 57, 60, 160
DUT design under test. 19, 34
FD see False detection, no upset. 57, 60, 160
FPGA field-programmable gate array. 1
LSB least significant bit. 55
MSB most significant bit. 41, 55
MTTF see Mean time to failure. 24, 65, 160
NU see No upset, no false detection. 57, 60, 160
RPR reduced-precision redundancy. 51
SEU single event upset. 2, 8
SNR signal-to-noise ratio. 30
TMR triple modular redundancy. 12, 15, 23
UU see Undetected upset. 57, 60, 160
145
GLOSSARY OF TERMS
a The detection probability, a, is the fraction of upsets which are detected by the particular
RPR implementation. 57, 77
B The bit-width of the full-precision module in an RPR system. 62, 117
Br The bit-width of the reduced-precision module in an RPR system. 62, 103, 113
εe The error signal formed by the difference of the outputs of the full-precision and reduced-
precision modules in an RPR system. This can be considered the quantization noise
or the estimation error of the reduced-precision module: εe = FPtrue − RPout. 58,
59, 89, 90, 104, 147, 157, 168
εmax The absolute maximum value of the estimation error, εe: εmax = max |εe|. 59, 89, 104
εRPR The error signal at the output of an RPR system: εRPR = FPtrue − RPRout. 52, 89
ERPR-avg The average error bound of an RPR system, taking into account the probability
of each RPR upset case and the error bounds in each case. 104
εu The error signal formed by the difference of the outputs of the true full-precision output
in the absence of upsets and the actual full-precision module in an RPR system. This
is the SEU-induced noise signal for a particular upset: εu = FPtrue − FPout. 59, 157,
158, 164, 165, 168
FPout The output signal of the full-precision module in an RPR system. 51
FPtrue The output signal of the full-precision module in an RPR system in the absence of
soft errors. 52
Pfp The probability of a false positive detection event in any given clock cycle. 58
147
Pupset The probability of a soft error in the full-precision module of an RPR system in any
given clock cycle. 57
RPout The output signal of the reduced-precision module in an RPR system. 51
RPRout The output signal of an RPR system. 52
Th The error-detection threshold used by Threshold RPR. 75, 91, 103, 104, 115
T ∗h An alternate error-detection threshold used by Threshold RPR for which the value has
been lowered below the maximum estimation error: T ∗h < εmax. 107, 165
bit error rate A measure of the performance of a communications system. It is the number
of incorrectly-received bits in a signal divided by the total number of bits transferred.
145
catastrophic SEU An SEU which causes a highly detrimental effect on the DSP system
in question. Non-catastrophic SEUs may cause errors, but these errors are much less
significant than the catastrophic SEUs. 39, 43, 45
configuration scrubbing Any of several methods used to continually repair any SEUs in
the configuration memory of an FPGA. 13
detected upset An upset occurs in the full-precision module of an RPR system and is
detected. The RPR system enters the reduced-precision mode. 145
failure rate λ, the rate at which failures occur in time in a particular system [2]. 20, 21
false detection, no upset In an RPR system, though there is no upset in the full-precision
module, the RPR decision block indicates that there is an error. In this case, the RPR
system is incorrectly in reduced-precision mode. 145
full-precision degraded mode RPR operating mode in which the FP module is not op-
erating perfectly, but its output is still approximately equal to the reduced-precision
output, so the slightly-degraded FP output is used. 52, 57
148
full-precision perfect mode RPR operating mode in which there are no upsets in the FP
module and the output of the system is the correct full-precision output. 52, 57
mean time to failure The expected time from initial operation until a failure occurs. 145
no upset, no false detection In an RPR system, no upset exists in the full-precision mod-
ule and there is no false detection. The RPR system is in full-precision perfect mode.
145
reduced-precision mode RPR operating mode in which the FP module output is different
enough from the RP output to determine that there is an error in the FP module and
the RP output is used instead of the erroneous FP output. 52, 57
reliability The ability of a system or component to operate correctly for a specified period
of time. Reliability is often reported as a probability or as a function of time.. 15
sensitive The sensitive configuration bits are a subset of the utilized bits of an FPGA
design. When a sensitive bit is upset, the output of the design is altered for some
input or input sequence. 19, 20, 33, 34, 37, 43, 45, 64, 139, 149, 153, 164
sensitivity The number and location of the sensitive configuration bits of a particular
FPGA design. 19, 20, 151
SEU-induced noise The corruption of a digital signal due to an SEU in the FPGA con-
figuration. 30, 114, 158, 164
undetected upset An upset occurs in the full-precision module of an RPR system and is
not detected. The RPR system operates in full-precision degraded mode. 145
utilized bit A utilized configuration bit is a memory cell which the FPGA design in question
utilizes. For most FPGA designs, the majority of the available configuration cells are
unused. 19, 41, 64, 66, 68, 70, 149, 153, 154
149
REFERENCES
[1] P. Ostler, M. Caffrey, D. Gibelyou, P. Graham, K. Morgan, B. Pratt, H. Quinn, and
M. Wirthlin, “SRAM FPGA reliability analysis for harsh radiation environments,” Nu-
clear Science, IEEE Transactions on, vol. 56, no. 6, pp. 3519–3526, 2009.
[2] D. P. Siewiorek and R. S. Swarz, ”Reliable Computer Systems”. A K Peters, 1998.
[3] E. Johnson, M. J. Wirthlin, and M. Caffrey, “Single-event upset simulation on an
FPGA,” in Proceedings of the International Conference on Engineering of Reconfig-
urable Systems and Algorithms (ERSA), T. P. Plaks and P. M. Athanas, Eds. CSREA
Press, Jun. 2002, pp. 68–73.
[4] B. Shim, S. Sridhara, and N. Shanbhag, “Reliable low-power digital signal processing
via reduced precision redundancy,” Very Large Scale Integration (VLSI) Systems, IEEE
Transactions on, vol. 12, no. 5, pp. 497–510, 2004.
[5] J. Snodgrass, “Low-Power fault tolerance for spacecraft FPGA-Based numerical com-
puting,” Ph.D. dissertation, Naval Postgraduate School, Monterey, CA, Sep. 2006.
[6] O. Mencer, M. Morf, and M. Flynn, “PAM-Blox: High performance FPGA design for
adaptive computing,” in FPGAs for Custom Computing Machines, 1998. Proceedings.
IEEE Symposium on, Apr. 1998, pp. 167–174.
[7] M. Cummings and S. Haruyama, “FPGA in the software radio,” Communications Mag-
azine, IEEE, vol. 37, no. 2, pp. 108–112, Feb. 1999.
[8] R. Tessier and W. Burleson, “Reconfigurable computing for digital signal processing: A
survey,” The Journal of VLSI Signal Processing, vol. 28, pp. 7–27, 2001.
[9] M. Caffrey, “A space-based reconfigurable radio,” in Proceedings of the International
Conference on Engineering of Reconfigurable Systems and Algorithms (ERSA), T. P.
Plaks and P. M. Athanas, Eds. CSREA Press, Jun. 2002, pp. 49–53.
[10] P. Graham, M. Caffrey, M. Wirthlin, D. E. Johnson, and N. Rollins, “Reconfigurable
computing in space: From current technology to reconfigurable systems-on-a-chip,” in
151
Proceedings of the 2003 IEEE Aerospace Conference. Big Sky, MT: IEEE, Mar. 2003,
pp. T07 0603.1–12.
[11] G. R. Goslin, “A guide to using field programmable gate arrays (FPGAs) for application-
specific digital signal processing performance,” in Xilinx Application Notes. Xilinx
Corporation, 1995.
[12] R. Petersen and B. Hutchings, “An assessment of the suitability of FPGA-based systems
for use in digital signal processing,” in Field-Programmable Logic and Applications, 1995,
pp. 293–302.
[13] C. Dick and F. Harris, “Configurable logic for digital communications: Some signal
processing perspectives,” Communications Magazine, IEEE, vol. 37, no. 8, pp. 107–
111, Aug. 1999.
[14] M. Cummings and S. Haruyama, “FPGA in the software radio,” Communications Mag-
azine, IEEE, vol. 37, no. 2, pp. 108–112, Feb. 1999.
[15] B. Salefski and L. Caglar, “Re-configurable computing in wireless,” in Proceedings of
the 38th annual Design Automation Conference. Las Vegas, Nevada, United States:
ACM, 2001, pp. 178–183.
[16] B. L. Hutchings and B. E. Nelson, “GigaOp DSP on FPGA,” The Journal of VLSI
Signal Processing, vol. 36, no. 1, pp. 41–55, Jan. 2004.
[17] C. Dick, F. Harris, and M. Rice, “FPGA implementation of carrier synchronization for
QAM receivers,” The Journal of VLSI Signal Processing, vol. 36, no. 1, pp. 57–71, Jan.
2004.
[18] R. A. Mewaldt. Cosmic rays. California Institute of Technology. [Accessed 15-October-
2010]. [Online]. Available: http://www.srl.caltech.edu/personnel/dick/cos encyc.html
[19] C. Beth Barbier. (2008, Jan.) Cosmicopia. National Aeronautics and Space
Administration. [Accessed 15-October-2010]. [Online]. Available: http://helios.gsfc.
nasa.gov/
[20] N. Cohen, T. Sriram, N. Leland, D. Moyer, S. Butler, and R. Flatley, “Soft error consid-
erations for deep-submicron CMOS circuit applications,” in Electron Devices Meeting,
1999. IEDM Technical Digest. International, 1999, pp. 315–318.
152
[21] P. Hazucha and C. Svensson, “Impact of CMOS technology scaling on the atmospheric
neutron soft error rate,” Nuclear Science, IEEE Transactions on, vol. 47, no. 6, pp.
2586–2594, 2000.
[22] N. Seifert, X. Zhu, and L. Massengill, “Impact of scaling on soft-error rates in com-
mercial microprocessors,” Nuclear Science, IEEE Transactions on, vol. 49, no. 6, pp.
3100–3106, 2002.
[23] P. Dodd and L. Massengill, “Basic mechanisms and modeling of single-event upset in
digital microelectronics,” Nuclear Science, IEEE Transactions on, vol. 50, no. 3, pp.
583–602, Jun. 2003.
[24] C. Martha O’Bryan. (2009, Mar.) Radiation effects and analysis home page. National
Aeronautics and Space Administration Goddard Space Flight Center. [Accessed
15-October-2010]. [Online]. Available: http://radhome.gsfc.nasa.gov/
[25] D. E. Johnson, “Estimating the dynamic sensitive cross section of an FPGA design
through fault injection,” Master’s thesis, Brigham Young University, Provo, UT, Apr.
2005.
[26] H. Quinn, P. Graham, J. Krone, M. Caffrey, and S. Rezgui, “Radiation-induced multi-
bit upsets in SRAM-based FPGAs,” Nuclear Science, IEEE Transactions on, vol. 52,
no. 6, pp. 2455–2461, Dec. 2005.
[27] K. Chiba, I. Nashiyama, K. Sugimoto, N. Nemoto, H. Asai, Y. Iide, H. Shindo, N. Ikeda,
S. Kuboyama, and S. Matsuda, “Correlation between proton and heavy-ion SEUs in
commercial memory devices,” in Radiation Effects Data Workshop, 2003. IEEE, Jul.
2003, pp. 127–132.
[28] T. Karnik and P. Hazucha, “Characterization of soft errors caused by single event upsets
in CMOS processes,” Dependable and Secure Computing, IEEE Transactions on, vol. 1,
no. 2, pp. 128–143, Apr. 2004.
[29] R. Katz, K. LaBel, J. Wang, B. Cronquist, R. Koga, S. Penzin, and G. Swift, “Radiation
effects on current field programmable technologies,” Nuclear Science, IEEE Transac-
tions on, vol. 44, no. 6, pp. 1945–1956, Dec. 1997.
[30] S. Rezgui, “Radiation-tolerant ProASIC3 FPGAs radiation effects,” Actel Corporation,
Tech. Rep., Apr. 2010.
153
[31] M. Wirthlin, N. Rollins, M. Caffrey, and P. Graham, “Hardness by design techniques for
field-programmable gate arrays,” in Proceedings of the 11th Annual NASA Symposium
on VLSI design, Coeur d’Alene, ID, May 2003, pp. WA11.1–WA11.6.
[32] M. Bellato, P. Bernardi, D. Bortolato, A. Candelori, M. Ceschia, A. Paccagnella, M. Re-
baudengo, M. Sonza Reorda, M. Violante, and P. Zambolin, “Evaluating the effects of
SEUs affecting the configuration memory of an SRAM-based FPGA,” in DATE ’04:
Proceedings of the conference on Design, automation and test in Europe. Washington,
DC, USA: IEEE Computer Society, 2004.
[33] C. Carmichael, M. Caffrey, and A. Salazar, “Correcting single-event upsets through
Virtex partial configuration,” Xilinx Corporation, Tech. Rep., Jun. 1, 2000, xAPP216
(v1.0).
[34] M. Berg, C. Poivey, D. Petrick, D. Espinosa, A. Lesea, K. LaBel, M. Friendlich, H. Kim,
and A. Phan, “Effectiveness of internal versus external SEU scrubbing mitigation strate-
gies in a Xilinx FPGA: Design, test, and analysis,” Nuclear Science, IEEE Transactions
on, vol. 55, no. 4, pp. 2259–2266, Aug. 2008.
[35] K. Morgan, D. McMurtrey, B. Pratt, and M. Wirthlin, “A comparison of TMR with
alternative fault-tolerant design techniques for FPGAs,” Nuclear Science, IEEE Trans-
actions on, vol. 54, no. 6, pp. 2065–2072, 2007.
[36] Y. Hsu and E. Swartzlander, “Time redundant error correcting adders and multipli-
ers,” in Defect and Fault Tolerance in VLSI Systems. Proceedings of the 1992 IEEE
International Workshop on, 1992, pp. 247–256.
[37] W. Townsend, J. Abraham, and E. Swartzlander, “Quadruple time redundancy adders
[error correcting adder],” in Defect and Fault Tolerance in VLSI Systems, 2003. Pro-
ceedings. 18th IEEE International Symposium on, 2003, pp. 250–256.
[38] F. Lima, L. Carro, and R. Reis, “Designing fault tolerant systems into SRAM-based
FPGAs,” in Design Automation Conference, 2003. Proceedings, Jun. 2003, pp. 650–655.
[39] T. K. Moon, Error Correction Coding: Mathematical Methods and Algorithms. Wiley-
Interscience, 2005.
[40] R. Rochet, R. Leveugle, and G. Saucier, “Analysis and comparison of fault tolerant FSM
architecture based on SEC codes,” in Defect and Fault Tolerance in VLSI Systems, The
IEEE International Workshop on, Oct. 1993, pp. 9–16.
154
[41] J. von Neumann, “Probabilistic logics and the synthesis of reliable organisms from
unreliable components,” in Automata Studies. Princeton University Press, 1956, pp.
43–98.
[42] C. Carmichael, “Triple module redundancy design techniques for Virtex FPGAs,” Xilinx
Corporation, Tech. Rep., Nov. 1, 2001, xAPP197 (v1.0).
[43] C. Carmichael, E. Fuller, J. Fabula, and F. D. Lima, “Proton testing of SEU mitigation
methods for the Virtex FPGA,” in Proceedings of the IEEE Microelectronics Reliability
and Qualification Workshop, Pasadena, CA, Dec. 2001.
[44] N. Rollins, M. Wirthlin, M. Caffrey, and P. Graham, “Evaluating TMR techniques in the
presence of single event upsets,” in Proceedings Conference on Military and Aerospace
Programmable Logic Devices (MAPLD). Washington, D.C.: NASA Office of Logic
Design, AIAA, Sep. 2003, p. P63.
[45] J. M. Johnson and M. J. Wirthlin, “Voter insertion algorithms for FPGA designs using
triple modular redundancy,” in Proceedings of the 18th Annual ACM/SIGDA Interna-
tional Symposium on Field Programmable Gate Arrays. Monterey, California, USA:
ACM, 2010, pp. 249–258.
[46] A. Reddy and P. Banerjee, “Algorithm-based fault detection for signal processing ap-
plications,” Transactions on Computers, vol. 39, no. 10, pp. 1304–1308, Oct 1990.
[47] B. Shim and N. Shanbhag, “Energy-efficient soft error-tolerant digital signal processing,”
Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, vol. 14, no. 4, pp.
336–348, 2006.
[48] P. Reyes, P. Reviriego, J. Maestro, and O. Ruano, “A new protection technique for finite
impulse response (FIR) filters in the presence of soft errors,” in Industrial Electronics,
IEEE International Symposium on, 2007, pp. 3328–3333.
[49] N. Shanbhag, K. Soumyanath, and S. Martin, “Reliable low-power design in the presence
of deep submicron noise,” in Low Power Electronics and Design. Proceedings of the 2000
International Symposium on, 2000, pp. 295–302.
[50] R. Hegde and N. Shanbhag, “Soft digital signal processing,” Very Large Scale Integration
(VLSI) Systems, IEEE Transactions on, vol. 9, no. 6, pp. 813–823, 2001.
[51] P. Reyes, P. Reviriego, J. Maestro, and O. Ruano, “New protection techniques against
SEUs for moving average filters in a radiation environment,” Nuclear Science, IEEE
Transactions on, vol. 54, no. 4, pp. 957–964, 2007.
155
[52] O. Ruano, P. Reyes, J. Maestro, L. Sterpone, and P. Reviriego, “An experimental
analysis of SEU sensitiveness on system knowledge-based hardening techniques,” in
Design and Diagnostics of Electronic Circuits and Systems, IEEE, 2007, pp. 1–6.
[53] P. Reviriego, J. Maestro, and O. Ruano, “Efficient protection techniques against SEUs
for adaptive filters: An echo canceller case study,” Nuclear Science, IEEE Transactions
on, vol. 55, no. 3, pp. 1700–1707, 2008.
[54] B. Shim and N. Shanbhag, “Reduced precision redundancy for low-power digital filter-
ing,” in Signals, Systems and Computers, Conference Record of the Thirty-Fifth Asilo-
mar Conference on, vol. 1, 2001, pp. 148–152.
[55] E. Fuller, M. Caffrey, P. Blain, C. Carmichael, N. Khalsa, and A. Salazar, “Radiation
test results of the Virtex FPGA and ZBT SRAM for space based reconfigurable comput-
ing,” in 2nd Annual Conference on Military and Aerospace Programmable Logic Devices
(MAPLD), Sep. 1999.
[56] C. Carmichael and C. W. Tseng, “Correcting single-event upsets in Virtex-4 FPGA
configuration memory,” Xilinx Corporation, Tech. Rep., Oct. 5, 2009, xAPP1088 (v1.0).
[57] K. Chapman, “SEU strategies for Virtex-5 devices,” Xilinx Corporation, Tech. Rep.,
Apr. 1, 2010, xAPP864 (v2.0).
[58] M. Alderighi, F. Casini, S. D’Angelo, M. Mancini, S. Pastore, and G. Sechi, “Evaluation
of single event upset mitigation schemes for SRAM based FPGAs using the FLIPPER
fault injection platform,” in Defect and Fault-Tolerance in VLSI Systems, 22nd IEEE
International Symposium on, Sep. 2007, pp. 105–113.
[59] B. Pratt, M. Wirthlin, M. Caffrey, P. Graham, and K. Morgan, “Noise impact of single-
event upsets on an FPGA-based digital filter,” in Field Programmable Logic and Appli-
cations, International Conference on, 2009, pp. 38–43.
[60] B. Pratt, M. Fuller, M. Rice, and M. Wirthlin, “Reliable communications using FPGAs
in high-radiation environments – Part I: Characterization,” in Communications (ICC),
2010 IEEE International Conference on, Cape Town, South Africa, May 2010.
[61] M. Violante, L. Sterpone, M. Ceschia, D. Bortolato, P. Bernardi, M. Reorda, and
A. Paccagnella, “Simulation-based analysis of SEU effects in SRAM-based FPGAs,”
Nuclear Science, IEEE Transactions on, vol. 51, no. 6, pp. 3354–3359, Dec. 2004.
[62] J. G. Proakis, Digital Communications, 4th ed. New York: McGraw-Hill, 2001.
156
[63] M. Rice, Digital Communications: A Discrete-Time Approach, 1st ed. New Jersey:
Pearson Prentice Hall, 2009.
[64] M. A. Sullivan, “Reduced precision redundancy applied to arithmetic operations in field
programmable gate arrays for satellite control and sensor systems,” Master’s thesis,
Naval Postgraduate School, Monterey, CA, Dec. 2008.
[65] A. Savich, M. Moussa, and S. Areibi, “The impact of arithmetic representation on
implementing MLP-BP on FPGAs: A study,” Neural Networks, IEEE Transactions on,
vol. 18, no. 1, pp. 240–252, Jan. 2007.
[66] B. Widrow and I. Kollar, Quantization Noise: Roundoff Error in Digital Computa-
tion, Signal Processing, Control, and Communications. Cambridge, UK: Cambridge
University Press, 2008.
[67] G. A. Constantinides and G. J. Woeginger, “The complexity of multiple wordlength
assignment,” Applied Mathematics Letters, vol. 15, no. 2, pp. 137–140, 2002.
[68] M.-A. Cantin, Y. Savaria, and P. Lavoie, “A comparison of automatic word length
optimization procedures,” in Circuits and Systems, IEEE International Symposium on,
vol. 2, 2002, pp. II–612–II–615.
[69] W. Osborne, R. Cheung, J. Coutinho, W. Luk, and O. Mencer, “Automatic accuracy-
guaranteed bit-width optimization for fixed and floating-point systems,” in Field Pro-
grammable Logic and Applications, International Conference on, Aug. 2007, pp. 617–
620.
[70] L. Sterpone, M. Violante, and S. Rezgui, “An analysis based on fault injection of hard-
ening techniques for sram-based fpgas,” Nuclear Science, IEEE Transactions on, vol. 53,
no. 4, pp. 2054–2059, Aug. 2006.
[71] L. Sterpone and M. Violante, “A new reliability-oriented place and route algorithm for
SRAM-based FPGAs,” Computers, IEEE Transactions on, vol. 55, no. 6, pp. 732–744,
Jun. 2006.
[72] B. Shim, “Error-tolerant digital signal processing,” Ph.D. dissertation, University of
Illinois at Urbana-Champaign, 2005.
[73] B. Pratt, M. Caffrey, P. Graham, E. Johnson, K. Morgan, and M. Wirthlin, “Improving
FPGA design robustness with partial TMR,” in Proceedings of the IRPS Conference,
Mar. 2006.
157
[74] BYU Configurable Computing Lab. (2009, Sep.) BYU-LANL TMR tool usage
guide, version 0.5.2. Brigham Young University. [Online]. Available: http:
//sourceforge.net/projects/byuediftools/files/
[75] K. Gurzi, “Estimates for best placement of voters in a triplicated logic network,” Elec-
tronic Computers, IEEE Transactions on, vol. EC-14, no. 5, pp. 711–717, Oct. 1965.
[76] F. L. Kastensmidt, L. Sterpone, L. Carro, and M. S. Reorda, “On the optimal design of
triple modular redundancy logic for SRAM-based FPGAs,” in DATE ’05: Proceedings
of the conference on Design, Automation and Test in Europe. Washington, DC, USA:
IEEE Computer Society, 2005, pp. 1290–1295.
[77] B. H. Pratt, M. P. Caffrey, D. Gibelyou, P. S. Graham, K. Morgan, and M. J. Wirthlin,
“TMR with more frequent voting for improved FPGA reliability,” in Proceedings of the
2008 International Conference on Engineering of Reconfigurable Systems & Algorithms,
Las Vegas, Nevada, USA, July 14-17, 2008, T. P. Plaks, Ed., 2008, pp. 153–158.
[78] J. Johnson, “Synchronization voter insertion algorithms for FPGA designs using triple
modular redundancy,” Master’s thesis, Brigham Young University, Electrical and Com-
puter Engineering Department, Mar. 2010.
[79] E. Johnson, M. Caffrey, P. Graham, N. Rollins, and M. Wirthlin, “Accelerator validation
of an FPGA SEU simulator,” Nuclear Science, IEEE Transactions on, vol. 50, no. 6,
pp. 2147–2157, Dec. 2003.
[80] MISSE homepage. National Aeronautics and Space Administartion (NASA). [Accessed
30-September-2010]. [Online]. Available: http://misse1.larc.nasa.gov/
[81] Sandia lab news: March 26, 2009. Sandia National Laboratories. [Accessed 30-
September-2010]. [Online]. Available: http://www.sandia.gov/LabNews/100326.html
[82] M. Caffrey, K. Morgan, D. Roussel-Dupre, S. Robinson, A. Nelson, A. Salazar,
M. Wirthlin, W. Howes, and D. Richins, “On-orbit flight results from the reconfigurable
cibola flight experiment satellite (CFESat),” in Field Programmable Custom Computing
Machines, 17th IEEE Symposium on, 2009, pp. 3–10.
158
APPENDIX A. FAULT INJECTION EXPERIMENT CONFIGURATION
The fault injection experiments presented in this dissertation were conducted using
an FPGA board designed by SEAKR Engineering for the Xilinx Radiation Test Consortium
(XRTC). The board contains two Xilinx Virtex-II Pro FPGAs (XC2VP70-FF1704-6) and
a daughter card with a Virtex-4 FPGA (XC4VSX55-FF1148-10). The first Virtex-II Pro
FPGA is the ConfigMon (configuration monitor) FPGA, which controls the overall test and
injects faults into the design under test (DUT) FPGA. The second Virtex-II Pro FPGA is
the FuncMon (functional monitor) which generates the inputs that drive the DUT FPGA
and receives and analyzes the DUT FPGA’s outputs. The Virtex-4 FPGA is the DUT FPGA
which contains the design to be tested. Figure A.1 shows a photograph of this board with
the three FPGAs labeled.
Figure A.2 is a simplified diagram showing the function of ConfigMon FPGA. This
FPGA was designed to interface with a host PC over a USB 2.0 connection. The host PC
instructed this FPGA which bits in the configuration to test and received and recorded the
test results. The fault injection of the DUT FPGA was otherwise completely controlled by
the hardware-based fault injection (HW FI) core over the SelectMAP interface (SMAP I/F).
The ConfigMon’s state machine also controlled the duration of each test by sending com-
mands to the FuncMon FPGA. The results of the test were then passed from the FuncMon
to the ConfigMon and then onto the host PC to be recorded.
A.1 Sensitivity Experiments
To measure the sensitivity of a design, the fault injection experiments followed the
general flow of [3] as described in Section 2.3.2. On the XRTC board experiments, the circuit
159
FuncMon
DUT
USB I/F
ConfigMon
Figure A.1: A photograph of the fault injection test board.
Figure A.2: A block diagram of the ConfigMon FPGA used in the fault injection tests.
160
driver and the golden copy of the FPGA design resided on the FuncMon FPGA and the DUT
design resided on the DUT FPGA. The outputs of the two designs were then compared on
the FuncMon FPGA and any sensitive upsets were reported to the ConfigMon FPGA, which
recorded the bit location of the upset and sent it to the host PC.
Since the fault injection was controlled by a hardware module on the ConfigMon
FPGA, the host PC had little interaction with the XRTC board. This allowed the tests
to complete far faster than any software-controlled fault injection test where each bit is
upset by sending a corrupt portion of the configuration bit file from the host PC. Such a
software setup could test the approximately 22 million bits of the SX-55 FPGA in about 24
hours. This hardware-based approach was able to complete the same test in approximately
25 minutes.
The test procedure for FPGA designs with redundancy was slightly modified in order
to accurately locate all of the utilized configuration bits in the design. A sensitivity test
locates all of the configuration bits which affect the output of the design including any
voting circuitry. A test to identify the utilized bits must bypass any voting logic to locate
even those bits whose errors are normally masked.
Figure A.3 illustrates the difference between a sensitivity and utilization test for a
TMR system. For the sensitivity test, the voted output of the DUT design is compared to
the golden design output. For the utilization test, the output of each replica is compared to
the golden output, bypassing the masking logic of the voter.
The location all of the utilized bits of a particular design provides an alternate measure
of the hardware cost of a design. The size of an FPGA design is typically reported in the
number of logic elements used by the design. The number of utilized bits includes those used
by these logic elements as well as any memory cells used to configure the routing within the
design.
The BER test results in Chapters 4, 5, and 6 report the classification of all of the
utilized bits rather than the sensitive bits. All of the non-sensitive utilized bits fall within
161
(a)
(b)
Figure A.3: Comparison between the (a) sensitivity test architecture and the (b) utilizationtest architecture.
the Class 1 SEU category, which is evident in the TMR results where virtually all of the
utilized bits are Class 1 SEUs. This provides a more comprehensive comparison between the
redundant designs since the number of configuration bits required to implement the TMR
and RPR designs can be seen.
A.2 Bit Error Rate Experiments
This section describes the hardware used to conduct the fault injection experiments
described in Section 3.4. These experiments record the bit error rate (BER) of every utilized
162
bit of a given communications system design. Both the FIR filter designs of the feed-forward
demodulator and the full recursive demodulator systems were tested with this hardware
architecture.
Figure A.4 shows a block diagram of the FuncMon and DUT FPGAs for the FIR
filter experiments. The design on the FuncMon FPGA generated a pseudorandom sequence
of data to send over the communications link. The random data was modulated with a
square-root raised-cosine (SRRC) pulse shape [63]. The modulated signal was then added
to the output of a noise generator which could be configured with different levels of white
Gaussian noise.
The DUT FPGA contained the FIR filter design as part of the demodulator circuit
under test. After the DUT processed the noisy modulated signal, the result was passed
back to the FuncMon FPGA. The rest of the demodulator, was located on the FuncMon
and finished demodulating the signal. A bit error rate tester (BERT) then analyzed the
demodulated data, comparing it to the original pseudorandom sequence and any bit errors
were counted up. From this count and the test duration, the BER of the system was
calculated and passed back to the ConfigMon FPGA. The ConfigMon sent the configuration
bit identifier and the BER resulting from the injected fault back to the host PC to be
recorded.
For the recursive system of Sections 3.6 and 6.3.4, only a small change was necessary.
For these experiments, the entire demodulator design was located on the DUT FPGA. Only
the BERT block was needed on the FuncMon FPGA for processing the test data.
Using this test architecture, a BER curve was generated for every utilized configu-
ration bit in each design tested. As explained in Section 3.4, each utilized bit was tested
with input SNR values of 2, 4, 6, 8, and 10 dB. Each test with a given SNR was run at a
different test duration, shown in Table A.1, each long enough to obtain an accurate BER
measurement for an ideal binary PAM system. The table also shows the total test duration
for each tested configuration bit.
163
Figure A.4: A block diagram of the BER fault injection test.
Table A.1: Fault injection run times for each SNR input value.
SNR (dB) Clock Cycles Data Symbols (bits)
2 40,000 10,0004 40,000 10,0006 400,000 100,0008 4,000,000 1,000,00010 40,000,000 10,000,000
Total 44,480,000 11,120,000
The full set of BER fault injection experiments collected and condensed a massive
amount of data. With approximately 3 million utilized bits tested throughout this disserta-
tion, over 33 trillion data bits were sent and processed on the XRTC board to produce the 3
million BER values reported in the tables and figures in this dissertation. This represented
a total of approximately 1,100 hours (about 45 days) of real time on the XRTC board.
164
APPENDIX B. SAMPLE NOISE DATA
This appendix contains sample data to illustrate some of the possible noise signals
for the designs presented in this dissertation. Section B.1 describes the estimation error,
εe, of several reduced-precision FIR filter designs. Section B.2 presents the probability mass
functions (pmf) of the SEU-induced noise, εu, for several SEUs affecting an FIR filter design.
Section B.3 presents some combined statistical measures of the εu signals for all sensitive
SEUs of the FIR filter design.
Although the data presented in this appendix cannot be assumed to be typical across
all designs, it is valuable as an example of the possible noise data. This data can help give
insight into the issues of mitigating SEU-induced noise and into improving techniques such
as RPR.
B.1 FIR Filter Estimation Error
Figures B.1–B.6 plot the probability mass function (pmf) of the estimation error, εe,
for a range of reduced-precision FIR filter designs. The full-precision filter design is the one
described in Section C.2 and shown in Figure C.1, with B = 15. The input signal had an
SNR of 8 dB in each case. The figures plot the pmf of the difference between the full- and
reduced-precision filter outputs for filters with Br = 2–7.
These figures show that εe for each of these reduced-precision designs have an approx-
imately Gaussian distribution. This fact is used in Section 6.1.3 to apply Threshold RPR to
this filter with a reasonable error-detection threshold, T ∗h , for which T ∗
h < εmax. These figures
also illustrate the reduction in the magnitude of the error signal as Br increases, reflecting
the fact that the approximation of the full-precision output improves as Br increases.
165
−0.2 0 0.2 0.4 0.6 0.8 1 1.2
Figure B.1: Probability mass function(pmf) of the estimation error, εe, of thereduced-precision FIR Filter with Br = 2.
−0.2 0 0.2 0.4 0.6 0.8 1 1.2
Figure B.2: Probability mass function(pmf) of the estimation error, εe, thereduced-precision FIR Filter with Br = 3.
Note that while these error distribution functions appear Gaussian, they do not have
zero mean. This is due to the truncation of the signals associated with the reduced-precision
module. The truncation operation introduces a positive error bias which is reflected in the
non-zero mean of each of the pmf plots.
This truncation bias reveals an opportunity for optimization of this particular reduced-
precision module implementation. The mean value of εe could be subtracted from the output
of the reduced-precision filter, resulting in a lower εmax value. This would allow better detec-
tion of SEU-induced errors and better overall performance, though the cost of the module
would increase slightly with the extra hardware for the subtraction operation.
B.2 SEU-Induced Noise Probability Mass Functions
Figures B.7–B.10 plot the probability mass function (pmf) of the SEU-induced noise
signals, εu, for several SEUs in an FIR filter design. Each subplot represents a distinct
configuration bit upset. The filter designed used was described in [59]. It was a 49-tap FIR
filter using the SRRC pulse shape with excess bandwidth α = 0.5 using Lp = 6. The filter
had a 16-bit input with range [-2,2) and an 18-bit output with range [-8,8). The design
was implemented on a Virtex 1000 FPGA and fault injection was performed using the fault
injection platform described in [79].
166
−0.2 0 0.2 0.4 0.6 0.8 1 1.2
Figure B.3: Probability mass function(pmf) of the estimation error, εe, thereduced-precision FIR Filter with Br = 4.
−0.2 0 0.2 0.4 0.6 0.8 1 1.2
Figure B.4: Probability mass function(pmf) of the estimation error, εe, thereduced-precision FIR Filter with Br = 5.
−0.2 0 0.2 0.4 0.6 0.8 1 1.2
Figure B.5: Probability mass function(pmf) of the estimation error, εe, thereduced-precision FIR Filter with Br = 6.
−0.2 0 0.2 0.4 0.6 0.8 1 1.2
Figure B.6: Probability mass function(pmf) of the estimation error, εe, thereduced-precision FIR Filter with Br = 7.
The figures demonstrate a wide range of SEU-induced noise. The probability distri-
bution of several of the noise signals resembles a Gaussian distribution, but most are not
smooth distributions. Most of the noise signals shown have low-magnitude noise compared
to the output of the original filter, which had an average range of [-1.5,1.5] in these tests.
167
−1 0 1 2
x 10−4
−2 0 2
0 2 4 6 −1 0 1 2
x 10−4
−5 0 5
−2 −1 0 1
x 10−3
−1 0 1 −0.5 0 0.5
−0.1 0 0.1 −0.02 0 0.02 −5 0 5
x 10−3
−2 0 2 4
Figure B.7: Sample probability mass functions (pmfs) of the SEU-induced noise signals, εu,for several upsets in an FIR filter design.
168
−5 0 5
x 10−4
−1 0 1 2
x 10−4
−2 0 2
x 10−3
0 0.5 1 1.5
x 10−4
−1 −0.5 0
x 10−4
−4 −2 0 2
x 10−4
0 0.5 1
x 10−4
−1 −0.5 0
x 10−4
−0.5 0 0.5 −1 0 1 2
x 10−4
−1 0 1 2
x 10−4
−1 0 1
x 10−3
Figure B.8: More sample probability mass functions (pmfs) of the SEU-induced noise signals,εu, for several upsets in an FIR filter design.
169
−2 0 2
x 10−4
−2 0 2
x 10−4
−2 0 2
x 10−4
0 0.5 1
x 10−4
−1 0 1 2
x 10−4
−10 0 10 −5 0 5
x 10−4
−6 −4 −2 0
x 10−4
−6 −4 −2 0
x 10−4
0 0.5 1
x 10−4
−1 0 1 2
x 10−4
−5 0 5
x 10−4
Figure B.9: More sample probability mass functions (pmfs) of the SEU-induced noise signals,εu, for several upsets in an FIR filter design.
170
−1 −0.5 0 −0.02 −0.01 0
−5 0 5
x 10−3
−0.2 −0.1 0 −0.2 −0.1 0
−4 −2 0
x 10−3
−1 −0.5 0
x 10−3
−0.01 −0.005 0
−0.01 0 0.01 −4 −2 0
x 10−4
−4 −2 0
x 10−3
−1 −0.5 0
Figure B.10: More sample probability mass functions (pmfs) of the SEU-induced noise sig-nals, εu, for several upsets in an FIR filter design.
171
B.3 SEU-Induced Noise Statistics
Figures B.11–B.16 show the combined statistics of the SEU-induced noise signals, εu,
for an FIR filter design. The filter design is the same as that in Section B.1. The histograms
in each figure include the statistics of the noise signal, εu, induced by every sensitive bit in
the filter design.
Figure B.11 plots a histogram of the mean value of all εu signals. Figure B.12 shows
more detail of this histogram. The sample mean of the noise signal was calculated:
µ =1
N
n∑1
x. (B.1)
Notice that the vast majority of SEUs have a mean close to zero. Some SEUs, cause higher
means, which is to be expected for upsets which cause a stuck-at fault in a high order bit of
the output, for example.
Figure B.13 is a histogram of the variance of all εu signals. Figure B.14, again, shows
more detail of this histogram. The variance shown is the sample variance:
σ2 =1
N
n∑1
(x− µ)2. (B.2)
The variance of the noise signals are again, mostly close to zero.
Figure B.15 plots a histogram of the power (mean square) of all εu signals while
Figure B.16 shows more detail. The power of the noise signal is calculated as:
power =1
N
n∑1
x2. (B.3)
The power of the noise signals, which can be used to calculate the SNR of the SEU-induced
noise, is distributed similar to the variance. This is not surprising given the mean of most
of the signals is close to zero.
172
−1 −0.5 0 0.5 1 1.50
1
2
3
4
5
6
7x 10
4
Figure B.11: Histogram of the mean ofthe SEU-induced noise signals, εu, for allsensitive SEUs in an FIR filter design.
−1 −0.5 0 0.5 1 1.50
500
1000
1500
2000
Figure B.12: Detail of the histogram inFigure B.11.
0 1 2 3 4 5 60
1
2
3
4
5
6
7x 10
4
Figure B.13: Histogram of the variance ofthe SEU-induced noise signals, εu, for allsensitive SEUs in an FIR filter design.
0 1 2 3 4 5 60
200
400
600
800
1000
Figure B.14: Detail of the histogram inFigure B.13.
These statistics are given as an example of the distribution and characteristics of the
SEU-induced noise signals. Without knowing the pmf of each of the εu signals, however, it
is difficult to use this information to improve a mitigation technique such as RPR. If, for
example, the distribution of every εu signal were Gaussian, these statistics could be used to
determine a better value for the error-detection threshold, T ∗h . Section B.2 demonstrates,
however, that this is not the case. These plots are provided, therefore, simply as an example
for one sample design.
173
0 1 2 3 4 50
1
2
3
4
5
6
7x 10
4
Figure B.15: Histogram of the power(mean square) of the SEU-induced noisesignals, εu, for all sensitive SEUs in anFIR filter design.
0 1 2 3 40
200
400
600
800
1000
Figure B.16: Detail of the histogram inFigure B.15.
174
APPENDIX C. RPR COMPARISON DESIGNS
This appendix describes the filter designs used in the RPR comparison testing in
Chapter 5. Section 5.3.1 explained that two versions of the FIR filter design were created:
one for Threshold and Bounded RPR and one for RP-TMR. Section C.1 describes the general
architecture of the filter, which is shared by both designs. Section C.2 will describe the
first design and Section C.3 will describe the second. Tables C.1–C.4 show fault injection
test results for each filter design and report on the predicted failure rates in several orbit
environments.
C.1 General Filter Architecture
Both FIR filter designs share the same basic architecture. The architecture is a
standard type I direct form FIR filter made up of registers, adders, and multipliers. The
unmitigated filter uses 16-bit registers, coefficients, and input as a fixed-point number with
range [-1,1) (Q15 format). The output is truncated to a 17-bit fixed-point number with range
[-2,2) (Q1.15 format). The filter is a 25-tap FIR filter with symmetric coefficients, which
allows the filter to be implemented with 13 multipliers. The filter is designed to be used in a
communications receiver as a matched filter using a square-root raised-cosine (SRRC) pulse
shape with excess bandwidth α = 0.5 using Lp = 3. A block diagram of a 7-tap filter of the
same form is shown in Figure C.1.
C.2 System Generator FIR Filter
This FIR filter design was used throughout Chapters 5 and 6 to compare RPR imple-
mentations. It was used for both the Threshold RPR and Bounded RPR implementations.
175
Figure C.1: Block diagram of a type I direct form FIR filter with seven taps, optimized forsymmetric coefficients.
The filter was designed using Xilinx’s System Generator software, which allowed rapid pro-
totyping and easy alterations. The multipliers were implemented using the Xilinx Constant
Multiplier block with coefficients rounded to 16 bits of precision (Q15 format). In addition
to the tables presented here, Sections B.1 and B.3 in Appendix B include some statistics of
the estimation error, εe, and the SEU-induced noise signals, εu, for this design.
Table C.1: Number of SEUs causing each class of effectfor the FIR filter design with α = 0.5.
TotalSlices Class 1 Class 2 Class 3 Class 4 Utilized Total
Design Used Bits Bits Bits Bits Bits Cat. Bits
SysGenFilter 1,030 59,156 6,472 1,501 943 68,072 2,444
VHDLFilter 2,457 112,066 14,719 6,581 1,646 135,012 8,227
176
Table C.2: Percentage of SEUs causing certain SNR losses at aBER of 10−5 for the FIR filter design with α = 0.5
Design > 0.2 dB > 0.5 dB > 1 dB > 3 dB > 6 dB
SysGen Filter 13.10% 10.57% 9.068% 6.289% 5.322%
VHDL Filter 17.00% 14.28% 12.86% 9.69% 8.140%
Table C.3: Sensitive failure rates (λ) for the FIR filter design in various orbits.
Design GEO GPS Molniya Polar LEO
SysGen Filter 1.04×10−5 9.09×10−6 9.90×10−6 2.40×10−6 6.48×10−8
VHDL Filter 2.06×10−5 1.80×10−5 1.96×10−5 4.76×10−6 1.29×10−7
Table C.4: Catastrophic failure rates (λ) for the FIR filter design in various orbits.
Design GEO GPS Molniya Polar LEO
SysGen Filter 3.73×10−7 3.26×10−7 3.55×10−7 8.63×10−8 2.33×10−9
VHDL Filter 1.25×10−6 1.10×10−6 1.20×10−6 2.90×10−7 7.83×10−9
C.3 VHDL FIR Filter
This FIR filter design was used in Chapter 5 for the RP-TMR implementation. In
order to create a correct RP-TMR implementation, this filter was designed structurally in
VHDL and EDIF (electronic design interchange format). This was done in order to ensure
that the correct low-level components were targeted as described in Section 5.1.3. To help
simplify this task, the multipliers in the RP-TMR filter are two-input multipliers with a
constant as one of the inputs. The custom multipliers created also may not be optimized
for the Xilinx architecture used for our testing. For these reasons, the filter used in the
RP-TMR tests is larger than that used in the Threshold and Bounded RPR tests.
177
APPENDIX D. RPR DECISION BLOCKS
This appendix contains supplemental material related to the decision blocks used with
RPR. Section D.1 calculates the area cost of the decision blocks associated with each variation
of RPR. Section D.2 estimates the cost of a simple design with different configurations of
the decision blocks. Section D.3 explains some of the difficulties seen in the fault injection
experiments when using triplicated RPR decision blocks.
D.1 Decision Block Area Costs
Section 5.2.1 presents a comparison of the relative costs of the decision blocks for each
variation of RPR. Figure 5.11 plotted the estimated area cost of each decision block as a
function of the reduced-precision bit-width, Br. This section supports the plot by estimating
the cost of each type of RPR decision block as a function of Br.
D.1.1 Threshold RPR Decision Block
This RPR decision block is fairly costly, especially on an FPGA. For comparison with
the other variations of RPR, we estimate the hardware cost of this decision block with an
n-bit full-precision input and two k-bit reduced-precision inputs. In a typical FPGA, most
functions of x input bits have the roughly the same cost due to their implementation in
lookup tables (LUTs). This is true for the adder, absolute value (abs), equality, comparison,
and multiplexer (mux) blocks shown in Figure 5.2. If we assign a cost of 1 for each LUT
utilized and each bit in an adder or other module consumes one LUT, the area cost of this
179
decision block is roughly:
Avoter = Aadder-n + Aabs-n + Acomparison-n + Amux-n + Aequality-k + AAND
= n+ n+ n+ n+ k + 1
= 4n+ k + 1. (D.1)
Table E.5 shows some actual implementation costs of this decision block on a Xilinx Virtex-4
FPGA, verifying this estimate.
The Threshold RPR decision block can be optimized under certain conditions. Shim
suggested a modification to the Threshold RPR decision block which is shown in Figure 5.3.
This circuit replaces the three upper modules in Figure 5.2. This optimization assumes that
the value chosen for Th is a power of two. With this assumption, the comparison block can
be implemented using simple AND and OR gates in place of a more complex n + 1 adder
block. The width m of the simplified comparator gates is dependent on the threshold value
chosen.
The area cost of this decision block is roughly1:
Avoter opt = Aadder-n + (AAND-m + AOR-m + AAND) + Amux-n + Aequality-k + AAND
= n+m
3+m
3+ 1 + n+ k + 1
= 2n+2
3m+ k + 1. (D.2)
Since the value of m must be smaller than n, this arrangement is less costly than the more
general decision block. Using a mid-range value of Th = 2−7, the necessary comparator width
is m = 8, Table E.6 confirms that this version is cheaper at a cost of just over one-half that
of the non-optimized version.
1This uses an approximate area cost of 13m for an m-bit logic function using the 4-input LUTs on a typical
FPGA, which is roughly the number of LUTs required.
180
D.1.2 Bounded RPR Decision Block
Using the same assumptions as in Section 5.1.1, the estimated cost of this Bounded
RPR decision block is roughly:
Avoter = 2 · Acomparison-n + Acomparison-k + AAND + ANOR + Aadder-k + Amux-n
= 2n+ k + 1 + 1 + k + n
= 3n+ 2k + 2. (D.3)
Table E.7 shows some actual implementation costs of this decision block on a Xilinx Virtex-4
FPGA. The estimate matches the table for larger values of k, but is pessimistic for smaller
values, where the FPGA synthesis software is able to optimize the blocks which compare
n-bit and k-bit signals. Even with the pessimistic estimate, however, the Bounded RPR
decision block cost estimate is lower than the Threshold RPR estimate for nearly all values
of n and k.
D.1.3 RP-TMR Decision Block
The decision blocks for RP-TMR are identical to the majority voters of TMR. These
voters are much smaller than the decision blocks required by Threshold and Bounded RPR.
Each bit of the voter takes three inputs and produces one output. In a typical FPGA
architecture, each of these three-input voters consumes one LUT resource. The cost of an
RP-TMR decision block, then, is roughly:
Avoter = Avoter-k
= k. (D.4)
Table E.8 shows some actual implementation costs of these voters on a Xilinx Virtex-4 FPGA
which verify this estimate. This cost is obviously much lower than those of the Threshold
181
and Bounded RPR decision blocks. If assuming k ≈ n/2, the decision blocks for Threshold
and Bounded RPR are 8–9 times larger, respectively, than the voters required for RP-TMR.
D.2 RPR Decision Block Placement
This section estimates the cost of several configurations of a 4-tap FIR filter, each
with a different arrangement of RPR decision blocks. Section 6.3 uses these calculations to
compare the efficiency of the different configurations. The 4-tap FIR filter circuit is shown
in Figure D.1.
For this system, RPR voters could be placed at the output of every multiplier in
addition to a voter at the final output. Assuming each reduced-precision multiplier needs to
be Br bits wide to achieve a specific threshold Th and each of reduced-precision adders is Br
bits wide as well, the total area cost of the system would be:
Afilter = 4 · (M + 2Mr) + 3 · (A+ 2Ar) + 4 · (R + 2Rr) + 5Vr, (D.5)
where M and Mr are the area costs of a full-precision and reduced-precision multiplier, A
and Ar are the costs of the adders, R and Rr are the costs of the registers, and Vr is the cost
of a voter.
In order to compare two implementations, we can use the resource utilization tables in
Appendix E. We will assign the costs of these variables the number of LUTs and/or flip-flops
in each module. If we set B = 16 and Br = 8, these values become: M = 127, Mr = 26,
A = 17, Ar = 9, R = 16, Rr = 8, and Vr = 44 (in the best case, using Shim’s optimized,
non-triplicated voters). With these estimates, the total cost of the system is roughly:
Afilter = 4 · (127 + 2 · 26) + 3 · (17 + 2 · 9) + 4 · (16 + 2 · 8) + 5 · 44
= 1, 169, (D.6)
182
Figure D.1: Block diagram of a 4-tap FIR filter.
or with triplicated voters:
Afilter = 4 · (127 + 2 · 26) + 3 · (17 + 2 · 9) + 4 · (16 + 2 · 8) + 3 · 5 · 44
= 1, 609. (D.7)
Alternatively, a single voter could be placed at the final output. Assuming the same
threshold is desired at the output of the filter, the bit-widths of the reduced-precision modules
must increase. This increases the cost of the reduced-precision components, but there is a
decrease in the cost of the RPR voters. Assuming the error at the output of each multiplier
adds to create the total error for the filter, that error is four times larger than for a single
multiplier. In order to maintain the same error level and thus the same threshold, m =
log2(4) = 2 extra bits of precision must be added to the reduced-precision modules:
Afilter = 4 · (M + 2Mr+2) + 3 · (A+ 2Ar+2) + 4 · (R + 2Rr+2) + Vr+2. (D.8)
183
The total cost of this version of the system is:
Afilter = 4 · (127 + 2 · 43) + 3 · (17 + 2 · 11) + 4 · (16 + 2 · 10) + 44
= 1, 157, (D.9)
or with triplicated voters:
Afilter = 4 · (127 + 2 · 43) + 3 · (17 + 2 · 11) + 4 · (16 + 2 · 10) + 3 · 44
= 1, 245. (D.10)
D.3 Triplicated Decision Blocks
In this dissertation, all of the fault injection tests involving RPR were conducted using
triplicated RPR decision blocks. Because the RPR decision block is implemented using the
same SEU-sensitive logic as the rest of the FPGA design, it is reasonable to protect them
somehow. The most straightforward protection method is to use TMR to eliminate all SEU
sensitivity.
In the fault injection tests, however, it became clear that the RPR decision blocks
were not fully protected by TMR. Some SEU sensitivity remained in each RPR experiment,
to varying degrees. Although the RPR decision block was triplicated in each case (essentially,
TMR was applied to that module), some SEUs caused upsets more than one of the replicas
and overcame the TMR protection. This can be seen in the results of Section 4.7, where
RPR left many catastrophic SEUs.
TMR has been shown to be imperfect in FPGAs in some instances, where a single
configuration bit affects signal routing in two of the three TMR domains [70]. Events in which
TMR is overcome by a single upset are called cross-domain errors or TMR failures. This
problem has also been shown to be correctable, in a large extent, using reliability-oriented
routing techniques [71].
184
These specialized routing techniques were not available at the time these experiments
were run. Therefore, the configuration bits of the triplicated RPR decision blocks were
ignored in the BER tests results presented in Chapter 5 and in Sections 6.1 and 6.2. Instead,
all of the decision block configuration bits were classified as Class 1 SEUs. This made the
comparisons of the performance of different RPR implementations more accurate, especially
when comparing the different bit-widths.
Section 6.3.4 presented fault injection results on the recursive receiver design. In the
RPR implementation, the RPR decision block was located within the feedback loop. This
made it difficult to separate the configuration bits of the triplicated decision block from the
rest of the design. The results presented in that section include any TMR failures within
the RPR decision block.
185
APPENDIX E. COMPONENT UTILIZATION TABLES
This appendix consists of tables which report on the FPGA resource utilization of
several modules referred to throughout this dissertation. These values are used to show the
area cost of the different types of RPR as well as TMR. Sections 5.1.1, 5.1.2, and 5.1.3 refer
to these tables to confirm area cost estimates for their respective decision blocks in order to
compare the three variations.
Table E.1: Resource utilization for two-input addermodules with a range of input bit-widths.
Input 4 inputWidth LUTs Slices
16 17 9
15 16 8
14 15 8
13 14 7
12 13 7
11 12 6
10 11 6
9 10 5
8 9 5
7 8 4
6 7 4
5 6 3
4 5 3
3 4 2
2 3 2
1 2 1
187
To generate these tables, the Xilinx ISE Design Suite 10.1 software was used. The
modules were created using Xilinx System Generator and synthesized using Xilinx XST. For
all tables, the target device was a Xilinx Virtex-4 SX-55 FPGA.
The input width values shown are the full bit-width, including the sign bit. All input
signals are fixed-point signals in the range [-1,1) with the exception of the RPR decision
block inputs, which include an extra bit to the left of the binary point for a range of [-2,2).
For the optimized Threshold RPR decision block, Th was always set to 2−7 (which would
normally be dependent on the chosen threshold), giving a adder output width of m = 8.
Table E.2: Resource utilization for single-input (constant coefficient)multiplier modules with a range of input bit-widths.
Input 4 input LUTs used asWidth LUTs route-through Slices
16 119 8 61
15 99 7 52
14 91 6 47
13 72 5 38
12 61 1 32
11 47 1 25
10 42 1 22
9 37 7 20
8 25 1 13
7 18 1 10
6 15 1 8
5 9 1 5
4 5 0 3
3 3 0 2
2 0 0 0
1 0 0 0
188
Table E.3: Resource utilization for two-input multipliermodules with a range of input bit-widths.
Input 4 inputWidth LUTs Slices
16 281 141
15 251 132
14 215 108
13 189 100
12 159 80
11 137 73
10 111 56
9 93 50
8 72 36
7 58 32
6 40 20
5 30 17
4 18 9
3 16 9
2 4 2
1 1 1
Table E.4: Resource utilization for FIR filter modules with a range of input bit-widths.
RP Input 4 input LUTs used asWidth LUTs route-through Slices Flip-Flops
16 1,588 125 1,019 384
15 1,363 63 896 360
14 1,109 55 750 336
13 920 35 641 312
12 810 25 573 288
11 589 5 448 264
10 515 5 396 240
9 435 6 342 207
8 383 16 302 184
7 256 4 227 161
6 179 3 161 120
5 136 2 124 90
4 85 0 84 68
3 56 0 56 42
2 42 0 42 28
1 24 0 24 12
189
Table E.5: Resource utilization for standard Threshold RPR decision modules with 17-bitfull-precision input and a range of reduced-precision input bit-widths.
RP Input 4 input LUTs used asWidth LUTs route-through Slices
17 82 1 42
16 81 1 41
15 81 1 41
14 80 1 41
13 80 1 41
12 79 1 40
11 79 1 40
10 78 1 40
9 78 1 40
8 78 1 40
7 77 1 40
6 76 1 39
5 76 1 39
4 76 1 39
3 74 1 38
2 74 1 38
Table E.6: Resource utilization for Shim’s optimized Threshold RPR modules with 17-bitfull-precision input and a range of reduced-precision input bit-widths.
RP Input 4 inputWidth LUTs Slices
17 48 25
16 47 24
15 47 24
14 46 25
13 46 24
12 45 24
11 45 23
10 44 23
9 44 23
8 44 23
7 43 22
6 43 22
5 42 22
4 42 21
3 41 21
2 40 21
190
Table E.7: Resource utilization for Bounded RPR decision modules with 17-bitfull-precision input and a range of reduced-precision input bit-widths.
RP Input 4 inputWidth LUTs Slices
17 90 47
16 86 44
15 82 43
14 78 40
13 75 40
12 71 36
11 67 35
10 63 32
9 60 31
8 56 29
7 52 27
6 48 25
5 43 22
4 39 20
3 35 18
2 30 15
Table E.8: Resource utilization for TMR voter modules with a range of input bit-widths.
Input 4 inputWidth LUTs Slices
16 16 8
15 15 8
14 14 7
13 13 7
12 12 6
11 11 6
10 10 5
9 9 5
8 8 4
7 7 4
6 6 3
5 5 3
4 4 2
3 3 2
2 2 1
1 1 1
191
APPENDIX F. ON-ORBIT EXPERIMENTS
The goal of this dissertation was to analyze the effects of SEUs on FPGAs in radi-
ation environments and to present how to apply a mitigation technique with a lower cost
than the traditional TMR. Space was the primary radiation environment focused on in this
dissertation. As such, a pair of on-orbit experiments have been developed to further validate
the results presented in this dissertation.
Through collaborations with Sandia National Laboratory and Los Alamos National
Laboratory, we have gained access to two separate space-based FPGA platforms. These
platforms include an experimental payload to be placed on the International Space Station
(ISS) and an experimental satellite in low Earth orbit (LEO). This appendix describes these
platforms and the experiments developed to run on them.
F.1 MISSE-8 Experiment
The Materials International Space Station Experiment (MISSE) is a series of exper-
iments focused on testing the durability of various materials in the space environment [80].
Each MISSE payload has been mounted externally on the ISS for full exposure to the en-
vironment in low Earth orbit, where the space station resides. The 8th experiment in the
series, MISSE-8, includes the second SEU Xilinx-Sandia Experiment (SEUXSE II). This
experiment contains a Virtex-4 and a space-qualified Virtex-5 FPGA from Xilinx. SEUXSE
II is intended to allow researchers to analyze the effects of the harsh environment of space
on these FPGAs [81].
In collaboration with Sandia National Laboratory, we have developed a design to
be run on these FPGAs. This experiment was designed to verify the results presented in
193
Chapters 3 and 4. Figure F.1 shows a block diagram of the experiment. The Virtex-5 FPGA
is a radiation tolerant version and handles the data generation and analysis. The Virtex-4
FPGA is the design under test (DUT) chip and contains the receiver circuits being tested.
The DUT FPGA contains three unmitigated FIR filters along with one filter protected
with RPR. These were 25-tap FIR filters with symmetric coefficients using a square-root
raised-cosine (SRRC) pulse shape with excess bandwidth α = 0.5 using Lp = 3 and operating
at N = 4 samples/bit. The filters implemented were:
1. 16-bit logic-based FIR filter
2. 8-bit logic-based FIR filter
3. 16-bit DSP48-based FIR filter
4. 16-bit logic-based FIR filter, protected using Threshold RPR with Br = 7
The experiment was designed similarly to the fault injection experiment described
in Section 3.4. Each filter is fed a modulated, noisy signal. At the output of each filter, a
downsample and decision block complete the simple demodulator. The four demodulated
data streams are then passed through a bit error rate test (BERT) block to measure any
differences between the sent and received data streams. The control state machine watches
for any bit error rates above a pre-chosen threshold and records any such events.
This experiment is anticipated to confirm that very few SEUs will impact the bit
error rate of these demodulator systems. If enough data is collected, we expect that the
RPR filter system will perform better on average than the other implementations. Further,
we expect the relative frequency of high-BER events between the different filters to match
those reported in Sections 3.5.2 and 4.7.3.
As of this writing, there are no on-orbit results to report from this experiment.
MISSE-8 is currently scheduled to be delivered to ISS aboard the STS-134 Space Shuttle
mission in 2011.
194
Figure F.1: Block diagram of the experiment designed for the MISSE-8 experiment on theInternational Space Station.
F.2 CFE Experiment
The Cibola Flight Experiment Satellite (CFESat) was created by Los Alamos Na-
tional Laboratory to test the suitability of SRAM-based FPGAs for on-orbit processing.
The satellite launched in March 2007 and operates in the 560 km low Earth orbit. The
satellite receives approximately 2.4 SEUs per day. The processing payload of the satel-
lite includes three reconfigurable computing processor boards, each with three Virtex 1000
radiation-hardened XQVR1000 CG560 FPGAs [82].
Figure F.2 shows a block diagram of the experiment design for the CFE satellite. This
entire design is implemented on a single FPGA and the design is replicated across several of
the devices to increase the effective testing time. This experiment includes a data generator
similar to the MISSE-8 experiment, but no noise generator is included. The test architecture
allows for several demodulator with filters protected with RPR and a “golden” demodulator
protected with TMR.
195
During operation, the outputs of the RPR filter are compared against that of the
TMR filter. If any difference is detected by the control state machine, the output of the
faulty demodulator is selected with the mux. At that point, the squared difference between
the two filters is accumulated over a set number of clock cycles, N , which is later divided
by N to calculate the noise power between the two filters. This power measurement is then
recorded along with the bit error rate measured during this time period.
Figure F.2: Block diagram of the experiment designed for the Cibola Flight Experimentsatellite.
196
The initial results of the CFE RPR experiment are summarized in Table F.1. This
experiment is relatively power-intensive compared to others sharing operation time on CFE
and thus is not scheduled to run often. At the last report, the RPR test has operated for
49 FPGA device days, during which 140 configuration SEUs were detected. One of these
SEUs caused an error to propagate to the output of one of the RPR filters. This SEU did
not trigger the reduced-precision mode nor did it cause any bit errors in the binary PAM
receiver. Since the upset was seen at the output of the RPR voter and did not trigger the
reduced-precision mode, the event was a undetected upset (UU) event.
Table F.1: Results of the CFE RPR Test
FPGA Events With Events With Events WithOperation Config Total Events With Only TMR Only RPR TMR & RPR
Device Days SEUs Events No Bit Errors Bit Errors Bit Errors Bit Errors
49.0 140 1 1 0 0 0
It is anticipated that any updates to the CFE experimental data will be posted
on ScholarsArchive, BYU’s institutional repository for the scholarly and creative content
produced by the University. ScholarsArchive may be accessed online at:
http://lib.byu.edu/sites/scholarsarchive/.
197