autonomic fault-handling and refurbishment using ... · autonomic fault-handling and refurbishment...

12
Applied Soft Computing 11 (2011) 1588–1599 Contents lists available at ScienceDirect Applied Soft Computing journal homepage: www.elsevier.com/locate/asoc Autonomic fault-handling and refurbishment using throughput-driven assessment Ronald F. DeMara, Kening Zhang, Carthik A. Sharma School of Electrical Engineering and Computer Science, University of Central Florida, Orlando, FL 32816-2450, United States article info Article history: Received 10 October 2007 Received in revised form 8 March 2009 Accepted 17 January 2010 Available online 25 January 2010 Keywords: Evolvable hardware Genetic Algorithms Populational fault tolerance Reconfigurable computing Evolutionary algorithms Competitive Runtime Reconfiguration abstract An evolvable hardware paradigm for autonomic regeneration called Competitive Runtime Reconfiguration (CRR) is developed whereby an individual’s performance is assessed using the dynamic properties of the population rather than a static fitness function. CRR employs a Sliding Evaluation Window of recent throughput data and a periodically updated Outlier Threshold which avoids the extensive downtime asso- ciated with exhaustive Genetic Algorithm (GA) based evaluation. The relative fitness measure favors graceful degradation by leveraging the behavioral diversity among the individuals in the population. Throughput-driven assessment identifies configurations whose discrepancy values violate the Outlier Threshold and are thus selected for modification using Genetic Operators. Application of CRR to FPGA- based logic circuits demonstrates the identification of configurations impacted by a set of randomly injected stuck-at faults. Furthermore, regeneration of functionality can be observed within a few hun- dred repair iterations. The viable throughput of the CRR system during the repair process was maintained at greater than 91.7% of the fault-free throughput rate under a number of circuit scenarios. CRR results are also compared with alternative soft computing approaches for autonomous refurbishment using the MCNC-91 benchmarks. Published by Elsevier B.V. 1. Introduction Evolvable hardware (EH) mechanisms for self-repair seek to actively restore lost functionality of digital logic circuits realized in reprogrammable logic devices such Field Programmable Gate Arrays (FPGAs). The EH techniques aim towards providing fault- recovery capability without incurring the increased weight, size, and cost penalties incurred with redundant spares. Hence, recent research has explored the feasibility of using GA techniques to increase FPGA reliability and autonomy [6,14,2,5,19,24,32]. In par- ticular, a detailed overview of a several dozen works which define the field of autonomous regeneration and categorize the possi- ble approaches is presented in Ref. [21]. A key feature of the existing evolutionary techniques for this problem is their reliance upon either exhaustive functional fitness evaluation or exhaustive resource testing during regeneration. Alternatively, using the Competitive Runtime Reconfiguration (CRR) approach developed herein, an initial population of function- ally identical (same input-output behavior), yet physically distinct (alternative design or place-and-route realization) FPGA configu- Corresponding author. Tel.: +1 407 421 3062; fax: +1 407 823 5835. E-mail addresses: [email protected] (R.F. DeMara), [email protected] (C.A. Sharma). rations is produced at design-time. At runtime, these individuals compete for selection favoring fault-free behavior. Discrepant behavior, whereby the outputs of two competing individuals do not agree on a bit-by-bit basis, is used as a metric during the fitness evaluation process. Over a period of time, as the result of successive pair-wise comparisons, performance capabilities are identified for the entire population regarding the fitness of individuals relative to one another. This fitness information can then be leveraged to select underperforming configurations to receive genetic modifi- cation such as crossover or mutation. Conventional GA techniques employ a population-based opti- mization approach to the regeneration problem with the objective of producing a single best-fit configuration as the final outcome. They utilize a fitness function for each regenerated circuit which is evaluated exhaustively for all possible input values. However, given that partially complete repairs are often the best attainable with a tractably sized search [14,27], there is no guarantee that the individual with the best absolute fitness measure for an exhaus- tive set of test inputs will correspond to the individual that has the superior performance under a particular subset of inputs actually applied. Thus, exhaustive evaluation of regenerated alternatives is computationally expensive, yet not necessarily indicative of the optimal performing individual within a population of partially cor- rect repairs. In this paper, we base fitness instead on the actual throughput data of the alternatives when they are placed into ser- 1568-4946/$ – see front matter. Published by Elsevier B.V. doi:10.1016/j.asoc.2010.01.020

Upload: vunhan

Post on 13-Aug-2018

224 views

Category:

Documents


0 download

TRANSCRIPT

Aa

RS

a

ARRAA

KEGPREC

1

aiArarittbeur

(a(

(

1d

Applied Soft Computing 11 (2011) 1588–1599

Contents lists available at ScienceDirect

Applied Soft Computing

journa l homepage: www.e lsev ier .com/ locate /asoc

utonomic fault-handling and refurbishment using throughput-drivenssessment

onald F. DeMara, Kening Zhang, Carthik A. Sharma ∗

chool of Electrical Engineering and Computer Science, University of Central Florida, Orlando, FL 32816-2450, United States

r t i c l e i n f o

rticle history:eceived 10 October 2007eceived in revised form 8 March 2009ccepted 17 January 2010vailable online 25 January 2010

eywords:volvable hardware

a b s t r a c t

An evolvable hardware paradigm for autonomic regeneration called Competitive Runtime Reconfiguration(CRR) is developed whereby an individual’s performance is assessed using the dynamic properties ofthe population rather than a static fitness function. CRR employs a Sliding Evaluation Window of recentthroughput data and a periodically updated Outlier Threshold which avoids the extensive downtime asso-ciated with exhaustive Genetic Algorithm (GA) based evaluation. The relative fitness measure favorsgraceful degradation by leveraging the behavioral diversity among the individuals in the population.Throughput-driven assessment identifies configurations whose discrepancy values violate the Outlier

enetic Algorithmsopulational fault toleranceeconfigurable computingvolutionary algorithmsompetitive Runtime Reconfiguration

Threshold and are thus selected for modification using Genetic Operators. Application of CRR to FPGA-based logic circuits demonstrates the identification of configurations impacted by a set of randomlyinjected stuck-at faults. Furthermore, regeneration of functionality can be observed within a few hun-dred repair iterations. The viable throughput of the CRR system during the repair process was maintainedat greater than 91.7% of the fault-free throughput rate under a number of circuit scenarios. CRR resultsare also compared with alternative soft computing approaches for autonomous refurbishment using theMCNC-91 benchmarks.

. Introduction

Evolvable hardware (EH) mechanisms for self-repair seek toctively restore lost functionality of digital logic circuits realizedn reprogrammable logic devices such Field Programmable Gaterrays (FPGAs). The EH techniques aim towards providing fault-ecovery capability without incurring the increased weight, size,nd cost penalties incurred with redundant spares. Hence, recentesearch has explored the feasibility of using GA techniques toncrease FPGA reliability and autonomy [6,14,2,5,19,24,32]. In par-icular, a detailed overview of a several dozen works which definehe field of autonomous regeneration and categorize the possi-le approaches is presented in Ref. [21]. A key feature of thexisting evolutionary techniques for this problem is their reliancepon either exhaustive functional fitness evaluation or exhaustiveesource testing during regeneration.

Alternatively, using the Competitive Runtime ReconfigurationCRR) approach developed herein, an initial population of function-lly identical (same input-output behavior), yet physically distinctalternative design or place-and-route realization) FPGA configu-

∗ Corresponding author. Tel.: +1 407 421 3062; fax: +1 407 823 5835.E-mail addresses: [email protected] (R.F. DeMara), [email protected]

C.A. Sharma).

568-4946/$ – see front matter. Published by Elsevier B.V.oi:10.1016/j.asoc.2010.01.020

Published by Elsevier B.V.

rations is produced at design-time. At runtime, these individualscompete for selection favoring fault-free behavior. Discrepantbehavior, whereby the outputs of two competing individuals donot agree on a bit-by-bit basis, is used as a metric during the fitnessevaluation process. Over a period of time, as the result of successivepair-wise comparisons, performance capabilities are identified forthe entire population regarding the fitness of individuals relativeto one another. This fitness information can then be leveraged toselect underperforming configurations to receive genetic modifi-cation such as crossover or mutation.

Conventional GA techniques employ a population-based opti-mization approach to the regeneration problem with the objectiveof producing a single best-fit configuration as the final outcome.They utilize a fitness function for each regenerated circuit whichis evaluated exhaustively for all possible input values. However,given that partially complete repairs are often the best attainablewith a tractably sized search [14,27], there is no guarantee that theindividual with the best absolute fitness measure for an exhaus-tive set of test inputs will correspond to the individual that has thesuperior performance under a particular subset of inputs actually

applied. Thus, exhaustive evaluation of regenerated alternatives iscomputationally expensive, yet not necessarily indicative of theoptimal performing individual within a population of partially cor-rect repairs. In this paper, we base fitness instead on the actualthroughput data of the alternatives when they are placed into ser-

ft Com

vf

(

(

hd

2

tg

2

lmRTdneoo(railsPt

[edeGr

TC

R.F. DeMara et al. / Applied So

ice. Hence, two innovations are developed in the CRR approach toacilitate self-adaptive EH regeneration:

1) relaxation of the requirement for exhaustive assessment vec-tors; and

2) fitness evaluation based on outlier identification over time.

These characteristics are used to assess the evaluation over-ead and the repair capabilities, respectively, of the CRR approachefined herein.

. Related work

The existing schemes for EH repair can be broadly classified inerms of their refurbishment strategy, resource coverage, and theranularity at which faults are isolated.

.1. Offline refurbishment with exhaustive functional testing

Several approaches to GA-based fault-handling in FPGAs uti-ize exhaustive testing for fault isolation and offline regeneration

echanisms. As listed in Table 1, conventional Triple Modularedundancy (TMR) [15], Vigander’s approach [27] which combinesMR with GAs, and other n-plex spatial voting techniques [17]eliver real-time fault-handling, but increase power consumption-fold during fault-free operation. These techniques also requirexhaustive evaluation of the function being refurbished. On thether hand, STARS [1] is an example of exhaustive evaluationf a resource-oriented diagnostic that performs Built-in Self-TestsBISTs) on sub-sections of FPGA. Under this paradigm, the test areaoves across all FPGA resources. Portions of the FPGA are continu-lly taken offline in succession for testing while the functionalitys moved to a new location within the reprogrammable fabric. Oneimitation is that detection latency can be large since tests mustweep through all intervening resources before a fault is detected.otential throughput unavailability due to diagnostic reconfigura-ions when no faults have yet occurred is also a consideration.

As listed in Table 1, methods proposed by Lohn [14], Larchev11], and Lach [10] either rely on offline regeneration supported by

xhaustive functional testing, or pre-determined spares defined atesign-time. Other soft computing-based approaches to fault tol-rance include Jiggling [4,5] which combines TMR with a (1 + 1)A, genetic-learning based models [24], an Asexual GA for FPGA self-

epair [22], and bio-inspired systems for evolving dependability [26]

able 1haracteristics of related FPGA fault-handling schemes.

Approach Fault-handling method Latency Fault detecti

Distinguish t

TMR Spatial voting Negligible No[27] Spatial voting and

offline evolutionaryregeneration

Negligible No

[14] Offline evolutionaryregeneration

Negligible No

[10] Static-capability tilereconfiguration

Relies on independentfault detectionmechanism

STARS [1] Online BIST Up to 8.5M erroneousoutputs

Test pattern

[6] Population-basedfault-insensitive design

Design-timeprevention emphasis

No

CRR (proposedherein)

Competitiveruntime-input fitnessevaluation andevolutionaryregeneration

Negligible Transients arattenuatedautomatically

puting 11 (2011) 1588–1599 1589

which apply immune-based approaches on circuit case studies sim-ilar to those herein. In addition, Liu et al. [13] present anotherbiologically inspired model for transient faults.

Of the methods in Table 1, only Keymeulen et al. [6] investigatethe possibility of using a population-based approach to desen-sitize circuits to faults. They exploit evolutionary techniques atdesign-time so that a circuit is more likely to be designed toremain functional even in presence of various a-priori envisionedfaults. Their population-based fault tolerant design method evolvesdiverse circuits and then selects the single most fault-insensitiveindividual. While their population-based fault tolerance approachprovides passive runtime fault tolerance, the approach developedherein adapts dynamically to environmental demands throughintrinsic evaluation of a fitness consensus from the entire popu-lation at runtime.

2.2. Forming a robust consensus using diversity

To form a robust consensus, Layzell and Thompson [12] dealtwith fitness evaluation by exploiting populational fault tolerance(PFT). Under a PFT strategy, the creation of the best-fit individualproceeds incrementally by incorporating additional elements intopartially correct prototypes as they adapt to faults. They speculatethat evaluation becomes focused on the precise regions of relevancewithin the search space during the execution of online processes.This provides motivation to explore CRR’s first goal mentioned inSection 1 of relaxing exhaustive input evaluation, especially beforereturning a partially repaired configuration back to service.

Yao et al. [30] further emphasize that in evolutionary systemsthe population as a whole contains more robust information thanany one individual alone. They demonstrate the utility of infor-mation contained within the population using case studies fromthe domains of artificial neural networks and rule based systems.In both cases, the final collection of individuals outperforms anysingle individual. Yao and Liu [31] further extend this concept bypresenting four methods for combining the different individuals inthe final population to generate system outputs. While the authorsdevise a method to utilize the information contained in the popu-lation to improve the final solution, they did not attempt to use the

information in the population to improve the optimization pro-cess itself. More recently in Ref. [29] the authors describe usingfitness sharing and negative correlation to create a diverse popu-lation of solutions. A combined solution is then obtained using agating algorithm that ensures the best response to the observed

on Resource coverage Fault isolation

ransients Logic Inter-connect Comparator Granularity

Yes Yes No Voting elementYes No No Voting element

Yes Yes No Unnecessary

transients Yes Yes No LUT function

Yes Yes No Not addressed atruntime

e Yes Yes Yes Unnecessary, but canisolate functionalcomponents

1590 R.F. DeMara et al. / Applied Soft Com

stdoe

bboowccvaetrp

Fig. 1. Physical arrangement with two competing configurations.

timuli. The authors claim that applying the described techniqueso evolvable hardware applications should be straightforward, buto not provide examples. They state the absence of an optimal wayf predicting future performance of evolved circuits in unforeseennvironments as an impediment.

Problems related to fault tolerance in online evolution identifiedy the existing approaches can be addressed by a new Consensus-ased Evaluation scheme [32]. With relative fitness measures basedn competition, a consensus is produced regarding the fitnessf individuals in response to the actual environmental stimuliith hardware in-the-loop. This intrinsic evolutionary process

an adapt to runtime requirements and improve fault-handlingapability. The approach utilizes a temporal, rather than spatial,oting scheme whereby the outputs of two competing instances

re compared at any instant and alternative pairings are consid-red over time. The presence or absence of a discrepancy is usedo adjust the discrepancy values (DVs) of both individuals withoutendering any judgment at that instant on which individual of theair is actually faulty. The faulty, or later exonerated, configura-

Fig. 2. Procedural flow in

puting 11 (2011) 1588–1599

tion is determined over time through other pairings of competingconfigurations.

These concepts are expanded here by evaluating the real-timeperformance of individuals in comparison to others in the popula-tion. Instead of using an absolute fitness function with concomitantexhaustive testing, relative discrepancy values are used as thethreshold to identify faulty individuals. The system favors selectionof individuals which intrinsically perform the well in the currentenvironment as dictated by the resource failure status and recentthroughput data for the circuit itself. Each individual competes forselection based on its performance over an observation window asdescribed in the next section.

3. Competitive Runtime Reconfiguration (CRR) paradigm

3.1. Detecting faults using a population of alternatives

In the CRR paradigm, each individual in the population has aninstance of one of the two complementary logic circuits whichform a discrepancy detection arrangement [3]. For purposes ofillustration in Fig. 1, assume two competing half-configurationslabeled Functional Logic Left (“L”) and Functional Logic Right (“R”)are loaded in tandem on the physical FPGA platform. The half-configurations occupy mutually exclusive physical resources toimplement identical functionality. Like a duplex version of TMR,the L and R functional logic elements are identical, but theiroutputs are enabled if and only if the discrepancy detector inthe complementary half finds the outputs to be fault-free. Thissupports the possibility of a single fault in either half of thefunctional logic, or in the discrepancy detector logic of either half-configuration.

This arrangement realizes a conventional Concurrent ErrorDetection (CED) capability to identify at least any single resourcefault with certainty [18]. As in traditional CED approaches, com-parison of the outputs of the two resident half-configurations willproduce either discrepant or matching outputs which will indicatethe presence or absence of faulty resources in the FPGA hardware

platform respectively.

Whenever two loaded half-configurations disagree on the com-puted value of throughput data, the discrepancy value (DV) ofeach half-configuration is incremented. By repeated pairing overa period of time, only those half-configurations which do not use

the CRR technique.

R.F. DeMara et al. / Applied Soft Computing 11 (2011) 1588–1599 1591

tectio

fbctwnvo

3

RcITDicafdahAqtitf

puwrsgoT

Fig. 3. Selection and de

aulty resources will eventually become preferred as evidencedy their reduced DV. This is because the DV of a faulty half-onfiguration is always increased regardless of its pairing, yethe DV is never increased whenever fault-free half-configurationshich are paired together. This pairing process occurs as part of theormal throughput processing of the FPGA without additional testectors or other resource diagnostics by tracking the Health Statef each half-configuration.

.2. Tracking competence using health states

The Health State of the configurations that comprise the L andfunctional configuration populations are managed by the pro-

edural flow for the CRR algorithm as depicted in Fig. 2. Afternitialization, the Selection of the L and R half-configurations occurs.he selected individuals are then loaded onto the FPGA. Next, theetection process is conducted when the normal data processing

nputs are applied to the FPGA. The DVs of the competing half-onfigurations are updated based on whether or not their outputsre discrepant. The central Primary Loop representing discrepancy-ree behavior can repeat without reselection as long as there is noiscrepancy. However, even in the absence of any observed discrep-ncies, one or more of the competing individuals may be replaced toasten regeneration in the presence of any Under Repair individuals.s described later, the Replacement Rate, RX, determines the fre-uency with which such discrepancy-free individuals are replacedo allow rotation of other individuals from the Dormant pool. Fornstance, the system availability can be increased by using a rela-ively small value for RX, rather than using a relatively high valueor RX.

As described in detail in Section 4, the Fitness State Adjustmentrocess will be used to validate and update the state of the individ-al after an Evaluation Window has elapsed. Otherwise reselectionill occur, without updating the fitness state of the individual being

eplaced. For Under Repair individuals, if the value of the corre-ponding History matrix Hii element, as described in Section 4.3, isreater than the threshold value, then Genetic Operators are invokednly once without attempting to achieve complete refurbishment.he modified configuration is then immediately returned to the

n in the CRR technique.

pool of competing configurations and the process resumes startingwith the Selection phase.

3.3. Selecting candidates for competition

The population of configurations to be evolved by the GA con-sists of L and R individuals chosen by the Selection and Detectionprocesses shown in Fig. 3. During the selection process, Pristine,Suspect, and Refurbished individuals [2] are preferred in that orderfor one half-configuration. The other half-configuration is selectedbased on a stochastic process determined by the Reintroduction Rate(�R). In particular, Under Repair individuals are selected as one ofthe competing half-configurations, on average, at a rate equal to �R.Thus, a genetically modified Under Repair configuration is reintro-duced at a controlled rate into the operational throughput flow. Thereintroduced configurations act as a new competitor to potentiallyexhibit fault-free behavior against the larger pool of configura-tions. Individuals in the population have a finite probability of beingselected as the Active individuals with a suitable interval betweensuccessive selections.

The Detection process is presented in the lower right corner ofFig. 3. If a discrepancy is observed as a result of output comparison,the FPGA is reconfigured with a different pair of competing config-urations and the output of the FPGA is not propagated externally.Thus, CRR employs the runtime inputs simultaneously as evalu-ation inputs, with subsequent isolation of outliers as described inSection 4.2. Also, the partially correct outputs generated by compet-ing fault-affected individuals can improve availability as opposedto keeping a device offline while a perfect solution is evolved.

4. Identifying outliers by evaluating performance

4.1. Determination of the Evaluation Window

CRR uses runtime inputs for individual performance evaluationrather than exhaustive testing with a predefined set of test vec-tors. Nonetheless, sufficient testing of individuals with throughputdata can provide adequate test coverage as shown in Section 5.While the range and sequence of online inputs may not be known at

1 ft Computing 11 (2011) 1588–1599

deoE

dpop2tpfceoseiwc6wDli

o

nwac(

o

lv6xwt

P

(

a

(

DVi = 164

× E = 9.375 (7)

While analysis does not guarantee any amount of test vec-tor coverage in practice, experiments shown in Section 5 indicate

592 R.F. DeMara et al. / Applied So

esign-time, a probabilistic model is analyzed here to estimate thexpected number of evaluations anticipated to encounter a rangef evaluations with high probability. An initial default width for thevaluation Window, E, can then selected based on the analysis.

The characteristics of the circuit Under Repair will influence theetermination of E as illustrated for an unsigned integer multi-lier. Let the circuit input width, W, denote the total number ofperand bits to the multiplier. In the case of a 3-bit × 3-bit multi-lier, W = 6 and the total number of distinct input combinations isW = 64. Thus in the case of the 3-bit × 3-bit multiplier, an exhaus-ive set of inputs would consist of all 64 possible combinations. Theroblem of determining the number of random inputs needed toacilitate all possible inputs appearing at least once is similar to theoupon collector problem [7]. In the coupon collector problem, thexpected number of coupons to be collected before at least one eachf D total coupons are collected is given by the simplified expres-ion, D × HD, where HD is the Dth harmonic sum [7]. However, for thexhaustive test modeling problem at hand, the number of randomnputs required to facilitate the appearance of all possible inputs

ith varying confidence factors needs to be derived. This probleman be modeled as a game involving selection of balls from a set of4 differently colored balls. A single ball is selected in each drawing,ith replacement. In other words, what is the probability that, afterdrawings, at least one ball of each of the 64 colors appeared at

east once? Clearly, for D < 64, the probability is zero, and for D = 64s 2.54 × 10−116 which is highly improbable.

To solve this problem, consider the case where all balls are of

ne color. After D drawings, we have

(11

)x1 = 1D where x1 is the

umber of feasible sample events, so x1 = 1D. Now, consider the casehen D ≥ 64. In general, a K-color experiment can be described assum of experiments involving smaller numbers of colors for any

onstant value of D:

KK

)xK +

(Kk − 1

)xK−1 + · · · +

(K2

)x2 +

(K1

)x1 = KD (1)

r

K

m=1

(Km

)xm = KD (2)

Since the numerical value of KD in Eq. (2) can be excessivelyarge, it may not be possible to represent it using an unsigned longariable, the widest variable in a 32-bit system, since for example464 > 232 − 1. Therefore, an alternate representation can considerK as a sample event in which all K colored balls appear at least onceith a probability PK. D is the number of drawings, and KD is the

otal number of possible permutations, yielding:

K = xK

KD(3)

Now, by dividing Eqs. (1) and (2) by KD, we obtain, respectively

KK

)PK +

(Kk − 1

)PK−1 + · · · +

(K2

)P2 +

(K1

)P1 = 1 (4)

nd

k

m=1

(km

)Pm = 1 (5)

so when K = 1

11

)P1 = x1

1D= 1 ⇒ P1 = 1; x1 = 1D

Fig. 4. Effect of sample size on test coverage.

when K = 2(22

)...

P2 +(

21

)P1 = P2 +

(21

)P1

(1D

2D

)= 1 ⇒ P2 = 1 −

(21

)P1

(12

)D

Therefore, in general:((KK

)PK +

(KK − 1

)PK−1 + · · · +

(K1

)P1

= PK +(

KK − 1

)PK−1

((K − 1)D

KD

)+ · · · +

(K1

)P1

((K − 1)D

KD

)

= 1 ⇒ PK = 1 −(

KK − 1

)PK−1

((K − 1)

K

)D

− · · · −(

K1

)

P1

((K − 1)

K

)D)

(6)

Eq. (6) yields PK recursively without the computational burdenof calculating KD as ((K − 1)/K) < 1 for all K.

As shown in Fig. 4, when K = 16 colors and D = 100 drawings,the probability P16 of all 16 colors appearing is ≈ 100%. Similarly,250 trials for 32 colors are sufficient given equiprobable inputs.Table 2 shows the result for the case when K = 64, which appliesto the 3-bit × 3-bit multiplier. In order to achieve comprehensivecoverage with a certainty of 97.59%, approximately 500 evaluationsare sufficient. A certainty of 99.50% implies an Evaluation Windowof width E = 600 which was adopted for the fault isolation experi-ments in Section 4.3. Thus, in the case of a 3-bit × 3-bit multiplierdesign, if 1-out-of-64 inputs articulate a fault in a single individualCi, and all the input combinations are equally likely to appear, thenthe expected discrepancy value after E = 600 evaluations is:

Table 2Probability of all 64 inputs appearing at least once given D evaluations.

D = 350 D = 400 D = 450 D = 500 D = 550 D = 600 D = 650

P64 76.96% 88.84% 94.77% 97.59% 99.00% 99.50% 99.77%

R.F. DeMara et al. / Applied Soft Com

Fw

na

4

crtcSwdffipt

oairiHoidcci

4

LkoLtwbi

In general, outliers can be identified with a periodicity of approx-

ig. 5. Cut-off values and outlier detection rates as a function of sliding windowidth, S.

umerous case studies where a 99% or greater equiprobable cover-ge provides stable adaptive behavior during fitness evaluation.

.2. Determination of the Sliding Window width

To ascertain if the DV anticipated by Eq. (7) is indicative of ahange in fitness, it is compared to the consensus value of theecent number of discrepancies observed throughout the popula-ion. The Sliding Window, S, is used to update the global discrepancyonsensus to which all individual values are compared. Typically,is selected to be an integer multiple of E such that S = q × E,here 1 < q < |C| and |C| is the population size. In the experimentsescribed, a Sliding Window width size is selected such that q = 5or |C| = 20, as these are shown experimentally below to build a suf-cient consensus of recent observations across 5/20 = 25% of theopulation. In particular, the size of E is indirectly proportional tohe update frequency of each individual DVi for individual i.

For instance as shown in Fig. 5, for a Sliding Window width ofnly 15 and a cut-off value of 0.5, 100% outlier identification can bechieved. To minimize fault location time, the lowest value of S, Smins sought which repeatedly identifies the faulty individuals. As S iseduced from 20 to 5, the curves indicate that a higher cut-off values required to identify outliers and that 0.4 is sufficient for S ≥ 10.owever, as shown in Fig. 5, with S = 5, and a cut-off value of 0.9,utliers can be consistently identified, thus yielding Smin = 5. A fault-mpact value of 32-out-of-64 was selected to reflect a middle-rangeamage scenario where the physical resource fault manifests a dis-repancy for 50% of the applied inputs. For less catastrophic faults,onsistent fault isolation can be relied upon with the parametersdentified above.

.3. Identification of outliers

Outlier diagnostics is based on the principle of detecting theeast Squares (LS) projection matrix H [23]. This matrix is wellnown under the name hat matrix, because it is denoted by a hatn the column vector y = (y1, . . ., yn)t such that y = H × y and y is theS prediction for y. The hat matrix H is defined as follows: consider

here are p explanatory variables and one response variable whichill have n observations. The n-by-1 vector of responses is denoted

y y = (y1, . . ., yn)t. The linear model states that y = X × �+e, where �s the vector of unknown parameters, e is the error vector and X is

puting 11 (2011) 1588–1599 1593

the n-by-p matrix:

X =

⎡⎢⎢⎢⎢⎢⎣

x11 x12 · · · x1p

x21 x22 · · · x2p

......

......

......

xn1 xn2 · · · xnp

⎤⎥⎥⎥⎥⎥⎦

(8)

Then, the H matrix is composed from X as follows:

H = X(XtX)−1

Xt (9)

The diagonal elements of H have a direct interpretation as theeffect exerted by the ith observation on the expectation of response

variable because they equal ∂∧yi/∂yi. The average value of the diag-

onal element Hii is p/n and it follows that 0 ≤ Hii ≤ 1 for all i.In the CRR approach, the DV of each individual can be viewed as

one observation or one explanatory variable, and the ObservationInterval can be set as the size of the entire population. Fortunately,since the X matrix consists of only one column in our application,the result of the XtX product is a single-element vector matrix, andits inverse can be computed using a straightforward one-step com-putation. In general, the computation complexity of the H matrixapproach is 2n2 + 1.

The recommended threshold for the identification of outliers isHii > 2p/n and a stricter cut-off value 3p/n has been used in pre-vious works [8,9]. For an analysis of the CRR problem for faultisolation, setting p = 1 and n = 20 corresponds to one faulty indi-vidual among a population of 20. For example, a cut-off valueof 10 × p/n = 10/20 = 0.5 can be used in conjunction with a largerSliding Window width of 15 to favor fairly consistent outlier iden-tification as demonstrated in the case studies. Also, to increase theconfidence with which outliers are isolated, we increase thresholdfrom one standard deviation from the mean to a value of 2.5�.

5. CRR performance evaluation

5.1. Outlier detection and fault isolation performance

Experimental results regarding the effect of the outlier detectionparameters are illustrated in Figs. 6–13. Each has been gener-ated using a simulator written in the C++ programming languagewhich utilizes an equiprobable selection of individuals. In the datareported for experiments, the inputs causing the first discrepancyare applied once after each pair of faulty configurations is replacedto assess the damage definitively under a single-fault model.

To further illustrate how the DVs are mapped to the Hii values,Figs. 6–13 are presented in pairs that show results from the sameexperiment. The first figure in each pair shows the observed DVsand the subsequent figure shows the Hii values calculated using thisdata. For example, Fig. 6 shows the DVs observed over 50 individ-ual evaluations, where each evaluation occurs after the particularindividual has completed E = 600 computations as an Active config-uration on the simulated FPGA. Fig. 7 shows the plot of Hii valuesfor a subset of evaluations corresponding to the identification ofthe first outlier. Figs. 6 and 7 depict the identification of outlyingindividuals in the population that has a 10-out-of-64 fault impactcaused by a single fault. A Sliding Window width of 15 was used inthis experiment. Fig. 6 shows identification of an outlier (a high y-axis value) at evaluation numbers 11, 31, and 48 (along the x-axis).

imately 20 × E for a population size of 20 individuals. By choosinga smaller Sliding Window, outlier identification will take place atan increased frequency as shown in subsequent experiments. Fig. 7shows the outlier at evaluation 11 exhibits Hii ≈ 0.94 which is well

1594 R.F. DeMara et al. / Applied Soft Computing 11 (2011) 1588–1599

Fi

opvc

afnaTaFioopttr

ig. 6. Discrepancy values observed when one individual has a 10-out-of-64 faultmpact.

ver an order of magnitude larger than Hii ≈ 0.02 of the other com-etitors. Based on analysis of the Hii values, and an outlier cut-offalue of 0.5, the outlying individual is identified without statisti-ally significant ambiguity.

In Figs. 6 and 7, individual performance was measured usingWinner-Takes-All scheme, where the only information available

rom the discrepancy detection is bit-wise output equality. A alter-ate discrepancy detection circuit could provide information suchs the Hamming distance of the observed output of individuals.he use of Hamming distance information leads to outliers havinghigher discrepancy value, as shown in Fig. 8, when compared to

ig. 6. As in the previous experiment, a 10-out-of-64 fault impacts considered, with a Sliding Window width of 15. The higher DVf approximately 140 can be accounted for by the fact that the

bserved Hamming distance between the observed discrepant out-ut and the desired ideal can be greater than 1. This is opposed tohe previous case, where the presence of a discrepancy increaseshe DV of the corresponding yielding DV≈70. The Outlier Thresholdemains the same, nonetheless, since the hat matrix operates on

Fig. 7. Plot of Hii showing outlier identification.

Fig. 8. Discrepancy values observed when hamming distance is used.

normalized information. Figs. 8 and 9 show plots of the discrep-ancy value and the H values when the Hamming distance is usedto quantify divergence which still clearly delineate outliers as highvalues on the y-axis.

Although the Hamming distance of the outputs serves as a use-ful metric to quantify divergence, other performance evaluationschemes can be used to compare and evaluate the fitness of indi-viduals. As an example, a bit-weight performance evaluation schemecan be used, where the output of a configuration is converted toa numerical value by assigning weights to the bits in the output.In this scheme, the binary output is evaluated as a binary number,and the numerical difference between the outputs serves as thediscrepancy metric where most significant bits would correspondto more drastic errors in output values.

In the case when a single faulty L individual with a less catas-trophic 1-out-of-64 fault impact is analyzed, two outlier pointsare successfully isolated as shown in Fig. 10. Fig. 11 shows thecorresponding plot of the Hii for the same experiment. The detec-tion rate was observed to be 100% for this fault injection scenario.

When compared to the results in Figs. 6 and 7, it can be seen thatthe identification takes place more frequently with a periodicity ofapproximately 5 × E. This corresponds to the use of a narrower, yetequally effective, Sliding Window width as opposed to the 15 × E

Fig. 9. Plot of Hii showing Outlier Identification when Hamming distance is used.

R.F. DeMara et al. / Applied Soft Computing 11 (2011) 1588–1599 1595

uiiSii

ilfivnntavsiiR1tia

F

XOR, NOR, NAND – and one unary-function NOT, each of which can

Fig. 10. DV of a single faulty L individual with a 1-out-of-64 fault impact.

sed in the earlier experiment. In Fig. 11, the outlier cut-off values 0.3 as compared to 0.5 in Fig. 7. Also, the first outlier in Fig. 11s closer to the cut-off value which can be expected with a narrowliding Window. A wider Sliding Window width helps reinforcedentification, yet too large a value can delay identification withoutmproving the discrimination among faulty and viable competitors.

For a greater fault-impact scenarios, where one or more faultsmpacts one or more configurations, the isolation will be more chal-enging and time-consuming as shown in Figs. 12 and 13. Bothgures depict the isolation characteristics for a single faulty L indi-idual with a 32-out-of-64 (50% correctness) fault impact. A greaterumber of observations are required than the 1-out-of-64 sce-ario and the divergence of the outlier is also greater. Individualshat are eventually identified as outliers are replaced by the CRRlgorithm more often, since the computations involving these indi-iduals produce discrepant outputs. Under the default replacementtrategy for discrepancy-free behavior depicted in Fig. 3, fault-freendividuals reside on the FPGA indefinitely. However, in this exper-ment, they are replaced in accordance with the Replacement Rate

= 0.16, which corresponds to a guaranteed evaluation period of

X

00 contiguous iterations out of the E = 600 window. Individualshat do not produce discrepant outputs are replaced with otherndividuals less frequently than ones that do. Thus, individuals thatre not fault-affected complete the required E number of iterations

ig. 11. Isolation of a single faulty L individual with a 1-out-of-64 fault impact.

Fig. 12. DVs observed when a single faulty individual has a 32-out-of-64 faultimpact.

to finish evaluation much sooner than the fault-affected individu-als. This is because discrepancies trigger immediate reconfigurationas a means of maintaining throughput and improving system avail-ability within limits dictated by �R.

5.2. FPGA circuit representation and characteristics

The FPGA structure used in the following experiments is similarto that used by Miller and Thompson for GA-based arithmetic cir-cuit design [16]. The feed-forward combinational logic circuit usesa rectangular array of nodes with two inputs and one output. Eachnode represents a Look-up Table (LUT) in the FGPA device, and aConfigurable Logic Block (CLB) is composed of four LUTs. In the array,each CLB will be a row of the array and two LUTs are represented asfour columns of the array. There are five dyadic functions – OR, AND,

be assigned to an LUT. The LUTs in the CLB array are indexed linearlyfrom 1 to n. Array routing is defined by the internal connectivity andthe inputs/outputs of the array. Internal connectivity is specified by

Fig. 13. Isolation of a single faulty L individual with a 32-out-of-64 faulty impact.

1596 R.F. DeMara et al. / Applied Soft Computing 11 (2011) 1588–1599

-bit ×

tols

msghtcr

atucpdidiia

tpgwoSenaifs

5c

rti

Fig. 14. Example of a 3

he connections between the array cells. The inputs of the cells cannly be the outputs of cells with lower row numbers. Thus, theinear labeling and connection restrictions impose a feed-forwardtructure on the combinational circuit.

As an example of the circuit representation, the 3-bit × 3-bitultiplier can be implemented using the above FPGA structure, as

hown in Fig. 14. The entire configuration utilizes 21 CLBs. XORates are excluded from the initial designs to force usage of aigher number of the gates than conventional multiplier designso increase the design space. XOR gates simplify the process of cal-ulating partial binary sums, and thus reduce the number of gatesequired to build half-adders and full-adders.

A library of user-defined modules can be defined to instanti-te a population of diverse yet functionally equivalent circuits. Inhis case study, 20 distinct individuals are created at design-timesing a set of 10 or more variations of three fundamental sub-ircuits. These consist of parallel-AND, half-adder, and full-adderrimitives. For example, 24 different full-adder designs and 18ifferent half-adder designs were created for use in building the

ndividual 3-bit × 3-bit multiplier designs. Thus, each multiplier is aistinct combination of building blocks, where each building block

tself is chosen from among alternate designs in the library. Fig. 14llustrates an individual with three parallel-AND, three full-adder,nd six half-adder modules.

The population of competing alternatives is then divided intowo groups, L and R, where each group uses an exclusive set ofhysical resources. For crossover to occur such that offspring areuaranteed to utilize only mutually exclusive physical resourcesith other resident half-configurations, a two-point crossover

peration is carried out with another randomly selected Pristine,uspect or Refurbished individual belonging to the same group. Bynforcing speciation, breeding occurs exclusively within L or R, andon-interfering resource use is maintained. The crossover pointsre chosen along the boundary of CLBs so that intra-CLB crossovers precluded. The mutation operator randomly changes the LUT’sunctionality or reconnects one input of the LUT to a new, randomlyelected output inside the CLB.

.3. Refurbishment of a unique failed configuration – multiplierase study

In this experiment, GA-based recovery operators are applied toegenerate the functionality in the affected individuals. In ordero simulate a hardware fault in the FPGA, a single stuck-at faults inserted at a randomly chosen LUT input pin. This fault hap-

3-bit multiplier design.

pens to affect the L individuals in the population. Later, a differentsingle fault is introduced, which will affect R individuals, and theexperiment is repeated. Upon observing the first discrepancy, thesame inputs are applied once to the reloaded configurations as adefinitive means of damage assessment under a single-fault model.Over 25 experimental runs, an average of 2,171 iterations wererequired to dependably demote the fitness state of the affectedindividual from Pristine to Under Repair. During regeneration, theGenetic Algorithm performs inter-module crossover and intra-module mutation operator called the input permutation operator.Unlike traditional mutation, the input permutation operator altersa specific LUT’s functionality, choosing from among AND, OR, XOR,NOR and NAND gates, as also changing the connections to the inputpins. Such mutation in conjunction with the crossover operationsenables full exploration of a wide range of designs.

Table 3 lists the evolutionary regeneration characteristics ofCRR for stuck-at-0 and stuck-at-1 faults. The faults were injectedat randomly chosen locations in the designs. For the experiment,DVR DVO, the repair and operational thresholds, were 2.5� and1� respectively. The use of multiples of standard deviation as thethreshold ensures that the system adapts in the case of catastrophicfault conditions, as well as the condition where very few discrep-ancies are observed.

The parameters which control the rate at which individuals arerotated on the FPGA, �R and RX were set at 0.2 and 0.16, respectively.The Reintroduction Rate of 0.2 implies that 20% of the computa-tions were carried out using a pair of individuals, one of whichwas Under Repair. In spite of this, the effective throughput remainshigh and above 97.5% on an average. This shows that even the indi-viduals undergoing repair produce useful output approximately0.975 − (1 − �R)/�R × 100% = 87.5% of the time.

Using a higher value for �R will lead to faster regenerationat an incremental cost to repair throughput. This provides fine-grained control over system performance measured in terms ofavailability and regeneration latency. Unlike other evolvable hard-ware approaches, CRR can be optimized to reduce downtime,increase availability, or to speed up the fault identification andregeneration process accordingly. The results listed in Table 3 indi-cate that the evolutionary algorithm is capable of regenerationfor the tested fault locations. The correctness of the affected con-

figurations is raised from as low as 22-out-of-64 correctness tocomplete operational suitability. The effective throughput is main-tained throughout at above 97.6%. It can also be seen that CRR-basedregeneration can be more computational tractable without exhaus-tive evaluation, as is listed in the Repair Iterations column.

R.F. DeMara et al. / Applied Soft Computing 11 (2011) 1588–1599 1597

Table 3Regeneration characteristics for a single fault under CRR.

3 × 3 multiplierexperiment number

Fault location Failure type Correctnessafter fault

Total iterations Discrepantiterations

Repairiterations

Finalcorrectness

Throughput (%)

1 CLB3, LUT0, Input1 Stuck-at-1 52/64 1.7 × 107 4.2 × 105 1194 64/64 97.72 CLB6, LUT0, Input1 Stuck-at-0 33/64 8.0 × 105 1.7 × 104 47 64/64 97.93 CLB5, LUT2, Input0 Stuck-at-1 22/64 3.1 × 106 6.8 × 104 193 64/64 97.8

82

6

5b

risuTcfialmLtLseSc

aaftpFot

5s

tWouosttro

TR

4 CLB7, LUT2, Input0 Stuck-at-0 38/645 CLB9, LUT0, Input1 Stuck-at-0 40/64

Average 32.6/64

.4. Refurbishment of multiple failed configurations – MCNC-91enchmark circuits case study

In order to further evaluate the performance of CRR-basedegeneration on a wider variety of circuits, refurbishment exper-ments were conducted on circuits from the MCNC-91 benchmarkuite [28]. Four different circuits from the benchmark suite weresed, namely, the z4ml, cm85a, cm138a, and a 2x-decod circuit.hese circuits were chosen to represent two different kinds of cir-uits. The z4ml and cm85a circuits have a fan-in greater than thean-out, and the cm138a and 2x-decod circuit have a fan-out thats greater than the fan-in. The z4ml circuit is a representation ofbinary adder. The cm138a and cm85a circuits are combinational

ogic circuits, and the decod circuit is a decoder. In these refurbish-ent experiments, a single stuck-at fault is introduced at a random

UT input which affects 18 of the 20 configurations. As comparedo the prior experiment described in Section 5.3, all but one of theand R configurations are affected by the fault such that a sub-

et of their outputs are incorrect. All other parameters for thesexperiments are identical to those for the experiment described inection 5.3. Results of the experiments with multiple fault-affectedonfigurations are shown in Table 4.

As listed in Table 4, for all the circuits, CRR is able to refurbishminimum of 14 configurations. The throughput is maintained at

bove 91.7% for the experiments. Using a lower value of �R willurther improve the throughput. From these experiments, it is clearhat CRR is able to utilize information obtained from pair-wise com-etition to direct the global search for refurbished configurations.urthermore, no correlation is observed between the fan-in to fan-ut ratio of the circuits and the relative difficulty of refurbishinghe circuits.

.5. Comparison with alternative soft computing fault tolerancechemes

In Vigander’s experiment with using a voting system in conjunc-ion with TMR [27], the target circuit is a 4-bit × 4-bit multiplier.

ith a population size of 50, and a crossover rate of 70%, mostf the 44 runs Vigander performed developed a set of three mod-les which vote to provide fully fit output for the exhaustive setf 256 unique input combinations. However, it is not always pos-

ible to successfully identify a single fully repaired individual fromhose three alternatives. Vigander’s experiment selected a popula-ion size of 50, which is 500% greater than the population used in theepair experiments attempted herein. Most significantly, it reliesn exhaustive serial testing against the set of all possible inputs.

able 4egeneration characteristics with multiple fault-affected configurations.

Circuit Number of fault-affectedconfigurations

Failure type Total iterations Dite

z4ml 18/20 Stuck-at-0 1.08 × 106 4.cm85a 18/20 Stuck-at-1 4.07 × 106 3.cm138a 18/20 Stuck-at-1 2.15 × 106 1.2x-decod 18/20 Stuck-at-0 3.01 × 106 1.

.1 × 106 1.8 × 105 513 64/64 97.7

.3 × 106 7.1 × 104 219 64/64 96.9

.4 × 106 1.5 × 105 433 64/64 97.6

CRR, however, achieves refurbishment with runtime inputs, con-tinually providing some validated outputs that maintains usefulthroughput above 91.7% across all benchmarks mentioned above.Furthermore, in Ref. [11] 3-bit by 3-bit multipliers are evolved inthe presence of faults using a Genetic Algorithm. Over 17 runs ofthe experiments, the fitness of the multiplier is improved by 12.6%on average. However, none of the runs succeed in realizing a fully fitmultiplier, after as many as 222,056 generations. This highlights theasymptotic performance of GA optimization in this domain – whereimprovement to the fitness of a failed configuration is rapid initially,but continued improvement to a fully fit solution is difficult. CRRis able to leverage the diverse genetic information contained in thepopulation of designs to realize fully functional refurbished config-urations within a few thousand iterations as shown in Section 5.3.More importantly, even in the presence of fault-affected configu-rations, the population of solutions is capable of maintaining someuseful throughput.

Compared to Jiggling [4], which is also an evolutionary-algorithm based approach to repairing permanent faults, CRRcan exhibit lower latency by virtue of not relying on exhaustivetracking of the repair candidates. Additionally, the (1 + 1) Evolu-tionary System described therein relies on rollbacks to preservebest-fit mutants. CRR, by virtue of depending on a population ofhigher-fit alternatives that are evaluated temporally over manyiterations, precludes the need for rollback of configurations andensures higher populational fault tolerance capability. In Ref. [5],results are presented for the performance of the Jiggling schemewhen the technique is applied to some of the MCNC-91 bench-mark circuits presented in Section 5.4. For the cm138a circuitwhen configurations are available, the system is able to maintainreliability at greater than 99% even when a mean time betweenfaults is 52.8 s. The results also show that repair is achieved forsome circuits such as cm138a, while refurbishments remained elu-sive for others, such as the decod circuit. Furthering the results ofKeymeulen et al. on populational fault tolerance [6], CRR achievesdevice refurbishment at runtime, while ensuring sustainable lev-els of throughput with graceful degradation. As compared to theRoving STARs approach [1], CRR minimizes detection latency, asfaults are evident immediately upon a discrepancy at the out-puts. Also, unlike STARs, by virtue of the runtime-input basedperformance evaluation, CRR leverages partially fit configurations

to provide some functional throughput. This effectively improvesthe granularity of spare usage to include those affected by stuck-at faults, as the GA may evolve solutions that use fault-affectedresources in generating repair configurations. Nonetheless, despiteits exhaustive testing of FPGA resources, STARs does avoid the

iscrepantrations

Repair iterations Number of refurbishedindividuals

Throughput (%)

94 × 104 391 14/20 95.439 × 105 1913 14/20 91.748 × 105 4072 15/20 93.115 × 105 14429 14/20 96.2

1 ft Com

nC

6

pneiCmtmuotd

fitvbatir

apoaflmtliu

raTitcbpwwF

A

N

R

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

598 R.F. DeMara et al. / Applied So

eed for convergence inherent in GA-based approaches, includingRR.

. Conclusion

Evolutionary regeneration can benefit from a population ofartially working designs which provide diverse, relevant alter-atives. This also allows departure from conventional fitnessvaluation with a rigid individual-centric fitness measure special-zed at design-time for each functional behavior needing repair.RR uses instead, a self-adapting, population-centric assessmentethod at runtime based on actual throughput data rather than

est vectors. CRR relies adapts fitness criteria based on perfor-ance of the population, thus providing graceful degradation. By

tilizing outlier detection techniques that work temporally with-ut the need for exhaustive testing, CRR provides a fault toleranceechnique that achieves usable device throughput during the faultetection process.

While the pre-existing methods focus on creating a single fullyt configuration, CRR extends this to maintain a population of solu-ions that have preferred fitness. This enhances the adaptability ofiable alternatives to a variety of unanticipated faults. An additionalenefit of maintaining a population of diverse partially fit individu-ls is that when the inputs to the system are localized to a subset ofhe set of all possible inputs, even partially fit individuals can assistn generating many of the required outputs, thereby improving theate of throughput during recovery.

A concern with self-adapting systems such as CRR is the avoid-nce of undesirable emergent behaviors. In the case of CRR, oneotential risk is functional drift occurring over a period of time. Inther words, fitness evaluation without exhaustive testing createsn exposure that the population’s behavior will begin to deviaterom the originally intended function, as repeated repairs accumu-ate on long missions encountering multiple failures. One way CRR

itigates this risk is through its designation and tracking of Pris-ine configurations which are normally preferred for selection. Asong as at least one known Pristine configuration exists then theres an avenue for evolution to be guided back to the desired behaviornder a number of fault scenarios.

Future work involves scaling to larger circuit sizes, which is cur-ently being investigated in pipelined combinational logic designss follows. After discrepancy detection, a Combinatorial Groupesting (CGT) method [25] tracks utilization of resource sets amongndividuals in the population to quickly identify the stage con-aining the faulty resource. This can be incorporated within theonfiguration selection step of CRR. The Genetic Operators can thene applied only to that isolated stage to attempt recovery [20], thusroviding an approach to extend the CRR method to larger circuitshile remaining computationally tractable. Finally, experimentsith hardware in-the-loop for Xilinx Virtex-II Pro [33] and Virtex-5

PGAs indicate feasibility on various hardware platforms.

cknowledgement

This research was supported in part by NASA Intelligent SystemsRA Contract NNA04CL07A.

eferences

[1] M. Abramovici, J.M. Emmert, C.E. Stroud, Roving STARs: an integrated approachto on-line testing, diagnosis, and fault tolerance for FPGAs in adaptive comput-ing systems, in: NASA/DoD Workshop on Evolvable Hardware, 2001.

[2] R.F. DeMara, K. Zhang, Autonomous FPGA fault handling through competi-tive runtime reconfiguration, in: Proceedings of the NASA/DoD Conference onEvolvable Hardware (EH’05), Washington, DC, USA, June 29–July 1, 2005, pp.109–116.

[3] R.F. DeMara, C.A. Sharma, Self-checking fault detection using discrepancy mir-rors, in: Proceedings of the International Conference on Parallel and Distributed

[

puting 11 (2011) 1588–1599

Processing Techniques and Applications (PDPTA’05), Las Vegas, NV, USA, June27–30, 2005, pp. 311–317.

[4] M. Garvie, A. Thompson, Scrubbing away transients and jiggling around thepermanent: long survival of FPGA systems through evolutionary self-repair, in:Proceedings of the 10th IEEE International On-Line Testing Symposium, 2004,pp. 155–160.

[5] M. Garvie, Reliable electronics through artificial evolution, Doctoral disserta-tion, University of Sussex, Brighton, 2005.

[6] D. Keymeulen, A. Stoica, R. Zebulum, Fault-tolerant evolvable hardware usingfield programmable transistor arrays, IEEE Transactions on Reliability 49(September (3)) (2000).

[7] P. Flajolet, D. Gardy, L. Thimonier, Birthday paradox, coupon collectors,caching algorithms and self-organizing search, Discrete Applied Mathematics39 (November (3)) (1992) 207–229.

[8] D.C. Hoaglin, R.E. Welsch, The hat matrix in regression and ANOVA, Journal ofthe American Statistical Association 32 (1978) 17–22.

[9] H.V. Henderson, P.F. Velleman, Building multiple regression models interac-tively, Biometrics 37 (1981) 391–411.

10] J. Lach, W.H. Mangione-Smith, M. Potkonjak, Low overhead fault-tolerantFPGA systems, IEEE Transactions on VLSI Systems 6 (June (2)) (1998)212–321.

11] G.V. Larchev, J.D. Lohn, Evolutionary based techniques for fault tolerant fieldprogrammable gate arrays, in: Proceedings of Second IEEE Conference on SpaceMission Challenges for Information Technology, August, 2006.

12] P. Layzell, A. Thompson, Understanding the inherent qualities of evolved cir-cuits: evolutionary history as a predictor of fault tolerance, in: Proceedings ofThird Conference on Evolvable Systems: From Biology to Hardware (ICES00).LNCS, Vol. 1801, April, Springer, 2000, pp. 133–142.

13] H. Liu, J.F. Miller, A.M. Tyrell, A biological development model for the designof robust multiplier, Applications of Evolutionary Computing: EvoHot (2005)195–204.

14] J.D. Lohn, G. Larchev, R.F. DeMara, Evolutionary fault recovery in a Virtex FPGAusing a representation that incorporates routing, in: Proceedings of 17th Inter-national Parallel and Distributed Processing Symposium, Nice, France, April,2003.

15] R.E. Lyons, W. Vanderkulk, The use of triple-modular redundancy to improvecomputer reliability, IBM Journal of Research and Development 6 (2) (1962)200–209.

16] J.F. Miller, P. Thomson, Cartesian genetic programming, in: Proceedings of theThird European Conference on Genetic Programming (EuroGP2000). LNCS, Vol.1802, Springer-Verlag, 2000, pp. 121–132.

17] C.J. Milliord, C.A. Sharma, R.F. DeMara, Dynamic voting schemes to enhanceevolutionary repair in reconfigurable logic devices, in: Proceedings of the Inter-national Conference on Reconfigurable Computing and FPGAs (ReConFig’05),Puebla City, Mexico, September 28–30, 2005, pp. 8.1.1–8.1.6.

18] S. Mitra, E.J. McCluskey, Which concurrent error detection scheme to choose?in: Proceedings of 2000 International Test Conference, Atlantic City, NJ, October3–5, 2000, pp. 985–994.

19] P. Moore, G.K. Venayagamoorthy, Evolving combinational logic circuits using ahybrid quantum evolution and particle swarm inspired algorithm, in: Proceed-ings of NASA/DoD Conference on Evolvable Hardware, Washington, DC, USA,June 29–July 1, 2005, pp. 97–102.

20] R.S. Oreifej, C.A. Sharma, R.F. DeMara, Expediting GA-based evolution usinggroup testing techniques for reconfigurable hardware, in: Proceedings ofthe IEEE International Conference on Reconfigurable Computing and FPGAs(Reconfig’06), San Luis Potosi, Mexico, September 20–22, 2006, pp. 106–113.

21] M. Parris, C.A. Sharma, R.F. DeMara, Progress in Autonomous Fault Recoveryof Field Programmable Gate Arrays, University of Central Florida Com-puter Architecture Laboratory Technical Report 08-P1, 2009, Available athttp://cal.ucf.edu/files/TechRep-08P1.pdf (accessed 02.03.09).

22] R. Ross, R. Hall, A FPGA simulation using asexual genetic algorithmsfor integrated self-repair, in: Proceedings of the First NASA/ESA Confer-ence on Adaptive Hardware and Systems, June 15–18, 2006, pp. 301–304.

23] P.J. Rousseuw, A.M. Leroy, Robust regression and outlier detection Wiley Seriesin Probability and Mathematical Statistics (1987) 216–227.

24] A.P. Shanthi, R. Parthasarathi, Genetic learning based fault tolerant mod-els for digital systems, Applied Soft Computing Journal (2005) 357–371.

25] C.A. Sharma, R.F. DeMara, A combinatorial group testing method for fpga faultlocation, in: Proceedings of the International Conference on Advances in Com-puter Science and Technology (ACST 2006), Puerto Vallarta, Mexico, January23–35, 2006.

26] A.M. Tyrell, A.J. Greensted, Evolving dependability, ACM Journal on EmergingTechnologies in Computing Systems 3 (July (2)) (2007).

27] S. Vigander, Evolutionary fault repair of electronics in space applications, M.S.Thesis, Norwegian University Science and Technology, Trondheim, Norway,February 28, 2001.

28] S. Yang, Logic synthesis and Optimization Benchmarks User Guide: Version

3.0, Technical Report, Microelectronics Center of North Carolina (MCNC),1991.

29] X. Yao, Y. Liu, Getting most of evolutionary approaches, in: A. Stoica, J. Lohn, R.Kata, D. Keymeulen, R. Zebulum (Eds.), Proceedings of the 2002 NASA/DOD Con-ference on Evolvable Hardware, IEEE Computer Society, Alexandria, Virginia,July, 2002, pp. 8–14.

ft Com

[

[

[

Conference in Evolvable Systems (ICES’05), Barcelona, Spain, September 12–14,2005, pp. 12–24.

R.F. DeMara et al. / Applied So

30] X. Yao, Y. Liu, P. Darwen, How to make best use of evolutionary learning, in: R.Stocker, H. Jelinek, B. Durnota (Eds.), Complex Systems: From Local Interactions

to Global Phenomena, Amsterdam, 1996, pp. 229–242.

31] X. Yao, Y. Liu, Making use of population information in evolutionary artificialneural networks, IEEE Transactions on Systems, Man and Cybernetics, Part B:Cybernetics 28 (3) (1998) 417–425.

32] K. Zhang, R.F. DeMara, C.A. Sharma, Consensus-based evaluation for fault isola-tion and on-line evolutionary regeneration, in: Proceedings of the International

[

puting 11 (2011) 1588–1599 1599

33] K. Zhang, R.F. DeMara, J. Alghazo, FPGA self-repair using an organic embed-ded system architecture, in: Proceedings of the International Workshop onDependable Circuits Design (DECIDE’07), Buenos Aires, Argentina, December6–7, 2007.