deep healing: ease the bti and em wearout crisis by...

6
Deep Healing: Ease the BTI and EM Wearout Crisis by Activating Recovery Xinfei Guo and Mircea R. Stan Department of Electrical and Computer Engineering University of Virginia, Charlottesville, Virginia 22904 Email: {xg2dt,mircea}@virginia.edu Abstract—The down-scaling of CMOS technologies into the nano-regime and the advent of the IoT era jointly conspire to elevate wearout effects to the status of major reliability threats. Bias temperature instability (BTI) and Electromigration (EM) are two of the dominant wearout mechanisms which affect transistors and on-chip interconnect, respectively. Both phenomena have been shown to exhibit partial recovery, but this property has been treated only as a side effect until now since passive recovery is slow and ineffective due to the permanent portion of wearout. In this paper, we propose and demonstrate that recovery for both wearout mechanisms can be further activated and accelerated, such that the permanent portion of wearout can be fully eliminated by using in-time scheduled recovery. We show that the explored recovery properties can be utilized effectively for reducing the wearout-induced design margins, this approach introducing a new design dimension by reducing the effects of wearout in a fundamental way. A novel circuit scheme and potential implementations at the system level that can assist both BTI and EM recovery are also detailed in the paper. Index Terms—Wearout, IoT, EM, BTI, Active Recovery I. WEAROUT CRISIS Wearout (aging) has become one of the dominant failure sources for VLSI systems as technology scaling is reaching the nanoscale regime [1], [2]. The transistors become more susceptible to voltage stress [3], [4], [2] due to the increased effective field due to the scaling of the thin oxide. Similarly, the shrinking geometries of metal layers render higher current densities, and the tremendous number of transistors within a compact area has resulted in higher power densities as well. Together, these lead to increased on-chip temperatures which potentially accelerate the wearout effects [5]. Moreover, advanced technologies such as FinFET have given rise to several new wearout concerns due to new effects such as self- heating [6]. Besides the technology scaling factors, wearout issues also become more pronounced from an application perspective. In emerging applications like the Internet of Things (IoT) or wearables, where circuits usually work in near/sub-threshold for ultra low power (ULP) operation, the sensitivity of transistor ON current to threshold voltages is much higher than in super-threshold regimes. Also, demanded by marketing and applications, these devices usually have very strict resiliency requirements [1] and require long lifetimes. For example, some biomedical applications will require a lifetime of more than 50 years for medical implants [3]. Finally, many of these devices need to operate in extreme environmental conditions, such as high temperatures, which, unfortunately, further accelerate wearout. Wearout phenomena affect all the parts of a system. In general, at the transistor level, Bias Temperature Instability (BTI) is one of the most prominent wearout mechanisms [2], [4]. It is characterized by the increase of the absolute value of threshold voltage |V th | and the reduction of the carrier mobility (μ). In the metal layers Electromigration (EM) is the dominant reliability threat that increases the wire resistance over time (soft wearout), and ultimately can potentially break the wire (hard failure). EM is especially critical for power delivery networks (PDN) in modern ICs [5], [7]. These two wearout effects conspire to worsen the system metrics like performance, and can lead to timing errors at the circuit level and, ultimately, failures at the system level. The most common solution for wearout issues is adding margins at design time (pre-fabrication). Specifically, for BTI, upsizing the transistors or stretching the clock are widely used. EM effects are mainly addressed by design rules (e.g. metal width requirement) dur- ing the physical design phase. However, predicting the margin under dynamic workloads and changing operating conditions is very difficult and many times unfeasible, and therefore, worse-case estimation is commonly used; but this leads to conservative overdesigns, which can significantly sacrifice performance and increase area, power and cost. Adaptive post- silicon techniques appear to be more “economic” in terms of costs and margins by compensating for wearout during run-time. Previous work have proposed novel BTI and EM sensors to track and monitor wearout, and then several knobs can be adjusted correspondingly. Such knobs can be clock frequency, supply voltage or body bias [8], [9]. Although the dynamic margins enabled by these solutions can guarantee that the circuit is functioning in the presence of wearout, the wearout itself means that the power/performance metrics will be degraded and the system runs sluggish or burns more power gradually. Thus, a solution that can fundamentally fix wearout instead of compensating for its effects would be clearly preferable. It has been known that the effects of both BTI and EM wearout recover passively when the stress (voltage or current stress) are removed [2], [5]. Because passive recovery is very slow and unpredictable, and it can only “relieve” wearout, there is still a permanent portion of wearout that still keeps accumulating [10]. In this paper, we propose that recovery for both wearout mechanisms can be further activated by reversing the directions of the stress and, additionally, can also be accelerated (e.g. by increasing the temperature). Based on actual hardware measurement results, we demonstrate both significant improvements in the recovery rate as well as avoidance of the permanent portion of wearout. To enable the proposed recovery techniques, an on-chip implementation that is able to activate both BTI and EM recovery is presented. II. BACKGROUND AND PRIOR WORK A. BTI and EM Recovery Mechanisms Although a consensus has still not been reached regarding the exact physical mechanisms that cause wearout (especially

Upload: others

Post on 05-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Deep Healing: Ease the BTI and EM Wearout Crisis by ...people.virginia.edu/~xg2dt/papers/SELSE2017_EM.pdf(BTI) is one of the most prominent wearout mechanisms [2], [4]. It is characterized

Deep Healing: Ease the BTI and EM WearoutCrisis by Activating Recovery

Xinfei Guo and Mircea R. StanDepartment of Electrical and Computer Engineering

University of Virginia, Charlottesville, Virginia 22904Email: {xg2dt,mircea}@virginia.edu

Abstract—The down-scaling of CMOS technologies into thenano-regime and the advent of the IoT era jointly conspire toelevate wearout effects to the status of major reliability threats.Bias temperature instability (BTI) and Electromigration (EM) aretwo of the dominant wearout mechanisms which affect transistorsand on-chip interconnect, respectively. Both phenomena havebeen shown to exhibit partial recovery, but this property hasbeen treated only as a side effect until now since passiverecovery is slow and ineffective due to the permanent portionof wearout. In this paper, we propose and demonstrate thatrecovery for both wearout mechanisms can be further activatedand accelerated, such that the permanent portion of wearoutcan be fully eliminated by using in-time scheduled recovery.We show that the explored recovery properties can be utilizedeffectively for reducing the wearout-induced design margins, thisapproach introducing a new design dimension by reducing theeffects of wearout in a fundamental way. A novel circuit schemeand potential implementations at the system level that can assistboth BTI and EM recovery are also detailed in the paper.

Index Terms—Wearout, IoT, EM, BTI, Active Recovery

I. WEAROUT CRISIS

Wearout (aging) has become one of the dominant failuresources for VLSI systems as technology scaling is reachingthe nanoscale regime [1], [2]. The transistors become moresusceptible to voltage stress [3], [4], [2] due to the increasedeffective field due to the scaling of the thin oxide. Similarly,the shrinking geometries of metal layers render higher currentdensities, and the tremendous number of transistors withina compact area has resulted in higher power densities aswell. Together, these lead to increased on-chip temperatureswhich potentially accelerate the wearout effects [5]. Moreover,advanced technologies such as FinFET have given rise toseveral new wearout concerns due to new effects such as self-heating [6]. Besides the technology scaling factors, wearoutissues also become more pronounced from an applicationperspective. In emerging applications like the Internet ofThings (IoT) or wearables, where circuits usually work innear/sub-threshold for ultra low power (ULP) operation, thesensitivity of transistor ON current to threshold voltages ismuch higher than in super-threshold regimes. Also, demandedby marketing and applications, these devices usually have verystrict resiliency requirements [1] and require long lifetimes.For example, some biomedical applications will require alifetime of more than 50 years for medical implants [3].Finally, many of these devices need to operate in extremeenvironmental conditions, such as high temperatures, which,unfortunately, further accelerate wearout.

Wearout phenomena affect all the parts of a system. Ingeneral, at the transistor level, Bias Temperature Instability(BTI) is one of the most prominent wearout mechanisms [2],[4]. It is characterized by the increase of the absolute valueof threshold voltage |Vth| and the reduction of the carriermobility (µ). In the metal layers Electromigration (EM) is the

dominant reliability threat that increases the wire resistanceover time (soft wearout), and ultimately can potentially breakthe wire (hard failure). EM is especially critical for powerdelivery networks (PDN) in modern ICs [5], [7]. These twowearout effects conspire to worsen the system metrics likeperformance, and can lead to timing errors at the circuit leveland, ultimately, failures at the system level. The most commonsolution for wearout issues is adding margins at design time(pre-fabrication). Specifically, for BTI, upsizing the transistorsor stretching the clock are widely used. EM effects are mainlyaddressed by design rules (e.g. metal width requirement) dur-ing the physical design phase. However, predicting the marginunder dynamic workloads and changing operating conditionsis very difficult and many times unfeasible, and therefore,worse-case estimation is commonly used; but this leads toconservative overdesigns, which can significantly sacrificeperformance and increase area, power and cost. Adaptive post-silicon techniques appear to be more “economic” in termsof costs and margins by compensating for wearout duringrun-time. Previous work have proposed novel BTI and EMsensors to track and monitor wearout, and then several knobscan be adjusted correspondingly. Such knobs can be clockfrequency, supply voltage or body bias [8], [9]. Although thedynamic margins enabled by these solutions can guaranteethat the circuit is functioning in the presence of wearout, thewearout itself means that the power/performance metrics willbe degraded and the system runs sluggish or burns more powergradually. Thus, a solution that can fundamentally fix wearoutinstead of compensating for its effects would be clearlypreferable. It has been known that the effects of both BTIand EM wearout recover passively when the stress (voltage orcurrent stress) are removed [2], [5]. Because passive recoveryis very slow and unpredictable, and it can only “relieve”wearout, there is still a permanent portion of wearout thatstill keeps accumulating [10]. In this paper, we propose thatrecovery for both wearout mechanisms can be further activatedby reversing the directions of the stress and, additionally, canalso be accelerated (e.g. by increasing the temperature). Basedon actual hardware measurement results, we demonstrate bothsignificant improvements in the recovery rate as well asavoidance of the permanent portion of wearout. To enable theproposed recovery techniques, an on-chip implementation thatis able to activate both BTI and EM recovery is presented.

II. BACKGROUND AND PRIOR WORK

A. BTI and EM Recovery Mechanisms

Although a consensus has still not been reached regardingthe exact physical mechanisms that cause wearout (especially

Page 2: Deep Healing: Ease the BTI and EM Wearout Crisis by ...people.virginia.edu/~xg2dt/papers/SELSE2017_EM.pdf(BTI) is one of the most prominent wearout mechanisms [2], [4]. It is characterized

StressPassive

Recovery

On-chip Power Rail

Stress

NBTI

PBTI

Passive Recovery

ON OFF

OxideGate Channel

Charge Carriers

Traps

Trapping

De-Trapping

Load On-chip Power Rail

Load

Current Flow No Current

Anode

+Cathode

-Cu+

e-

e-

e-

e-

e-

e-

e-

(a) BTI (b) EM

Current

Fig. 1: BTI and EM Mechanisms: (a) BTI Stress and PassiveRecovery; (b) EM Stress and Passive Recovery.

for BTI), it is now widely accepted that BTI is induced bytraps at the Si − SiO2 interface and in the gate oxide [4],[2]. As shown in Fig. 1(a), when a transistor is under stress,traps are able to capture charge and cause a threshold voltageshift. If the transistor is in passive recovery phase (no stress),some of the interface traps can anneal slowly (known as de-trapping process), and the number of occupied traps reachesa new equilibrium. Since the stress process exhibits a non-negligible permanent component, this limits the attainablelevel of recovery [11]. A somewhat similar degradation alsohappens to the power rails as shown in Fig. 1(b); when acurrent flows through a metal wire the current conductingelectrons produce an electron wind and lead to momentumexchange with the constituent metal atoms [12], [5]. Thismomentum exchange leads to a flux of the metal atoms thatcan create voids and cause uneven redistribution of resistance.If the voids grow gradually (known as void growth), this cancause an open circuit eventually. EM passive recovery happenswhen no current flows in the metal, the effect of the electronwind induced-stress can be relieved to a certain level, but cannot be fully released [13].

B. Prior Work on BTI and EM Recovery

Since both wearout effects are partially recoverable, theproperty has been previously utilized to improve the lifetimeand other metrics (e.g. performance) of the system. For BTI re-covery, several methods [14], [15] were proposed to rebalancethe signal probabilities to maximize the passive recovery time.An alternative method was to adaptively tune the performanceaccording to the degree of wearout so that certain blocks couldstart the recovery phase earlier [16]. Since passive recoveryis much slower than the wearout process, recovery boost forSRAM array was introduced in [17]; the idea was to raisethe gate voltages of a memory cell in order to put PMOSdevices into the recovery enhancement mode. As these worksfocus on SRAM cell circuit and architectural level throughmodeling and simulation, it was still unclear how much benefitrecovery boost could achieve due to lack of experimental data.Several recent works [18], [19] have studied the irreversiblecomponents of BTI at the device level. However, these worksfocused only on demonstrating and modeling the permanentcomponent, thus a solution that could fundamentally repair the

irreversible wearout is still missing in the field. Wafer leveland transistor level experiments and theory [20] indicated thatBTI recovery highly depends on temperature; thus these worksprovide physical evidence for our proposed active recoverysolutions.

The recovery effect of EM under AC stress was firstlystudied in [21]; the experimental results show that the lifetimeincreases with the frequency. This effect was further analyzedin [22], which shows that the healing can increase the lifetimeby several orders of magnitude depending on the metal used.While [13] indicates that EM is not fully recovered even duringan opposite polarity pulse current, this means there is alsoan irreversible component for EM. [5], [12] suggest from aphysics perspective that high temperature can lead to faster andmore complete recovery, but it is still simulation based, and noexperimental results are presented. In this paper, we investigatehow recovery can be accelerated by high temperature andactivated by reverse stress for both BTI and EM wearout basedon actual measurements. Furthermore, we study the extent ofthe irreversible components and the frequency dependence ofwearout and recovery. The goal is to fully alleviate or avoidboth BTI and EM wearout through effective deep healingtechniques.

III. ACTIVATING RECOVERY BY “REVERSING” THEDIRECTIONS OF BTI AND EM WEAROUT

A. Activate the Recovery

In this paper, we postulate that BTI and EM recovery canbe further activated and accelerated beyond passive recovery,and that systems can effectively use their sleep time (e.g.intrinsic OFF periods or scheduled OFF time) as active healingperiods essential for their overall performance, not unlikein the biological world. During sleep periods, several activerecovery solutions can be applied, and they are shown inFig. 2. In both BTI (a) and EM recovery cases (b), passiverecovery (No. 1) is treated as the baseline case. Differentfrom the passive recovery where only stress is removed, BTIrecovery can be made active by turning off the transistor morevia a negative voltage across the source and gate (No. 2). Hightemperature can increase the kinetic energy for the chargecarriers, thus leading to the accelerated recovery (No. 3). Thejoint efforts of both negative voltage and high temperature areable to deeply rejuvenate the circuit. Similarly, for EM, thedirection of current can be reversed to assist the electron backflow (active recovery), and high temperature can accelerate therecovery. To validate these hypotheses, we conduct hardwaretesting and study the recovery behaviors for both wearoutphenomena comprehensively; details are presented in the nextsections.

B. Experimental Setup

The recovery behaviors for BTI are studied on 2-input LookUp Table (LUT)- based commercial FPGA chips fabricatedin the 40nm node. The test structure is a 75-stage LUT-mapped ring oscillator, the oscillation frequency change iscaptured during BTI wearout and recovery. For EM, weconduct experiments on a set of on-chip “long” and “narrow”

Page 3: Deep Healing: Ease the BTI and EM Wearout Crisis by ...people.virginia.edu/~xg2dt/papers/SELSE2017_EM.pdf(BTI) is one of the most prominent wearout mechanisms [2], [4]. It is characterized

Vsg = 0, room

PassiveRecovery

Vsg = 0, high

Accelerated RecoveryActivate the recovery

Vsg = negativeroom temperature

Active Recovery:

Vsg = negativehigh temperature

Accelerated & Active 1 2 3 4Recovery

temperature temperature

(a) Activating BTI Recovery

I = 0, room

PassiveRecovery

I = 0, high

Accelerated RecoveryActivate the recovery

I = negativeroom temperature

Active Recovery:

I = negativehigh temperature

Accelerated & Active 1 2 3 4Recovery

temperature temperature

(b) Activating EM Recovery

Fig. 2: Illustration of “wearout reversing” techniques for BTI andEM wearout: No.1 is the passive recovery case which is used as thebaseline, and No. 2, 3, 4 are proposed active and accelerated recoverysolutions.

Technology 180nm

Material Copper

Thickness 0.8um

Length 2.673mm

Width 1.57um

Resistance (@rt) 35.76

ProbePads

MetalWire

ProbePads

MetalWire

Technology 180nm

Material Copper

Thickness 0.8um

Length 2.673mm

Width 1.57um

Resistance (@rt) 35.76

ProbePads

MetalWire Ω

Fig. 3: Die photo with the test structure for EM recovery: On-chip“long” and “narrow” metal wires and their dimensions.

metal wires (with probe pads) that are fabricated in 0.18µmtechnology. Fig. 3 shows the die photo and the dimension ofthe metal wire. The metal wire is fabricated with the highestmetal layer (M6) of the technology in dual-damascene process.The resistance change is measured during stress and recoveryphases. Temperature in both test cases is controlled by athermal chamber which allows fluctuation of ±0.3°C. All testsare carried out on “fresh” devices that haven’t been poweredON before.

C. BTI Active and Accelerated Recovery Experimental Results

The BTI experiment and measurement results presented inthis part were previously published in [10], [11]. Shown in Ta-ble I is one group of measurements, where we demonstrate that72.4% of the wearout is recovered within only 1/4 of the stresstime through both high temperature and negative voltage. Themeasurement results are also compared against the analyticalmodel [10]. Our experiments further reveal that even under ahigh temperature and negative voltage recovery condition (No.4), there is still a permanent component (>27%) which cannotbe recovered with the extended recovery period (much longerthan 6 hours). To further fix this component, we proposed thatperiodic scheduled recovery (instead of one-time recovery)will be able to fully eliminate or avoid the permanent BTIwearout. This has been demonstrated successfully by ourexperiments, the results are shown in Fig. 4, where the sameperiod of recovery (the same as the condition in test caseNo. 4) is scheduled after accelerated stress. It shows that thepermanent BTI component under 1 hour stress vs. 1 houractive accelerated recovery schedule is practically 0, and thisleads to full recovery. We conclude that BTI recovery can beactivated and accelerated significantly, and there is a balance ofstress and recovery (e.g. 1hr vs. 1hr in Fig. 4) which can bring

AcceleratedStress

Active andAcceleratedRecovery

T

T

C1 C2

Cx: End of xth cycle

Fig. 4: Measurement results that show how BTI permanent compo-nents accumulate over time under different stress vs. recovery patterns(recovery condition is the same as in No. 4): Under 1 hour vs. 1 hourcase, the permanent component is almost 0.

0 120 240 360 480 600 720 840 960 108072.8

73.0

73.2

73.4

73.6

73.8

74.0

74.2

74.4

74.6

74.8

Stable evenwith extendedrecovery

Void Nucleation

Start Recovery

Re

sis

tan

ce

(oh

m)

Time (min)

Accelerated StressAccelerated and Active RecoveryPassive Recovery

PermanentComponent

Continuous stressafter this point willpotentially causemetal break

Void Growth

Fig. 5: Measurement results for EM degradation and recovery underpassive recovery (Fig.2b No. 1) and proposed recovery conditions(Fig.2b No. 4, at 230°C and ±7.96MA/cm2) during the void growthphase: there is still a permanent component even under acceleratedand active recovery.

the aged system back to almost fresh status. In Section IV,we discuss how to utilize these explored unique BTI recoverybehaviors in detail.TABLE I: Summary of the BTI recovery test results for a 6-hourrecovery following a 24-hour constant accelerated stress with highvoltage and temperature (%: recovery percentage; Test case numbercorresponds to Fig. 2a)

Test Case Recovery Condition Measurement ModelNo. 1 20°Cand 0V 0.66% 1%No. 2 20°Cand -0.3V 16.7% 14.4%No. 3 110°Cand 0V 28.7% 29.2%No. 4 110°Cand -0.3V 72.4% 72.7%

D. EM Active and Accelerated Recovery Experimental Results

Shown in Fig. 5 is the measured EM-induced resistancechange under accelerated stress and recovery with relativelyhigh constant current density and elevated temperature. Duringthe stress phase, the results indicate that the EM evolutionconsists of two distinct phases – the void nucleation phase andthe void growth phase. During the nucleation phase, the EM-

Page 4: Deep Healing: Ease the BTI and EM Wearout Crisis by ...people.virginia.edu/~xg2dt/papers/SELSE2017_EM.pdf(BTI) is one of the most prominent wearout mechanisms [2], [4]. It is characterized

0 120 240 360 480 600 72076.5

77.0

77.5

78.0

78.5

79.0

79.5

80.0

80.5

Reverse current-induced EM

Full Recovery

Re

sis

tan

ce

(oh

m)

Time (min)

Accelerated StressAccelerated and Active Recovery

Start Recovery

Fig. 6: Measurement results for EM accelerated and active recoveryduring the early period of the void growth phase (at 230°C and±7.96MA/cm2): full recovery.

0 200 400 600 800 1000 1200 1400 1600 180075.0

75.5

76.0

76.5

77.0

77.5

78.0

78.5

79.0

Metal broke

Overall time to failure (TTF) is extented

Apply recovery regularly, voidnucleation slows down

Accelerated StressAccelerated and Active Recovery

Re

sis

tan

ce

(oh

m)

Time (min)

Full Recovery

Fig. 7: Measurement results for scheduled periodic recovery intervalsduring void nucleation phase: It takes much longer for voids tonucleate, and the overall TTF is extended.

induced stress increases until it hits a critical value, when voidsare generated; before this point, the resistance has almost nochange. Following the void nucleation phase, these generatedvoids start growing and lead to an increased resistance overtime. Our experimental results agree with measured datain [23], [21], and are also consistent with what is predictedby recently proposed physics-based EM models [5], [12].

During the active recovery phase, a reverse current (withthe same absolute value as in the stress phase) and elevatedtemperature are applied; Fig. 5 shows that the activated re-covery is much faster than that under passive recovery, andmore than 75% of EM wearout can be recovered within 1/5of the stress time. However there is still a lingering permanentcomponent, which is similar behavior to what we saw forinitial BTI wearout measurements. This suggests that a similarscheduling strategy as the one used in BTI recovery case canbe applied to EM in the hope of reducing, or even eliminatingthe permanent component of EM; Fig. 6 demonstrates exactlythis. The results show that by scheduling the recovery phase inthe early phase of void growth, EM can also be fully recovered.But the potential issue of scheduling recovery during voidgrowth is that during recovery, there is still (reverse) currentflowing through the metal, and this could lead to potential EM,but in the opposite direction (shown in the figure), and thus adduncertainties in terms of ultimate effects. A more “economic”way is to schedule the recovery periodically before voids

nucleation happens; the results of this strategy are shown inFig. 7, where multiple short recovery intervals are scheduledin the early phase of EM stress evolution, and this resultsin a delay of void nucleation for a significant amount of time(almost 3× slower compared to Fig. 5). In this way, the overalltime-to-failure (TTF) can be also significantly extended.

E. Summary on Experimental Results

Based on extensive accelerated tests, we conclude that bothBTI and EM recovery can be further activated and acceleratedsignificantly, and both share common recovery behaviors –the “Push-Pull” stress/active recovery compensation where in-time scheduled periodic recovery intervals are able to fullyeliminate the permanent wearout component. While BTI activerecovery needs to be in an OFF period, and EM activerecovery happens during ON period when there is reversecurrent flowing; this opens new opportunities of schedulingboth recovery over the whole lifetime span with the propercircuit solutions, which will be discussed in details in thefollowing section.

IV. IMPLEMENTATIONS

A. Assist Circuitry for Activating BTI and EM Recovery

Since power rails suffer from single-direction DC currentmostly [12], [5], we focus only on EM-induced effects inpower delivery network in this paper. The circuit schemepresented in this section is inspired by the concept proposedin [7], [24], [25]; the difference between this work and previ-ous solutions is that our scheme is able to support both EM andBTI active recovery modes, and we also discuss physical im-plementations and potential system level integration, which aremissing in the literature. Shown in Fig. 8(a) is the schematicof the assist circuitry, which supports three modes (Normal,EM Active Recovery and BTI Active Recovery). Under Normaloperating mode, the load works similarly to a regular power-gated system, and during EM Active Recovery, the currentflowing through the VDD and VSS grid is reversed, and thecurrent has the same absolute value that is guaranteed by thesymmetry of the scheme, thus the load (target circuit) stillfunctions as under Normal mode. BTI active recovery happenswhen the load is idle, during which VDD and VSS of the loadare switched. Depending on the input values, NBTI or PBTIrecovery can be activated; this is shown in Fig. 8(c).

To validate the design, we implemented and simulatedthe assist circuitry in 28nm FD-SOI technology. A set ofring oscillators running in parallel was used as the load,the VDD/VSS grid was treated as a resistor for which wepicked a reasonable value based on the published literature.Fig. 9 presents the functionality simulation under three dif-ferent modes, under BTI Active Recovery mode, the VDDand VSS nodes of the load are switched as expected, andthere is about 0.2V voltage droop/increase induced by thepass transistors, but the voltage is still large enough foractivating BTI recovery (-0.816V is much higher than -0.3Vused in our experiment in Section III-C). One of the biggestchallenges of the assist circuitry is the voltage droop/increase

Page 5: Deep Healing: Ease the BTI and EM Wearout Crisis by ...people.virginia.edu/~xg2dt/papers/SELSE2017_EM.pdf(BTI) is one of the most prominent wearout mechanisms [2], [4]. It is characterized

(b)BTI Active RecoveryEM Active RecoveryNormal Operation

CurrentDirection

VDD Grid

Load

VSS Grid

VDD

VSS

P1

P2

N1

P3

P4

N2 N3

N4

VSS+∆V

VDD-∆V

“1”

Negative

>VDD- >VDD-

Normal

P1

P2

P3

P4

N1

ON

OFF

OFF

ON

ON

N2 OFF

N3

N4

OFF

ON

OFF

ON

ON

OFF

OFF

ON

ON

OFF

DeviceMode

ON

OFF

ON

OFF

ON

OFF

ON

OFF

(a)

(c) Activate NBTI Recovery

Vth 2Vth

Voltage

EM Active Recovery

BTIActive Recovery

Load VDD

Load VSS

Fig. 8: Assist circuitry for activating BTI and EM recovery: (a) Themain circuitry, arrows represent the current direction under differentmodes, VDD and VSS pins can be connected to the on-chip voltageregulator directly, or to the global power delivery network; (b) Truthtable for three operating modes; (c) An example of activating NBTIrecovery under BTI Active Recovery mode, for PBTI recovery, theinput needs to be “0”, ∆V represents voltage droop/increase or noise.

-6.00E-04

-4.00E-04

-2.00E-04

0.00E+00

2.00E-04

4.00E-04

6.00E-04

1.00E-08 1.20E-08 1.40E-08

VD

D G

rid

Cu

rre

nt (

A)

Time (s)

Normal Mode

EM Active Recovery Mode

(a)

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.50E-08 1.70E-08 1.90E-08

Vo

ltag

e (

V)

Time (s)

BTI Active Recovery Mode

Load VSS node

Load VDD node

~ 0.816V

~ 0.223V

V: 0.2 ~ 0.3V

(b)

Fig. 9: Functionality simulation in 28nm FDSOI: (a) The currentdirection is reversed under EM Active Recovery Mode, and the currentvalue is still the same; (b) Under BTI Active Recovery Mode, loadVDD and VSS values are switched.

1

1.2

1.4

1.6

1.8

1 2 3 4 5

Nor

mal

ized

Del

ay

Load Size

Load Delay

Switching Time

Fig. 10: Load Size vs. Performance and Switching time: Increasingthe number of loads will reduce the performance as well as theswitching time between modes, to compensate the degradation,header/footer transistors need to be upsized, which will furtherincrease the area.

at the load VDD/VSS nodes that are introduced by theheader/footer transistors during Normal operation and EMactive recovery mode, during which performance is critical.Another potential concern is the switching time (retentiontime) between modes. Since both metrics depend on the load,we explore how load size affects them. Fig. 10 shows thatby increasing load size, the performance degrades linearlybecause of the voltage drop/increase across the footer/headertransistors. Switching time also reduces with the increasedload, but with a slower rate. To compensate this performance

M10

M9

M8

M7

M6

M5

C4 Bump

M4

M3M2

P4 P3 LoadP1P2

VDD

VDD Via

VDD_PAD

EM

EM

EM

Tower

Global PDN

Connect to VSS Grid

Connect to VSS Grid

VDD Grid(EM hazards)

BTI

Fig. 11: Vertical cross section of the physical implementation forthe assist circuitry for VDD Grid (VSS Grid is similar): EM hazardshappen at high current density regions, which could be caused byfaster switching activities on the load logic; At the logic level, BTIhazards happen due to the continuous stress.

degradation, the header/footer transistors need to be upsized,which will results in more area. This study indicates that eachload will have its own optimal design point which give theoptimal metrics in terms of area and other metrics.

Fig. 11 gives an example of physical implementation ofVDD Grid with the assist circuitry integrated (10 metal layersare assumed). It has a global PDN grid which is usually builtwith the top one or two metals that are wide and thick, thusbeing relatively robust against EM. Local VDD/GND gridsthat are close to logic and use the lower metal layers aremore EM sensitive; this implementation will be able to protectthe local grids and also enable the flexibility of designinglocalized assist circuitry for individual loads. The structureis very similar to a conventional power gated PDN, on top ofwhich we add one more layer of header/footer. Since powergating techniques are widely used, this implementation makesit easy to integrate the assist circuitry into the existing designflow.

B. Implications at the System Level

The recent shift of architecture to heterogeneous and many-core systems significantly increases the number of integratedcores. Specialized computing resources serve for different loadtasks, which also leads to different EM and BTI behaviors,thus requiring different recovery strategies. At the systemlevel, localized active recovery at the core level or blocklevel will be able to leverage the cost while rejuvenating the“aged” system. Fig. 12(a) illustrates a potential system with thelocalized active recovery techniques. Each square represents acore or logic block with local PDN and can have differentrecovery strategies. In the meanwhile, “Dark Silicon” stillappears at a big challenge in these systems [26]. The “dark”parts of the chip usually lead to some “redundant” resourceswhich have intrinsic OFF periods, and these resources canbe a single core or a subset of the cores. Since we havedemonstrated that high temperature is able to accelerated therecovery of both wearout mechanisms, and if these redundantresources (e.g. the core located in the center in the systemshown in the figure) can be scheduled and allocated in such

Page 6: Deep Healing: Ease the BTI and EM Wearout Crisis by ...people.virginia.edu/~xg2dt/papers/SELSE2017_EM.pdf(BTI) is one of the most prominent wearout mechanisms [2], [4]. It is characterized

Global VDD Grid

(Most EM Sensitive)

C4 Bumps

Global VSS Grid

(Top Metal Layers)

Assist Circuitry

EM ActiveRecovery

NormalOperation

BTI ActiveRecovery

VDD/VSS Grids

Logics and Local

Heat Flow

OFF

(a)

(b)

Per

form

ance

Time

Wor

st-c

ase

Mar

gin

Time 0

BTI-induced

Short

BTI Active

Wearout

intervals

BTI/EM Sensing

Recovery EM Active Recovery

New

Des

ign

Mar

gin

Original Lifetime Target

Fig. 12: System-level Implementation: (a) Illustration of On-chipPDN that supports both BTI and EM active recovery: LocalVDD/VSS grids (Most EM-sensitive) are connected with globalgrids with the assisted circuitry; Depending on the applications,multiple recovery modes can be enabled, the generated heat fromthe neighboring logic can be utilized to accelerate the BTI recovery;(b) Illustration of periodic scheduled EM/BTI active recovery.

a way that they can be healed by the generated heat from theneighboring active elements, the recovery can be further spedup. Fig. 12(b) presents an example of run-time schedulingfor BTI and EM active recovery. In the early lifetime, sinceEM-induced stress hasn’t reached the nucleation threshold,the main performance degradation will be caused mainly byBTI; novel BTI and EM sensors can be employed to trackwearout and feed back the run-time degradation information.Short intervals of BTI active recovery periods can then beinserted to bring the chip back to the fresh status in time;during these intervals, certain states need to be in retentionmode, alternatively, workload can be shifted to other redundantresources. EM Active Recovery period can be scheduled eitherfrom when the void nucleation happens or even earlier. Basedon the measurement results presented in Section III-D, earlyrecovery is more economic and efficient, and the system is stillin operation during EM recovery interval, so EM active periodcan be scheduled alternately with normal operation with asmall switching overhead. Overall, such a scheduling strategycan potentially fully recover both the BTI and EM wearout,such that the system always runs in a “refreshing” mode;the necessary wearout guardbands can then be significantlyreduced as well.

V. CONCLUSIONS AND FUTURE WORK

As BTI and EM wearout effects become more critical,novel techniques that are able to mitigate them with loweroverhead are highly desirable. In this paper, we propose onesuch candidate solution by fixing both wearout mechanismsin a fundamental way. Based on hardware measurements,we demonstrate that BTI and EM recovery can be activatedand accelerated, and the permanent components can be ef-fectively eliminated by optimal scheduling. To fully enable

the utilization of the explored recovery behaviors, we presentan assist circuitry scheme and discuss the implementationdetails at both circuit and system level. As future work, wewill continue to develop the high-level compact models thatcapture the accurate device and circuit level BTI/EM recoveryinformation while being able to apply at the architectural andsystem level; this will enable an enhanced design methodologythat integrates active recovery as an effective design knob forsystem-level design.

ACKNOWLEDGMENT

This work was supported in part by NSF CCF-1255907,SRC 2410.001 and C-FAR, one of six SRC STARnet Centers,sponsored by MARCO and DARPA. The authors would liketo thank Mr. Linqiang Luo for helping with wire bonding.

REFERENCES

[1] R. Aitken et al., “Resiliency challenges in sub-10nm technologies,” inIEEE 33rd VTS. IEEE, 2015, pp. 1–4.

[2] S. Mahapatra, Fundamentals of Bias Temperature Instability in MOSTransistors. Springer, 2016.

[3] J. Franco et al., “BTI reliability of ultra-thin EOT MOSFETs for sub-threshold logic,” Microelectronics Reliability, vol. 52, no. 9, pp. 1932–1935, 2012.

[4] Y. Cao et al., “Cross-layer modeling and simulation of circuit reliability,”IEEE TCAD, vol. 33, no. 1, pp. 8–23, 2014.

[5] X. Huang et al., “Dynamic electromigration modeling for transient stressevolution and recovery under time-dependent current and temperaturestressing,” Integration, the VLSI Journal, 2016.

[6] C. Prasad et al., “Self-heat reliability considerations on intel’s 22nmtri-gate technology,” in IEEE IRPS, 2013.

[7] D. C. Sekar et al., “Electromigration Resistant Power Delivery Systems,”IEEE Electron Device Letters, vol. 28, no. 8, pp. 767–769, Aug 2007.

[8] E. Mintarno et al., “Self-tuning for maximized lifetime energy-efficiencyin the presence of circuit aging,” IEEE TCAD, vol. 30, no. 5, pp. 760–773, 2011.

[9] S. Narang and A. P. Srivastava, “NBTI detection methodology forbuilding tolerance with respect to NBTI effects employing adaptive bodybias,” in IEEE ICCPCT. IEEE, 2015, pp. 1–7.

[10] X. Guo, W. Burleson, and M. Stan, “Modeling and ExperimentalDemonstration of Accelerated Self-healing Techniques,” in DAC, 2014.

[11] X. Guo and M. R. Stan, “Work hard, sleep well - Avoid irreversible ICwearout with proactive rejuvenation,” in IEEE ASPDAC. IEEE, 2016,pp. 649–654.

[12] V. Sukharev et al., “Electromigration induced stress evolution underalternate current and pulse current loads,” Journal of Applied Physics,vol. 118, no. 3, p. 034504, 2015.

[13] K.-D. Lee, “Electromigration recovery and short lead effect underbipolar-and unipolar-pulse current,” in IEEE IRPS. IEEE, 2012, pp.6B–3.

[14] S. Gupta and S. S. Sapatnekar, “GNOMO: Greater-than-NOMinal Vddoperation for BTI mitigation,” in IEEE ASPDAC. IEEE, 2012, pp.271–276.

[15] J. Abella et al., “Penelope: The NBTI-aware processor,” in IEEE/ACMMICRO. IEEE, 2007, pp. 85–96.

[16] A. Tiwari and J. Torrellas, “Facelift: Hiding and slowing down aging inmulticores,” in IEEE/ACM MICRO. IEEE, 2008, pp. 129–140.

[17] J. Shin et al., “A proactive wearout recovery approach for exploitingmicroarchitectural redundancy to extend cache sram lifetime,” in ACMSIGARCH Computer Architecture News, vol. 36, no. 3. IEEE ComputerSociety, 2008, pp. 353–362.

[18] T. Grasser et al., “The “permanent” component of NBTI revisited:Saturation, degradation-reversal, and annealing,” in IEEE IRPS. IEEE,2016, pp. 5A–2.

[19] A. A. Katsetos, “Negative bias temperature instability (NBTI) recoverywith bake,” Microelectronics Reliability, vol. 48, no. 10, pp. 1655–1659,2008.

[20] G. Pobegen et al., “Understanding temperature acceleration for nbti,” inIEDM, 2011, pp. 27–3.

[21] J. Tao et al., “Modeling and characterization of electromigration failuresunder bidirectional current stress,” IEEE Transactions on ElectronDevices, vol. 43, no. 5, pp. 800–808, 1996.

[22] J. Abella and X. Vera, “Electromigration for microarchitects,” ACMCSUR, vol. 42, no. 2, p. 9, 2010.

[23] M. R. Stan and P. Re, “Electromigration-aware design,” in IEEE ECCTD.IEEE, 2009, pp. 786–789.

[24] J. Abella et al., “Refueling: Preventing Wire Degradation due to Elec-tromigration,” IEEE Micro, vol. 28, no. 6, pp. 37–46, 2008.

[25] A. Bansal and J.-J. Kim, “Power napping technique for accelerated nega-tive bias temperature instability (NBTI) and/or positive bias temperatureinstability (PBTI) recovery,” Jul. 21 2015, US Patent 9,086,865.

[26] M. Shafique and S. Garg, “Computing in the dark silicon era: Currenttrends and research challenges,” IEEE Design & Test, 2016.