efficient scrub mechanisms for error-prone emerging memories
DESCRIPTION
Efficient Scrub Mechanisms for Error-Prone Emerging Memories. Manu Awasthi ǂ , Manjunath Shevgoor⁺, Kshitij Sudan⁺, Rajeev Balasubramonian⁺, Bipin Rajendran ‡ , Viji Srinivasan ‡ ǂ Micron, ⁺ University of Utah, ‡ IBM Research. Executive Summary. - PowerPoint PPT PresentationTRANSCRIPT
Efficient Scrub Mechanisms for Error-Prone Emerging Memories
Manu Awasthiǂ, Manjunath Shevgoor⁺, Kshitij Sudan⁺, Rajeev Balasubramonian⁺, Bipin Rajendran‡, Viji
Srinivasan‡
ǂMicron, ⁺University of Utah, ‡IBM Research
2
Executive Summary
• Future memory hierarchies will likely include NVMs– Focus of this work: Phase Change Memory (PCM)
• Multi level cells in PCM appear imminent• A number of proposals exist to handle hard errors
and lifetime issues of PCM devices• Resistance Drift is a lesser explored phenomenon– Will become primary cause of “soft errors” -- Need to
explore holistic solutions to counter drift
3
Phase Change Memory - MLC
• Chalcogenide material can exist in crystalline or amorphous states
• The material can also be programmed into intermediate states– Leads to many intermediate states, paving way for Multi
Level Cells (MLCs)
Resistance AmorphousCrystalline
(11) (00)(10) (01)
(111) (000)(101) (011) (010) (001) (110) (100)
4
What is Resistance Drift?
Resistance
Time11 10 01 00
A
B
ERROR!!
T0
Tn
5
Resistance Drift - Issues
• Programmed resistance drifts according to power law equation -
• R0, α usually follow a Gaussian distribution• Time to drift (error) depends on – Programmed resistance (R0), and – Drift Coefficient (α)– Is highly unpredictable!!
Rdrift(t) = R0 х (t)α
6
Resistance Drift - How it happens
11 10 01 00
Median case cell• Typical R0• Typical α
Worst case cell• High R0• High α
Scrub rate will be dictated by the Worst Case R0 and Worst Case α
Drift Drift
R0 R0Rt Rt
ERROR!!
Number of Cells
7
Resistance Drift DataCell Type Drift Time at Room
temperature (secs)Median 11 cell 10499
Worst 11 Case cell 1015
Median 10 cell 1024
Worst Case 10 cell 5.94
Median 01 cell 108
Worst Case 01 cell 1.81
(11) (00)(10) (01)
8
Uncorrectable Errors
9
Problem Severity
10
Inadequacy of Device Level Solutions
• Precise writes –> write – compare – write– Can be effective, requires multiple iterations– Increases write time, total write energy, decreases
lifetime• Correcting values during read not effective• Coding techniques– Can help, but cannot eliminate the problem by
themselves
11
Naïve Solution : Scrubbing
• Drift resets with every cell reprogram (write)• Leverage existing error correction mechanisms,
e.g., ECC - has its own drawbacks• A Full Refresh/Scrub (read-compare-write) is
extremely costly in PCM– Each PCM write takes 100 - 1000ns– Writing to a 2-bit cell may consume as much as 1.6nJ– Requires 600 refreshes in parallel
Refresh should be reactionary NOT precautionary!
12
Solution Outline
• A three pronged approach
– Incorporate stronger ECC codes
– Incorporate low-cost error detection mechanisms
– Adapt scrub algorithms that are conservative and adaptive
13
Additional ECC support
• With stronger ECC support, scrub operations can be made infrequent– If E errors can be corrected and E+1 detected, then
scrub operation can be postponed till the Eth error– Can provide support for hard error tolerance as well
• We advocate ECC-8 (E-8)– Correct eight errors, detect nine– Leads to 14.25% storage overhead, for 512 bits
14
• Light Array Reads for Drift Detection– Support for E Error-correcting, E+1
error detecting codes assumed– Lines are read periodically and
checked for correctness– Only after the number of errors
reaches a threshold, scrubbing is performed
– Drift detection has to be simple, on-chip
LARDD
Read Line
Check for Errors
Errors < E
Scrub Line
True
False
After N cycles
15
Read Pipeline with LARDD & Parity
• Simple error detection mechanism to bypass complex BCH circuit – 256 cell line divided up into eight 32-cell fields– Count number of drift prone states, store as parity
information– For every LARDD, check if parity has changed– If true, invoke BCH circuitry
1515(11) (00)(10) (01)
16
Headroom
• Headroom-h scheme – scrub is triggered if E-h errors are detected
† Decreases probability of errors slipping through
– Increases frequency of full scrub and hence decreases life time
• Presents trade-off between Hard and Soft errors
Read Line
Check for Errors
Errors < E-h
Scrub Line
True
False
After N cycles
17
Headroom Gradual
• Start with a small LARDD frequency
• Increase frequency as drift-based errors increase– Decreases energy
overhead– Might cause slight
increase in error rate
Read Line
Check for Errors
Errors <= E-h-g True
False
After N cycles
Double LARDD rate
18
Adaptive Scrub
• Dynamic events can affect reliability– Temperature increases can increase α and decrease drift
time– Cell lifetime/wearout is also an issue– Soft error rate depends on prevalence of drift prone states– Hard errors show up with aging cells
• Start with headroom h policy, get rid of headroom when at least E-h-1 hard errors
• Can increase LARDD frequency when total number of hard errors increases preset threshold
19
Results
Stronger ECC + LARDD support leads to decrease in unrecoverable errors
20
Results - Headroom Schemes
• Compared to ECC-1, for 512 s LARDD interval– ECC-8-headroom-3 leads to 99.6% reduction in error rate, 35.4%
reduction in scrub energy– ECC-8-headroom-3-gradual-1 leads to 96.5% reduction in error rate,
37.8% reduction in scrub energy
21
Device Level Solution – Non Uniform Banding
Resistance
Before
After
Mean R0
11 10 01 00
22
Comparing with Device Level Solutions
Baselin
e ECC-1 (2
.75 σ)
ECC-1 +
Precise W
rite (1
.375 σ)
ECC-1 +
Non-uniform
banding
ECC-8 (2
.75 σ)
ECC-8-head
room-3 (2.75 σ)
0E+00
1E+05
2E+05
Number of Uncorrectable Errors
ECC-8-headroom-3 provides for least number of uncorrectable errors
23
Conclusions
• Resistance drift will exacerbate with MLC scaling• Naïve solutions based on ECC support are costly for
PCM– Increased write energy, decreased lifetimes
• Holistic solutions need to be explored to counter drift at device, architectural and system levels– 39% reduction in energy, 4x less errors, 102x increase in
lifetime
24
Thanks!!
www.cs.utah.edu/arch-research
25
Backup Slides