large-scale study of in-the-field flash failures · factors in platform design (e.g., number of...

148
Large-Scale Study of In-the-Field Flash Failures Onur Mutlu [email protected] (joint work with Justin Meza, Qiang Wu, Sanjeev Kumar) August 10, 2016 Flash Memory Summit 2016, Santa Clara, CA

Upload: others

Post on 01-Oct-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

Large-Scale Study of In-the-Field Flash Failures

Onur Mutlu [email protected]

(joint work with Justin Meza, Qiang Wu, Sanjeev Kumar)

August 10, 2016 Flash Memory Summit 2016, Santa Clara, CA

Page 2: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

Original Paper (I)

Page 3: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

Original Paper (II) n  Presented at the ACM SIGMETRICS Conference in June 2015.

n  Full paper for details:

q  Justin Meza, Qiang Wu, Sanjeev Kumar, and Onur Mutlu, "A Large-Scale Study of Flash Memory Errors in the Field" Proceedings of the ACM International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS), Portland, OR, June 2015. [Slides (pptx) (pdf)]

q  [Coverage at ZDNet] [Coverage on The Register] [Coverage on TechSpot] [Coverage on The Tech Report]

q  https://users.ece.cmu.edu/~omutlu/pub/flash-memory-failures-in-the-field-at-facebook_sigmetrics15.pdf

Page 4: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

A Large-Scale Study ofFlash Memory Errors in the FieldJustin Meza Qiang Wu Sanjeev Kumar Onur Mutlu

Page 5: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

Overview

First study of flash reliability:▪ at a large scale

▪ in the field

Page 6: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

Access patterndependence

SSD lifecycle

Readdisturbance

Temperature

Overview

New reliability

trends

Page 7: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

Access patterndependence

Readdisturbance

Temperature

New reliability

trends

SSD lifecycle

Overview

Early detection lifecycle period distinct from hard disk drive lifecycle.

Page 8: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

Access patterndependence

Temperature

New reliability

trends

SSD lifecycle

Readdisturbance

Overview

We do not observe the effects of read disturbance errors in the field.

Page 9: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

Access patterndependence

New reliability

trends

SSD lifecycle

Readdisturbance

Temperature

Overview

Throttling SSD usage helps mitigate temperature-induced errors.

Page 10: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

New reliability

trends

SSD lifecycle

Readdisturbance

Temperature

Access patterndependence

Overview

We quantify the effects of the page cache and write amplification in the field.

Page 11: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

▪ background and motivation ▪ server SSD architecture ▪ error collection/analysis methodology ▪ SSD reliability trends ▪ summary

Outline

Page 12: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

Background and motivation

Page 13: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

▪ persistent▪ high performance▪ hard disk alternative▪ used in solid-state drives (SSDs)

Flash memory

Page 14: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

▪ persistent▪ high performance▪ hard disk alternative▪ used in solid-state drives (SSDs)▪ prone to a variety of errors

▪ wearout, disturbance, retention

Flash memory

Page 15: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

Prior  Flash  Error  Studies  (I)  1.  Overall  flash  error  analysis

-­‐  Yu  Cai,  Erich  F.  Haratsch,  Onur  Mutlu,  and  Ken  Mai,  Error  Pa3erns  in  MLC  NAND  Flash  Memory:  Measurement,  CharacterizaBon,  and  Analysis,  DATE  2012.  

-­‐  Yu  Cai,  Gulay  Yalcin,  Onur  Mutlu,  Erich  F.  Haratsch,  Adrian  Cristal,  Osman  Unsal,  and  Ken  Mai,  Error  Analysis  and  RetenBon-­‐Aware  Error  Management  for  NAND  Flash  Memory,  Intel  Technology  Journal  2013.  

2.  Program  and  erase  cycling  noise  analysis

-­‐  Yu  Cai,  Erich  F.  Haratsch,  Onur  Mutlu,  and  Ken  Mai,  Threshold  Voltage  DistribuBon  in  MLC  NAND  Flash  Memory:  CharacterizaBon,  Analysis  and  Modeling,  DATE  2013.  

Page 16: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

Prior  Flash  Error  Studies  (II)  3.  RetenBon  noise  analysis  and  management  

-­‐  Yu  Cai,  Gulay  Yalcin,  Onur  Mutlu,  Erich  F.  Haratsch,  Adrian  Cristal,  Osman  Unsal,  and  Ken  Mai,  Flash  Correct-­‐and-­‐Refresh:  RetenBon-­‐Aware  Error  Management  for  Increased  Flash  Memory  LifeBme,  ICCD  2012.  

-­‐  Yu  Cai,  Yixin  Luo,  Erich  F.  Haratsch,  Ken  Mai,  and  Onur  Mutlu,  Data  RetenBon  in  MLC  NAND  Flash  Memory:  CharacterizaBon,  OpBmizaBon  and  Recovery,  HPCA  2015.  

-­‐  Yixin  Luo,  Yu  Cai,  Saugata  Ghose,  Jongmoo  Choi,  and  Onur  Mutlu,  WARM:  Improving  NAND  Flash  Memory  LifeBme  with  Write-­‐hotness  Aware  RetenBon  Management,  MSST  2015.  

Page 17: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

Prior  Flash  Error  Studies  (III)  4.  Cell-­‐to-­‐cell  interference  analysis  and  management  

-­‐  Yu  Cai,  Onur  Mutlu,  Erich  F.  Haratsch,  and  Ken  Mai,  Program  Interference  in  MLC  NAND  Flash  Memory:  CharacterizaBon,  Modeling,  and  MiBgaBon,  ICCD  2013.    

-­‐  Yu  Cai,  Gulay  Yalcin,  Onur  Mutlu,  Erich  F.  Haratsch,  Osman  Unsal,  Adrian  Cristal,  and  Ken  Mai,  Neighbor-­‐Cell  Assisted  Error  CorrecBon  for  MLC  NAND  Flash  Memories,  SIGMETRICS  2014.  

5.  Read  disturb  noise  study  

-­‐  Yu  Cai,  Yixin  Luo,  Saugata  Ghose,  Erich  F.  Haratsch,  Ken  Mai,  and  Onur  Mutlu,  Read  Disturb  Errors  in  MLC  NAND  Flash  Memory:  CharacterizaBon  and  MiBgaBon,  DSN  2015.  

Page 18: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

Some  Prior  Talks  on  Flash  Errors  

• Saugata  Ghose,  Write-­‐hotness  Aware  Reten0on  Management,  FMS  2016.• Onur  Mutlu,  Read  Disturb  Errors  in  MLC  NAND  Flash  Memory,  FMS  2015.• Yixin  Luo,  Data  Reten0on  in  MLC  NAND  Flash  Memory,  FMS  2015.• Onur  Mutlu,Error  Analysis  and  Management  for  MLC  NAND  Flash  Memory,  FMS  2014.

• FMS  2016  posters:-­‐  WARM:  Improving  NAND  Flash  Memory  LifeOme  with  Write-­‐hotness  AwareRetenOon  Management  

-­‐  Read  Disturb  Errors  in  MLC  NAND  Flash  Memory  -­‐  Data  RetenOon  in  MLC  NAND  Flash  Memory  

Page 19: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

Prior Works on Flash Error Analysis n  Controlled studies that experimentally analyze many types

of error q  retention, program interference, read disturb, wear

n  Conducted on raw flash chips, not full SSD-based systems

n  Use synthetic access patterns, not real workloads in production systems

n  Do not account for the storage software stack

n  Small number of chips and small amount of time

Page 20: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

Prior Lower-Level Flash Error Studies

n  Provide a lot of insight

n  Lead to new reliability and performance techniques q  E.g., to manage errors in a controller

n  But they do not provide information on q  errors that appear during real-system operation q  beyond the correction capability of the controller

Page 21: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

In-The-Field Operation Effects n  Access patterns not controlled

n  Real applications access SSDs over years

n  Through the storage software stack (employs buffering)

n  Through the SSD controller (employs ECC and wear leveling)

n  Factors in platform design (e.g., number of SSDs) can affect access patterns

n  Many SSDs and flash chips in a real data center

Page 22: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

Our goal

Understand SSD reliability:▪ at a large scale

▪ millions of device-days, across four years

▪ in the field▪ realistic workloads and systems

Page 23: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

Server SSDarchitecture

Page 24: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems
Page 25: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

PCIe

Page 26: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

Flash chips

Page 27: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

SSD controller▪ translates addresses▪ schedules accesses▪ performs wear leveling

Page 28: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

10011111 11001111 11000011 00001101 10101110 11100101 11111001 01111011 00011001 11011101 11100011 11111000 11011111 01001101 11110000 10111111 00000001 11011110 00000101 01010110 00001011 10000010 11111110 00011100

...

01001100 01001101 11010010 01000000 10011100 10111111 10101111 11000101

User data

ECC metadata

Page 29: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

Types of errorsSmall errors▪ 10's of flipped bits per KB▪ silently corrected by SSD controller

Large errors▪ 100's of flipped bits per KB▪ corrected by host using driver▪ referred to as SSD failure

Page 30: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

Small errors

Large errors

Types of errors

▪ ~10's of flipped bits per KB ▪ silently corrected by SSD controller

▪ ~100's of flipped bits per KB▪ corrected by host using driver▪ refer to as SSD failure

We examine large errors (SSD failures) in this study.

Page 31: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

Error collection/analysismethodology

Page 32: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

SSD data measurement▪ metrics stored on SSDs▪ measured across SSD lifetime

Page 33: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

SSD characteristics▪ 6 different system configurations

▪ 720GB, 1.2TB, and 3.2TB SSDs▪ servers have 1 or 2 SSDs▪ this talk: representative systems

▪ 6 months to 4 years of operation▪ 15TB to 50TB read and written

Page 34: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

Platform and SSD Characteristics n  Six different platforms n  Spanning a majority of SSDs at Facebook’s production servers

Page 35: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

Bit error rates (BER)▪ BER = bit errors per bits transmitted▪ 1 error per 385M bits transmitted to

1 error per 19.6B bits transmitted▪ averaged across all SSDs in each system type

▪ 10x to 1000x lower than prior studies▪ large errors, SSD performs wear leveling

Page 36: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

Some Definitions

n  Uncorrectable error q  Cannot be corrected by the SSD q  But corrected by the host CPU driver

n  SSD failure rate q  Fraction of SSDs in a “bucket” that have had at least one

uncorrectable error

1

Page 37: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

Different Platforms, Different Failure Rates

Page 38: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

Older Platforms à Higher SSD Error Rates

Page 39: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

Platforms with Multiple SSDs

n  Failures of SSDs in the same platform are correlated q  Multiple SSDs in one host

n  Conclusion: Operational conditions related to platform affect SSD failure trends

Page 40: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

A few SSDs cause most errors

Page 41: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

A few SSDs cause most errors

10% of SSDshave >80%of errors

Errors followWeibulldistribution

Page 42: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

A few SSDs cause most errors

What factors contribute to SSD failures in the field?

Page 43: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

Analytical methodology▪ not feasible to log every error▪ instead, analyze lifetime counters▪ snapshot-based analysis

Page 44: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems
Page 45: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

Errors 54,326 0 2 10

Data written 10TB 2TB 5TB 6TB

Page 46: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

Errors 54,326 0 2 10

Data written 10TB 2TB 5TB 6TB 2014-11-1

Page 47: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

Errors 54,326 0 2 10

Data written 10TB 2TB 5TB 6TB 2014-11-1

Data written

Errors

Page 48: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

Errors 54,326 0 2 10

Data written 10TB 2TB 5TB 6TB 2014-11-1

Data written

BucketsErrors

Page 49: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

Errors 54,326 0 2 10

Data written 10TB 2TB 5TB 6TB 2014-11-1

Data written

Errors

Page 50: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

Errors 54,326 0 2 10

Data written 10TB 2TB 5TB 6TB 2014-11-1

Data written

Errors

Page 51: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

Some Definitions

n  Uncorrectable error q  Cannot be corrected by the SSD q  But corrected by the host CPU driver

n  SSD failure rate q  Fraction of SSDs in a “bucket” that have had at least one

uncorrectable error

Page 52: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

SSD reliabilitytrends

Page 53: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

Access patterndependence

SSD lifecycle

Readdisturbance

Temperature

New reliability

trends

Page 54: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

Access patterndependence

Readdisturbance

Temperature

SSD lifecycle

New reliability

trends

Page 55: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

bathtub curveStorage lifecycle background:

the

[Schroeder+,FAST'07]

for disk drives

Usage

Failure rate

Page 56: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

bathtub curveStorage lifecycle background:

the

[Schroeder+,FAST'07]

for disk drives

Failure rate

Usage

Earlyfailureperiod

Useful lifeperiod

Wearoutperiod

Page 57: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

bathtub curveStorage lifecycle background:

the

[Schroeder+,FAST'07]

for disk drives

Failure rate

Usage

Earlyfailureperiod

Useful lifeperiod

Wearoutperiod

Do SSDs display similar lifecycle periods?

Page 58: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

Use data written to flashto examine SSD lifecycle

(time-independent utilization metric)

Page 59: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

What We Find

Page 60: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

●●

0e+00 4e+13 8e+13

Data written (B)

0.00

0.50

1.00

SSD

failu

re ra

te

● Platform A Platform B720GB, 1 SSD 720GB, 2 SSDs

0 40 80

Data written (TB)

Page 61: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

●●

0e+00 4e+13 8e+13

Data written (B)

0.00

0.50

1.00

SSD

failu

re ra

te

● Platform A Platform B720GB, 1 SSD 720GB, 2 SSDs

0 40 80

Data written (TB)

Early failure period

Useful life period

Wearout period

Page 62: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

●●

0e+00 4e+13 8e+13

Data written (B)

0.00

0.50

1.00

SSD

failu

re ra

te

● Platform A Platform B720GB, 1 SSD 720GB, 2 SSDs

0 40 80

Data written (TB)

Early failure period

Useful life period

Wearout period

Earlydetectionperiod

Page 63: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

Access patterndependence

Readdisturbance

Temperature

New reliability

trends

SSD lifecycle

Early detection lifecycle period distinct from hard disk drive lifecycle.

Page 64: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

Why Early Detection Period?

n  Two pool model of flash blocks: weak and strong

n  Weak ones fail early à increasing failure rate early in lifetime q  SSD takes them offline à lowers the overall failure rate

n  Strong ones fail late à increasing failure rate late in lifetime

Page 65: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

Access patterndependence

Temperature

SSD lifecycle

Readdisturbance

New reliability

trends

Page 66: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

Read disturbance▪ reading data can disturb contents▪ failure mode identified in lab setting▪ under adversarial workloads

Page 67: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

Read  from  Flash  Cell  Array  

3.0V   3.8V   3.9V   4.8V  

3.5V   2.9V   2.4V   2.1V  

2.2V   4.3V   4.6V   1.8V  

3.5V   2.3V   1.9V   4.3V  

Vread  =  2.5  V  

Vpass  =  5.0  V  

Vpass  =  5.0  V  

Vpass  =  5.0  V  

1   1  0  0  Correct  values  for  page  2:   1  

Page  1  

Page  2  

Page  3  

Page  4  

Pass  (5V)  

Read  (2.5V)  

Pass  (5V)  

Pass  (5V)  

Page 68: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

Read  Disturb  Problem:  “Weak  Programming”  Effect  

3.0V   3.8V   3.9V   4.8V  

3.5V   2.9V   2.4V   2.1V  

2.2V   4.3V   4.6V   1.8V  

3.5V   2.3V   1.9V   4.3V  

Repeatedly  read  page  3  (or  any  page  other  than  page  2)   2  

Read  (2.5V)  

Pass  (5V)  

Pass  (5V)  

Pass  (5V)  

Page  1  

Page  2  

Page  3  

Page  4  

Page 69: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

More on Flash Read Disturb Errors n  Yu Cai, Yixin Luo, Saugata Ghose, Erich F. Haratsch, Ken Mai,

and Onur Mutlu, "Read Disturb Errors in MLC NAND Flash Memory: Characterization and Mitigation" Proceedings of the 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), Rio de Janeiro, Brazil, June 2015.

Page 70: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

Read disturbance▪ reading data can disturb contents▪ failure mode identified in lab setting▪ under adversarial workloads

Does read disturbance affect SSDs in the field?

Page 71: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

Examine SSDs withflash R/W

to understand read effects

(isolate effects of read vs. write errors)

ratiosmost data readand

high

Page 72: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

●● ● ●

● ●

● ●● ●

●● ●

0.0e+00 1.5e+14

Data read (B)

0.00

0.50

1.00

SSD

failu

re ra

te

3.2TB, 1 SSD (average R/W = 2.14)

0 100 200

Data read (TB)

Page 73: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

●● ● ● ● ● ● ●

● ●● ●

●●

●●

●●

0.0e+00 1.0e+14 2.0e+14

Data read (B)

0.00

0.50

1.00

SSD

failu

re ra

te

1.2TB, 1 SSD (average R/W = 1.15)

0 100 200

Data read (TB)

Page 74: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

Access patterndependence

Temperature

New reliability

trends

SSD lifecycle

Readdisturbance

We do not observe the effects of read disturbance errors in the field.

Page 75: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

Access patterndependence

SSD lifecycle

Readdisturbance

Temperature

New reliability

trends

Page 76: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems
Page 77: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

Temperaturesensor

Page 78: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

Three Failure Rate Trends with Temperature

n  Increasing q  SSD not throttled

n  Decreasing after some temperature q  SSD could be throttled

n  Not sensitive q  SSD could be throttled

Page 79: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

●● ●

30 40 50 60

Average temperature (°C)

0.00

0.50

1.00

SSD

failu

re ra

te

● Platform A Platform B720GB, 1 SSD 720GB, 2 SSDs

Page 80: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

No Throttling on A & B SSDs

Page 81: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

High temperature: may throttle or shut down

Page 82: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

●● ● ● ●

30 40 50 60 70

Average temperature (°C)

0.00

0.50

1.00

SSD

failu

re ra

te

● Platform C Platform E1.2TB, 1 SSD 3.2TB, 1 SSD

Throttling

Page 83: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

Heavy Throttling on C & E SSDs

Page 84: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

Access patterndependence

New reliability

trends

SSD lifecycle

Readdisturbance

Temperature

Throttling SSD usage helps mitigate temperature-induced errors.

Page 85: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

PCIe Bus Power Consumption

n  Trends for Bus Power Consumption vs. Failure Rate q  Similar to Temperature vs. Failure Rate

n  Temperature might be correlated with Bus Power

Page 86: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

Temperature

SSD lifecycle

Readdisturbance

Access patterndependence

New reliability

trends

Page 87: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

Access pattern effectsSystem buffering▪ data served from OS caches▪ decreases SSD usage

Write amplification▪ updates to small amounts of data▪ increases erasing and copying

Page 88: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

Access pattern effects

Write amplification▪ updates to small amounts of data▪ increases erasing and copying

System buffering▪ data served from OS caches▪ decreases SSD usage

Page 89: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems
Page 90: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

OS

Page 91: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

OS

Page cache

Page 92: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

OS

Page cache

Page 93: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

OS

Page cache

Page 94: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

OS

Page cache

Page 95: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

OS

Page cache

Page 96: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

OS

Page cache

Page 97: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

OS

Page cache

Page 98: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

OS

the impact of SSD writesSystem caching reduces

Page cache

Page 99: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

●●

● ● ●

0e+00 3e+10 6e+10

System data written (sectors)

0.00

0.50

1.00

SSD

failu

re ra

te

● Platform A Platform B720GB, 1 SSD 720GB, 2 SSDs

0 15 30

Data written to OS (TB)

Page 100: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

● ●

●●

●●

0e+00 2e+10 4e+10

2e+1

36e

+13

System data written (sectors)

Platform A

Dat

a w

ritte

n to

flas

h ce

lls (B

)

●● ●

0e+00 3e+10 6e+10

2e+1

36e

+13

System data written (sectors)

Platform B

●●●●●●

●●●●

●●●

●●

●●●●

0.0e+00 6.0e+10 1.2e+110.0e

+00

1.0e

+14

System data written (sectors)

Platform C

0.0e+00 1.5e+10 3.0e+10

0e+0

02e

+13

System data written (sectors)

Platform D

●●●●●

●●●●

●●

●●

●●

●●●●

●●●●

●●

●●

0.0e+00 1.0e+11 2.0e+110.0e

+00

1.5e

+14

3.0e

+14

System data written (sectors)

Platform E

● ●

●●

0e+00 2e+10 4e+100.0e

+00

2.0e

+13

System data written (sectors)

Platform F720GB, 2 SSDs

0 15 30

Data written to OS (TB)

Data written toflash cells (TB)

60

20

Page 101: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

System-Level Writes vs. Chip-Level Writes

n  More data written at the software does not imply

n  More data written into flash chips

n  Due to system level buffering

n  More system-level writes can enable more opportunities for coalescing in the system buffers

Page 102: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

Access pattern effectsSystem buffering▪ data served from OS caches▪ decreases SSD usage

Write amplification▪ updates to small amounts of data▪ increases erasing and copying

Page 103: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

OS

Flash devices use atranslation layer

to locate data

Page 104: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

OS

Logical address space

Translation layerPhysical address space

<offset1, size1><offset2, size2>

...

Page 105: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

Sparse data layoutmore translation metadata

potential for higher write amplification

e.g., many small file updates

Page 106: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

Dense data layoutless translation metadata

potential for lower write amplification

e.g., one huge file update

Page 107: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

Use translation data sizeto examine effects of data layout

(relates to application access patterns)

Page 108: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

●●

● ● ●

●●

5.0e+08 1.5e+09

DRAM buffer usage (B)

0.00

0.50

1.00

SSD

failu

re ra

te

0 1 2

Translation data (GB)

SparserDenser

720GB, 1 SSD

Page 109: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

5.0e+08 1.5e+09

DRAM buffer usage (B)

0.00

0.50

1.00

SS

D fa

ilure

rate

Platform A Platform B

5.0e+08 1.5e+09

DRAM buffer usage (B)

0.00

0.50

1.00

SS

D fa

ilure

rate

Platform C Platform D

5.0e+08 1.5e+09

DRAM buffer usage (B)

0.00

0.50

1.00

SS

D fa

ilure

rate

Platform E Platform F

Figure 8: SSD failure rate vs. DRAM bu↵er usage. Sparse data mappings (e.g., non-contiguous data, indicatedby high DRAM bu↵er usage to store flash translation layer metadata) negatively a↵ect SSD reliability themost (Platforms A, B, and D). Additionally, some dense data mappings (e.g., contiguous data in PlatformsE and F) also negatively a↵ect SSD reliability, likely due to the e↵ect of small, sparse writes.

2.5e+08 4.0e+08

DRAM buffer usage (B)

0.00

0.50

1.00

Graph Search

SS

D fa

ilure

rate

2.5e+08 4.0e+08

DRAM buffer usage (B)

0.00

0.50

1.00

Batch Processor

2.5e+08 3.5e+08

DRAM buffer usage (B)

0.00

0.50

1.00

Key−Value Store

2.5e+08 4.0e+08

DRAM buffer usage (B)

0.00

0.50

1.00

Load Balancer

2.5e+08 4.0e+08

DRAM buffer usage (B)

0.00

0.50

1.00

Distributed Key−Value Store

4e+08 8e+08

DRAM buffer usage (B)

0.00

0.50

1.00

Flash Cache

Figure 9: SSD failure rate vs. DRAM bu↵er usage across six applications that run on Platform B. We observesimilar DRAM bu↵er e↵ects to Figure 8, even among SSDs running the same application.

5. THE ROLE OF EXTERNAL FACTORSWe next examine how factors external to the SSD influence

the errors observed over an SSD’s lifetime. We examine the ef-fects of temperature, PCIe bus power, and system-level writesreported by the OS.

5.1 TemperatureIt is commonly assumed that higher temperature negatively

a↵ects the operation of flash-based SSDs. In flash cells, highertemperatures have been shown to cause cells to age morequickly due to the temperature-activated Arrhenius e↵ect [39].Temperature-dependent e↵ects are especially important to un-derstand for flash-based SSDs in order to make adequate datacenter provisioning and cooling decisions. To examine the ef-fects of temperature, we used temperature measurements fromtemperature sensors embedded on the SSD cards, which pro-vide a more accurate portrayal of the temperature of flash cellsthan temperature sensors at the server or rack level.Figure 10 plots the failure rate for SSDs that have various

average operating temperatures. We find that at an operatingtemperature range of 30 to 40 C, SSDs across server platformssee similar failure rates or slight increases in failure rates astemperature increases.Outside of this range (at temperatures of 40 C and higher),

we find that SSDs fall into one of three categories with respectto their reliability trends vs. temperature: (1) temperature-sensitive with increasing failure rate (Platforms A and B),(2) less temperature-sensitive (Platforms C and E), and (3)

temperature-sensitive with decreasing failure rate (PlatformsD and F). There are two factors that may a↵ect the trends weobserve with respect to SSD temperature.One potential factor when analyzing the e↵ects of temper-

ature is the operation of the SSD controller in response tochanges in temperature. The SSD controllers in some of theSSDs we examine attempt to ensure that SSDs do not exceedcertain temperature thresholds (starting around 80 C). Simi-lar to techniques employed in processors to reduce the amountof processor activity in order to keep the processor within acertain temperature range, our SSDs attempt to change theirbehavior (e.g., reduce the frequency of SSD access or, in theextreme case, shut down the SSD) in order not to exceed tem-perature thresholds.A second potential factor is the thermal characteristics of

the machines in each platform. The existence of two SSDs ina machine (in Platforms B, D, and F) compared to one SSDin a machine may (1) increase the thermal capacity of themachine (causing its SSDs to reach higher temperatures morequickly and increase the work required to cool the SSDs) and(2) potentially reduce airflow to the components, prolongingthe e↵ects of high temperatures when they occur.One hypothesis is that temperature-sensitive SSDs with in-

creasing error rates, such as Platforms A and B, may not em-ploy as aggressive temperature reduction techniques as otherplatforms. While we cannot directly measure the actions theSSD controllers take in response to temperature events, weexamined an event that can be correlated with temperature

5.0e+08 1.5e+09

DRAM buffer usage (B)

0.00

0.50

1.00

SS

D fa

ilure

rate

Platform A Platform B

5.0e+08 1.5e+09

DRAM buffer usage (B)

0.00

0.50

1.00

SS

D fa

ilure

rate

Platform C Platform D

5.0e+08 1.5e+09

DRAM buffer usage (B)

0.00

0.50

1.00

SS

D fa

ilure

rate

Platform E Platform F

Figure 8: SSD failure rate vs. DRAM bu↵er usage. Sparse data mappings (e.g., non-contiguous data, indicatedby high DRAM bu↵er usage to store flash translation layer metadata) negatively a↵ect SSD reliability themost (Platforms A, B, and D). Additionally, some dense data mappings (e.g., contiguous data in PlatformsE and F) also negatively a↵ect SSD reliability, likely due to the e↵ect of small, sparse writes.

2.5e+08 4.0e+08

DRAM buffer usage (B)

0.00

0.50

1.00

Graph Search

SS

D fa

ilure

rate

2.5e+08 4.0e+08

DRAM buffer usage (B)

0.00

0.50

1.00

Batch Processor

2.5e+08 3.5e+08

DRAM buffer usage (B)

0.00

0.50

1.00

Key−Value Store

2.5e+08 4.0e+08

DRAM buffer usage (B)

0.00

0.50

1.00

Load Balancer

2.5e+08 4.0e+08

DRAM buffer usage (B)

0.00

0.50

1.00

Distributed Key−Value Store

4e+08 8e+08

DRAM buffer usage (B)

0.00

0.50

1.00

Flash Cache

Figure 9: SSD failure rate vs. DRAM bu↵er usage across six applications that run on Platform B. We observesimilar DRAM bu↵er e↵ects to Figure 8, even among SSDs running the same application.

5. THE ROLE OF EXTERNAL FACTORSWe next examine how factors external to the SSD influence

the errors observed over an SSD’s lifetime. We examine the ef-fects of temperature, PCIe bus power, and system-level writesreported by the OS.

5.1 TemperatureIt is commonly assumed that higher temperature negatively

a↵ects the operation of flash-based SSDs. In flash cells, highertemperatures have been shown to cause cells to age morequickly due to the temperature-activated Arrhenius e↵ect [39].Temperature-dependent e↵ects are especially important to un-derstand for flash-based SSDs in order to make adequate datacenter provisioning and cooling decisions. To examine the ef-fects of temperature, we used temperature measurements fromtemperature sensors embedded on the SSD cards, which pro-vide a more accurate portrayal of the temperature of flash cellsthan temperature sensors at the server or rack level.

Figure 10 plots the failure rate for SSDs that have variousaverage operating temperatures. We find that at an operatingtemperature range of 30 to 40 C, SSDs across server platformssee similar failure rates or slight increases in failure rates astemperature increases.

Outside of this range (at temperatures of 40 C and higher),we find that SSDs fall into one of three categories with respectto their reliability trends vs. temperature: (1) temperature-sensitive with increasing failure rate (Platforms A and B),(2) less temperature-sensitive (Platforms C and E), and (3)

temperature-sensitive with decreasing failure rate (PlatformsD and F). There are two factors that may a↵ect the trends weobserve with respect to SSD temperature.

One potential factor when analyzing the e↵ects of temper-ature is the operation of the SSD controller in response tochanges in temperature. The SSD controllers in some of theSSDs we examine attempt to ensure that SSDs do not exceedcertain temperature thresholds (starting around 80 C). Simi-lar to techniques employed in processors to reduce the amountof processor activity in order to keep the processor within acertain temperature range, our SSDs attempt to change theirbehavior (e.g., reduce the frequency of SSD access or, in theextreme case, shut down the SSD) in order not to exceed tem-perature thresholds.

A second potential factor is the thermal characteristics ofthe machines in each platform. The existence of two SSDs ina machine (in Platforms B, D, and F) compared to one SSDin a machine may (1) increase the thermal capacity of themachine (causing its SSDs to reach higher temperatures morequickly and increase the work required to cool the SSDs) and(2) potentially reduce airflow to the components, prolongingthe e↵ects of high temperatures when they occur.

One hypothesis is that temperature-sensitive SSDs with in-creasing error rates, such as Platforms A and B, may not em-ploy as aggressive temperature reduction techniques as otherplatforms. While we cannot directly measure the actions theSSD controllers take in response to temperature events, weexamined an event that can be correlated with temperature

0.25 0.45

Translation data (GB)

0.25 0.45

Translation data (GB)

Graph search Key-value store

Write amplification in the field

Page 110: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

Why Does Sparsity of Data Matter?

n  More translation data correlates with higher failure rates

n  Sparse data updates, i.e., updates to less contiguous data, lead to more translation data

n  Higher failure rates likely due to more frequent erase and copying caused by non-contiguous updates q  Write amplification

Page 111: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

New reliability

trends

SSD lifecycle

Readdisturbance

Temperature

Access patterndependence

We quantify the effects of the page cache and write amplification in the field.

Page 112: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

Access patterndependence

SSD lifecycle

Readdisturbance

Temperature

New reliability

trends

Page 113: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

▪ Block erasures and discards▪ Page copies▪ Bus power consumption

More results in paper

Page 114: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

Summary

Page 115: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

▪ Large scale▪ In the field

Page 116: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

Access patterndependence

SSD lifecycle

Readdisturbance

Temperature

Summary

New reliability

trends

Page 117: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

Access patterndependence

Readdisturbance

Temperature

New reliability

trends

SSD lifecycle

Summary

Early detection lifecycle period distinct from hard disk drive lifecycle.

Page 118: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

Access patterndependence

Temperature

New reliability

trends

SSD lifecycle

Readdisturbance

Summary

We do not observe the effects of read disturbance errors in the field.

Page 119: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

Access patterndependence

New reliability

trends

SSD lifecycle

Readdisturbance

Temperature

Summary

Throttling SSD usage helps mitigate temperature-induced errors.

Page 120: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

New reliability

trends

SSD lifecycle

Readdisturbance

Temperature

Access patterndependence

Summary

We quantify the effects of the page cache and write amplification in the field.

Page 121: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

A Large-Scale Study ofFlash Memory Errors in the FieldJustin Meza Qiang Wu Sanjeev Kumar Onur Mutlu

Page 122: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

Our Remaining FMS 2016 Talks n  At 4:20pm Today n  Practical Threshold Voltage Distribution Modeling

q  Yixin Luo (CMU PhD Student) August 10 @ 4:20pm q  Forum E-22: Controllers and Flash Technology

n  At 5:45pm Today n  "WARM: Improving NAND Flash Memory Lifetime

with Write-hotness Aware Retention Management” q  Saugata Ghose (CMU Researcher) August 10 @ 5:45pm q  Forum C-22: SSD Concepts (SSDs Track)

Page 123: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

Referenced Papers and Talks

n  All are available at http://users.ece.cmu.edu/~omutlu/projects.htm http://users.ece.cmu.edu/~omutlu/talks.htm

n  And, many other previous works on q  NVM & Persistent Memory q  DRAM q  Hybrid memories q  NAND flash memory

Page 124: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

Thank you.

Feel free to email me with any questions & feedback

[email protected] http://users.ece.cmu.edu/~omutlu/

Page 125: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

Large-Scale Study of In-the-Field Flash Failures

Onur Mutlu [email protected]

(joint work with Justin Meza, Qiang Wu, Sanjeev Kumar)

August 10, 2016 Flash Memory Summit 2016, Santa Clara, CA

Page 126: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

References to Papers and Talks

Page 127: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

Challenges and Opportunities in Memory

n  Onur Mutlu, "Rethinking Memory System Design" Keynote talk at 2016 ACM SIGPLAN International Symposium on Memory Management (ISMM), Santa Barbara, CA, USA, June 2016. [Slides (pptx) (pdf)] [Abstract]

n  Onur Mutlu and Lavanya Subramanian, "Research Problems and Opportunities in Memory Systems" Invited Article in Supercomputing Frontiers and Innovations (SUPERFRI), 2015.

Page 128: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

Our  FMS  Talks  and  Posters  • Onur  Mutlu,  ThyNVM:  So+ware-­‐Transparent  Crash  Consistency  forPersistent  Memory,  FMS  2016.

• Onur  Mutlu,  Large-­‐Scale  Study  of  In-­‐the-­‐Field  Flash  Failures,  FMS  2016.• Yixin  Luo,  PracBcal  Threshold  Voltage  DistribuBon  Modeling,  FMS  2016.• Saugata  Ghose,  Write-­‐hotness  Aware  RetenBon  Management,  FMS  2016.• Onur  Mutlu,  Read  Disturb  Errors  in  MLC  NAND  Flash  Memory,  FMS  2015.• Yixin  Luo,  Data  RetenBon  in  MLC  NAND  Flash  Memory,  FMS  2015.• Onur  Mutlu,Error  Analysis  and  Management  for  MLC  NAND  Flash  Memory,  FMS  2014.

• FMS  2016  posters:-­‐  WARM:  Improving  NAND  Flash  Memory  LifeOme  with  Write-­‐hotness  AwareRetenOon  Management  

-­‐  Read  Disturb  Errors  in  MLC  NAND  Flash  Memory  -­‐  Data  RetenOon  in  MLC  NAND  Flash  Memory  

Page 129: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

Our  Flash  Memory  Works  (I)  1.  Reten'on  noise  study  and  management1) Yu  Cai,  Gulay  Yalcin,  Onur  Mutlu,  Erich  F.  Haratsch,  Adrian  Cristal,  Osman  

Unsal,  and  Ken  Mai,  Flash  Correct-­‐and-­‐Refresh:  Reten'on-­‐Aware  Error  Management  forIncreased  Flash  Memory  Life'me,  ICCD  2012.  

2) Yu  Cai,  Yixin  Luo,  Erich  F.  Haratsch,  Ken  Mai,  and  Onur  Mutlu,  Data  Reten'on  in  MLC  NAND  Flash  Memory:  Characteriza'on,  Op'miza'onand  Recovery,  HPCA  2015.  

3) Yixin  Luo,  Yu  Cai,  Saugata  Ghose,  Jongmoo  Choi,  and  Onur  Mutlu,  WARM:  Improving  NAND  Flash  Memory  Life'me  with  Write-­‐hotness  AwareReten'on  Management,  MSST  2015.  

2.  Flash-­‐based  SSD  prototyping  and  tes'ng  plaLorm4) Yu  Cai,  Erich  F.  Haratsh,  Mark  McCartney,  Ken  Mai,  

FPGA-­‐based  solid-­‐state  drive  prototyping  plaLorm,  FCCM  2011.  

Page 130: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

Our  Flash  Memory  Works  (II)  3.  Overall  flash  error  analysis  5) Yu  Cai,  Erich  F.  Haratsch,  Onur  Mutlu,  and  Ken  Mai,  

Error  PaQerns  in  MLC  NAND  Flash  Memory:  Measurement,  Characteriza'on,  and  Analysis,  DATE  2012.  

6) Yu  Cai,  Gulay  Yalcin,  Onur  Mutlu,  Erich  F.  Haratsch,  Adrian  Cristal,  Osman  Unsal,  and  Ken  Mai,  Error  Analysis  and  Reten'on-­‐Aware  Error  Management  for  NAND  Flash  Memory,  ITJ  2013.  

4.  Program  and  erase  noise  study  7) Yu  Cai,  Erich  F.  Haratsch,  Onur  Mutlu,  and  Ken  Mai,  

Threshold  Voltage  Distribu'on  in  MLC  NAND  Flash  Memory:  Characteriza'on,  Analysis  and  Modeling,  DATE  2013.  

Page 131: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

Our  Flash  Memory  Works  (III)  5.  Cell-­‐to-­‐cell  interference  characteriza'on  and  tolerance  8) Yu  Cai,  Onur  Mutlu,  Erich  F.  Haratsch,  and  Ken  Mai,  

Program  Interference  in  MLC  NAND  Flash  Memory:  Characteriza'on,  Modeling,  and  Mi'ga'on,  ICCD  2013.    

9) Yu  Cai,  Gulay  Yalcin,  Onur  Mutlu,  Erich  F.  Haratsch,  Osman  Unsal,  Adrian  Cristal,  and  Ken  Mai,  Neighbor-­‐Cell  Assisted  Error  Correc'on  for  MLC  NAND  Flash  Memories,  SIGMETRICS  2014.  

 6.  Read  disturb  noise  study  10) Yu  Cai,  Yixin  Luo,  Saugata  Ghose,  Erich  F.  Haratsch,  Ken  Mai,  and  Onur  Mutlu,  

Read  Disturb  Errors  in  MLC  NAND  Flash  Memory:  Characteriza'on  and  Mi'ga'on,  DSN  2015.  

Page 132: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

Our  Flash  Memory  Works  (IV)  7.  Flash  errors  in  the  field  11) JusOn  Meza,  Qiang  Wu,  Sanjeev  Kumar,  and  Onur  Mutlu,  

A  Large-­‐Scale  Study  of  Flash  Memory  Errors  in  the  Field,  SIGMETRICS  2015.  

8.  Persistent  memory  12) Jinglei  Ren,  Jishen  Zhao,  Samira  Khan,  Jongmoo  Choi,  Yongwei  Wu,  and  Onur  

Mutlu,  ThyNVM:  Enabling  SoZware-­‐Transparent  Crash  Consistency  in  Persistent  Memory  Systems,  MICRO  2015.  

Page 133: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

Phase Change Memory As DRAM Replacement

n  Benjamin C. Lee, Engin Ipek, Onur Mutlu, and Doug Burger, "Architecting Phase Change Memory as a Scalable DRAM Alternative" Proceedings of the 36th International Symposium on Computer Architecture (ISCA), pages 2-13, Austin, TX, June 2009. Slides (pdf)

n  Benjamin C. Lee, Ping Zhou, Jun Yang, Youtao Zhang, Bo Zhao, Engin Ipek, Onur Mutlu, and Doug Burger, "Phase Change Technology and the Future of Main Memory" IEEE Micro, Special Issue: Micro's Top Picks from 2009 Computer Architecture Conferences (MICRO TOP PICKS), Vol. 30, No. 1, pages 60-70, January/February 2010.

Page 134: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

STT-MRAM As DRAM Replacement

n  Emre Kultursay, Mahmut Kandemir, Anand Sivasubramaniam, and Onur Mutlu, "Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative" Proceedings of the 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Austin, TX, April 2013. Slides (pptx) (pdf)

Page 135: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

Taking Advantage of Persistence in Memory n  Justin Meza, Yixin Luo, Samira Khan, Jishen Zhao, Yuan Xie, and

Onur Mutlu, "A Case for Efficient Hardware-Software Cooperative Management of Storage and Memory" Proceedings of the 5th Workshop on Energy-Efficient Design (WEED), Tel-Aviv, Israel, June 2013. Slides (pptx) Slides (pdf)

n  Jinglei Ren, Jishen Zhao, Samira Khan, Jongmoo Choi, Yongwei Wu, and Onur Mutlu, "ThyNVM: Enabling Software-Transparent Crash Consistency in Persistent Memory Systems" Proceedings of the 48th International Symposium on Microarchitecture (MICRO), Waikiki, Hawaii, USA, December 2015. [Slides (pptx) (pdf)] [Lightning Session Slides (pptx) (pdf)] [Poster (pptx) (pdf)] [Source Code]

Page 136: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

Hybrid DRAM + NVM Systems (I) n  HanBin Yoon, Justin Meza, Rachata Ausavarungnirun, Rachael Harding,

and Onur Mutlu, "Row Buffer Locality Aware Caching Policies for Hybrid Memories" Proceedings of the 30th IEEE International Conference on Computer Design (ICCD), Montreal, Quebec, Canada, September 2012. Slides (pptx) (pdf) Best paper award (in Computer Systems and Applications track).

n  Justin Meza, Jichuan Chang, HanBin Yoon, Onur Mutlu, and Parthasarathy Ranganathan, "Enabling Efficient and Scalable Hybrid Memories Using Fine-Granularity DRAM Cache Management" IEEE Computer Architecture Letters (CAL), February 2012.

Page 137: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

Hybrid DRAM + NVM Systems (II)

n  Dongwoo Kang, Seungjae Baek, Jongmoo Choi, Donghee Lee, Sam H. Noh, and Onur Mutlu, "Amnesic Cache Management for Non-Volatile Memory" Proceedings of the 31st International Conference on Massive Storage Systems and Technologies (MSST), Santa Clara, CA, June 2015. [Slides (pdf)]

Page 138: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

NVM Design and Architecture

n  HanBin Yoon, Justin Meza, Naveen Muralimanohar, Norman P. Jouppi, and Onur Mutlu, "Efficient Data Mapping and Buffering Techniques for Multi-Level Cell Phase-Change Memories" ACM Transactions on Architecture and Code Optimization (TACO), Vol. 11, No. 4, December 2014. [Slides (ppt) (pdf)] Presented at the 10th HiPEAC Conference, Amsterdam, Netherlands, January 2015. [Slides (ppt) (pdf)]

n  Justin Meza, Jing Li, and Onur Mutlu, "Evaluating Row Buffer Locality in Future Non-Volatile Main Memories" SAFARI Technical Report, TR-SAFARI-2012-002, Carnegie Mellon University, December 2012.

Page 139: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

Referenced Papers and Talks

n  All are available at http://users.ece.cmu.edu/~omutlu/projects.htm http://users.ece.cmu.edu/~omutlu/talks.htm

n  And, many other previous works on NAND flash memory errors and management

Page 140: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

Related Videos and Course Materials n  Undergraduate Computer Architecture Course Lecture

Videos (2013, 2014, 2015)

n  Undergraduate Computer Architecture Course Materials (2013, 2014, 2015)

n  Graduate Computer Architecture Lecture Videos (2013, 2015)

n  Parallel Computer Architecture Course Materials (Lecture Videos)

n  Memory Systems Short Course Materials (Lecture Video on Main Memory and DRAM Basics)

Page 141: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

Additional Slides

Page 142: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

Backup slides

Page 143: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

System characteristicsSSD

capacityPCIe

Average age

(years)

SSDs per server

Average written

(TB)

Average read(TB)

720GB v1, x4 2.4 1 27.2 23.82 48.5 45.1

1.2TB v2, x4 1.6 1 37.8 43.42 18.9 30.6

3.2TB v2, x4 0.5 1 23.9 51.12 14.8 18.2

Page 144: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

A B C D E F

SSD

failu

re ra

te

0.0

0.2

0.4

0.6

0.8

1.0

720GB 1.2TB 3.2TBDevices: 1 2 1 2 1 2

Page 145: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

A B C D E F

Year

ly u

ncor

rect

able

erro

rs p

er S

SD

0e+0

04e

+05

8e+0

5

720GB 1.2TB 3.2TBDevices: 1 2 1 2 1 2

Page 146: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

Channelsoperate in parallel

Page 147: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

DRAM buffer▪ stores address translations ▪ may buffer writes

Page 148: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems

●● ● ●

35 45 55 65

Average temperature (°C)

0.00

0.50

1.00

SSD

failu

re ra

te

● Platform D Platform F1.2TB, 2 SSDs 3.2TB, 2 SSDs

Early detection+ throttling