Download - Al Fazio Intel Fellow Director, Memory technology Development …€¦ · Putting it together: SSD Reliability Metrics SSD UBER values can be

Future Compute MemoryNon Volatile Memory (NVM) in Compute

Al Fazio

Intel Fellow

Director, Memory technology Development

November 12, 2008

Outline

Motivations for NVM in Compute

Key Principles of NAND Flash Operation & Device Physics from a compute applications viewpoint

Memory Controller Architecture

NVM Impact on Compute Applications

Future NVM Technology Trends

1987 View of NVM in Compute

2008 View of NVM in Compute

Form Factor: 2.5”/ 1.8” Standard SATA 3Gb/s

Performance

• World Class SATA I/O Performance• X2S-E (SLC) Throughput

– Sustained R/W: 240 / 170MB/s– Active (avg): 2.4W; Idle: 0.06W

• X25-M/X18-M (MLC) Throughput – Sustained R/W: 240 / 70MB/s – Active (avg): 0.25W; Idle: 0.06W

A Full Range of NVM in Compute

Media Access Time for 20K Read

0

20

40

60

80

100

120

140

160

180

200

Jan-96

Jan-97

Jan-98

Jan-99

Jan-00

Jan-01

Jan-02

Jan-03

Jan-04

Jan-05

Jan-06

Jan-07

Jan-08

Jan-09

Nor

mal

ized

Per

form

ance

Multicore CPUCPU

Disk

Motivation for NVM in compute:Huge Scaling Discrepancy Between CPU and HDD

Source: Intel measurements

1.3X vs 175X in 13 years!

Normalized CPU PerformanceNormalized Media Access Time for 20K Read

Intel® Mainstream SATA SSD

Intel®Turbo Memory

Intel®SSD

3+ decades of floating-gate technology scaling starting on EPROM Flash

Technology designed for high-volume manufacturing

1986/1.51986/1.5µµmm

1988/1.01988/1.0µµmm

1991/0.81991/0.8µµmm

1993/0.61993/0.6µµmm

1996/0.41996/0.4µµmm

1998/0.251998/0.25µµmm

2000/0.182000/0.18µµmm

2002/0.132002/0.13µµmm

2004/90nm2004/90nm

20+ Years Flash Floating Gate Technology20+ Years Flash Floating Gate Technology

2006/72nm2006/72nm

2007/50nm2007/50nm

Source: IntelSource: Intel

2008/34nm2008/34nm32Gbit32Gbit

Flash: A License to Disrupt

35mm film, Floppy drives, audio tape…

• Flash use in consumer electronics characterized by:– Large block files (.jpg, mp3…)– # Writes determined by human interaction (i.e. photos taken)

To disrupt HDD, flash must accommodate compute characteristics:

• Small random writes, # writes determine by OS

• Add to this:

A Be-

Control

Flash requires high fields to overcome energy barriers for non-volatility

Flash reliability dominated by oxide-degradation; result of program/erase

NAND Physical Organization

A block is a sea of cells arranged in a grid

TG’s are connected in wordlines (typ. 32, only 5 shown)

Cells in different wordlines are strung together in series

Each string of cells is connected to a bitline at one, source at the other

Select devices control whether the block is connected to bitlines and source

Gate Gate Gate Gate Gate

One block

Wor

dlin

e

Bitline

Source

Sele

ct

Sele

ct

Flash Cell Layout and Cross-Section

N-Channel MOSFET with a few distinguishing features: – Isolated floating gate– Charge storage on Floating gate modulates threshold voltage of underlying

MOSFET

D

S

CG

FG

Ids

Vcg

“1” “0”

Erased“1”

Programmed“0”

Stored Electrons

Inter Poly Dielectric ONO

N+ Source N+ DRAIN

POLY1FLOATING GATE

POLY2 CONTROL GATE

P-Type SiliconSubstrate

Tunnel Oxide

Oxide Sidewall

Charge Storage: Program and Erase

Programming means injecting electrons to the FG

• Fowler-Nordheim Tunneling

Erase: Fowler-Nordheim Tunneling in reverse direction

N+ N+

20V

0V0V

Programming: NAND

N+ N+

-20V

0V0V

Erase

Dis

trib

utio

n

Vt

“1” “0”

2 Levels => 1 bit/cell

Dis

trib

utio

n

Vt

“11” “00”

4 Levels => 2 bit/cell

“10”“01”

Reliability and Oxide TrapsNormally, F-N tunneling occur only during

accelerated stresses done by engineers trying to study oxide degradation…

• Flash memories: basis device operation itself

This fact has two fundamental implications:

• Flash reliability is dominated by oxide-degradation effects, notably trap buildup in the tunnel oxide, which occur as a result of program/erase cycling

• More than any other IC technology, developing a Flash technology centers around obtaining acceptable reliability

Over time, charges can detrap

• Effect will cause VT to shift and possible data loss

-----

Channel

FG

Q’

Distance

Ener

gy

N+ N+

FG

Top Gate

Bit Errors: Overview

At any instant, some fraction of bits are in the wrong data state, typically 1E-9 to 1E-6, called the “raw bit error rate” or RBER

These failing bits develop with use

L0 L1 L2 L3After Write

VtVpass Read

L0 L1 L2 L3After Time

• During write, some bits program when they shouldn’t, or program higher than they should

This complexity means that RBER is a number, but not like pi: • like temperature: a # for specific set of conditions, location, instant

• Cells shift in VT over time, because of simply time (“data retention”) or of repetitive read operations (“read disturb”)

• Both kinds increase with more program/erase cycles• Several mechanisms cause bit errors, each with its own dependence on

cycles, time, temperature, etc.

1.0E-09

1.0E-08

1.0E-07

1.0E-06

1.0E-05

4000 6000 8000 10000P/E Cycles

RB

ER

1.0E-09

1.0E-08

1.0E-07

1.0E-06

1.0E-05

1 10 100 1000 10000

P/E Cycles

RB

ER

Erratic Nature of Write Errors

Errors are erratic: Most bits failing at 5K didn’t fail at 10K

Explanation: oxide traps are transient

Data verified only at symbols: did we miss errors in between?

Ran experiment to verify data after every cycle

• Example bit failed 11 times, never at previous verify points

• Previous verifies detected only 0.6% of failing bits

Standard “test after stress” qualifications miss most errors!

Bit failpoints

EarlierDataIMFT

Next Several Slides are based on 70nm results from: Mielke, N., et. al., “Bit error rate in NAND Flash memories”, IEEE International Reliability Physics Symposium, 2008

10-4

10-5

10-6

10-7

10-8

10-9

0 5000 10000Retention Time (Hours)

Raw

Bit

Erro

r R

ate

Data-Retention Errors

Post 10K Cycles

After cycling, RBER increases over time without bias

Error transitions show cells are losing VT (“charge loss”)

Two products dominated by upper state (L3), others by L1 & L2

Characteristics:

• L1 & L2: Detrapping from the tunnel oxide

• L3: SILC (trap-assisted tunneling) leakage off FG

L0 L1 L2 L3

R1 R2 R3

n+ n+

CG

FG

n+ n+

SILCDetrapping

0

5x10-7

10-6

1.5 x10-6

0 5000 10000Number of Reads

Raw

Bit

Erro

r R

ate

Read Disturb ErrorsPost 10K Cycles

After cycling, RBER increases with repetitive reading

Error transitions show erased cells gaining VT

Mechanism is well known: SILC under read bias

L0 L1 L2 L3

R1 R2 R3

n+ n+

~6V

SILC

10KCycles

+1 Yearor

10K reads

10KCycles

+ 0.5 Yearor

5K reads

10KCycles

5KCycles

00.0E+00

1.0E-04

2.0E-04C

um F

ract

ion

Sect

ors

Faili

ng (

1-bi

t EC

C)

0

2E-13

4E-13

Cum

Fra

ctio

n Se

ctor

sFa

iling

(4-b

it EC

C)

Effect of ECC4x10-13

0 0

2x10-4

Failures drop several orders of magnitude, ~1012x over no ECC

Curves get steeper (because of Ecc power law)

Dominant mechanism switches to retention (because of underlying error distribution)

1-bit

4-bit

Workable UBER Definition for NAND

)Cyc-PostN eReads/Cycl# (Nsector)per (bitsFailing Sectors Fraction Cum

sector)per reads(#sector)per (bitsFailing Sectors Fraction CumUBER

CYC +•⋅=

⋅=

Worst case:1Read Disturb: #reads in stressUnbiased: Impute same rate

as in cycling

UBER = Uncorrectable Bit Error Rate

00

1085x107

2x10-13

Bits Read per Sector

Cum

Fra

ctio

n Se

ctor

s,4-

bit E

CC

3x10-21Retention

Read DisturbWrite

UBER Estimate

Data re-plotted vs. # bits read

UBER at any point is the slope of line to the origin

UBER is very low 3x10-21 at worst-case point (retention)

UBER increases with greater use, so use range must be stated when UBER is specified

Concurrency in Intel® SSD ASIC10 external physical NAND channels

• Up to 2 NAND components per channel

• Component = Dual Die or Quad Die Packages

Each channel supports multiple outstanding tasks

• Each NAND channel fully hardware automated/accelerated

• Hardware fully overlaps & pipelines commands

• Automated ECC generators & correctors

Dual CE#

NAND

Dual CE#

NAND

Dual CE#

NAND

Dual CE#

NAND

Memory Arbiter

ECC Corrector

Dual CE#

NAND

Dual CE#

NAND

Dual CE#

NAND

Dual CE#

NAND

To rest of ASIC To rest of ASIC datapathdatapath

NAND Channel

#1

NAND Channel

#2

NAND Channel

#9

NAND Channel

#0

ECC Generator

ECC Generator

ECC Generator

ECC Generator

Algorithmic EfficiencyA high-performance NAND controller is necessary but not sufficient

Primary impact on overall performance is algorithmic efficiency

• Especially the case for small random writes

Write AmplificationWrite Amplification is the amount of NAND written for a requested amount of write from host

…

Page 0

Page 1

Page 2

Page 61

Page 62

Page 63Erase Block (EB)

Page 3

*Simplified example to illustrate the write amplification effect. Specific algorithms vary greatly.

Data to be written

Data to be Data to be writtenwritten


…

Page 0

Page 1

Page 2

Page 61

Page 62


Page 3


…

Page 0

Page 1

Page 2

Page 61

Page 62

Page 63

Page 3

DRAM Copy

First retrieve all data in

erase block

First retrieve First retrieve all data in all data in

erase blockerase block

Then insert new data in

retrieved copy

Then insert Then insert new data in new data in

retrieved copyretrieved copy


…

Page 0

Page 1

Page 2

Page 61

Page 62


Page 3


…

Page 0

Page 1

Page 2

Page 61

Page 62

Page 63

Page 3

DRAM Copy

Then erase the NAND

erase block

Then erase Then erase the NAND the NAND

erase blockerase blockFinally put all

data back (including new)

Finally put all Finally put all data back data back

(including new)(including new)

…

Page 0

Page 1

Page 2

Page 61

Page 62


Page 3

…

Page 0

Page 1

Page 2

Page 61

Page 62

Page 63

Page 3



Example amplification is 32 Example amplification is 32 (32X NAND written for host request)(32X NAND written for host request). . Traditional schemes have amplification of approx 20Traditional schemes have amplification of approx 20--40X.40X.

Client Workload Write Amplification

0

10 20 30 40 50 60 70 80 90

100

110

120

130

140

150

160

170

180

190

0.1

1

10

100

1000

Dat

a W

ritte

n (M

B)

Workload Duration (Minutes)

Host Data Written NAND Data Written

XP Mobile Workload XP Mobile Workload WritesWrites

IntelIntel®® HighHigh--Performance SATA Performance SATA SSDsSSDs typical write amplification typical write amplification <1.1 for client workloads (this example <1.05)<1.1 for client workloads (this example <1.05)

Measured writes from

host

Measured Measured writes from writes from

hosthost

Measured writes to

NAND

Measured Measured writes to writes to

NANDNAND

Performance measurements are made using specific computer systems and/or components and reflect the approximate performance of the technology as measured by those tests. Any difference in system hardware or software design or configuration may affect actual results.

*Third party marks and brands are the property of their respective owners

Variability in Wear Leveling

Controllers vary in in wear-leveling effectiveness

Poor wear leveling can have high impact

20x in cycles can be 10x or more in RBER

10x in RBER is 10ECC+1 in ECC failure rate: 100,000x for 4-bit ECC

Wear Leveling

0

500

1000

1500

2000

2500

1 2049 4097 6145

Sorted Erase Block (min to max)

P/E

Cyc

les

Ref: Intel® X18-M

Unsophisticated regioned scheme More sophisticated scheme

Cycles/blockVaries 20x 4% variation

Putting it together: SSD Reliability MetricsSSD UBER values can be << 10-15

UBER ∝ usage: program/erase/read & subsequent retention

Intel® X18-M and X25-M Mainstream SATA SSD (80GB)

• 10 Channels Architecture with 50nm MLC ONFI 1.0 NAND

• 5 years usage, 1000G, 1.2million hrs MTBF

• GB/day client workload @ 1e-15 UBER >>100GB/day, 5 years

Intel® X25-M and X18-M Mainstream SATA SSDs deliver

>5X accepted requirement for clients (20GB/day)Intel® X25-E Extreme SATA SSD (32GB)

• 10 Channels Architecture with 50nm SLC ONFI 1.0 NAND

• 1000G, 2Million hrs MTBF

• Intel SLC SSD support > 7000 8K 2:1 R/W Random IOPs 24/7, 5 years

Intel X25-E SLC SSDs support the endurance required

to replace many 15K RPM HDDs for IOPS applications

Why Random Performance Matters(more than sequential transfer rate)

Most requests are nonMost requests are non--sequential where the nonsequential where the non--transfer time component is dominanttransfer time component is dominant

Approximate service Approximate service time breakdown for time breakdown for 7200RPM HDD w/ 8ms 7200RPM HDD w/ 8ms average seek time and average seek time and 75MB/s transfer rate 75MB/s transfer rate performing 32KB performing 32KB random read random read operation.operation.

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

32KB Transfer

SeekRotationMediaInterface

Media Transfer (440us)

Media Transfer Media Transfer (440us)(440us)

Interface Transfer (110us)

Interface Transfer Interface Transfer (110us)(110us)

Most requests are Most requests are nonnon--sequentialsequential

For nonFor non--sequential sequential accesses, >95% of accesses, >95% of total HDD service total HDD service time is mechanical time is mechanical latencylatency

Intel® Mainstream SATA SSD Bridges the HDD Performance Gap Random Read Performance

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

512

1024

2048

4096

8192

1638

432

768

6553

613

1072

Transfer Size (B)

Ran

dom

Rea

d IO

Ps

(QD

=32)

7200RPM HDD

128GB SATA SSD (A)

64GB SSD (B)

Intel® MainstreamSATA SSD


Intel® Mainstream SATA SSD Bridges the HDD Performance Gap (cont’d)Random Write Performance

0

500

1000

1500

2000

2500

3000

3500

4000

512

1024

2048

4096

8192

1638

432

768

6553

613

1072

Transfer Size (B)

Rand

om W

rite

IOP

s (Q

D=32

)

7200RPM HDD

128GB SATA SSD (A)

64GB SSD (B)

Intel® MainstreamSATA SSD


0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

0 20 40 60 80 100 120

Minutes

Pow

er (W

)

5400 HDD (8X)

7200 HDD (13X)

Intel SSD (1X)

Intel® Mainstream SATA SSDs Save Power:SATA Power Rails With 2 Hour Mobile Workload

HDD spends only 10% in lowest power states

Intel® X25-M SSD spends 96% in

lowest power states


Samples Sorted by Increasing Power ->

Intel® X25-M SSD

Intel® Mainstream SATA SSDs Mean Better Mobile CPU Scaling

Sysmark07*-Productivity Performance Scaling

50

75

100

125

150

175

200

225

2 GHz Intel® Core™2 DuoProcessor

3 GHz Intel® Core™2 DuoExtreme Processor

Ben

chm

ark

scor

e Intel® MainstreamSATA SSD

5400 RPM MobileHDD

25% Scaling

36% Scaling

Intel® Mainstream SATA SSDs

maximize end user’s processor performance Performance measurements are made using specific computer systems and/or components and reflect the approximate performance of the technology as measured by those tests. Any difference in system hardware or software design or configuration may affect actual results.

SSDs in Data Center

Data center value proposition:

• Performance, especially IOPS performance – IOPS = Input/Output Operation Per Second

• Fewer devices needed to meet IOP need, saving money

• Lower power consumption

• Higher system reliability

SSD Value:

A lower cost, greener, more reliable data center

Media Access Time for 20K Read

0

20

40

60

80

100

120

140

160

180

200

Jan-96

Jan-97

Jan-98

Jan-99

Jan-00

Jan-01

Jan-02

Jan-03

Jan-04

Jan-05

Jan-06

Jan-07

Jan-08

Jan-09

Nor

mal

ized

Per

form

ance

Multicore CPUCPU

Disk


Intel SSD

Normalized CPU PerformanceNormalized Media Access Time for 20K ReadMedia Access Time for 20K Read

0

20

40

60

80

100

120

140

160

180

200

Jan-96

Jan-97

Jan-98

Jan-99

Jan-00

Jan-01

Jan-02

Jan-03

Jan-04

Jan-05

Jan-06

Jan-07

Jan-08

Jan-09

Nor

mal

ized

Per

form

ance

Multicore CPUCPU

Disk


Intel SSD

Normalized CPU PerformanceNormalized Media Access Time for 20K Read

Enterprise HDD Performance Gap Results in Multiplication of HDDs

7056 HDDs are expensive

7056 HDDs are hard to manage

7056 HDDs fail often

7056 HDDs burn a lot of power

7056 15K RPM HDDs

TPC-C Reporthttp://www.tpc.org/

Similar I/O Performance For IOPS Intensive Workload

490 Fiberchannel 15K RPM drives in 4 racks

Vs.

8 lab prototype SSDs (not product) internal to server

HDDHDD64,000 IOPS64,000 IOPS490 490 HDDsHDDs35 drive shelves35 drive shelves24 sq ft24 sq ft14 kW14 kW4.6 IOPS/W4.6 IOPS/W

SDDSDD120,000 IOPS120,000 IOPS8 SSDs8 SSDs1 drive shelf1 drive shelf1 sq ft1 sq ft0.6 kW0.6 kW200 IOPS/W200 IOPS/W

Energy Savings

$9KSAVINGS/year

Energy Costs

23XREDUCTION

Storage Cost

8XREDUCTION

Floor Space

24XREDUCTION

Picture not to scale

IOPS Application Optimization

Intel® Turbo Memory (NAND Cache)

NAND flash solution on PCI-e bus

Intel driver interfaces to Microsoft ReadyBoost* and ReadyDrive*

O-ROM handles pre-driver load cache management

Supports on-motherboard and minicard solutions

Operating System

Intel® Matrix Storage Manager

Driver 7.0

SWHW

PCIe*SATA

Disk

ReadyDrive* Technology/

T13 NVM Commands

Microsoft ReadyBoost* Technology/SuperFetch*

Memory Management

Technology

CTRLNAND NAND

Robson Driver

OROM

Standardized High Performance NAND Platform

HostHost

Platform Platform NVM NVM

subsystemsubsystemFlashFlashFlashFlashFlashFlash

All elements necessary for standardized All elements necessary for standardized highhigh--performance platform NAND solutionperformance platform NAND solution

FlashFlashFlashFlashFlashFlash

Flash Controller

Flash Flash ControllerController

ONFI 2.0 (~133MB/s per channel)

ONFI 2.0 ONFI 2.0 (~133MB/s (~133MB/s per channel)per channel)

NVMHCINVMHCINVMHCIONFI NAND DIMM Connector

ONFI ONFI NAND DIMM NAND DIMM ConnectorConnector

*Future platform evolution forecasted*Future platform evolution forecasted

NAND in the Platform

NAND in the platform has started with modules plugged in on PCIe

As NAND becomes more prevalent, the controller will be integrated with the platform

• Down on motherboard or higher levels of integration

OEMs want to offer customers capacity/feature choice, so NAND will remain on a module

Issue: How to plug a NAND-only module into a PC platform?

• NAND does not talk PCIe*

ChipsetChipset

IntelIntel®®Turbo MemoryTurbo Memory

Connector for NAND-only Modules

To offer capacity choice, ONFI is defining a standard connector • Enables OEMs to sell NAND on a module• Like an unbuffered and unregistered DIMM

The ONFI connector effort is leveraging existing DRAM standards

• Avoids major connector tooling costs• Re-uses electrical verification• Ensures low cost with quick time to

market

Both right-angle and vertical entry form factors are being delivered

NAND Technology Future Scaling Trends: $$/GB = Bit Area & Reliability Scaling

NAND scaling: multiple potential vectors

Pipeline for next several years

Traditional FG NAND

34nm 32Gb MLC NAND

Vertical Integration

Ref: S. Jung, IEDM 2006

NAND Pathfinding Programs

Trap Based Flash

Control Dielectric

Charge Storage Dots

Tunnel Oxide

More Evolutionary: Higher production

probability, but less scalable

Less Evolutionary: Lower production probability, but potentially more

scalable

Non-NAND NVM

Storage ElementSwitch Element

NVM in Compute…

20+ Year Vision drive by Moore’s Law

Now we can start…

Download - Al Fazio Intel Fellow Director, Memory technology Development …€¦ · Putting it together: SSD Reliability Metrics SSD UBER values can be

Top Related