Future Compute MemoryNon Volatile Memory (NVM) in Compute
Al Fazio
Intel Fellow
Director, Memory technology Development
November 12, 2008
Outline
Motivations for NVM in Compute
Key Principles of NAND Flash Operation & Device Physics from a compute applications viewpoint
Memory Controller Architecture
NVM Impact on Compute Applications
Future NVM Technology Trends
1987 View of NVM in Compute
2008 View of NVM in Compute
Form Factor: 2.5”/ 1.8” Standard SATA 3Gb/s
Performance
• World Class SATA I/O Performance• X2S-E (SLC) Throughput
– Sustained R/W: 240 / 170MB/s– Active (avg): 2.4W; Idle: 0.06W
• X25-M/X18-M (MLC) Throughput – Sustained R/W: 240 / 70MB/s – Active (avg): 0.25W; Idle: 0.06W
A Full Range of NVM in Compute
Media Access Time for 20K Read
0
20
40
60
80
100
120
140
160
180
200
Jan-96
Jan-97
Jan-98
Jan-99
Jan-00
Jan-01
Jan-02
Jan-03
Jan-04
Jan-05
Jan-06
Jan-07
Jan-08
Jan-09
Nor
mal
ized
Per
form
ance
Multicore CPUCPU
Disk
Motivation for NVM in compute:Huge Scaling Discrepancy Between CPU and HDD
Source: Intel measurements
1.3X vs 175X in 13 years!
Normalized CPU PerformanceNormalized Media Access Time for 20K Read
Intel® Mainstream SATA SSD
Intel®Turbo Memory
Intel®SSD
3+ decades of floating-gate technology scaling starting on EPROM Flash
Technology designed for high-volume manufacturing
1986/1.51986/1.5µµmm
1988/1.01988/1.0µµmm
1991/0.81991/0.8µµmm
1993/0.61993/0.6µµmm
1996/0.41996/0.4µµmm
1998/0.251998/0.25µµmm
2000/0.182000/0.18µµmm
2002/0.132002/0.13µµmm
2004/90nm2004/90nm
20+ Years Flash Floating Gate Technology20+ Years Flash Floating Gate Technology
2006/72nm2006/72nm
2007/50nm2007/50nm
Source: IntelSource: Intel
2008/34nm2008/34nm32Gbit32Gbit
Flash: A License to Disrupt
35mm film, Floppy drives, audio tape…
• Flash use in consumer electronics characterized by:– Large block files (.jpg, mp3…)– # Writes determined by human interaction (i.e. photos taken)
To disrupt HDD, flash must accommodate compute characteristics:
• Small random writes, # writes determine by OS
• Add to this:
A Be-
Control
Flash requires high fields to overcome energy barriers for non-volatility
Flash reliability dominated by oxide-degradation; result of program/erase
NAND Physical Organization
A block is a sea of cells arranged in a grid
TG’s are connected in wordlines (typ. 32, only 5 shown)
Cells in different wordlines are strung together in series
Each string of cells is connected to a bitline at one, source at the other
Select devices control whether the block is connected to bitlines and source
Gate Gate Gate Gate Gate
One block
Wor
dlin
e
Bitline
Source
Sele
ct
Sele
ct
Flash Cell Layout and Cross-Section
N-Channel MOSFET with a few distinguishing features: – Isolated floating gate– Charge storage on Floating gate modulates threshold voltage of underlying
MOSFET
D
S
CG
FG
Ids
Vcg
“1” “0”
Erased“1”
Programmed“0”
Stored Electrons
Inter Poly Dielectric ONO
N+ Source N+ DRAIN
POLY1FLOATING GATE
POLY2 CONTROL GATE
P-Type SiliconSubstrate
Tunnel Oxide
Oxide Sidewall
Charge Storage: Program and Erase
Programming means injecting electrons to the FG
• Fowler-Nordheim Tunneling
Erase: Fowler-Nordheim Tunneling in reverse direction
N+ N+
20V
0V0V
Programming: NAND
N+ N+
-20V
0V0V
Erase
Dis
trib
utio
n
Vt
“1” “0”
2 Levels => 1 bit/cell
Dis
trib
utio
n
Vt
“11” “00”
4 Levels => 2 bit/cell
“10”“01”
Reliability and Oxide TrapsNormally, F-N tunneling occur only during
accelerated stresses done by engineers trying to study oxide degradation…
• Flash memories: basis device operation itself
This fact has two fundamental implications:
• Flash reliability is dominated by oxide-degradation effects, notably trap buildup in the tunnel oxide, which occur as a result of program/erase cycling
• More than any other IC technology, developing a Flash technology centers around obtaining acceptable reliability
Over time, charges can detrap
• Effect will cause VT to shift and possible data loss
-----
Channel
FG
Q’
Distance
Ener
gy
N+ N+
FG
Top Gate
Bit Errors: Overview
At any instant, some fraction of bits are in the wrong data state, typically 1E-9 to 1E-6, called the “raw bit error rate” or RBER
These failing bits develop with use
L0 L1 L2 L3After Write
VtVpass Read
L0 L1 L2 L3After Time
• During write, some bits program when they shouldn’t, or program higher than they should
This complexity means that RBER is a number, but not like pi: • like temperature: a # for specific set of conditions, location, instant
• Cells shift in VT over time, because of simply time (“data retention”) or of repetitive read operations (“read disturb”)
• Both kinds increase with more program/erase cycles• Several mechanisms cause bit errors, each with its own dependence on
cycles, time, temperature, etc.
1.0E-09
1.0E-08
1.0E-07
1.0E-06
1.0E-05
4000 6000 8000 10000P/E Cycles
RB
ER
1.0E-09
1.0E-08
1.0E-07
1.0E-06
1.0E-05
1 10 100 1000 10000
P/E Cycles
RB
ER
Erratic Nature of Write Errors
Errors are erratic: Most bits failing at 5K didn’t fail at 10K
Explanation: oxide traps are transient
Data verified only at symbols: did we miss errors in between?
Ran experiment to verify data after every cycle
• Example bit failed 11 times, never at previous verify points
• Previous verifies detected only 0.6% of failing bits
Standard “test after stress” qualifications miss most errors!
Bit failpoints
EarlierDataIMFT
Next Several Slides are based on 70nm results from: Mielke, N., et. al., “Bit error rate in NAND Flash memories”, IEEE International Reliability Physics Symposium, 2008
10-4
10-5
10-6
10-7
10-8
10-9
0 5000 10000Retention Time (Hours)
Raw
Bit
Erro
r R
ate
Data-Retention Errors
Post 10K Cycles
After cycling, RBER increases over time without bias
Error transitions show cells are losing VT (“charge loss”)
Two products dominated by upper state (L3), others by L1 & L2
Characteristics:
• L1 & L2: Detrapping from the tunnel oxide
• L3: SILC (trap-assisted tunneling) leakage off FG
L0 L1 L2 L3
R1 R2 R3
n+ n+
CG
FG
n+ n+
SILCDetrapping
0
5x10-7
10-6
1.5 x10-6
0 5000 10000Number of Reads
Raw
Bit
Erro
r R
ate
Read Disturb ErrorsPost 10K Cycles
After cycling, RBER increases with repetitive reading
Error transitions show erased cells gaining VT
Mechanism is well known: SILC under read bias
L0 L1 L2 L3
R1 R2 R3
n+ n+
~6V
SILC
10KCycles
+1 Yearor
10K reads
10KCycles
+ 0.5 Yearor
5K reads
10KCycles
5KCycles
00.0E+00
1.0E-04
2.0E-04C
um F
ract
ion
Sect
ors
Faili
ng (
1-bi
t EC
C)
0
2E-13
4E-13
Cum
Fra
ctio
n Se
ctor
sFa
iling
(4-b
it EC
C)
Effect of ECC4x10-13
0 0
2x10-4
Failures drop several orders of magnitude, ~1012x over no ECC
Curves get steeper (because of Ecc power law)
Dominant mechanism switches to retention (because of underlying error distribution)
1-bit
4-bit
Workable UBER Definition for NAND
)Cyc-PostN eReads/Cycl# (Nsector)per (bitsFailing Sectors Fraction Cum
sector)per reads(#sector)per (bitsFailing Sectors Fraction CumUBER
CYC +•⋅=
⋅=
Worst case:1Read Disturb: #reads in stressUnbiased: Impute same rate
as in cycling
UBER = Uncorrectable Bit Error Rate
00
1085x107
2x10-13
Bits Read per Sector
Cum
Fra
ctio
n Se
ctor
s,4-
bit E
CC
3x10-21Retention
Read DisturbWrite
UBER Estimate
Data re-plotted vs. # bits read
UBER at any point is the slope of line to the origin
UBER is very low 3x10-21 at worst-case point (retention)
UBER increases with greater use, so use range must be stated when UBER is specified
Concurrency in Intel® SSD ASIC10 external physical NAND channels
• Up to 2 NAND components per channel
• Component = Dual Die or Quad Die Packages
Each channel supports multiple outstanding tasks
• Each NAND channel fully hardware automated/accelerated
• Hardware fully overlaps & pipelines commands
• Automated ECC generators & correctors
Dual CE#
NAND
Dual CE#
NAND
Dual CE#
NAND
Dual CE#
NAND
Memory Arbiter
ECC Corrector
Dual CE#
NAND
Dual CE#
NAND
Dual CE#
NAND
Dual CE#
NAND
To rest of ASIC To rest of ASIC datapathdatapath
NAND Channel
#1
NAND Channel
#2
NAND Channel
#9
NAND Channel
#0
ECC Generator
ECC Generator
ECC Generator
ECC Generator
Algorithmic EfficiencyA high-performance NAND controller is necessary but not sufficient
Primary impact on overall performance is algorithmic efficiency
• Especially the case for small random writes
Write AmplificationWrite Amplification is the amount of NAND written for a requested amount of write from host
…
Page 0
Page 1
Page 2
Page 61
Page 62
Page 63Erase Block (EB)
Page 3
*Simplified example to illustrate the write amplification effect. Specific algorithms vary greatly.
Data to be written
Data to be Data to be writtenwritten
Write AmplificationWrite Amplification is the amount of NAND written for a requested amount of write from host
…
Page 0
Page 1
Page 2
Page 61
Page 62
Page 63Erase Block (EB)
Page 3
*Simplified example to illustrate the write amplification effect. Specific algorithms vary greatly.
…
Page 0
Page 1
Page 2
Page 61
Page 62
Page 63
Page 3
DRAM Copy
First retrieve all data in
erase block
First retrieve First retrieve all data in all data in
erase blockerase block
Then insert new data in
retrieved copy
Then insert Then insert new data in new data in
retrieved copyretrieved copy
Write AmplificationWrite Amplification is the amount of NAND written for a requested amount of write from host
…
Page 0
Page 1
Page 2
Page 61
Page 62
Page 63Erase Block (EB)
Page 3
*Simplified example to illustrate the write amplification effect. Specific algorithms vary greatly.
…
Page 0
Page 1
Page 2
Page 61
Page 62
Page 63
Page 3
DRAM Copy
Then erase the NAND
erase block
Then erase Then erase the NAND the NAND
erase blockerase blockFinally put all
data back (including new)
Finally put all Finally put all data back data back
(including new)(including new)
…
Page 0
Page 1
Page 2
Page 61
Page 62
Page 63Erase Block (EB)
Page 3
…
Page 0
Page 1
Page 2
Page 61
Page 62
Page 63
Page 3
Write AmplificationWrite Amplification is the amount of NAND written for a requested amount of write from host
*Simplified example to illustrate the write amplification effect. Specific algorithms vary greatly.
Example amplification is 32 Example amplification is 32 (32X NAND written for host request)(32X NAND written for host request). . Traditional schemes have amplification of approx 20Traditional schemes have amplification of approx 20--40X.40X.
Client Workload Write Amplification
0
10 20 30 40 50 60 70 80 90
100
110
120
130
140
150
160
170
180
190
0.1
1
10
100
1000
Dat
a W
ritte
n (M
B)
Workload Duration (Minutes)
Host Data Written NAND Data Written
XP Mobile Workload XP Mobile Workload WritesWrites
IntelIntel®® HighHigh--Performance SATA Performance SATA SSDsSSDs typical write amplification typical write amplification <1.1 for client workloads (this example <1.05)<1.1 for client workloads (this example <1.05)
Measured writes from
host
Measured Measured writes from writes from
hosthost
Measured writes to
NAND
Measured Measured writes to writes to
NANDNAND
Performance measurements are made using specific computer systems and/or components and reflect the approximate performance of the technology as measured by those tests. Any difference in system hardware or software design or configuration may affect actual results.
*Third party marks and brands are the property of their respective owners
Variability in Wear Leveling
Controllers vary in in wear-leveling effectiveness
Poor wear leveling can have high impact
20x in cycles can be 10x or more in RBER
10x in RBER is 10ECC+1 in ECC failure rate: 100,000x for 4-bit ECC
Wear Leveling
0
500
1000
1500
2000
2500
1 2049 4097 6145
Sorted Erase Block (min to max)
P/E
Cyc
les
Ref: Intel® X18-M
Unsophisticated regioned scheme More sophisticated scheme
Cycles/blockVaries 20x 4% variation
Putting it together: SSD Reliability MetricsSSD UBER values can be << 10-15
UBER ∝ usage: program/erase/read & subsequent retention
Intel® X18-M and X25-M Mainstream SATA SSD (80GB)
• 10 Channels Architecture with 50nm MLC ONFI 1.0 NAND
• 5 years usage, 1000G, 1.2million hrs MTBF
• GB/day client workload @ 1e-15 UBER >>100GB/day, 5 years
Intel® X25-M and X18-M Mainstream SATA SSDs deliver
>5X accepted requirement for clients (20GB/day)Intel® X25-E Extreme SATA SSD (32GB)
• 10 Channels Architecture with 50nm SLC ONFI 1.0 NAND
• 1000G, 2Million hrs MTBF
• Intel SLC SSD support > 7000 8K 2:1 R/W Random IOPs 24/7, 5 years
Intel X25-E SLC SSDs support the endurance required
to replace many 15K RPM HDDs for IOPS applications
Why Random Performance Matters(more than sequential transfer rate)
Most requests are nonMost requests are non--sequential where the nonsequential where the non--transfer time component is dominanttransfer time component is dominant
Approximate service Approximate service time breakdown for time breakdown for 7200RPM HDD w/ 8ms 7200RPM HDD w/ 8ms average seek time and average seek time and 75MB/s transfer rate 75MB/s transfer rate performing 32KB performing 32KB random read random read operation.operation.
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
32KB Transfer
SeekRotationMediaInterface
Media Transfer (440us)
Media Transfer Media Transfer (440us)(440us)
Interface Transfer (110us)
Interface Transfer Interface Transfer (110us)(110us)
Most requests are Most requests are nonnon--sequentialsequential
For nonFor non--sequential sequential accesses, >95% of accesses, >95% of total HDD service total HDD service time is mechanical time is mechanical latencylatency
Intel® Mainstream SATA SSD Bridges the HDD Performance Gap Random Read Performance
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
512
1024
2048
4096
8192
1638
432
768
6553
613
1072
Transfer Size (B)
Ran
dom
Rea
d IO
Ps
(QD
=32)
7200RPM HDD
128GB SATA SSD (A)
64GB SSD (B)
Intel® MainstreamSATA SSD
Performance measurements are made using specific computer systems and/or components and reflect the approximate performance of the technology as measured by those tests. Any difference in system hardware or software design or configuration may affect actual results.
Intel® Mainstream SATA SSD Bridges the HDD Performance Gap (cont’d)Random Write Performance
0
500
1000
1500
2000
2500
3000
3500
4000
512
1024
2048
4096
8192
1638
432
768
6553
613
1072
Transfer Size (B)
Rand
om W
rite
IOP
s (Q
D=32
)
7200RPM HDD
128GB SATA SSD (A)
64GB SSD (B)
Intel® MainstreamSATA SSD
Performance measurements are made using specific computer systems and/or components and reflect the approximate performance of the technology as measured by those tests. Any difference in system hardware or software design or configuration may affect actual results.
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
0 20 40 60 80 100 120
Minutes
Pow
er (W
)
5400 HDD (8X)
7200 HDD (13X)
Intel SSD (1X)
Intel® Mainstream SATA SSDs Save Power:SATA Power Rails With 2 Hour Mobile Workload
HDD spends only 10% in lowest power states
Intel® X25-M SSD spends 96% in
lowest power states
Performance measurements are made using specific computer systems and/or components and reflect the approximate performance of the technology as measured by those tests. Any difference in system hardware or software design or configuration may affect actual results.
Samples Sorted by Increasing Power ->
Intel® X25-M SSD
Intel® Mainstream SATA SSDs Mean Better Mobile CPU Scaling
Sysmark07*-Productivity Performance Scaling
50
75
100
125
150
175
200
225
2 GHz Intel® Core™2 DuoProcessor
3 GHz Intel® Core™2 DuoExtreme Processor
Ben
chm
ark
scor
e Intel® MainstreamSATA SSD
5400 RPM MobileHDD
25% Scaling
36% Scaling
Intel® Mainstream SATA SSDs
maximize end user’s processor performance Performance measurements are made using specific computer systems and/or components and reflect the approximate performance of the technology as measured by those tests. Any difference in system hardware or software design or configuration may affect actual results.
SSDs in Data Center
Data center value proposition:
• Performance, especially IOPS performance – IOPS = Input/Output Operation Per Second
• Fewer devices needed to meet IOP need, saving money
• Lower power consumption
• Higher system reliability
SSD Value:
A lower cost, greener, more reliable data center
Media Access Time for 20K Read
0
20
40
60
80
100
120
140
160
180
200
Jan-96
Jan-97
Jan-98
Jan-99
Jan-00
Jan-01
Jan-02
Jan-03
Jan-04
Jan-05
Jan-06
Jan-07
Jan-08
Jan-09
Nor
mal
ized
Per
form
ance
Multicore CPUCPU
Disk
Source: Intel measurements
Intel SSD
Normalized CPU PerformanceNormalized Media Access Time for 20K ReadMedia Access Time for 20K Read
0
20
40
60
80
100
120
140
160
180
200
Jan-96
Jan-97
Jan-98
Jan-99
Jan-00
Jan-01
Jan-02
Jan-03
Jan-04
Jan-05
Jan-06
Jan-07
Jan-08
Jan-09
Nor
mal
ized
Per
form
ance
Multicore CPUCPU
Disk
Source: Intel measurements
Intel SSD
Normalized CPU PerformanceNormalized Media Access Time for 20K Read
Enterprise HDD Performance Gap Results in Multiplication of HDDs
7056 HDDs are expensive
7056 HDDs are hard to manage
7056 HDDs fail often
7056 HDDs burn a lot of power
7056 15K RPM HDDs
TPC-C Reporthttp://www.tpc.org/
Similar I/O Performance For IOPS Intensive Workload
490 Fiberchannel 15K RPM drives in 4 racks
Vs.
8 lab prototype SSDs (not product) internal to server
HDDHDD64,000 IOPS64,000 IOPS490 490 HDDsHDDs35 drive shelves35 drive shelves24 sq ft24 sq ft14 kW14 kW4.6 IOPS/W4.6 IOPS/W
SDDSDD120,000 IOPS120,000 IOPS8 SSDs8 SSDs1 drive shelf1 drive shelf1 sq ft1 sq ft0.6 kW0.6 kW200 IOPS/W200 IOPS/W
Energy Savings
$9KSAVINGS/year
Energy Costs
23XREDUCTION
Storage Cost
8XREDUCTION
Floor Space
24XREDUCTION
Picture not to scale
IOPS Application Optimization
Intel® Turbo Memory (NAND Cache)
NAND flash solution on PCI-e bus
Intel driver interfaces to Microsoft ReadyBoost* and ReadyDrive*
O-ROM handles pre-driver load cache management
Supports on-motherboard and minicard solutions
Operating System
Intel® Matrix Storage Manager
Driver 7.0
SWHW
PCIe*SATA
Disk
ReadyDrive* Technology/
T13 NVM Commands
Microsoft ReadyBoost* Technology/SuperFetch*
Memory Management
Technology
CTRLNAND NAND
Robson Driver
OROM
Standardized High Performance NAND Platform
HostHost
Platform Platform NVM NVM
subsystemsubsystemFlashFlashFlashFlashFlashFlash
All elements necessary for standardized All elements necessary for standardized highhigh--performance platform NAND solutionperformance platform NAND solution
FlashFlashFlashFlashFlashFlash
Flash Controller
Flash Flash ControllerController
ONFI 2.0 (~133MB/s per channel)
ONFI 2.0 ONFI 2.0 (~133MB/s (~133MB/s per channel)per channel)
NVMHCINVMHCINVMHCIONFI NAND DIMM Connector
ONFI ONFI NAND DIMM NAND DIMM ConnectorConnector
*Future platform evolution forecasted*Future platform evolution forecasted
NAND in the Platform
NAND in the platform has started with modules plugged in on PCIe
As NAND becomes more prevalent, the controller will be integrated with the platform
• Down on motherboard or higher levels of integration
OEMs want to offer customers capacity/feature choice, so NAND will remain on a module
Issue: How to plug a NAND-only module into a PC platform?
• NAND does not talk PCIe*
ChipsetChipset
IntelIntel®®Turbo MemoryTurbo Memory
Connector for NAND-only Modules
To offer capacity choice, ONFI is defining a standard connector • Enables OEMs to sell NAND on a module• Like an unbuffered and unregistered DIMM
The ONFI connector effort is leveraging existing DRAM standards
• Avoids major connector tooling costs• Re-uses electrical verification• Ensures low cost with quick time to
market
Both right-angle and vertical entry form factors are being delivered
NAND Technology Future Scaling Trends: $$/GB = Bit Area & Reliability Scaling
NAND scaling: multiple potential vectors
Pipeline for next several years
Traditional FG NAND
34nm 32Gb MLC NAND
Vertical Integration
Ref: S. Jung, IEDM 2006
NAND Pathfinding Programs
Trap Based Flash
Control Dielectric
Charge Storage Dots
Tunnel Oxide
More Evolutionary: Higher production
probability, but less scalable
Less Evolutionary: Lower production probability, but potentially more
scalable
Non-NAND NVM
Storage ElementSwitch Element
NVM in Compute…
20+ Year Vision drive by Moore’s Law
Now we can start…