![Page 1: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/1.jpg)
Memory Architecture and Storage Systems
Myoungsoo Jung
Computer Architecture and Memory Systems Lab.
School of Integrated TechnologyYonsei University
(IIT8015) Lecture#7: SSD Architecture and System-level Controllers
![Page 2: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/2.jpg)
Outline
• Holistic Viewpoint (Overview)
• SSD Architecture
• Parallelism Overview
• Page Allocation Strategies
• Evaluation Studies for Parallelism
• Host Interface Overview
• Case Studies• Parallelism-Aware Host Interface I/O Scheduler
• GC-Aware Host Interface I/O Scheduler
• Simulation Infrastructure (for your future research)
2
![Page 3: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/3.jpg)
Outline
• Holistic Viewpoint (Overview)
• SSD Architecture
• Parallelism Overview
• Page Allocation Strategies
• Evaluation Studies for Parallelism
• Host Interface Overview
• Case Studies• Parallelism-Aware Host Interface I/O Scheduler
• GC-Aware Host Interface I/O Scheduler
• Simulation Infrastructure (for your future research)
3
![Page 4: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/4.jpg)
NAND Flash
NAND Flash
CTRL
CH
AN
NEL
1
NAND Flash
NAND Flash
CTRL
CH
AN
NEL
2NAND Flash
NAND Flash
CTRL
CH
AN
NEL
3
NAND Flash
NAND Flash
CTRL
CH
AN
NEL
4
Emb
edd
ed P
roce
sso
rs
SSD Internals
Ho
st In
terf
ace
Co
ntr
olle
r
Die 0 Die 1 Die 2 Die 3
Multiplexed Interface
Flash Package Internals
k*j
Blo
cks
DATA REGISTER
CACHE REGISTER
NAND Flash Memory Array
DATA REGISTER
CACHE REGISTER
NAND Flash Memory Array
1 Block 1 Block
DIE 1
PLANE 0 PLANE jk Blocks k Blocks
Die Internals
Northbridge
IDESATA
USB
Southbridge
I/O controller hub
Memory controller hub
High-speed graphic I/O slots (PCI Express)
PCI Slots
Memory Slots
Cables and ports leading off-board
Core Core Core
Holistic Viewpoint (Hardware)
Lecture 6
![Page 5: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/5.jpg)
Flas
h M
emo
ry R
equ
est
(Ph
ysic
al A
dd
ress
)
Flas
h M
emo
ry R
equ
est
(Vir
tual
Ad
dre
ss)
Holistic Viewpoint (Software)
NVMHC
Queuing Memory Request Building
Core (Flash Translation Layer)
Memory Request Commitment Transaction Handling
Flash Controllers
De
vice
-le
vel Q
ue
ue
Arrivals
I/O
Req
ues
t
Parsing Data Movement Initiation
Memory Requests: data size is the same as atomic flash I/O unit size
AddressTranslation
Execution Sequence
Striping &Pipelining
Transaction Decision
Interleaving & Sharing
Lecture 6Lecture 3 TodayToday Spec-specific
![Page 6: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/6.jpg)
Outline
• Holistic Viewpoint (Overview)• SSD Architecture• Parallelism Overview• Page Allocation Strategies • Evaluation Studies for Parallelism • Host Interface Overview• Case Studies
– Parallelism-Aware Host Interface I/O Scheduler– GC-Aware Host Interface I/O Scheduler
• Simulation Infrastructure (for your future research)
6
![Page 7: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/7.jpg)
SSD Architecture
• At the top of SSD internals, there is host interface controller that parses the incoming requests
• Embedded CPU(s) is employed for flash firmware such asFTL, buffer $, I/O scheduler,parallelism management
7
![Page 8: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/8.jpg)
SSD Architecture
• Underneath the embedded processor, multiple flash controllers exist, each connecting a memory bus, referred to as channel
• Within a bus, there are multiple flash packages, each having flash interface, called way
![Page 9: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/9.jpg)
System-Level Parallelism
• Channel striping– An I/O request is striped
over multiple channels
• Way pipelining – As it shares the channel, an
I/O request cannot be perfectly striped in parallel
– However, NAND flash transactions consists of multiple phases, and individual NAND flash can still work simultaneously
Mic
rop
roce
sso
r
CH A
Flash Chip Flash Chip
CH B
Flash Chip Flash Chip
CH C
Flash Chip Flash Chip
CH D
Flash Chip Flash Chip
ChannelH
ost
Inte
face
Way
CCHH AA
CCHH BB
CH C
CCHH D
Channel
Channel Striping
![Page 10: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/10.jpg)
System-Level Parallelism
• Channel striping– An I/O request is striped
over multiple channels
• Way pipelining – As it shares the channel, an
I/O request cannot be perfectly striped in parallel
– However, NAND flash transactions consists of multiple phases, and individual NAND flash can still work simultaneously
Mic
rop
roce
sso
r
CH A
Flash Chip Flash Chip
CH B
Flash Chip Flash Chip
CH C
Flash Chip Flash Chip
CH D
Flash Chip Flash Chip
ChannelH
ost
Inte
face
Wayss
or
CCHH AA
Flash Chip Flash Chip
CCHH BB
Way pipelining
![Page 11: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/11.jpg)
Flash-Level Parallelism
• Die interleaving
– A striped/pipelined request can be further interleaved within a chip
NAND Flash Memory Array
DATA REGISTER
CACHE REGISTER
1 Block
NAND Flash Memory Array
DATA REGISTER
CACHE REGISTER
Mu
ltip
lex
Inte
rfac
e DIE 0
DIE 1
DIE 2
DIE 3
DATA REGISTER
CACHE REGISTER
NAND Flash Memory Array
(Plane)
1 Block
wordline
Mu
ltip
lex
Inte
rfac
e DIE 0
DIE 1
DIE 2
DIE 3
Die interleaving
![Page 12: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/12.jpg)
Flash-Level Parallelism
• Die interleaving
– A striped/pipelined request can be further interleaved within a chip
• Plane sharing
– Multiple planes simultaneously works using shared wordline(s)
NAND Flash Memory Array
DATA REGISTER
CACHE REGISTER
1 Block
NAND Flash Memory Array
DATA REGISTER
CACHE REGISTER
Mu
ltip
lex
Inte
rfac
e DIE 0
DIE 1
DIE 2
DIE 3
DATA REGISTER
CACHE REGISTER
NAND Flash Memory Array
(Plane)
1 Block
wordline
DATA REGISTER
CACHE REGISTER
NAND FlashMemory Array
(Plane)
1 Block
worddlline
Plane sharing
![Page 13: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/13.jpg)
Parallelism Overview
• Note that different vendors use different naming rules (interleaving, stripping, way, etc.)
• You need to understand four different level of parallelism based on the contexts
– Plane sharing (=multiple-mode operation, two-plane operation, etc)
– Die interleaving (=interleaved die operation, bank interleaving, etc)
– Way pipelining (=package interleaving)
13
![Page 14: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/14.jpg)
Outline
• Holistic Viewpoint (Overview)• SSD Architecture• Parallelism Overview• Page Allocation Strategies • Evaluation Studies for Parallelism • Host Interface Overview• Case Studies
– Parallelism-Aware Host Interface I/O Scheduler– GC-Aware Host Interface I/O Scheduler
• Simulation Infrastructure (for your future research)
14
![Page 15: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/15.jpg)
WAY 0 WAY 1
CH
AC
H B
FTL
HIL
HAL
Host Interface Layer- Responsible for communication
Flash Translation LayerAddress translation between the host address space and physical addresses
Hardware Abstraction LayerCommitting flash transaction to underlying flash memory chips
[Image:micron.com]
24Page Allocation Strategies
Page allocation strategies are directly related with physical data layout and access sequences, which have impact on the performance and internal parallelism
Software stack (review) & page allocation
![Page 16: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/16.jpg)
Page Allocation Strategies (Palloc)
• Channel-first pallocs• Allocate internal resources in favor of channel striping method
• Way-first pallocs• Are oriented forward taking advantage of the way pipelining
• Die-first and plane-first pallocs• Allocate die and plane in an attempt to reap the benefit of flash-
level parallelism
![Page 17: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/17.jpg)
NAND Flash Chip
Pla
ne
0
Pla
ne
0
Pla
ne
1
Pla
ne
1
DIE 0 DIE 1
WAY 0 WAY 1
CH
AC
H B
Channel striping
Way Pipelining
Die Interleaving
Plane Sharing
CWDP -- Channel-Way-Die-PlaneCDPWCDWPCPDWCPWDCWPD
Channel-first Page Allocation
![Page 18: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/18.jpg)
Channel-first Page Allocation
• This page allocation strategies give priority to the order of channel, way, die and plane
• Some channel first page allocation strategies introduce low flash-level locality
![Page 19: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/19.jpg)
NAND Flash Chip
Pla
ne
0
Pla
ne
0
Pla
ne
1
Pla
ne
1
DIE 0 DIE 1
WAY 0 WAY 1
CH
AC
H B
Channel striping
Die Interleaving
Plane SharingWDCP -- Way-Die-Channel-PlaneWCPDWDCPWDPCWPCDWPDC Way Pipelining
Way-first Page Allocation
![Page 20: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/20.jpg)
Way-first Page Allocation
• This allocation assigns 1. the way resources in a
channel
2. stripe all the requests over multiple ways
3. interleave the flash-level resources.
• Although it allocates the system-level resource first, some favor flash-level resources
![Page 21: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/21.jpg)
NAND Flash Chip
Pla
ne
0
Pla
ne
0
Pla
ne
1
Pla
ne
1
DIE 0 DIE 1
WAY 0 WAY 1
CH
AC
H B
Die Interleaving with Multiplane
Way Pipelining
Channel stripingDie Interleaving
DPWC -- Die-Plane-Way-ChannelDCPWDCWPDPCWDWCPDWPC
Die-first Page Allocation
![Page 22: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/22.jpg)
Die-first Page Allocation
• The die-first page allocation schemes favor the exploitation of the die-interleaving method
• It could also accommodate system-level resources instead of the flash-level resources based on the access patterns (DCWP/DWCP).
![Page 23: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/23.jpg)
NAND Flash Chip
Pla
ne
0
Pla
ne
0
Pla
ne
1
Pla
ne
1
DIE 0 DIE 1
WAY 0 WAY 1
CH
AC
H B
Die Interleaving with Multiplane
Way Pipelining
Plane Sharing
Channel striping
PWCD -- Plane-Way-Channel-DiePCWDPCWDPDCWPDWCPWDC
Plane-first Page Allocation
![Page 24: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/24.jpg)
Plane-first Page Allocation
• It parallelizes data accesses with plane sharing, which can in turn improve the storage throughput
• An excellent option for realizing the benefits of both inter- and intra-request parallelisms
![Page 25: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/25.jpg)
CH
AC
H B
CH
AC
H B
Assumption in this exampleLegacy : 200 us (1 page)Plane sharing : 240 us (2 pages)
Total latency : 200 us
Channel-first page allocation
Plane-first page allocation
Total latency : 240 us
Channel-first vs plane-first
![Page 26: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/26.jpg)
CH
AC
H B
CH
AC
H B
Assumption in this exampleLegacy : 200 us (1 page)Plane sharing : 240 us (2 pages)
Req1 : 200 usReq2 : 400 us
Req1: 240 usReq2: 240 us
Channel-first page allocation
Plane-first page allocation in favor of way pipelining
Channel-first vs plane-first
![Page 27: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/27.jpg)
WAY 0 WAY 1
CH
AC
H B
WAY 0 WAY 1
CH
AC
H B
System and flash level parallelism are same
Performance among the different page allocation strategies vary based on access pattern
Way-first (WPCD) vs. plane-first (PWCD)
![Page 28: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/28.jpg)
Outline
• Holistic Viewpoint (Overview)• SSD Architecture• Parallelism Overview• Page Allocation Strategies • Evaluation Studies for Parallelism • Host Interface Overview• Case Studies
– Parallelism-Aware Host Interface I/O Scheduler– GC-Aware Host Interface I/O Scheduler
• Simulation Infrastructure (for your future research)
28
![Page 29: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/29.jpg)
SSD Setup
• NAND Flash Chip• Fine-grained NAND command
• Advanced commands
• Strong address constraints
• Intrinsic latency variation
• SSD Framework• 8 channels, 8 flash per channel (64 total)
• Dual-die package format, 32 entry queue
• A page-level mapping and greedy garbage collection algorithm
![Page 30: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/30.jpg)
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3
CWDP
WCPD
DPCW
PDCW
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3
CWDP
WCPD
DPCW
PDCW
Performance Comparison
Way and flash-level resource first pallocs have better IOPS performance position than channel-first palloc
Channel-first palloc provide shorter latencies than flash-level resource first pallocs
[Normalized Latency]
[Normalized IOPS]
![Page 31: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/31.jpg)
CD
PW
CD
WP
CPD
W
CPW
D
CW
DP
CW
PD
DC
PW
DC
WP
DPC
W
DPW
C
DW
CP
DW
PC
PCD
W
PCW
D
PDC
W
PDW
C
PWC
D
PWD
C
WC
DP
WC
PD
WD
CP
WD
PC
WPC
D
WPD
C
0102030405060708090
100
The
fract
ion
of p
ralle
l dat
a ac
ces
met
hod
type
(%)
Die interleaving with multiplane write Die interleaving with multiplane read Plane sharing write Plane sharing read Die interleaving write Die interleaving read Striped lagacy write Striped lagacy read
Channel-first Die-first Plane-first Way-first
[Parallelism breakdown]
• Low flash-level parallelism is observed under pallocschemes in favor of channel
• They render advanced flash command compositions difficult at runtime (due to low flash-level localities) High parallelism
Parallelism Breakdown
![Page 32: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/32.jpg)
• channel resources– are utilized about 43.1% on average with most parallel data
access methods
• Idle time– About 80% of the total execution time are spent idle
CD
PWC
DW
PC
PDW
CPW
DC
WD
PC
WPD
DC
PWD
CW
PD
PCW
DPW
CD
WC
PD
WPC
PCD
WPC
WD
PDC
WPD
WC
PWC
DPW
DC
WC
DP
WC
PDW
DC
PW
DPC
WPC
DW
PDC
05
1015202590
100
Exe.
tim
e fra
ctio
n (%
) Idle Flash-level conflict Bus contention Bus activate Flash cell activate
CD
PWC
DW
PC
PDW
CPW
DC
WD
PC
WPD
DC
PWD
CW
PD
PCW
DPW
CD
WC
PD
WPC
PCD
WPC
WD
PDC
WPD
WC
PWC
DPW
DC
WC
DP
WC
PDW
DC
PW
DPC
WPC
DW
PDC
05
1015202590
100
Exe.
tim
e fra
ctio
n (%
)
CDPWCDWP
CPDWCPWD
CWDPCWPD
DCPWDCWP
DPCWDPWC
DWCPDWPC
PCDWPCWD
PDCWPDWC
PWCDPWDC
WCDPWCPD
WDCPWDPC
WPCDWPDC
020406080
100
Perc
enta
ge o
f cha
nnel
ut
ilizat
ion
(%)
msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3
Channel-first palloc Die-first palloc Plane-first palloc Way-first palloc
[Write Intensive -- Execution breakdown] [Read Intensive -- Execution breakdown]
[Average Channel Utilization]
Resource Utilization
![Page 33: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/33.jpg)
Low
er la
ten
cy
Higher throughput
CWDP
PWCD
DPWC
WDCP
IDEAL
Flash-level parallelism
Avoiding resource conflicts
Maxim
izin
g r
eso
urc
eutiliza
tion
Optimization Point
![Page 34: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/34.jpg)
Outline
• Holistic Viewpoint (Overview)• SSD Architecture• Parallelism Overview• Page Allocation Strategies • Evaluation Studies for Parallelism • Host Interface Overview• Case Studies
– Parallelism-Aware Host Interface I/O Scheduler– GC-Aware Host Interface I/O Scheduler
• Simulation Infrastructure (for your future research)
34
![Page 35: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/35.jpg)
Host Interface Overview
• Serial AT Attachment (SATA)• The most popular interface for SSD
• 600MB/sec (SATA3.0)
• Non-Volatile Memory Express (NVMe) • PCI Express based storage management protocol
• Around 1GB/sec per lane, and it’s upto 16 lanes (PCIe 3.0)
![Page 36: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/36.jpg)
CPU
PCHchipset
CPU busDMI (20Gbps)
NAND NAND
NAND NAND
SSDSATA\SAS
AHCI SSD
SATA\SAS
SATA / SAS( 1.5 to 6 Gbps)
• The host connection is to advanced host controller interface (AHCI)
• The bus overheads introduce 1 us for each command
• Throughput is also serialized
• Designed based on conventional spinning disks
• 32 entries for device-level queue (Native Command Queue)
SATA
![Page 37: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/37.jpg)
Need for High-Performance Interfaces
• Storage interface as a bridge between host and storage• Traditional SATA and SAS have been widely employed
• Storage-internal bandwidths keep increasing• Thanks to increased resources and parallelisms
• Traditional interfaces failed to deliver the very-high bandwidths
• From upgrading traditional interfaces to devising new high-performance interfaces
HostSystem
Storage System(NVMs)Interface
More Resources More Parallelisms
Higher Bandwidth
PerformanceBottleneck!
SATA/SAS PCIe
![Page 38: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/38.jpg)
CPU
PCHchipset
CPU busDMI (20Gbps)
NVMe SSD
NAND NAND
NAND NAND
SSD
Built in PCIe
PCIe (5Gbps / lane)
• The host connection is Peripheral Component Interconnect Express (PCIe)
• PCIe bus connection still requires SSD controller chip, but it doesn’t SATA/ACHI controller
• Throughput flows in parallel along each available PCIe lane
NVM Express
![Page 39: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/39.jpg)
NVMe’s Rich Queuing Mechanism
• Traditional interface provides single I/O queue with 10s entries• Native Command Queuing (NCQ) with 32 entries
• NVMe strives to increase throughput by providing scalable number of queues with scalable entries
• Up to 64K queues with up to 64K entries
• NVMe queue configurations in the host-side memory• Pairs of Submission Queue (SQ) and Completion Queue (CQ)• Per-core, Per-process, or Per-thread
![Page 40: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/40.jpg)
High-level View of Comparison• Scalable queue
(for multiple cores)
• Lock-less queue management
Host Host
AHCI SSD NVMe SSD
Single, 32 command entries 64K, 64K depth queue
Core A Core B Core A Core B
…
Issue queue
complete queue
Command list
Command tableI/O Issue
I/O Complete
I/O Interrupt
AHCI SSD NVMe SSD
![Page 41: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/41.jpg)
I/O Stack Comparison
• SATA requests traverse • Block layer
• AHCI driver
• Host bus
• AHCI host bus adapter (HBA)
• NVMe requests bypass all conventional modules and directly go to PCIe root complex
SATAController
SATAController
Application
Kern
el M
ode
Har
dw
are
Devi
ce
SATAControllerAHCI HBA
AHCI Driver
Block Layer
SATAController
PCIe root port
NVMe Driver
AHCI Connection NVMe connection
Use
r M
ode
Controller
NANDController
NAND
NAND NAND
SATA Controller
NVMeController
Storage stack
![Page 42: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/42.jpg)
Communication ProtocolH
ost
-Sid
e
SSD
-Sid
e
[I/O Write]
Tim
e f
low H
ost
-Sid
e
SSD
-Sid
e
[I/O Read]
DB-write
IO-Req
IO-Fetch
WR-DMA
SSD WRSSD-PROC
CPL-Submit
MSI
DB-write
IO-Req
IO-Fetch
SSD RDSSD-PROC
RD-DMA
CPL-Submit
MSI
![Page 43: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/43.jpg)
Protocol Comparison
• DB-Write • Door-Bell (DB) register based communication
• Remove all register reads, each consuming 2000 CPU cycles approximately
• MSI• MSI software-based vector interrupt
• Ensures a specific core not IOPs bottleneck
• Less Synchronization and Less Lock• DB per queue – increasing parallelism
• Remove synchronization lock to issue commands
![Page 44: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/44.jpg)
Outline
• Holistic Viewpoint (Overview)
• SSD Architecture
• Parallelism Overview
• Page Allocation Strategies
• Evaluation Studies for Parallelism
• Host Interface Overview
• Case Studies• Parallelism-Aware Host Interface I/O Scheduler
• GC-Aware Host Interface I/O Scheduler
• Simulation Infrastructure (for your future research)
44
![Page 45: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/45.jpg)
Case Study #1: Physically Addressed Queueing(PAQ): Improving Parallelism in
Solid State Disks
![Page 46: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/46.jpg)
Motivation
Background & Problems
PAQ
Evaluation Results
Conclusion
![Page 47: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/47.jpg)
• Observation– SSD performance varies based on how to
parallelize data accesses
– Writes fully enjoy internal parallelism, but reads suffer from resource contention
• Problem– Virtual addressed queuing is insufficient to
schedule incoming I/O requests
• Our solution– Expose physical address space to the
scheduler and avoid internal resource contention
![Page 48: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/48.jpg)
Motivation
Background & Problem
PAQ
Evaluation Results
Conclusion
![Page 49: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/49.jpg)
Use-cases of High-speed SSDs
ADVANTAGE DISADVANTAGE
FASTER100x more throughput than a 15K RPM disk
EXPENSIVEA server SSD is $30/GB10K SAS HDD is about $1/GB
ENERGY EFFICIENTRequiring 33% less power than HDD
LIFE TIME LIMITNAND flash memory cell wear-out with overuse
[Image: Intel]
[Image: thessdreview.com]
SSDs are considered for workloads rife with
reads or as a cache
HPC-Enterprise
Hybrid-SSD
![Page 50: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/50.jpg)
Reads vs. Writes in bare NAND Flash memory
• NAND flash are biased towards reading
PERF. METRIC OPERATION VALUE
LATENCY WRITE 440 ~ 5000 us
READ 25 us (80x faster -typ.)
ERASE 2500 us
BANDWIDTH WRITE 2.2 MB/sec
READ 26.7 MB/sec (13x faster)
[SK-Hynix 32nm MLC NAND flash]• Care needed for writes on NAND flash– Requires erase operation before a write– Requires garbage collections or block merges, a set of operations,
erase, read, and write operation for new writes
![Page 51: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/51.jpg)
Two Divergent Research Directions
Internal Research workingto improve writes
Garbage collection scheduling
Flash firmware mapping algorithm
Write buffer management
External Research developing mechanisms capitalizing on read performance (avoid the high penalties of writes)
[Image: Thinkpads.com]
![Page 52: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/52.jpg)
Read vs. Write in an SSD
512B 1K 2K 4K 8K 16K 32K 64K128K0
20406080
100120140160180200220240260
Band
wid
th (M
B/s)
Transfer Size
SSD-A Rand. Read SSD-A Seq. Read SSD-A Rand. Write SSD-A Seq. Write
0 2 4 6 8 10
512B 1K 2K 4K 8K 16K 32K 64K128K0
20406080
100120140160180200220240260
Band
wid
th (M
B/s)
Transfer Size
SSD-B Rand.Read SSD-B Seq. Read SSD-B Rand.Write SSD-B Seq. Write
Y Ax
is T
itle
512B 1K 2K 4K 8K 16K 32K 64K128K
0
2
4
6
8
10
12
14
Aver
age
Res
pons
e Ti
me
(ms)
Transfer Size
SSD-A Rand. Read SSD-A Seq. Read SSD-A Rand. Write SSD-A Seq. Write
X Axis Title
512B 1K 2K 4K 8K 16K 32K 64K128K
0
2
4
6
8
10
12
14
Aver
age
Res
pons
e Ti
me
(ms)
Transfer Size
SSD-B Rand.Read SSD-B Seq. Read SSD-B Rand.Write SSD-B Seq. Write
At least 25% At least 56%
Read performance depends on rigid data layout and access sequence, but writes sequence can be remapped and easily reap the benefit of parallelism
![Page 53: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/53.jpg)
Motivation
Background & Problem
PAQ
Evaluation Results
Conclusion
![Page 54: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/54.jpg)
NAND Flash
NAND Flash
CTRL
CH
AN
NEL
1
NAND Flash
NAND Flash
CTRLC
HA
NN
EL2
NAND Flash
NAND Flash
CTRL
CH
AN
NEL
3
NAND Flash
NAND Flash
CTRL
CH
AN
NEL
4
Emb
edd
ed P
roce
sso
rs
SSD Internals
Ho
st In
terf
ace
Co
ntr
olle
r
SSD & NAND Flash Internals
Die 0 Die 1 Die 2 Die 3
Multiplexed Interface
Flash Package Internals
k*j
Blo
cks
DATA REGISTER
CACHE REGISTER
NAND Flash Memory Array
DATA REGISTER
CACHE REGISTER
NAND Flash Memory Array
1 Block 1 Block
DIE 1
PLANE 0 PLANE jk Blocks k Blocks
Die Internals
HOST
![Page 55: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/55.jpg)
Software Stack of an SSD
VIRTUAL ADDRESS
PHYSICAL ADDRESS
ADDRESS SPACE
FTL
QBM
PHY
HIL
HAL
Host Interface Layer- Responsible for communication- Row protocols handed by PHY- Queue and Buffer Management (QBM) handled by APP
Flash Translation LayerAddress translation between the host address space and physical addresses
Hardware Abstraction LayerCommitting flash transaction to underlying flash memory chips
[Image:micron.com]
HIL is oblivious of physical address
![Page 56: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/56.jpg)
Conventional Scheduling
11
21
12
31
13
41
21
1
22
2
23
61
41
81
51
82
61
71
31
51
7 parallelized I/O groups
1 2 3 4 5 6
Channel 1 Channel 2 Channel 3 Channel 4
Virtual Addr.
Physical Addr.
QUEUE
2 Die interleavings
Conventional Scheduling
11 12 13 21 22 23 41 51 6131
11 22 33 44 55 66
Virtual Addr.Addr.Addr
QUEUE
Package
Die
I/O request scheduling is not efficient because HIL is sitting on the virtual address space
![Page 57: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/57.jpg)
Multiplane mode operation
• Plane-level parallelism can be achieved via only multi-plane mode advanced command
• Two issues building advanced commands at runtime– Conventional VAQ is
ignorant of physical addresses
– FTL and flash firmware is oblivious of upper device-level queue and requests therein
Even Odd Even Odd
Even Odd Even Odd
Die 0 Die 1C
H 1
CH
2
7372356742Physical Address
Die 0 Die 1
Package 2
Package 1
4321Tag ID
5141333231232221Virtual Address
Multiplane mode
Multiplane mode operation
• Plane-level parallelism can be achieved via only multi-plane mode advanced command
• Two issues building advanced commands at runtimeruntime– Conventional VAQ is
ignorant of physical addresses
– FTL and flash firmware is oblivious of upper device-level queue and requests therein
2 multiplane operations
What if HIL or virtual addressed scheduler knows physical address space?
![Page 58: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/58.jpg)
Motivation
Background & Problem
PAQ
Evaluation Results
Conclusion
![Page 59: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/59.jpg)
High Level View of PAQ
VIRTUAL ADDRESS
PHYSICAL ADDRESS
ADDRESS SPACE
FTL
QBM
PHY
HIL
HAL
[Image:micron.com]
QBM
Moving QBM layer out from the HIL and beneath the FTL
QBM migration exposes physical addresses to our scheduler, PAQ (Physical Addressed Queuing)
![Page 60: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/60.jpg)
High Level View of PAQ
• Identify requests that will cause conflicts
• Building a group of request together that do not share conflicts, called a Clump
• Packing transactions based on physical layout of I/O request
VIRTUAL ADDRESS
PHYSICAL ADDRESS
ADDRESS SPACE
FTL
PHY
HIL
HAL
QBM
![Page 61: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/61.jpg)
Clump Composition
• Lower-level conflicts are most costly!!• Building Clumps
1. Add transactions incurring conflicts in the lowest levels first
2. For die- and package-level conflicts, never schedule a clump
3. Continue adding transactions to the clump, prioritizing for low-level conflicts, until no more can be added without breaking #2.
PAQ attempts to build clumps in a bottom-up, conflict-first fashion such that the lowest level with contention does not have conflicting transactions in the clump
![Page 62: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/62.jpg)
Physical Address Queueing
11
21
12
31
13
41
21
1
22
2
23
61
41
81
51
82
61
71
31
51
1 2 3 4 5 6
Channel 1 Channel 2 Channel 3 Channel 4
Virtual Addr.
Physical Addr.
QUEUE
Physical Address Queueing11 22 33 44 55 66QUEUE
3 parallelized I/O groups
5 Die interleavings
![Page 63: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/63.jpg)
Plane Packing
• PAQ knows both device-level queue and physical addresses
• It parses requests in the queue, and sends each transactionin favor of multi-planemode operations
Even Odd Even Odd
Even Odd Even Odd
Die 0 Die 1C
H 1
CH
2
7372356742Physical Address
Die 0 Die 1
Package 2
Package 1
4321Tag ID
5141333231232221Virtual Address
Multiplane mode
Plane Packing
• PAQ knows both device-level queue and physical addresses
• It parses requests in the queue, and sends each transactionin favor of multi-planemode operations
Multiplanemode operations
Multiplanemode operations
mode
4 multiplane operations
Physical Address Queuing can schedule & pack multiple transactions into one advanced command at runtime
![Page 64: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/64.jpg)
Motivation
Background & Problem
PAQ
Evaluation Results
Conclusion
![Page 65: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/65.jpg)
SSD Setup
• NAND Flash Chip– Fine-grained NAND command
– Advanced commands
– Strong address constraints
– Intrinsic latency variation
• SSD Framework– 8 channels, 8 flash per channel (64 total)
– Dual-die package format, 32 entry queue
– A page-level mapping algorithm
![Page 66: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/66.jpg)
Configurations & Traces• Queuing strategies
– VAQ : Default queuing scheme (Virtual Address)
– PAQ0: PAQ, only using plane-packing
– PAQ1: PAQ, only using clumping
– PAQ2: PAQ, using both plane-packing and clumping
• Traces– fin :online transaction
– web :search engine
– usr :shared directory
– prn :print serving
– sql :database
– msnfs :file storage servers
![Page 67: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/67.jpg)
Aggregate Performance -Bandwidth
• Improved read performance about 45% (100MB/sec) compared to VAQ scheduler for web workloads (90% random reads)
• PAQ2 never hurts performance for any workload regardless of read- or write oriented
[IOPS – the number host-level I/O requests per sec]
[Bandwidth KB/sec]45% improvement
Worst performance trace : Writes are intermixed with small reads
PAQ2 shows 1.41 times better performance
![Page 68: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/68.jpg)
Outline
• Holistic Viewpoint (Overview)• SSD Architecture• Parallelism Overview• Page Allocation Strategies • Evaluation Studies for Parallelism • Host Interface Overview• Case Studies
– Parallelism-Aware Host Interface I/O Scheduler– GC-Aware Host Interface I/O Scheduler
• Simulation Infrastructure (for your future research)
68
![Page 69: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/69.jpg)
Case Study #2: Host Interface Assisted
Garbage Collection Scheduler
1. Taking Garbage Collection Overheads off the Critical Path in SSDs, Middleware'122. HIOS: A Host Interface I/O Scheduler for Solid State , ISCA'14
![Page 70: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/70.jpg)
Outline
• Motivation
• Worst-case latency analysis
• Background garbage collection (GC)
– Advanced GC and delayed GC
– Incremental GC
• Garbage collection scheduling
– Slack stealing
– GC overhead redistribution
70
![Page 71: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/71.jpg)
• Observation
– Garbage collection (GC) is the critical performance bottleneck for SSDs
• Bandwidth imposed by GC is 4x worse than normal case I/O operations
• Latency imposed by GC is 8x ~ 10x longer than normal I/O access time
– The presence of idle I/O times in workloads can be exploited by shifting garbage collections from busy periods to other periods
• Our solution
– Removing on-demand GCs from the critical path and secure free block in advance
– Delaying on-demand GC to next idle periods
![Page 72: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/72.jpg)
Motivation• Solid State Drives!!
– Faster than any conventional block devices– Overwriting a page is not allowed before erasing
block, which is a set of pages
Write Cliff
Garbage Collection
Write performance of modern SSDs is significantly degraded after garbage collection begins
![Page 73: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/73.jpg)
• Average and worst-case latency of 7 SSDs and 4 disks
• As we commonly know, SSDs show better average latency
• However, worst-case latencies of SSDs are much higher
• Some I/O showing the worst-case latency may violate QoS
SSDs Always Faster than Disks?<7 SSDs>
<4 Disks>
![Page 74: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/74.jpg)
How Often Worst-case Latencies?
• Two Samsung SSDs are written using Intel IO meter
• Performance variation = worst-case latency – average latency
• As we use SSDs more and more,
– SLC: worst-case latencies are repeatedly and frequently shown
– MLC: In addition to be observed, worst-case latencies get worse
• Worst-case latencies are frequent and get worse over time
<SAMSUNG SLC 120GB SSD> <SAMSUNG MLC 256GB SSD>
average latency average latency
![Page 75: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/75.jpg)
Worst-case Latency affects Throughput?
• Samsung and OCZ SSDs are written by Intel IO meter
• Worst-case latency and throughput measured over time
• At some time point, Write Cliff is observed
– Worst-case latency significantly get worse (by x40)
– (At the same time) throughput severely degrades (by 64%)
• Worst-case latencies directly affect throughput degradation
<SAMSUNG 830> <OCZ Vertex 3>
![Page 76: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/76.jpg)
Latency Impact:Empirical Experimental Results
• 256GB MLC-Based SSD
– 128 * 2 DRAM Buffer
– Dual Core, 8 channel and 64 flash packages
– Device-level Latency is captured ULINK Drive Master
• GC impact
– Warm-up: 1MB random writesfor whole region
Pristine State Performance
![Page 77: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/77.jpg)
So far.. & Our Approach
[Without GCs]
[With GCs]
[Our Approach]
![Page 78: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/78.jpg)
Goals
• Making Garbage Collection (GC) Overheads Invisible to Users
– While using our GC strategies, application does not experience GC overheads
• Avoiding additional GC operations
– Our GC strategies only schedule GC operation that would be invoked soon
• Compatibility with underlying FTL schemes
– Doesn’t need to extra NVM buffer and change main address mapping policy of FTL
![Page 79: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/79.jpg)
Shifting Garbage Collection
• Advanced GC strategy (AGC)– Removing on-demand GCs from the critical
path and secure free block in advance
– 2 components: Look-ahead GC and Proactive Block Compaction
• Delayed GC strategy (DGC)– Handling the cases where idleness does not
frequently occurs and AGC fails to secure free blocks
– Delaying on-demand GC to next idle periods
![Page 80: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/80.jpg)
Device-level short idleness
Long idleness imposed by host
![Page 81: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/81.jpg)
Device-level Short IdlenessUtilization
• Leveraging device-level queue and preinformation arrivals
• 3~17 command tags arrive in parallel (LeCroy commercial protocol analyzer)
• Look-ahead GC (a part of AGC) is executed using short idleness
Previous I/O Expected Execution Time
![Page 82: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/82.jpg)
Long Idleness Utilization
• 38% ~ 83% instructions experience idle periods more than 1 sec
• DGC and Proactive Block Compaction (another part of AGC) are performed
![Page 83: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/83.jpg)
Details of AGC
• Look-ahead GC– Predict on-demand GCs based on incoming
host requests and mapping information
– Look-ahead GC is executed only if the short idle period is longer than the latency of GC predicted
• Proactive Block Compaction
– Reclaiming blocks, which are fully occupied by contents, during long idle periods
Valid Page Migration Time
Block Cleaning Time
![Page 84: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/84.jpg)
Details of DGC
• Update Block Replacement
– GCs need not to be run the same time as writes
– Skipping time consuming tasks (page migration) of GC and serve urgent I/O request first
– Put on-demand page migration into DGC list and replacing another update block
• Retroactive Block Compaction
– Resume page migration activity in long idle time and return update block (used for DGC)
![Page 85: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/85.jpg)
Incremental Garbage Collection• In the case of long idle, there is no pre-arrival
information, or no advantage of device-level queue (empty)
• AGC and DGC employ Incremental Garbage Collection– GC activities split into multiple sub-collections delaminated by
checkpoint– Checkpoint: Check if further collection can be performed
VALIDINVALIDVALID
INVALID
INVALIDVALID
INVALIDVALID
I/O Request
Update Block
Data Block
Target Block
I/O RequestCP
CP
CP
CPDevice-level Queue
![Page 86: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/86.jpg)
Experimental Setup
• 4 channel, 16 flash chips bus-level transaction SSD simulation – 6 volumes of SSD array
• FTL implementation– L-FTL: Log structured block mapping FTL– H-FTL: Superblock-style block and page hybrid mapping
FTL– P-FTL: Partial block cleaning FTL (16% more flash blocks)
• Garbage Collection Straggles– Baseline: Block-merge type garbage collector– AGC: Advanced GC strategy only– DGC: Delayed GC strategy only– AGC+DGC: putting our GC schemes together
![Page 87: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/87.jpg)
Low Write Intensive Workload of Microsoft File Server Storage
• Low write intensive workload, AGC successfully hides GC overheads
[Baseline GC] [AGC Only]
No GC Impact
![Page 88: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/88.jpg)
High Write-Intensive Workload of Microsoft File Server Storage
[Baseline GC] [AGC Only]
[DGC Only] [AGC+DGC]
Fail
FailSuccess
![Page 89: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/89.jpg)
Performance Comparison
[L-FTL] [H-FTL]
[AGC+DGC] [AGC+DGC Invisible]
Hybrid Mapping FTL introduces shorter WCRT as well as lower I/O blocking than L-FTL by reducing GC overheads
AGC+DGC perform all on demand GC even under write-intensive workloads
![Page 90: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/90.jpg)
Constraints of Background GC
• Background garbage collection is feasible only if the system can secure enough idle times
• May violate QoS or SLA in cases where there is an unexpected request during the background GC operations
![Page 91: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/91.jpg)
GC Overhead Distribution
• Four I/O (1~4) requests are present in order
– (Example) given deadlines of four requests are same
– GC is triggered during I/O-1 service
– I/O-1 misses its QoS, whereas others can satisfy it
• I/O-2 to 4 have time-margin (slack) until deadline
– We will distribute GC overhead of I/O-1 over others
– All I/O requests can meet its deadline even GC executed!
I/O-1 I/O-2 I/O-4I/O-3time
Latency
GC overhead
Flash execution
I/O deadline
![Page 92: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/92.jpg)
What Needs for GC Distribution?
• (1) GC overhead estimation
– Need to know how big GC overhead is
• (2) Slack stealing
– Need to know how much slacks (many I/O requests) are required
• (3) GC overhead distribution
– Need to segment GC and distribute them over other I/O requests
I/O-1 I/O-2 I/O-4I/O-3time
Latency
GC overhead
Flash execution
I/O deadline
![Page 93: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/93.jpg)
HIOS’s GC Overhead Estimation
• SATA interface is the best place for GC distribution
– Is aware of flash device operations including GC
– Is also aware of I/O requests by using tag (essential information)
– Before actual I/O request, its tag sent to the command queue
• GC overhead estimation based on flash status and I/O tag
– GC invocation by I/O-1 can be predicted
– GC overhead (# of read, write, and erase) can be estimated
I/O-1 time
Latency
command queueSATA I/F
Flash devices
I/O-2 I/O-3
Tag-
1
Tag-
2
Tag-
3
Tag-
4
I/O-4
![Page 94: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/94.jpg)
HIOS’s Slack Stealing
• For the following I/O requests, slack is accumulated
– For each I/O request, slack is calculated using tag information
– Slack time= Tdeadline – Tflash_latency
– Slack stealing is continued until GC overhead exhausted
• In this scenario, slacks (from following I/O-2, 3, and 4) are enough for distributing I/O-1’s GC overhead
I/O-1 time
Latency
command queueSATA I/F
Flash devices
I/O-2 I/O-3
Tag-
1
Tag-
2
Tag-
3
Tag-
4
I/O-4
![Page 95: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/95.jpg)
HIOS’s GC Distribution
• GC (reads, writes, erases) can be segmented into small pieces
• For each I/O request, (I/O-1)
– Write data is transferred from host to device buffer
– Assigned GC segments are executed
– Flash device commands are issued
I/O-1 time
Latency
command queueSATA I/F
Flash devices
I/O-2 I/O-3
Tag-
1
Tag-
2
Tag-
3
Tag-
4
I/O-4
I/O-1’s GC overhead
dram buffer
I/O-1 data
Tag-
1
flash cmd flash cmd
![Page 96: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/96.jpg)
HIOS’s GC Distribution
• GC (reads, writes, erases) can be segmented into small pieces
• For each I/O request, (I/O-2)
– Write data is transferred from host to device buffer
– Assigned GC segments are executed
– Flash device commands are issued
I/O-1 time
Latency
command queueSATA I/F
Flash devices
I/O-2 I/O-3 I/O-4
dram buffer
I/O-2 data
Tag-
3
Tag-
4
Tag-
2
flash cmd flash cmd flash cmd
![Page 97: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/97.jpg)
HIOS’s GC Distribution
• GC (reads, writes, erases) can be segmented into small pieces
• For each I/O request, (I/O-3)
– Write data is transferred from host to device buffer
– Assigned GC segments are executed
– Flash device commands are issued
I/O-1 time
Latency
command queueSATA I/F
Flash devices
I/O-2 I/O-3 I/O-4
dram buffer
I/O-3 data
Tag-
4
Tag-
3
flash cmd flash cmd flash cmd
![Page 98: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/98.jpg)
HIOS’s GC Distribution
• GC (reads, writes, erases) can be segmented into small pieces
• For each I/O request, (I/O-4)
– Write data is transferred from host to device buffer
– Assigned GC segments are executed
– Flash device commands are issued
• “I/O-1’s GC distributed over I/O-2,3, and 4” & “all satisfy QoS”
I/O-1 time
Latency
command queueSATA I/F
Flash devices
I/O-2 I/O-3 I/O-4
dram buffer
I/O-4 data
flash cmd
Tag-
4
![Page 99: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/99.jpg)
Simulation-based Evaluation• Simulator
– Models flash chips and associated data paths
– Implements typical SSD software stack
• Baseline configuration
– SSD: 8 channels, 2 flash chips / channel
– SATA I/F: 32 command queue entries
• Compared with four different I/O schedulers
– Noop: schedules in FIFO basis
– Anticipatory: considers spatial locality
– Deadline: sorts logical address in ascending order
– Flash-aware: reduces GC overhead by reducing writes
– HIOS: GC-only or channel-only management, or both
![Page 100: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/100.jpg)
Worst-case Latency
• Worst-case latency is normalized to Noop scheduler
• All existing scheduler are all similar, since they are oblivious to GC and cannot avoid high-cost GC overhead
• HIOS-1 (GC distribution only) significantly reduces it by 41%
• HIOS-2 (GC + channel management) is similar to HIOS-1
– Worst-case latency is caused by GC, not channel-conflict
![Page 101: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/101.jpg)
Average Latency
• Average latency is normalized to Noop scheduler
• All existing schedulers are similar or slightly worse
• HIOS-1 does not affect average latency much
– GC overheads are not eliminated, but redistributed
• HIOS-2 (GC + channel management) improves by 13%
– Resolving channel conflict achieves better performance
![Page 102: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/102.jpg)
Deadline Satisfaction
• Long series of writes to generate multiple GCs
• % of I/O requests missing its deadline (30ms) measured
• Under 40% usage, all I/O schedulers shows negligible miss rate
• As # of writes increase, miss rate of other schedulers dramatically increases suffered from more frequent GCs
![Page 103: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/103.jpg)
Outline
• Holistic Viewpoint (Overview)• SSD Architecture• Parallelism Overview• Page Allocation Strategies • Evaluation Studies for Parallelism • Host Interface Overview• Case Studies
– Parallelism-Aware Host Interface I/O Scheduler– GC-Aware Host Interface I/O Scheduler
• Simulation Infrastructure (for your future research)
103
![Page 104: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/104.jpg)
What we are doing on now for our heterogeneous computing research …. (real implementation based research)• Software approach: Extending NVMMU for a cluster
• Hardware approach: Storage-based accelerator
![Page 105: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/105.jpg)
Our Real Implementation Approach (SW)
• Advantage:
• Ready to explore systems without a limit from the simulation infra.
• We can measure the real execution time for the integration approach
• Disadvantage:
• Performance is varying based on the test environments somewhat
• Incompatible for the device and software version
• Debug…Debug…Debug.. and Debug….
SSDGPU
NVMMU
• GPU-based Storage I/O Acceleration
CPU
![Page 106: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/106.jpg)
Our Real Implementation Approach (HW)
• FPGA-based Storage I/O Acceleration• Memory/Storage Backend and frontend FPGA
implementation
• Multi-kernel Execution Model
• Scheduling for Near-Data Processing
• Advantage:
• Have no idea as still everything is on-going, sorry
• Disadvantage:
• INFLEXBIE, no choice for design exploration
• Debug…Debug…Debug.. and Debug….
• Slow.. Slow.. Slow.. And Slow
• Expensive and long procurement process for the test: there are 7 more platforms we failed to achieve goals
![Page 107: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/107.jpg)
Challenges for Simulation Model
Traditional system simulator:
It assumes all data have been loaded into RAM before execution…
CPUCore
L2L3
CPUCore
L2
CPUCore
L2
CPUCore
L2
Memory Controllers
RankDRAM Dies
Gem5 GPGPU-sim
![Page 108: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/108.jpg)
Challenges for Simulation Model
LD/STUnit
L1Cache
L2Cache
DRAM
1cycle
4cycles
12cycles
230cycles
60-800us
SSD
Extremely slow for simulation
![Page 109: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/109.jpg)
SimpleSSD
SimpleSSD: Modeling Solid State Drive for Holistic System Simulation
- A high-fidelity SSD simulation framework designed towards an educational purpose
- Free to download from http://SimpleSSD.camelab.org
![Page 110: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/110.jpg)
RankDRAM Dies
SimpleSSD overview
L1L2
CPUCore
L1
CPUCore
L1
CPUCore
L1
Memory Controllers
Controller
Die 0Die 1Die N
Controller
Die 0Die 1Die N
Controller
Die 0Die 1Die N
Multi-Channel SSD
ldr r1, [r0]
Application
PC: 0x04
Registers
Readregister 1
Readregister 2
Writedata
Readdata 1
Readdata 2
ALU
resu
lt
MEM
Address Readdata
CPUCoreCPUCore
dyn_inst_impl.hh/cco3_cpu_exec.hh/ccbase_dyn_inst.hh/ccmemhelpers.hh
cpu.hh/cciew_impl.hh/cclsq_impl.hh/cclsq_unit_impl.hh/cc
![Page 111: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/111.jpg)
RankDRAM Dies
SimpleSSD overview
CPUCore
L1L2
CPUCore
L1
CPUCore
L1
CPUCore
L1
Memory Controllers
Controller
Die 0Die 1Die N
Controller
Die 0Die 1Die N
Controller
Die 0Die 1Die N
Multi-Channel SSD
L1
Registers
Readregister 1
Readregister 2
Writedata
Readdata 1
Readdata 2
ALU
resu
ltMEM
Address Readdata
IndexTagD
eco
derTag
ArrayDataArray
S/A S/A
ComparatorMatch?
Cache request
L1D cache
port_interface.hh/ccbase.hh/cccacheset.hhbase.hh/cc
cache_impl.hhcache.hh/ccmshr.hh/ccmshr_queue.hh/cc
![Page 112: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/112.jpg)
RankDRAM Dies
SimpleSSD overview
CPUCore
L1L2
CPUCore
L1
CPUCore
L1
CPUCore
L1
Memory Controllers
Controller
Die 0Die 1Die N
Controller
Die 0Die 1Die N
Controller
Die 0Die 1Die N
Multi-Channel SSD
IndexTag
Dec
od
erTagArray
DataArray
S/A S/A
ComparatorMatch?
L1D cache
IndexTag
Dec
od
erTagArray
DataArray
S/A S/A
ComparatorMatch?
L2 cache
Cache miss
L2
port_interface.hh/ccbase.hh/cccacheset.hhbase.hh/cc
cache_impl.hhcache.hh/ccmshr.hh/ccmshr_queue.hh/cc
![Page 113: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/113.jpg)
RankDRAM Dies
SimpleSSD overview
CPUCore
L1L2
CPUCore
L1
CPUCore
L1
CPUCore
L1
Memory Controllers
Controller
Die 0Die 1Die N
Controller
Die 0Die 1Die N
Controller
Die 0Die 1Die N
Multi-Channel SSD
IndexTag
Dec
od
erTagArray
DataArray
S/A S/A
ComparatorMatch?
L2 cache
Cache miss
MMU
PageTable
I/O controller
Page fault
SSD
MainMemory
DMA
dram_ctrl.hh/ccpage_table.hh/ccmulti_level_page_table.hh/ccmulti_level_page_table_impl.hh
![Page 114: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/114.jpg)
RankDRAM Dies
SimpleSSD overview
CPUCore
L1L2
CPUCore
L1
CPUCore
L1
CPUCore
L1
Memory Controllers
Controller
Die 0Die 1Die N
Controller
Die 0Die 1Die N
Controller
Die 0Die 1Die N
Multi-Channel SSD
IndexTag
Dec
od
erTagArray
DataArray
S/A S/A
ComparatorMatch?
L2 cache
Cache miss
DRAM controller
Arbitration Engine
CMD queue WR queue RD queue
Sequencing Engine
PHY Interface
Main Memory
dram_ctrl.hh/ccsimple_mem.hh/cc
physical.hh/ccaddr_mapper.hh/cc
![Page 115: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/115.jpg)
RankDRAM Dies
SimpleSSD overview
CPUCore
L1L2
CPUCore
L1
CPUCore
L1
CPUCore
L1
Memory Controllers
Controller
Die 0Die 1Die N
Controller
Die 0Die 1Die N
Controller
Die 0Die 1Die N
Multi-Channel SSD
CPUCore
CPUCore
CPUCore
CPUCoreclass LSQ_unit
L1 L1 L1 L1class cacheL2class cache
Memory ControllersMemory Controllersclass mem_ctrl
RankDRAM Dies
class FuncPageTableDRAM Dies
Controller Controller Controllerclass HIL
Die 0Die 1
Die 0Die 1
Die 0Die 1class FTL
Die N Die N Die NMulti-Channel SSDclass PAL
executeLoad(inst)executeStore(inst)
recvTimingResp(pkt)
recvTimingResp(pkt)
recvTimingReq(pkt)
recvTimingReq(pkt)
recvTimingReq(pkt)
changeFlag(paddr, size, alloc)
SSDoperation()
fetchQueue()
accessAndRespond(pkt,lat)
commitLoad(inst)commitStore(inst)
PAL_setLatency()
setLatency()
![Page 116: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/116.jpg)
Inside Host Interface Layer (HIL)• Target: provide universal interface for system simulator, trace
generator, and RAID controller.• Add-on mode overview:
Data Movement modelAdd-on mode
System simulator
Host Interface Layer
FTL
PAM
I/O requestAddress SizecurTick opType
Address
finishTick
Page fault
ResponseAddress SizefinishTick opType
Latency map table
Store SSD access latency
![Page 117: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/117.jpg)
Inside Host Interface Layer (HIL)• Target: provide universal interface for system simulator, trace
generator, and RAID controller.• Add-on mode overview:
Data Movement modelAdd-on mode
System simulator
Host Interface Layer
FTL
PAM
Address
finishTick
I/O access
Look up latency
Latency map table
SSD delay = finishTick -curTick
Report
![Page 118: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/118.jpg)
Inside Host Interface Layer (HIL)• Target: provide universal interface for system simulator, trace
generator, and RAID controller.• Standalone mode Overview:
Dispatch
Host Interface Layer
FTL
PAM
Tracefiles
Micro-benchmark
Standalone mode
I/O queue
I/O
I/O requestAddress SizecurTick opTypeI/
O
I/O
I/O
![Page 119: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/119.jpg)
Inside Host Interface Layer (HIL)• Target: provide universal interface for system simulator, trace
generator, and RAID controller.• Standalone mode Overview:
Issue
Host Interface Layer
FTL
PAM
Tracefiles
Micro-benchmark
Standalone mode
I/O
I/O
I/O
I/O requestAddress SizecurTick opType
I/O
I/O queue Insert into the queue
ResponseAddress SizefinishTick opType
Issue new request @ finishTick
![Page 120: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/120.jpg)
DEMO
Gem5FS-SimpleSSD (full system mode)Software Dependencies:• Linux• mercurial• scons• swigImage Booting Dependencies:• Device Tree Blob: vexpress.aarch32.ll_20131205.0-gem5.4cpu.dtb• Linux Kernel: vmlinux.aarch32.ll_20131205.0-gem5• File System: aarch32-ubuntu-natty-headless.img
• gcc• g++• Python2.6 or Python 2.7• Protobuf
Compile software:scons -j7 build/ARM/gem5.opt
Execution command:./build/ARM/gem5.opt --debug-flags=IdeDisk,HIL,FTLOut,PAM2,GLOBALCONFIG -d ./configs/example/fs.py --num-cpu=4 --dtb-filename = vexpress.aarch32.ll_20131205.0-gem5.4cpu.dtb --disk-image=aarch32-ubuntu-natty-headless.img --kernel=vmlinux.aarch32.ll_20131205.0-gem5 --script=run_BC.rcS --SSD=1 --SSDconfig=rev_ch16_SATA.cfg
![Page 121: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/121.jpg)
Inside Flash Translation Layer (FTL)• Target: provide SSD services such as I/O address mapping, wear
leveling, garbage collection, etc.• Overview:
LBALBALBA
Host Req.
IO QueueMapping
Table
PPNPPNPPN
FTLmapping()
PP
NP
PN
PP
N
push
Direct Mapping Set-assoc Mapping Full-assoc Mapping
ReadTransaction()WriteTransaction()
HIL FTL PAM
![Page 122: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/122.jpg)
Inside Flash Translation Layer (FTL)• Target: provide SSD services such as I/O address mapping, wear
leveling, garbage collection, etc.• Overview:
LBALBALBA
Host Req.
IO QueueMapping
Table
PPNPPNPPN
FTLmapping()
PP
NP
PN
PP
N
push
Free BlockPool
GC_threshold Read
GarbageCollection()
MinHeapWearLeveling()
Direct Mapping Set-assoc Mapping Full-assoc Mapping
ReadTransaction()WriteTransaction()
SendRequest()
SetLatency()SetLatency()
HIL FTL PAM
FTLMapNFTLMapK
FTLGCthreshold
Configurable parameters:FTLMapN FTLMapKFTLGCthresholdFTLOP (over-provisioning)
![Page 123: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/123.jpg)
Inside Flash Translation Layer (FTL)• FTL mapping: support direct mapping, set-assoc mapping, and full-
assoc mapping by configuring FTLMapN amd FTLMapK.
Block layout based on N and K:
N Data Blocks
K Log Blocks
Data Group 0
N Data Blocks
K Log Blocks
Data Group (DGN)
SSD logical blocks
N Data Blocks
K Log Blocks
Data Group 1
![Page 124: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/124.jpg)
Inside Flash Translation Layer (FTL)• FTL mapping: support direct mapping, set-assoc mapping, and full-
assoc mapping by configuring FTLMapN amd FTLMapK.
Set-assoc Mapping (1<N<max, 1<K<max):
Data blocksLBN PBN
Data Group
N Data Blocks
K Log Blocks
N Data Blocks
K Log Blocks
Data Group (DGN)
Log blocksPage
Mapping
DGN LPN PPN
Page IndexBlock IndexData Group Index
Logical Page Address
![Page 125: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/125.jpg)
Inside Flash Translation Layer (FTL)• FTL mapping: support direct mapping, set-assoc mapping, and full-
assoc mapping by configuring FTLMapN amd FTLMapK.
Direct Mapping (N=1, K=1):
1 Data Block
1 Log Block
Data Group 0 Data Group max
Page IndexData Group Index
Logical Page Address
No block index
Log blocksPage
Mapping
DGN LPN PPN
Data blocksDGN PBN
![Page 126: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/126.jpg)
Inside Flash Translation Layer (FTL)• FTL mapping: support direct mapping, set-assoc mapping, and full-
assoc mapping by configuring FTLMapN amd FTLMapK.
Full-assoc Mapping (N=K=max):
Max Data Blocks
Max Log Blocks
Single Data Group
No data group index
Page IndexBlock Index
Logical Page Address
Log blocksPage
Mapping
DGN LPN PPN
Data blocksLBN PBN
![Page 127: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/127.jpg)
Inside Page Allocation Module (PAM)Target: simulate SSD internal parallelism and resource conflicts of IO bus and NAND flash memory.
HIL
FTL
PAM
Blk
FTLMapping
CHPKGDIE
Page PPN
CH0DIE0
DIEN
conflict
Overview:
![Page 128: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/128.jpg)
Inside Page Allocation Module (PAM)Target: simulate SSD internal parallelism and resource conflicts of IO bus and NAND flash memory.
• PPN disassemble.• Conflict simulation.
Main Functions:
PP
NP
PN
PP
NFTL
PP
N
CHPKGDIE
PPNdisassemble
CH0DIE0
DMA DMA
MEMLatency
TimelineScheduling
![Page 129: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/129.jpg)
Traditional model
Configurableparameters
Inside Page Allocation Module (PAM)
Simplified latency simulation model
OpCode Addr Data CMDDataChannel
MEMDie
tADL CMD
Channel
Die
pre-dmamem_op
post-dma
Simplified model
Configurable
prepre-dma
• DMA frequency• Command &
Address Delay
mem_oppostpost-dma
• SLC/MLC/TLC• Read/Write• MSB/CSB/LSB
• Page Size• Channel• Package• Die
![Page 130: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/130.jpg)
pre-dma: DMA operation in CHANNEL (transfer in metadata [+ write page]).
pre-dma
mem-op: NAND flash memory island read/write operation in DIE.
mem-op
post-dma: DMA operation in CHANNEL (transfer out metadata [+ read page]).
post-dma
pre-dma/post-dma: consume CHANNEL resource, mem-op: consume DIE resource.
Separate conflict model into i) CHANNEL and ii) DIE.
Inside Page Allocation Module (PAM)
Simplified conflict simulation model
![Page 131: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/131.jpg)
Inside Page Allocation Module (PAM)
DMA conflictsIO #1 transfer in DIE 0, CHANNEL 0
IO #2 transfer in DIE 1, CHANNEL 0
IO#1IO#2
Channel 0Die 0Die 1
DMA Conflict
mem-op (w)
mem-op (r)
DMA Conflict
Time
Pre-dma and post-dma conflict example.
![Page 132: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/132.jpg)
Inside Page Allocation Module (PAM)
MEM conflictsIO #1 transfer in DIE 0, CHANNEL 0
IO #2 transfer in DIE 0, CHANNEL 0
Pre-dma, mem-op and post-dma conflict example.
IO#1IO#2
Channel 0Die 0Die 1 Time
DMA Conflict
MEMConflict
mem-op (w)
DMA Conflict
mem-op (r)
![Page 133: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/133.jpg)
gem5FS-simpleSSD
Overview: leverage the disk interface provided by gem5.
Target: integrate simpleSSD into gem5 full system mode.
RankDRAM Dies
CPUCore
L1L2
CPUCore
L1
CPUCore
L1
CPUCore
L1
Memory Controllers
Page Faults File Read/Write
Simple Disk Latency Calculator
gem5FS model
SimpleSSD simulator
gem5FS-simpleSSD model
![Page 134: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/134.jpg)
DEMO
Gem5FS-SimpleSSDOutput files:
config.ini config.json SimpleSSD.log stats.txt system.terminal
Full system execution log
![Page 135: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/135.jpg)
DEMO
SimpleSSD-standaloneSoftware Dependencies:
• Linux• g++
Compile software:make
Execution command:
./ssdsim ssd_config_file microbench_config_file > SimpleSSD.log
Output files:
SimpleSSD.log
SimpleSSD runtime statistics report
![Page 136: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/136.jpg)
Evaluation Samples
Instruction per cycles on SLC is
better
Page cache (VFS) doesn’t work well
Massive I/O make system call overheads significant
MLC is worse than TLC
![Page 137: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/137.jpg)
Evaluation Samples
CPU utilization is not impacted even when there is storage access (page cache)
CPU utilization is severely impacted by storage access. (no locality)
![Page 138: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/138.jpg)
Educational Research Tools
• OpenNVM• http://opennvm.camelab.org
• SimpleSSD• http://simplessd.camelab.org
• NANDFlashSim• http://nfs.camelab.org
• Trace Repository• http://traces.camelab.org
![Page 139: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/139.jpg)
References
• Ozone (O3): An Out-of-Order Flash Memory Controller Architecture, TC 2011
• Exploring Parallel Data Access Methods in Emerging Non-Volatile Memory Systems, TPDS
• Unleashing the Potentials of Dynamism for Page Allocation Strategies in SSDs, SIGMETRICS
• Exploring and Exploiting the Multilevel Parallelism Inside SSDs for Improved Performance and Endurance, TC
• Design Tradeoffs for SSD Performance, ATC'09
• ParaFS: A Log-Structured File System to Exploit the Internal Parallelism of Flash Devices, ATC'16
![Page 140: (IIT8015) Lecture#7: SSD Architecture and System-level ...camelab.org/uploads/Main/lecture07-camel-SSD-architecture-and... · msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3 CWDP WCPD](https://reader036.vdocuments.mx/reader036/viewer/2022081800/5abef4517f8b9ab02d8d927b/html5/thumbnails/140.jpg)
References
• Sprinkler: Maximizing resource utilization in many-chip solid state disks, HPCA'14
• Taking Garbage Collection Overheads off the Critical Path in SSDs, Middleware'12
• An In-Depth Study of Next Generation Interface for Emerging Non-Volatile Memories, NVMSA
• HIOS: A Host Interface I/O Scheduler for Solid State , ISCA'14• NVMMU: A Non-Volatile Memory Management Unit for
Heterogeneous GPU-SSD Architectures, PACT’15• SimpleSSD: Modeling Solid State Drive for Holistic System
Simulation, CAL 2017• Disks Exploiting request characteristics and internal parallelism
to improve SSD performance, ICCD 15• Performance Analysis of NVMe SSDs and their Implication on
Real World Databases, SYSTOR’15