an efficient index structure for nand flash memory and it ... · b+tree (1) characteristics •...
TRANSCRIPT
An Efficient Index Structure An Efficient Index Structure An Efficient Index Structure for NAND Flash Memory
d It A li ti
An Efficient Index Structure for NAND Flash Memory
d It A li tiand Its Applicationsand Its Applications
한국과학기술원 전산학과
김 진 수(jinsoo@cs kaist ac kr)([email protected])
OutlineOutlineOutlineOutline
μ-Tree• Motivation• Design• Analysis• Evaluations
h b dCurrent Research based on μ-Tree• μFS• μ-FTL
2NVRAMOS 2008 – April 25, 2008. ([email protected])
μ-Treeμ-Treeμμ
Index StructureIndex StructureIndex StructureIndex StructureIndex• A data structure that enables sublinear time lookup.
(wikipedia.org)• Key Record
Examples• File system:
– File name File metadata (directory)
• DBMS:– Primary key Database record
4NVRAMOS 2008 – April 25, 2008. ([email protected])
B+Tree (1)B+Tree (1)B+Tree (1)B+Tree (1)Characteristics• Disk-based index structure• Used in most file systems and DBMSsy• f-ary balanced search tree• O(logf N) of retrieval, insertion, and deletionO(logf N) of retrieval, insertion, and deletion• Records are only in leaf nodes
15
10 13 1910 13 19
1 7 10 11 18 19 2114
5NVRAMOS 2008 – April 25, 2008. ([email protected])
B+Tree (2)B+Tree (2)B+Tree (2)B+Tree (2)B+Tree on Flash
A A1 A2
valid node invalid node
A
B C
A1
C1
A2
B2B C
D E F
C1
F1
B2
D2D E F F1 D2
D E F B C AD E F B C A F1D E F B C A F1 C1D E F B C A F1 C1 A1D E F B C A F1 C1 A1D E F B C A F1 C1 A1 D2D E F B C A F1 C1 A1 D2 B2D E F B C A F1 C1 A1 D2 B2 A2D E F B C A F1 C1 A1 D2 B2 A2
6NVRAMOS 2008 – April 25, 2008. ([email protected])
B+Tree (3)B+Tree (3)B+Tree (3)B+Tree (3)B+Tree on Flash (cont’d)( )• Node size = Page size in NAND flash memory• Wandering treeg
– A tree where any update in the tree requires updating parent nodes up to the root
• Update cost: Cw * H – H : The height of B+Tree
C : The cost of a write operation– Cw : The cost of a write operation
• B+Tree updates can be cached in memory for a while.– Journal tree in JFFS3Journal tree in JFFS3– Flush out all the changes in bulk
7NVRAMOS 2008 – April 25, 2008. ([email protected])
μ-Tree (1)μ-Tree (1)μ Tree (1)μ Tree (1)
lid d i lid d
A A1 A2
valid node invalid node page
B C C1 B2
D E F F1 D2
AAAA A1A A1A A1A A1A A1 A2A A1 A2
B C
D E F
B C
D E F F1
B C C1
D E F F1
B C C1
D E F F1
B C C1
D E F F1
B C C1
D E F F1 D2
B C C1 B2
D E F F1 D2
B C C1 B2
D E F F1 D2
B C C1 B2
D E F F1 D2
8NVRAMOS 2008 – April 25, 2008. ([email protected])
μ-Tree (2)μ-Tree (2)μ Tree (2)μ Tree (2)μ-Tree (minimally updated-Tree)μ ( y p )• A tree which store all the nodes from leaf to the root
into a single page.
Pros• Minimal update cost: 1 * Cw
ConsCons• Smaller node size Higher height• Lower space utilization• Lower space utilization
9NVRAMOS 2008 – April 25, 2008. ([email protected])
μ-Tree (3)μ-Tree (3)μ Tree (3)μ Tree (3)Page Layoutg y
RootRoot
512 bytes
3Root
Root512 bytes
22 1024 bytes
Root
111 2048 bytes
10NVRAMOS 2008 – April 25, 2008. ([email protected])
2 3 41 Height
μ-Tree (4)μ-Tree (4)μ Tree (4)μ Tree (4)Height Increaseg
R
valid nodeinvalid node
filled entry slot
AA1AR AL
filled entry slotvacant entry slot
B CLC DCR
RRA
B C D
A
B C D CL CR
A A1
B C D CL CR
AAL AR
B C D CL CR
AR
AL AR
B C D CL CR
AR
AL AR
B C D CL CR
11NVRAMOS 2008 – April 25, 2008. ([email protected])
B C DB C D CL CRB C D CL CRB C D CL CRB C D CL CRB C D CL CR
Analysis (1)Analysis (1)Analysis (1)Analysis (1)The Cost of Operationsp• When there is no garbage collection:
12NVRAMOS 2008 – April 25, 2008. ([email protected])
Analysis (2)Analysis (2)Analysis (2)Analysis (2)
B+-Tree μ-Tree
Max # of entries pernode at l -th level
The total # of records indexed withrecords indexed with
a height h tree
The height of treeThe height of tree needed for indexing
n records
13NVRAMOS 2008 – April 25, 2008. ([email protected])
Analysis (3)Analysis (3)Analysis (3)Analysis (3)
a billiona million Height B+-Tree µ-Tree
6
7
e
B+TreeB+Tree with ceiling
Treeμ
1 512 512
2 256K 64K
4
5
t o
f th
e tr
ee
-Treeμ-Tree with ceilingμ
3 128M 4M
4 64G 128M
2
3
the
hei
gh 4 64G 128M
5 32T 2G
1 2 3 4 5 6 7 8 9 100
1
log10(the total number of records)
6 16P 16G
7 8E 64G
14NVRAMOS 2008 – April 25, 2008. ([email protected])
log10(the total number of records)
CachingCachingCachingCachingPage-level cache vs. Node-level cacheg
Page-level cache
LRUhead
33 54 32 57dd
valid nodeinvalid node
Node-level cache
33 54 32 57page addr.
d
page or nodeon flash
level 2LRUhead
level 1 level 1 level 1
33 33 54 32
page or nodewhich will be written
level 1 level 1 level 1 level 1
31 57 55 58
15NVRAMOS 2008 – April 25, 2008. ([email protected])
Evaluation (1)Evaluation (1)Evaluation (1)Evaluation (1)Simulator• MLC NAND (Samsung K9GAG08U0M-P)
– Flash size: 64 – 256MB– 4KB page, 512KB block– Read latency: 165.6 μs
W it l t 905 8– Write latency: 905.8 μs– Erase latency: 1500 μs
• Fanout f = 512• Fanout f = 512– 4-byte key– 4-byte pointery p
16NVRAMOS 2008 – April 25, 2008. ([email protected])
Evaluation (2)Evaluation (2)Evaluation (2)Evaluation (2)Microbenchmarks• Total 1 million records (height = 3)• 10,000 retrievals, deletions, and insertions, , ,• 64MB, No cache
Costs for garbage collection Average # of read Average # of write
3.00 3.00
4.00
3.29
4.02
3.303.75 3.77
Costs for garbage collection Average # of read Average # of write
1.07 1.08
0.00 0.00
B+Tree μ-Tree B+Tree μ-Tree B+Tree μ-Tree
17NVRAMOS 2008 – April 25, 2008. ([email protected])
Retrieval Insertion Deletion
Evaluation (3)Evaluation (3)Evaluation (3)Evaluation (3)Traces from Real Workloads
kernel compile
postmark
mp3TPC-C mp3 ID3 tag
ReiserFSLinux MySQL
p mp3
B+Tree B+Tree
Trace Retrieval Insertion Deletion Initial State
ReiserFSfs_kernel 2,274,867 260,974 236,027 0fs_postmark 4,617,494 574,148 391,847 0fs_mp3 512,986 223,188 23,950 0
MySQLdb_tpcc 2,283 1,419 0 4,805,268db_mp3_io 0 5,992 4,280 0
18NVRAMOS 2008 – April 25, 2008. ([email protected])
db_mp3_rand 1,712 0 0 1,712
Evaluation (4)Evaluation (4)Evaluation (4)Evaluation (4)Real Workload Results• No cache• 64MB (256MB in db tpcc)( _ p )
1
1.2
d tim
e read write erase
18% 28% 20% 78% 42% 0%
0.6
0.8
e elapsed
ree = 1)
0.2
0.4
e relative
(B+Tr
0
B+Tree μ‐Tree B+Tree μ‐Tree B+Tree μ‐Tree B+Tree μ‐Tree B+Tree μ‐Tree B+Tree μ‐Tree
The
19NVRAMOS 2008 – April 25, 2008. ([email protected])
fs_kernel fs_postmark fs_mp3 db_tpcc db_mp3_io db_mp3_rand
Evaluation (5)Evaluation (5)Evaluation (5)Evaluation (5)Effects of Cachingg
6000fs_postmark
4000
5000
6000
time(sec) read
writeerase
13%
64%
1000
2000
3000
elapsed
t erase64%
0
B+Tree
Tree(P)
ree(N)
B+Tree
Tree(P)
ree(N)
B+Tree
Tree(P)
ree(N)
B+Tree
Tree(P)
ree(N)
B+Tree
Tree(P)
ree(N)
B+Tree
Tree(P)
ree(N)
B+Tree
Tree(P)
ree(N)
The total
B
μ‐T
μ‐Tr B
μ‐T
μ‐Tr B
μ‐T
μ‐Tr B
μ‐T
μ‐Tr B
μ‐T
μ‐Tr B
μ‐T
μ‐Tr B
μ‐T
μ‐Tr
0KB 4KB 8KB 16KB 32KB 64KB 128KB
20NVRAMOS 2008 – April 25, 2008. ([email protected])
Evaluation (6)Evaluation (6)Evaluation (6)Evaluation (6)Garbage Collection Costsg• Total 1 million records (height = 3)
– Actual tree size: 11MB (B+Tree), 22MB (μ-Tree)
• 10,000 insertions without caching
C f b ll i A # f d A # f i
5.82
3 78 4.00 3 47
5.11
3 75
Costs for garbage collection Average # of read Average # of write
3.78 4.00 3.29 3.47 3.13
1.20
3.75
1.07
3.35
1.03
B+Tree μ-Tree B+Tree μ-Tree B+Tree μ-Tree
Insertion / 48MB Insertion / 64MB Insertion / 80MB
21NVRAMOS 2008 – April 25, 2008. ([email protected])
Insertion / 48MB Insertion / 64MB Insertion / 80MB
SummarySummarySummarySummaryμ-Treeμ• A variant of B+Tree.• The sum of the size of nodes in a path from the root to p
the leaf is constant.• The size and the position of a node in each level do not
change after height increase or decrease except for the root node and its children.
• Only the direct ancestors can be stored in the upper levels, if any.
• Less writes, slightly more reads.
22NVRAMOS 2008 – April 25, 2008. ([email protected])
μFSμFSμμ
μFS (1)μFS (1)μFS (1)μFS (1)μFS Requirementsμ q• A flash-aware file system for portable multimedia
devices (mp3 players, digital camcorders, etc.)
• MLC NAND support• High-performance metadata operations
– Lookup, unlink, unlink all, random seek, truncate, …
• Real-time support– For audio/video recorders (MP3, MP4, SD/HD, etc.)– HD: 2MB/s min.
• Power-off recovery
24NVRAMOS 2008 – April 25, 2008. ([email protected])
μFS (2)μFS (2)μFS (2)μFS (2)Index Structures in File Systemsy• Directory
– File name Inode number – Linear array, B+Tree (or variants), ...– Local vs. Global
• File structureInode File contents– Inode File contents
– FAT, Block pointers, B+Tree (or variants), ...– Fixed vs. Variable
25NVRAMOS 2008 – April 25, 2008. ([email protected])
μFS Architecture (1)μFS Architecture (1)μFS Architecture (1)μFS Architecture (1)µ-Tree for Directoriesµ• Directory tree key
1 Parent FilenameMSB
looking up “/dir/file”
• Directory page
1 inode # hashMSB
µ-Tree
Directory pageJournaling info.
Current inode # (.) . = 3= 3
. = 12= 3Parent inode # (..)
Hash Offset
Hash Offset
.. = 3
hash(“dir”)
.. = 3
hash(“file”)
... ...
... ... ...
Header Inode # Filename
#12, “dir”
#27, “file”
26NVRAMOS 2008 – April 25, 2008. ([email protected])
Header Inode # Filename
μFS Architecture (2)μFS Architecture (2)μFS Architecture (2)μFS Architecture (2)µ-Tree for Filesµ• File tree key
0 Inode # OffsetMSB
seeking “/dir/file” at offset 32768
• Directory page
0 Inode # OffsetMSB
µ-Tree
Directory pageJournaling info.
Inode StatusInode 27
Inode Status
Offset PageAddress Length
0, P, 2
2, Q, 6 File Data
Page R
... ... … Extentsinfo.
8, R, 4
12, S, 8
…
27NVRAMOS 2008 – April 25, 2008. ([email protected])
…
μFS Architecture (3)μFS Architecture (3)μFS Architecture (3)μFS Architecture (3)Logical Layoutg y
File Leaf Page
type : dirp inode# : 4
type : inodeInode# : 9
ACL, mode, etc…
Dir Leaf PageFile Leaf Page
0, 256, 64…
1024, 8192, 21026, 8196, 6
p_inode# : 45, “abc”9, “def”
…
ACL, mode, etc…
…
Super Bit Dir & File Directory Inode Fil d tSuper block µ-Bitmaps Dir & File
TreeDirectory
PagesInodePages File data area
28NVRAMOS 2008 – April 25, 2008. ([email protected])
μFS PerformanceμFS PerformanceμFS PerformanceμFS PerformanceWrite Bandwidth• Write 3GB (average file size: 4.44MB)
120
100
μFS(SYNC) μFS(10M SYNC) μFS(Delayed SYNC) TFS4
dwid
th
4521.97 KB/s
60
80
rite
ban
d
20
40
f fl
ash
wr
0
20
1K 4K 8K 16K 32K 64K 128K 256K 512K 1024K
% o
f
29NVRAMOS 2008 – April 25, 2008. ([email protected])
Write Buffer Size
μ-FTLμ-FTLμμ
Page MappingPage MappingPage MappingPage MappingCharacteristics• Each table entry maps one page• Large memory footprintg y p• Efficient handling of small writes
ABC
01 F
ED Data
Block
2345
…GH
Free
Write to the Logical Page
Number(LPN) 2 G’
Invalid
67
Page Mapping
FreeFreeFree
Extra Block
Number(LPN) 2
31NVRAMOS 2008 – April 25, 2008. ([email protected])
Flash MemoryTable
Log Block SchemeLog Block SchemeLog Block SchemeLog Block SchemeLog Blockg• A temporary storage for small writes• Incremental updates from the first pagep p g
ABC
01 F
E
CD
Write to the LPN 2Repeat
FGH
Free
Block Mapping Table LogBlock
G’01
FreeFree
Free
Free
G123
Page Mapping Table
G’’G’’’
32NVRAMOS 2008 – April 25, 2008. ([email protected])
Freeg pp g
Flash Memory
μ-FTL (1)μ-FTL (1)μ FTL (1)μ FTL (1)Index Structure in FTL• LPN <PBN, PPN>
– LPN: Logical Page Number– PBN: Physical Block Number– PPN: Physical Page Number
T bl (li ) i id l d• Table (linear array) is most widely used.• Fixed mapping granularity: Page or Block
• Typical write patterns in real workloadsSmall and random– Small and random
– Large and sequential
33NVRAMOS 2008 – April 25, 2008. ([email protected])
μ-FTL (2)μ-FTL (2)μ FTL (2)μ FTL (2)Examplep
34NVRAMOS 2008 – April 25, 2008. ([email protected])
μ-FTL (3)μ-FTL (3)μ FTL (3)μ FTL (3)Extent Updatesp
35NVRAMOS 2008 – April 25, 2008. ([email protected])
μ-FTL Performance (1)μ-FTL Performance (1)μ FTL Performance (1)μ FTL Performance (1)Garbage Collection Overheadg
250 3000Logblock
150
200
second
s)
2000
2500
econ
ds)
FASTSuperblockDACμ‐FTL
100
Overhead(s
1000
1500
Overhead(se μ FTL
0
50
1% 2% 3% 4% 5%
O
0
500
1% 2% 3% 4% 5%O
1% 2% 3% 4% 5%
Extra blocks
1% 2% 3% 4% 5%
Extra blocks
36NVRAMOS 2008 – April 25, 2008. ([email protected])
MOV_8GB (16KB+44KB+4KB) WEB_32GB (64KB+160KB+32KB)
μ-FTL Performance (2)μ-FTL Performance (2)μ FTL Performance (2)μ FTL Performance (2)Extent Distribution (GENERAL 32GB)( )
Installed applications
Unused MFT zone
Temporary Internet files
Downloaded movies
System files
Unused MFT zone
37NVRAMOS 2008 – April 25, 2008. ([email protected])
μ-FTL Performance (3)μ-FTL Performance (3)μ FTL Performance (3)μ FTL Performance (3)Extent Distribution (MOV 8GB)( )
38NVRAMOS 2008 – April 25, 2008. ([email protected])
ConclusionConclusionConclusionConclusionμ-Treeμ• A flash-aware index structure
μFS• A flash-aware file system based on μ-Tree• A flash aware file system based on μ Tree
μ FTLμ-FTL• A flash translation layer based on μ-Tree
39NVRAMOS 2008 – April 25, 2008. ([email protected])