![Page 1: Bigtable - University of Michigan · 2018-03-28 · Changing BigTable nodes for cluster with workload Changing column family metadata, such as access control Read & Write Lookup Iterate](https://reader033.vdocuments.mx/reader033/viewer/2022041521/5e2e872cd0713b44e47eb0b9/html5/thumbnails/1.jpg)
Bigtable
Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. WallachMike Burrows, Tushar Chandra, Andrew Fikes, Robert E. GruberGoogle, Inc.OSDI’06
Presenter: Yijun Hou, Yixiao Peng
![Page 2: Bigtable - University of Michigan · 2018-03-28 · Changing BigTable nodes for cluster with workload Changing column family metadata, such as access control Read & Write Lookup Iterate](https://reader033.vdocuments.mx/reader033/viewer/2022041521/5e2e872cd0713b44e47eb0b9/html5/thumbnails/2.jpg)
Motivation
● Lots of (semi-)structured data, e.g.,○ Web pages: URL, contents, anchor, metadata, page rank○ Geography: country, city, satellite imagery
● Wide applicability: Google Analytics, Google Finance…● Most commercial databases cannot handle this large scale
![Page 3: Bigtable - University of Michigan · 2018-03-28 · Changing BigTable nodes for cluster with workload Changing column family metadata, such as access control Read & Write Lookup Iterate](https://reader033.vdocuments.mx/reader033/viewer/2022041521/5e2e872cd0713b44e47eb0b9/html5/thumbnails/3.jpg)
Motivation: a database?
BigTable resembles a database, however it has more flexibility.
● Control data layout and format dynamically● Control locality of data● May declare data to be in-memory
![Page 4: Bigtable - University of Michigan · 2018-03-28 · Changing BigTable nodes for cluster with workload Changing column family metadata, such as access control Read & Write Lookup Iterate](https://reader033.vdocuments.mx/reader033/viewer/2022041521/5e2e872cd0713b44e47eb0b9/html5/thumbnails/4.jpg)
Data Model
A Bigtable is simply a big dictionary
key value
(row: string, column: string, timestamp: int64) an uninterpreted array of bytes
![Page 5: Bigtable - University of Michigan · 2018-03-28 · Changing BigTable nodes for cluster with workload Changing column family metadata, such as access control Read & Write Lookup Iterate](https://reader033.vdocuments.mx/reader033/viewer/2022041521/5e2e872cd0713b44e47eb0b9/html5/thumbnails/5.jpg)
Example
In the above example,
"com.cnn.www" - row key
"anchor" - column family
"anchor:cnnsi.com" and "anchor:my.look.ca" - column keys
t3, t5, t6, t8, t9 - timestamps
![Page 6: Bigtable - University of Michigan · 2018-03-28 · Changing BigTable nodes for cluster with workload Changing column family metadata, such as access control Read & Write Lookup Iterate](https://reader033.vdocuments.mx/reader033/viewer/2022041521/5e2e872cd0713b44e47eb0b9/html5/thumbnails/6.jpg)
API● Metadata & Configuration
○ Creating and deleting tables○ Changing BigTable nodes for cluster with workload○ Changing column family metadata, such as access control
● Read & Write● Lookup● Iterate over a subset of the data in a table● More advanced operations
○ Atomic read-modify-write on single row○ Client-supplied scripts○ Working with MapReduce
![Page 7: Bigtable - University of Michigan · 2018-03-28 · Changing BigTable nodes for cluster with workload Changing column family metadata, such as access control Read & Write Lookup Iterate](https://reader033.vdocuments.mx/reader033/viewer/2022041521/5e2e872cd0713b44e47eb0b9/html5/thumbnails/7.jpg)
Implementation: Architecture
Master
Chubbydistributed
lock
service
Tablet server
.
.
.
GFS: stable storage for log and data files (SSTable file format)
SSTablepersistent,
ordered,
immutable
map
SSTable(key, value)
(key, value)
(key, value)
Tabletsa tablet = a row range
Log
![Page 8: Bigtable - University of Michigan · 2018-03-28 · Changing BigTable nodes for cluster with workload Changing column family metadata, such as access control Read & Write Lookup Iterate](https://reader033.vdocuments.mx/reader033/viewer/2022041521/5e2e872cd0713b44e47eb0b9/html5/thumbnails/8.jpg)
More on SSTable
● persistent, ordered, immutable map
● data blocks + block index
○ Index is loaded into memory when SSTable is opened
● Operations:
○ Lookup: single seek, memory index -> block -> disk (GFS)
○ iteration
![Page 9: Bigtable - University of Michigan · 2018-03-28 · Changing BigTable nodes for cluster with workload Changing column family metadata, such as access control Read & Write Lookup Iterate](https://reader033.vdocuments.mx/reader033/viewer/2022041521/5e2e872cd0713b44e47eb0b9/html5/thumbnails/9.jpg)
More on Chubby
● Consists of directories or files used as a lock
● A client’s session expires = loses the lock
● Tasks in Bigtable:○ Ensure at most one master at a time
○ Store location of tablets
○ Discover live tablet servers
○ Store Bigtable schema and metadata like access control
● Paxos
![Page 10: Bigtable - University of Michigan · 2018-03-28 · Changing BigTable nodes for cluster with workload Changing column family metadata, such as access control Read & Write Lookup Iterate](https://reader033.vdocuments.mx/reader033/viewer/2022041521/5e2e872cd0713b44e47eb0b9/html5/thumbnails/10.jpg)
Implementation: Architecture
● A single master:○ Keep track of live tablet servers
○ Garbage collect GFS files
○ Assign tablets
○ Handle schema changes
● Many tablet servers:○ split an existing large tablet
○ Handle read/write requests
● Client send read/write requests to tablet server directly
![Page 11: Bigtable - University of Michigan · 2018-03-28 · Changing BigTable nodes for cluster with workload Changing column family metadata, such as access control Read & Write Lookup Iterate](https://reader033.vdocuments.mx/reader033/viewer/2022041521/5e2e872cd0713b44e47eb0b9/html5/thumbnails/11.jpg)
Implementation: Tablet Location
tablet
tablet SSTables
SSTables
Key: table ID+last row in rangeValue: sstables, log, server ID
![Page 12: Bigtable - University of Michigan · 2018-03-28 · Changing BigTable nodes for cluster with workload Changing column family metadata, such as access control Read & Write Lookup Iterate](https://reader033.vdocuments.mx/reader033/viewer/2022041521/5e2e872cd0713b44e47eb0b9/html5/thumbnails/12.jpg)
Implementation: Tablet Location
● 234 tablets (261 bytes in 128 MB tablets)● Efficient: cache & prefetch● If the location is not in memory / is stale, it will several more round trips
![Page 13: Bigtable - University of Michigan · 2018-03-28 · Changing BigTable nodes for cluster with workload Changing column family metadata, such as access control Read & Write Lookup Iterate](https://reader033.vdocuments.mx/reader033/viewer/2022041521/5e2e872cd0713b44e47eb0b9/html5/thumbnails/13.jpg)
Implementation: Tablet Assignment
● Master:○ Keep track of live tablet servers
○ Keep track of the current assignment of tablets
○ Assign tablets by sending a load request
![Page 14: Bigtable - University of Michigan · 2018-03-28 · Changing BigTable nodes for cluster with workload Changing column family metadata, such as access control Read & Write Lookup Iterate](https://reader033.vdocuments.mx/reader033/viewer/2022041521/5e2e872cd0713b44e47eb0b9/html5/thumbnails/14.jpg)
Implementation: Tablet Assignment
MasterChubbyServers directory: server files of all alive servers
File for server 1File for server 2File for server 3...
Tablet server1
Tablet server2
Tablet server3
.
.
.
● A tablet is assigned to one tablet server● Server file represents the liveness of a tablet server
Unassigned list
![Page 15: Bigtable - University of Michigan · 2018-03-28 · Changing BigTable nodes for cluster with workload Changing column family metadata, such as access control Read & Write Lookup Iterate](https://reader033.vdocuments.mx/reader033/viewer/2022041521/5e2e872cd0713b44e47eb0b9/html5/thumbnails/15.jpg)
Common case: Assign t5
Master
ChubbyServers directory: server files of all alive servers
File for server 1File for server 2File for server3...
Tablet server1
Tablet server2
Tablet server3
Unassigned list:t5
Current Assignment:Server 1: t0 t1Server 2: t2 t3Server 3: t4
![Page 16: Bigtable - University of Michigan · 2018-03-28 · Changing BigTable nodes for cluster with workload Changing column family metadata, such as access control Read & Write Lookup Iterate](https://reader033.vdocuments.mx/reader033/viewer/2022041521/5e2e872cd0713b44e47eb0b9/html5/thumbnails/16.jpg)
Master
ChubbyServers directory: server files of all alive servers
File for server 1File for server 2File for server 3...
Tablet server1
Tablet server2
Tablet server3
Unassigned list:t5
Current Assignment:Server 1: t0 t1Server 2: t2 t3Server 3: t4
I should assign t5 to server 3 after load balancing algorithm
![Page 17: Bigtable - University of Michigan · 2018-03-28 · Changing BigTable nodes for cluster with workload Changing column family metadata, such as access control Read & Write Lookup Iterate](https://reader033.vdocuments.mx/reader033/viewer/2022041521/5e2e872cd0713b44e47eb0b9/html5/thumbnails/17.jpg)
Master
ChubbyServers directory: server files of all alive servers
File 1File 2File 3...
Tablet server1
Tablet server2
Tablet server3
Unassigned list:t5
Current Assignment:Server 1: t0 t1Server 2: t2 t3Server 3: t4
Load request
![Page 18: Bigtable - University of Michigan · 2018-03-28 · Changing BigTable nodes for cluster with workload Changing column family metadata, such as access control Read & Write Lookup Iterate](https://reader033.vdocuments.mx/reader033/viewer/2022041521/5e2e872cd0713b44e47eb0b9/html5/thumbnails/18.jpg)
Master
ChubbyServers directory: server files of all alive servers
File 1File 2File 3...
Tablet server1
Tablet server2
Tablet server3
Unassigned list:
Current Assignment:Server 1: t0 t1Server 2: t2 t3Server 3: t4 t5
![Page 19: Bigtable - University of Michigan · 2018-03-28 · Changing BigTable nodes for cluster with workload Changing column family metadata, such as access control Read & Write Lookup Iterate](https://reader033.vdocuments.mx/reader033/viewer/2022041521/5e2e872cd0713b44e47eb0b9/html5/thumbnails/19.jpg)
What if a tablet server is partitioned?
Master
ChubbyServers directory: server files of all alive servers
File 1File 2File 3...
Tablet server1
Tablet server2
Tablet server3
.
.
.
For example, server1 loses its Chubby session, and loses its exclusive lock
![Page 20: Bigtable - University of Michigan · 2018-03-28 · Changing BigTable nodes for cluster with workload Changing column family metadata, such as access control Read & Write Lookup Iterate](https://reader033.vdocuments.mx/reader033/viewer/2022041521/5e2e872cd0713b44e47eb0b9/html5/thumbnails/20.jpg)
Master
ChubbyServers directory: server files of all alive servers
File for server 1File for server 2File for server 3...
Tablet server1
Tablet server2
Tablet server3
.
.
.
Master periodically asks all tablet servers: What’s your status of lock?
Master can know how to reach out servers through files in servers directory.
No response from server 1! / response with losing the lock
![Page 21: Bigtable - University of Michigan · 2018-03-28 · Changing BigTable nodes for cluster with workload Changing column family metadata, such as access control Read & Write Lookup Iterate](https://reader033.vdocuments.mx/reader033/viewer/2022041521/5e2e872cd0713b44e47eb0b9/html5/thumbnails/21.jpg)
Master
ChubbyServers directory: server files of all alive servers
File for server 1File for server 2File for server 3...
Tablet server1
Tablet server2
Tablet server3
.
.
.
Master tries to get File1’s lock, if succeed, delete file1, and move
tablets assigned to server 1 into a set of unassigned tablets.
![Page 22: Bigtable - University of Michigan · 2018-03-28 · Changing BigTable nodes for cluster with workload Changing column family metadata, such as access control Read & Write Lookup Iterate](https://reader033.vdocuments.mx/reader033/viewer/2022041521/5e2e872cd0713b44e47eb0b9/html5/thumbnails/22.jpg)
What if master is partitioned?
Master
ChubbyServers directory: server files of all alive servers
File 1File 2File 3...
Tablet server1
Tablet server2
Tablet server3
.
.
.
The master kills itself if its Chubby session expires
![Page 23: Bigtable - University of Michigan · 2018-03-28 · Changing BigTable nodes for cluster with workload Changing column family metadata, such as access control Read & Write Lookup Iterate](https://reader033.vdocuments.mx/reader033/viewer/2022041521/5e2e872cd0713b44e47eb0b9/html5/thumbnails/23.jpg)
What will the new master do?
● Goals: discover the current tablet assignments● How?
○ Grab master lock in Chubby
○ Scan the servers directory to find the live servers
○ Communicate with live servers to discover all assigned tablets
○ Scan the METADATA tablets to learn the set of all tablets
○ Add unassigned tablets to a set of unassigned tablets
![Page 24: Bigtable - University of Michigan · 2018-03-28 · Changing BigTable nodes for cluster with workload Changing column family metadata, such as access control Read & Write Lookup Iterate](https://reader033.vdocuments.mx/reader033/viewer/2022041521/5e2e872cd0713b44e47eb0b9/html5/thumbnails/24.jpg)
Implementation: Tablet Serving
● Older data in SSTable, newer in memtable
● Memtable and SSTable are sorted by key
● Read:
○ Merge SSTable Files and memtable
● Write:
○ Check authorization
○ Write in log
○ Commit in memtable
● Recover:
○ Read metadata -> SSTables + log
○ Redo to construct memtable
![Page 25: Bigtable - University of Michigan · 2018-03-28 · Changing BigTable nodes for cluster with workload Changing column family metadata, such as access control Read & Write Lookup Iterate](https://reader033.vdocuments.mx/reader033/viewer/2022041521/5e2e872cd0713b44e47eb0b9/html5/thumbnails/25.jpg)
Implementation: Compactions
● Minor compaction: Convert memtable to a new SSTable○ To shrink memory usage○ To reduce REDO in recovery
![Page 26: Bigtable - University of Michigan · 2018-03-28 · Changing BigTable nodes for cluster with workload Changing column family metadata, such as access control Read & Write Lookup Iterate](https://reader033.vdocuments.mx/reader033/viewer/2022041521/5e2e872cd0713b44e47eb0b9/html5/thumbnails/26.jpg)
Implementation: Compactions● Major/Merging compaction: SSTable + memtable -> one new SSTable
periodically○ To reclaim resources from deleted data○ To make read less work on merging too many SSTables
(a, 3)
(b, 5) +
delete a
update (b, 6)
(c, 7)
=(b, 6)
(c, 7)
● Garbage collection○ Update Metadata tablet
![Page 27: Bigtable - University of Michigan · 2018-03-28 · Changing BigTable nodes for cluster with workload Changing column family metadata, such as access control Read & Write Lookup Iterate](https://reader033.vdocuments.mx/reader033/viewer/2022041521/5e2e872cd0713b44e47eb0b9/html5/thumbnails/27.jpg)
Refinement 1: Locality Group
Locality group = frequently-accessed-together column families
Clients can define locality groups
A separate SSTable for each locality group
![Page 28: Bigtable - University of Michigan · 2018-03-28 · Changing BigTable nodes for cluster with workload Changing column family metadata, such as access control Read & Write Lookup Iterate](https://reader033.vdocuments.mx/reader033/viewer/2022041521/5e2e872cd0713b44e47eb0b9/html5/thumbnails/28.jpg)
Refinement 2: Caching for read performance
Two levels of caching used
The Scan Cache - caches the key-value pairs
useful when same data read repeatedly
The Block Cache - caches SSTable blocks
useful when reading data close to recently read ones
![Page 29: Bigtable - University of Michigan · 2018-03-28 · Changing BigTable nodes for cluster with workload Changing column family metadata, such as access control Read & Write Lookup Iterate](https://reader033.vdocuments.mx/reader033/viewer/2022041521/5e2e872cd0713b44e47eb0b9/html5/thumbnails/29.jpg)
Refinement 3: Bloom Filters
Tells you whether a key exist in the SSTable
Needs little memory
Zero False Negative
Low False Positive rate
Looking for key 'b'
![Page 30: Bigtable - University of Michigan · 2018-03-28 · Changing BigTable nodes for cluster with workload Changing column family metadata, such as access control Read & Write Lookup Iterate](https://reader033.vdocuments.mx/reader033/viewer/2022041521/5e2e872cd0713b44e47eb0b9/html5/thumbnails/30.jpg)
Bloom Filter
A 0-1 array
![Page 31: Bigtable - University of Michigan · 2018-03-28 · Changing BigTable nodes for cluster with workload Changing column family metadata, such as access control Read & Write Lookup Iterate](https://reader033.vdocuments.mx/reader033/viewer/2022041521/5e2e872cd0713b44e47eb0b9/html5/thumbnails/31.jpg)
Refinement 4: Commit-log implementation
Avoiding duplicate log reads - sort log entries by <table, row name, log sequence number>
Only sort the log during recovery
One log file per ______ tablet A lot of log filesA lot of concurrent disk seeks
tablet server
A refinementBut complicates recovery - duplicate log reads
![Page 32: Bigtable - University of Michigan · 2018-03-28 · Changing BigTable nodes for cluster with workload Changing column family metadata, such as access control Read & Write Lookup Iterate](https://reader033.vdocuments.mx/reader033/viewer/2022041521/5e2e872cd0713b44e47eb0b9/html5/thumbnails/32.jpg)
Refinement 5: Speeding up tablet movement
Need to move tablets when, e.g. changing cluster size
Source tablet server does minor compactions
Destination tablet server doesn't need to go through log
![Page 33: Bigtable - University of Michigan · 2018-03-28 · Changing BigTable nodes for cluster with workload Changing column family metadata, such as access control Read & Write Lookup Iterate](https://reader033.vdocuments.mx/reader033/viewer/2022041521/5e2e872cd0713b44e47eb0b9/html5/thumbnails/33.jpg)
Refinement 6: Exploiting immutability
Benefits of having immutable SSTables:
1. No synchronization of SSTable accesses
2. During split, children reuse parent's SSTables
Dealing with mutable MemTable:
Copy-on-write allows concurrent reads and writes
![Page 34: Bigtable - University of Michigan · 2018-03-28 · Changing BigTable nodes for cluster with workload Changing column family metadata, such as access control Read & Write Lookup Iterate](https://reader033.vdocuments.mx/reader033/viewer/2022041521/5e2e872cd0713b44e47eb0b9/html5/thumbnails/34.jpg)
Evaluation
![Page 35: Bigtable - University of Michigan · 2018-03-28 · Changing BigTable nodes for cluster with workload Changing column family metadata, such as access control Read & Write Lookup Iterate](https://reader033.vdocuments.mx/reader033/viewer/2022041521/5e2e872cd0713b44e47eb0b9/html5/thumbnails/35.jpg)
Sequential write is not faster than random write because they are both appending to commit logs
Faster operation Slower operation Reason
Write Read Read involves accessing GFS for SSTables, but write only access GFS for log
Sequential read Random read caching
![Page 36: Bigtable - University of Michigan · 2018-03-28 · Changing BigTable nodes for cluster with workload Changing column family metadata, such as access control Read & Write Lookup Iterate](https://reader033.vdocuments.mx/reader033/viewer/2022041521/5e2e872cd0713b44e47eb0b9/html5/thumbnails/36.jpg)
Our Understanding
● Pros:
○ Scalability
○ Cluster resizing without downtime
○ High throughput
● Cons:
○ No multi-row transaction
○ No sql queries or joins
○ not a good solution for less than 1 TB of data
![Page 37: Bigtable - University of Michigan · 2018-03-28 · Changing BigTable nodes for cluster with workload Changing column family metadata, such as access control Read & Write Lookup Iterate](https://reader033.vdocuments.mx/reader033/viewer/2022041521/5e2e872cd0713b44e47eb0b9/html5/thumbnails/37.jpg)
Reference
https://www.youtube.com/watch?v=r1bh90_8dsg&t=630s