google file system log-structured merge trees · google file system log-structured merge trees...
TRANSCRIPT
![Page 1: Google File System Log-Structured Merge Trees · Google File System Log-Structured Merge Trees Marco Serafini COMPSCI 590S Lecture 9. 2. 33 Peculiar Requirements ... file contains](https://reader030.vdocuments.mx/reader030/viewer/2022041016/5ec7db49e2b32a13705ad577/html5/thumbnails/1.jpg)
Google File SystemLog-Structured Merge Trees
Marco Serafini
COMPSCI 590SLecture 9
![Page 2: Google File System Log-Structured Merge Trees · Google File System Log-Structured Merge Trees Marco Serafini COMPSCI 590S Lecture 9. 2. 33 Peculiar Requirements ... file contains](https://reader030.vdocuments.mx/reader030/viewer/2022041016/5ec7db49e2b32a13705ad577/html5/thumbnails/2.jpg)
2
![Page 3: Google File System Log-Structured Merge Trees · Google File System Log-Structured Merge Trees Marco Serafini COMPSCI 590S Lecture 9. 2. 33 Peculiar Requirements ... file contains](https://reader030.vdocuments.mx/reader030/viewer/2022041016/5ec7db49e2b32a13705ad577/html5/thumbnails/3.jpg)
33
Peculiar Requirements• Huge files
• Files can span multiple servers• Coarse granularity blocks to keep metadata manageable
• Failures• Many servers à many failures
• Workload• Append-only writes, reads mostly sequential
![Page 4: Google File System Log-Structured Merge Trees · Google File System Log-Structured Merge Trees Marco Serafini COMPSCI 590S Lecture 9. 2. 33 Peculiar Requirements ... file contains](https://reader030.vdocuments.mx/reader030/viewer/2022041016/5ec7db49e2b32a13705ad577/html5/thumbnails/4.jpg)
44
Design Choices• Optimized for bandwidth not latency• Weak consistency
• Supports multiple concurrent appends to a file• Best-effort attempt to guarantee atomicity of each append• Minimal attempts to “fix” state after failures• No locks
• How to deal with weak consistency• Application-level mechanisms to deal with inconsistent data
• Clients cache only metadata
![Page 5: Google File System Log-Structured Merge Trees · Google File System Log-Structured Merge Trees Marco Serafini COMPSCI 590S Lecture 9. 2. 33 Peculiar Requirements ... file contains](https://reader030.vdocuments.mx/reader030/viewer/2022041016/5ec7db49e2b32a13705ad577/html5/thumbnails/5.jpg)
55
Implementation• Distributed layer on top of Linux servers• Use local Linux file system to actually store data
![Page 6: Google File System Log-Structured Merge Trees · Google File System Log-Structured Merge Trees Marco Serafini COMPSCI 590S Lecture 9. 2. 33 Peculiar Requirements ... file contains](https://reader030.vdocuments.mx/reader030/viewer/2022041016/5ec7db49e2b32a13705ad577/html5/thumbnails/6.jpg)
66
Master-Slave Architecture• Master
• Keeps file chunk metadata (e.g. mapping to chunkservers)• Failure detection of chunkservers
• Procedure• Client contacts master to get metadata (small size)• Client contacts chunkserver(s) to get data (large size)• Master is not in the “critical path” and is thus not overloaded
![Page 7: Google File System Log-Structured Merge Trees · Google File System Log-Structured Merge Trees Marco Serafini COMPSCI 590S Lecture 9. 2. 33 Peculiar Requirements ... file contains](https://reader030.vdocuments.mx/reader030/viewer/2022041016/5ec7db49e2b32a13705ad577/html5/thumbnails/7.jpg)
77
Advantages of Large Chunks• Small metadata
• All metadata fits in memory at the master à no bottleneck• Clients cache lots of metadata à low load on master
• Batching when transferring data
![Page 8: Google File System Log-Structured Merge Trees · Google File System Log-Structured Merge Trees Marco Serafini COMPSCI 590S Lecture 9. 2. 33 Peculiar Requirements ... file contains](https://reader030.vdocuments.mx/reader030/viewer/2022041016/5ec7db49e2b32a13705ad577/html5/thumbnails/8.jpg)
88
Master Metadata• Persisted data
• File and chunk namespaces• File to chunks mapping• Operation log (Write-Ahead Log)• Stored externally for fault tolerance
• Q: Why not simply restart master from scratch?• This is what MapReduce does, after all
• Non-persisted data: Location of chunks • Fetched at startup from chunkservers• Updated periodically
![Page 9: Google File System Log-Structured Merge Trees · Google File System Log-Structured Merge Trees Marco Serafini COMPSCI 590S Lecture 9. 2. 33 Peculiar Requirements ... file contains](https://reader030.vdocuments.mx/reader030/viewer/2022041016/5ec7db49e2b32a13705ad577/html5/thumbnails/9.jpg)
99
Operation Log• Persists state• Memory mapped file - use only offsets as pointers• Log is a WAL - we will discuss it• Trimmed using checkpoints
![Page 10: Google File System Log-Structured Merge Trees · Google File System Log-Structured Merge Trees Marco Serafini COMPSCI 590S Lecture 9. 2. 33 Peculiar Requirements ... file contains](https://reader030.vdocuments.mx/reader030/viewer/2022041016/5ec7db49e2b32a13705ad577/html5/thumbnails/10.jpg)
10
Chunkserver Replication• Mutations are sent to all replicas
• One replica is primary for a lease• Within that lease, it totally orders and sends to backups• After old lease expires, master assigns new primary
• Separation of data and control flow• Data dissemination to all replicas (data flow) • Ordering through primary (control flow)
![Page 11: Google File System Log-Structured Merge Trees · Google File System Log-Structured Merge Trees Marco Serafini COMPSCI 590S Lecture 9. 2. 33 Peculiar Requirements ... file contains](https://reader030.vdocuments.mx/reader030/viewer/2022041016/5ec7db49e2b32a13705ad577/html5/thumbnails/11.jpg)
1111
Replication Protocol1. Client disseminates data to chunkservers2. Client contacts primary replica for ordering3. Primary determines order (offset of write)
• Also persists order to disk for recovery4. Primary sends offset to backups5. Backups apply write and ack back to primary6. Primary acks to client
![Page 12: Google File System Log-Structured Merge Trees · Google File System Log-Structured Merge Trees Marco Serafini COMPSCI 590S Lecture 9. 2. 33 Peculiar Requirements ... file contains](https://reader030.vdocuments.mx/reader030/viewer/2022041016/5ec7db49e2b32a13705ad577/html5/thumbnails/12.jpg)
1212
Weak Consistency• In presence of failures,
• There can be inconsistencies (e.g. failed backup)• Client simply retries the write à duplicate data
• Successful write (acknowledged back to client) is• Atomic: all data written (but may be later partially overwritten)• Consistent: same offset at all replica• This is because the primary proposes a specific offset
• Result: file contains stretches of “good” data interspersed with inconsistent and/or duplicated data
![Page 13: Google File System Log-Structured Merge Trees · Google File System Log-Structured Merge Trees Marco Serafini COMPSCI 590S Lecture 9. 2. 33 Peculiar Requirements ... file contains](https://reader030.vdocuments.mx/reader030/viewer/2022041016/5ec7db49e2b32a13705ad577/html5/thumbnails/13.jpg)
1313
Implications for Applications• Applications must deal with inconsistency
• Atomic file renaming after finishing to a file (single writer)• Add checksums to data to detect incomplete writes• Add unique record ids to detect duplication
• More difficult to program!
![Page 14: Google File System Log-Structured Merge Trees · Google File System Log-Structured Merge Trees Marco Serafini COMPSCI 590S Lecture 9. 2. 33 Peculiar Requirements ... file contains](https://reader030.vdocuments.mx/reader030/viewer/2022041016/5ec7db49e2b32a13705ad577/html5/thumbnails/14.jpg)
14
Log Structured Merge Trees
![Page 15: Google File System Log-Structured Merge Trees · Google File System Log-Structured Merge Trees Marco Serafini COMPSCI 590S Lecture 9. 2. 33 Peculiar Requirements ... file contains](https://reader030.vdocuments.mx/reader030/viewer/2022041016/5ec7db49e2b32a13705ad577/html5/thumbnails/15.jpg)
1515
LSMT Data Structures• Memtable
• Binary tree or skiplist à sorted• Receives writes and serves reads• Persistency through a Write Ahead Log
• Log files (runs)• L0: dump of memtable• Li: merge of multiple Li-1 runs
• Goal: make disk accesses sequential!
![Page 16: Google File System Log-Structured Merge Trees · Google File System Log-Structured Merge Trees Marco Serafini COMPSCI 590S Lecture 9. 2. 33 Peculiar Requirements ... file contains](https://reader030.vdocuments.mx/reader030/viewer/2022041016/5ec7db49e2b32a13705ad577/html5/thumbnails/16.jpg)
1616
Operations• Writes go to memtable• Reads
• Search memtables and read caches (if available)• Search log files in reverse chronological order• Bloom filters – indices in log files
• Periodically dump memtable to L0• Periodically merge from Li-1 to Li
![Page 17: Google File System Log-Structured Merge Trees · Google File System Log-Structured Merge Trees Marco Serafini COMPSCI 590S Lecture 9. 2. 33 Peculiar Requirements ... file contains](https://reader030.vdocuments.mx/reader030/viewer/2022041016/5ec7db49e2b32a13705ad577/html5/thumbnails/17.jpg)
1717
Optimizing Reads• Binary search in each run• Use a block index• Bloom filter
• Over-approximation of a set: ! ∈ # can return• False positives• No false negative
• Much smaller than storing entire set (e.g. HashSet)
![Page 18: Google File System Log-Structured Merge Trees · Google File System Log-Structured Merge Trees Marco Serafini COMPSCI 590S Lecture 9. 2. 33 Peculiar Requirements ... file contains](https://reader030.vdocuments.mx/reader030/viewer/2022041016/5ec7db49e2b32a13705ad577/html5/thumbnails/18.jpg)
1818
Merging• Starting from L1, every run is related to a key partition • Merging runs
• Take two Li runs• Merge with the relevant Li+1 runs (sequential)• Create new run Li+1 to replace the merged one• If too large, create a new Li+1 run• Merge to Li+2 if needed (too many Li+1 runs)