google file system
TRANSCRIPT
By Sanjay Ghemawat, Howard Gobioff, and
Shun-Tak Leung
The Google File SystemGFS
Presented by148215N J.R.K.C Jayakody148201T M.M. Shareek Ahamed
2
Google Platform Characteristics•100s to 1000s of PCs in cluster•Many modes of failure for each PC:•App bugs, OS bugs•Human error•Disk failure, memory failure, net failure, power supply failure•Connector failure
•Monitoring, fault tolerance, auto-recovery essential
2
3
4
Google File System: Design Criteria•Detect, tolerate, recover from failures automatically•Large files, >= 100 MB in size•Large, streaming reads (>= 1 MB in size)•Concurrent appends by multiple clients (e.g., producer-consumer queues)•Want atomicity for appends without synchronization overhead among clients
4
5
GFS: Architecture
•One master server
•Many chunk servers (100s – 1000s)•Spread across racks•Chunk: 64 MB portion of file, identified by 64-bit, globally unique ID
•Many clients accessing same and different files stored on same cluster
5
6
GFS: Architecture6
Master Server•Holds all metadata:•Namespace (directory hierarchy)•Access control information (per-file)•Mapping from files to chunks•Current locations of chunks (chunkservers)
•Delegates consistency management•Garbage collects orphaned chunks•Migrates chunks between chunkservers
7
8
Chunkserver
•Stores 64 MB file chunks on local disk using standard Linux filesystem, each with version number and checksum•Read/write requests specify chunk handle and byte range•Chunks replicated on configurable number of chunkservers (default: 3)•No caching of file data (beyond standard Linux buffer cache)
8
9
Client•Issues control (metadata) requests to master server
•Issues data requests directly to chunkservers
•Caches metadata
•Does no caching of data•Streaming reads (read once) and append writes (write once) don’t benefit much from caching at client
9
10
Client Read•Client sends master:• read(file name, chunk index)
•Master’s reply:•chunk ID, chunk version number, locations of replicas
•Client sends “closest” chunkserver w/replica:• read(chunk ID, byte range)
•Chunkserver replies with data
10
11
GFS: Architecture
11
12
Client Write•Some chunkserver is primary for each chunk•Master grants lease to primary (typically for 60 sec.)•Leases renewed using periodic heartbeat messages between master and chunkservers
•Client asks master for primary and secondary replicas for each chunk•Client sends data to replicas•Pipelined: each replica forwards as it receives
12
13
Client Write
13
14
Client Write
•All replicas acknowledge data write to client•Client sends write request to primary•Primary assigns serial number to write request, providing ordering•Primary forwards write request with same serial number to secondaries•Secondaries all reply to primary after completing write•Primary replies to client
14
15
Client Record Append•Google uses large files as queues between multiple producers and consumers•Same control flow as for writes•Client pushes data to replicas of last chunk of file•Client sends request to primary•Common case: request fits in current last chunk:•Primary appends data to own replica•Primary tells secondaries to do same at same byte offset in theirs•Primary replies with success to client
15
16
Client Record Append•When data won’t fit in last chunk:•Primary fills current chunk with padding•Primary instructs other replicas to do same•Primary replies to client, “retry on next chunk”
•If record append fails at any replica, client retries operation•So replicas of same chunk may contain different data—even duplicates of all or part of record data
•What guarantee does GFS provide on success?•Data written at least once in atomic unit
16
17
GFS: Consistency Model•Changes to data are ordered as chosen by a primary•All replicas will be consistent
•Record append completes at least once•Applications must cope with possible duplicates
17
18
Applications andRecord Append Semantics•Applications should use self-describing records and checksums when using Record Append•Reader can identify padding / record fragments
•If application cannot tolerate duplicated records, should include unique ID in record•Reader can use unique IDs to filter duplicates
18
19
Logging at Master•Master has all metadata information•Lose it, and you’ve lost the filesystem!
•Master logs all client requests to disk sequentially•Replicates log entries to remote backup servers
19
20
Chunk Leases and Version Numbers•If no outstanding lease when client requests write, master grants new one•Chunks have version numbers•Stored on disk at master and chunkservers•Each time master grants new lease, increments version, informs all replicas
•Master can revoke leases•e.g., when client requests rename or snapshot of file
20
Master Operations•Locking Operations•Replica Placement•Garbage Collection •Creation, Re-replication
21
Locking OperationsGoogle Directory Structure
It uses “Lookup tables” to Map Namespaces
Each Node Can be a “File” or a “Directory”
22Full path name Meta data
Node
Locking Operations•To access /d1/d2/leaf
•Need to lock /d1/d1/d2/d1/d2/leaf
•Can modify a directory concurrently •A read lock on a directory•A write lock on a file
•Totally ordered locking to prevent deadlocks
23
Replica Placement•Several replicas for each chunk•Default: 3 replicas
=> 3 Replicas in the same machine?? (No use)=> Spread the chunk replicas across machines and racks
•Goals: •Maximize data reliability and availability
•Tradeoff: Network Traffic
24
Creation, Re-replication, Rebalancing
Replica Creation- Find a disk with below Avg disk space- Limit Creations in one Chunk server- Spread Chunks across racksWhy Re-replication- If a Chunkserver goes down- If Replica corrupted- Replication Priority increased
25
(Masters Responsibility)
Garbage Collection •Delete is not actually a delete
•When a file is deleted=> Hide the file with a timestamp (Still Readable and can Undo)
•Master scans and delete the hidden files older than 3 days.
•Stale replica removal
26
Fault Tolerance and DiagnosisFast recovery•Master and Chunkservers are designed to restore their states and start in seconds regardless of termination conditions
•Chunk replication: - In multiple machines and racks.
•Master replication:- Primary master and Shadow masters.
27
Performance Measurements
Performance measured on a cluster with•1 Master, 2 master replicas•16 Chunkservers•16 Clients
Machines: 100 Mbps EthernetSwitches: 1 Gbps link
28
Machine Configuration
Processor: 1.4 GHz Dual PIIIRAM: 2GBHD: 2 x 80GB
19 Servers 16 Clients
Reads •Each client reads 4MB randomly, from a 320 GB file set simultan-eously.•4MB x 256 times
= 1GB Reads/Client
•Why low?1) Multiple clients
may Read from the same chunk
2) Network Traffic
80%
75%
10
1 client => 12.5 MB/s => 10 MB/s16 clients => 125 MB/s => 94 MB/s
29
Writes•N clients writing to N files simultaneously.
•Why low?Writing to 3 replicas
50%
50%
1 client => 12.5 MB/s => 6.3 MB/s16 clients => 67 MB/s => 35 MB/s 30
Record Appends•N clients appending to a single file simultaneously.• Append rate goes down with clients.
1 client => 125 MB/s => 6.0 MB/s16 clients => 125 MB/s => 4.8 MB/s 31
Performance (Real-world Cluster)•Cluster A: Research and development•Used by over a hundred engineers.• Typical task initiated by user and ran for few hours.• Task reads MB’s-TB’s of data, transforms/analyzes the data, and writes results back.
•Cluster B: Production data processing• Typical task runs much longer than a Cluster A task.• Continuously generates and processes multi-TB data sets.•Human users rarely involved.
•Measurements were taken after Clusters ran for a week.
32
Characteristics of two GFS Clusters
33
Cluster A: Research and developmentCluster B: Production data processing
Performance metrics for two GFS Clusters
34
Cluster A: Research and developmentCluster B: Production data processing
Performance (Real-world Cluster)•Experiment in recovery time:
• Killed One Chunkserver in Cluster B (Production)
• Chunkserver has = 15,000 chunks = 600 GB of data
•Took 23.2 minutes to restore all the chunks. 35
Conclusion• GFS is a distributed file system that support
large-scale data processing workloads on commodity hardware• GFS has different points in the design space• Component failures as the norm• Optimize for huge files
• GFS provides fault tolerance• Replicating data• Fast and automatic recovery• Chunk replication
• GFS has the simple, centralized master that does not become a bottleneck• GFS is a successful file system• An important tool that enables to
continue to innovate on Google’s ideas
36
Thank you
37