google file systems
DESCRIPTION
TRANSCRIPT
![Page 1: Google File Systems](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546d0beaaf7959ec228b8409/html5/thumbnails/1.jpg)
The Google File SystemPublished By:
Sanjay Ghemawat, Howard Gobioff, Shun-Tak LeungGoogle
Presented By:
Manoj Samaraweera (138231B) Azeem Mumtaz (138218R)
University of Moratuwa
![Page 2: Google File Systems](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546d0beaaf7959ec228b8409/html5/thumbnails/2.jpg)
Contents• Distributed File Systems• Introducing Google File System• Design Overview• System Interaction• Master Operation• Fault Tolerance and Diagnosis• Measurements and Benchmarks• Experience• Related Works• Conclusion• Reference
![Page 3: Google File Systems](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546d0beaaf7959ec228b8409/html5/thumbnails/3.jpg)
Distributed File Systems
•Enables programs to store and access remote files exactly as they do local ones
•New modes of data organization on disk or across multiple servers
•Goals▫Performance▫Scalability▫Reliability▫Availability
![Page 4: Google File Systems](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546d0beaaf7959ec228b8409/html5/thumbnails/4.jpg)
Introducing Google File System•Growing demand for Google data processing•Properties
▫A scalable distributed file system▫For large distributed data intensive
applications▫Fault tolerance ▫Inexpensive commodity hardware ▫High aggregated performance
•Design is driven by observation of workload and technological environment
![Page 5: Google File Systems](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546d0beaaf7959ec228b8409/html5/thumbnails/5.jpg)
Design Assumptions• Component failures are the norm
▫ Commodity Hardware• Files are huge by traditional standard
▫ Multi-GB files▫ Small files also must be supported,
Not optimized• Read Workloads
▫ Large streaming reads▫ Small random reads
• Write Workloads▫ Large, sequential writes that append data to file
• Multiple clients concurrently append to one file▫ Consistency Semantics▫ Files are used as producer-consumer queues or many way merging
• High sustained bandwidth is more important than low latency
![Page 6: Google File Systems](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546d0beaaf7959ec228b8409/html5/thumbnails/6.jpg)
Design Interface
•Typical File System Interface•Hierarchical Directory Organization•Files are identified as pathnames•Operations
▫Create, delete, open, close, read, write
![Page 7: Google File Systems](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546d0beaaf7959ec228b8409/html5/thumbnails/7.jpg)
Architecture (1/2)• Files are divided into chunks• Fixed-size chunks (64MB)• Unique 64-bit chunk handles
▫ Immutable and globally unique• Chunks as Linux files• Replicated over chunkservers, called replicas
▫ 3 replicas by default▫ Different replication for different region of file namespace
• Single master• Multiple chunkservers
▫ Grouped into Racks▫ Connected through switches
• Multiple clients• Master/chunkserver coordination
▫ HeartBeat Messages
![Page 8: Google File Systems](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546d0beaaf7959ec228b8409/html5/thumbnails/8.jpg)
Architecture (2/2)
![Page 9: Google File Systems](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546d0beaaf7959ec228b8409/html5/thumbnails/9.jpg)
Single Master
•Maintains Metadata•Controls System Wide Activities
▫Chunk lease management▫Garbage collection▫Chunk migration▫Replication
![Page 10: Google File Systems](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546d0beaaf7959ec228b8409/html5/thumbnails/10.jpg)
Chunk Size (1/2)•64 MB•Stored as plain Linux file on a chunkserver•Advantages
▫Reduces client’s interaction with single master▫Clients most likely to perform many operations
on a large chunk Reduce network overhead by keeping a persistent
TCP connection with the chunkserver▫Reduces the size of the metadata
Keep metadata in memory▫Lazy Space Allocation
![Page 11: Google File Systems](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546d0beaaf7959ec228b8409/html5/thumbnails/11.jpg)
Chunk Size (1/2)
•Disadvantages▫Small files consisting of small chunks may
become hot spots▫Solutions
Higher replication factor Stagger application start time Allow clients to read from other clients
![Page 12: Google File Systems](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546d0beaaf7959ec228b8409/html5/thumbnails/12.jpg)
Metadata (1/5)
•3 Major Types▫The file and chunk namespace▫File-to-chunk mappings▫The location of each chunk replicas
•Namespaces and mappings▫Persisted by logging mutation to an
operation log stored on master▫Operation log is replicated
![Page 13: Google File Systems](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546d0beaaf7959ec228b8409/html5/thumbnails/13.jpg)
Metadata (2/5)•Metadata are stored in the memory
▫Improves the performance master▫Easier to scan the entire state of metadata
periodically Chunk garbage collection Re-replication in the presence of chunkserver failure Chunk migration to balance load and disk space
•64 bytes of metadata for 64 MB chunk•File namespace data requires < 64 bytes per file
▫Prefix compression
![Page 14: Google File Systems](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546d0beaaf7959ec228b8409/html5/thumbnails/14.jpg)
Metadata (3/5)
•Chunk location information ▫Polled at master startup
Chunkservers join and leave the cluster▫Keeps up-to-date with chunkserver with
HeartBeat messages
![Page 15: Google File Systems](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546d0beaaf7959ec228b8409/html5/thumbnails/15.jpg)
Metadata (4/5)
•Operation Logs▫Historical record of critical metadata
changes▫Logical timeline that defines the order of
concurrent operations ▫Not visible to client
Until it is replicated and flushed the logs to the disk
▫Flushing and replication in batch Reduces impact on system throughput
![Page 16: Google File Systems](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546d0beaaf7959ec228b8409/html5/thumbnails/16.jpg)
Metadata (5/5)• Operation Logs
▫By replaying operation logs master recover its file system state
▫Checkpoints To avoid the growth of the operation logs beyond
the threshold avoids interfering other mutations by working in a
separate thread▫Compact B-tree like structure
Directly mapped into the memory and used for namespace lookup
No extra parsing
![Page 17: Google File Systems](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546d0beaaf7959ec228b8409/html5/thumbnails/17.jpg)
Consistency Model (1/3)
• Guarantees by GFS▫ File namespace mutations (i.e. File Creation) are atomic
Namespace management and locking guarantees atomicity and correctness
The master’s operation log▫ After a sequence of successful mutations, the mutated file is
guaranteed to be defined and contain the data written by the last mutation. This is obtained by Applying the same mutation in order to all replicas Using chunk version numbers to detect stale replica
![Page 18: Google File Systems](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546d0beaaf7959ec228b8409/html5/thumbnails/18.jpg)
Consistency Model (2/3)• Relaxed consistency model• Two types of mutations
▫ Writes Cause data to be written at an application-specified file offset
▫ Record Appends Cause data to be appended atomically at least once Offset chosen by GFS, not by the client
• States of a file region after a mutation▫ Consistent
All clients see the same data, regardless which replicas they read from▫ Inconsistent
Clients see different data at different times▫ Defined
consistent and all clients see what the mutation writes in its entirety▫ Undefined
consistent but it may not reflect what any mutation has written
![Page 19: Google File Systems](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546d0beaaf7959ec228b8409/html5/thumbnails/19.jpg)
Consistency Model (3/3)
•Implication for Applications▫Relying on appends rather on overwrites▫Checkpointing
to verify how much data has been successfully written
▫Writing self-validating records Checksums to detect and remove padding
▫Writing Self-identifying records Unique Identifiers to identify and discard
duplicates
![Page 20: Google File Systems](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546d0beaaf7959ec228b8409/html5/thumbnails/20.jpg)
Lease & Mutation Order
•Master uses leases to maintain a consistent mutation order among replicas
•Primary is the chunkserver who is granted a chunk lease▫Master delegates the authority of mutation ▫All others are secondary replicas
•Primary defines a mutation order between mutations▫Secondary replicas follows this order
![Page 21: Google File Systems](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546d0beaaf7959ec228b8409/html5/thumbnails/21.jpg)
Writes (1/7)
• Step 1▫ Which chunkserver
holds the current lease for the chunk?
▫ The location of secondary replicas
![Page 22: Google File Systems](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546d0beaaf7959ec228b8409/html5/thumbnails/22.jpg)
Writes (2/7)
• Step 2▫ Identities of primary
and secondary replicas
▫ Client cache this data for future mutation, until
Primary is unreachable
Primary no longer holds the lease
![Page 23: Google File Systems](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546d0beaaf7959ec228b8409/html5/thumbnails/23.jpg)
Writes (3/7)
• Step 3▫ Client pushes the data
to all replicas▫ Chunkserver stores
the data in an internal LRU buffer cache
![Page 24: Google File Systems](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546d0beaaf7959ec228b8409/html5/thumbnails/24.jpg)
Writes (4/7)
• Step 4▫ Client sends a write
request to the primary
▫ Primary assigns a consecutive serial numbers to mutations
Serialization ▫ Primary applies
mutations to its own state
![Page 25: Google File Systems](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546d0beaaf7959ec228b8409/html5/thumbnails/25.jpg)
Writes (5/7)
• Step 5▫ Forward the writes to
all secondary replicas ▫ Follows the mutation
order
![Page 26: Google File Systems](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546d0beaaf7959ec228b8409/html5/thumbnails/26.jpg)
Writes (6/7)
• Step 6▫ Secondary replicas
inform primary after completing the mutation
![Page 27: Google File Systems](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546d0beaaf7959ec228b8409/html5/thumbnails/27.jpg)
Writes (7/7)
• Step 7▫ Primary replies to the
client▫ Retries from step 3 to
7 in case of errors
![Page 28: Google File Systems](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546d0beaaf7959ec228b8409/html5/thumbnails/28.jpg)
Data Flow (1/2)
•Decoupled control flow and data flow•Data is pushed linearly along a chain of
chunkservers in a pipelined fashion▫Utilize inbound bandwidth
•Distance is accurately estimated from IP addresses
•Minimize latency by pipelining the data transmission over TCP
![Page 29: Google File Systems](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546d0beaaf7959ec228b8409/html5/thumbnails/29.jpg)
Data Flow (2/2)
•Ideal elapsed time for transmitting B bytes to R replicas:
T – Network Throughput L – Latency between 2 machines
•At Google: T = 100 Mbps L <= 1 ms 1000 replicas 1 MB distributed in 80 ms
RLΒ/Τ
![Page 30: Google File Systems](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546d0beaaf7959ec228b8409/html5/thumbnails/30.jpg)
Record Append• In traditional writes
▫Clients specifies offset where the data to be written▫Concurrent write to the same region is not serialized
• In record append▫Client specifies only the data▫Similar to writes ▫GFS appends data to the file at least once atomically
The chunk is padded if appending the record exceeds the maximum size
If a record append fails at any replica, the client retries the operation - record duplicates
File region may be defined but inconsistent
![Page 31: Google File Systems](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546d0beaaf7959ec228b8409/html5/thumbnails/31.jpg)
Snapshot (1/2)• Goals
▫To quickly create branch copies of huge data sets▫To easily checkpoint the current state
• Copy-on-write technique▫Master receive snapshot request,▫Revokes outstanding leases on chunks in the file▫Master logs the operation to the disk▫Applies this log to its in-memory state by
duplicating the metadata for the source file or directory tree
▫New snapshot file
![Page 32: Google File Systems](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546d0beaaf7959ec228b8409/html5/thumbnails/32.jpg)
Snapshot (2/2)
•After the snapshot operation ▫Clients sends a request to master to find
the current lease holder of a “chunk C”▫Reference count for chunk C is > 1▫Master pick a new chunk handle C▫Master asks chunkserver to create a new
chunk C▫Master grants one of the replicas a lease on
the new chunk C and replies to the client
![Page 33: Google File Systems](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546d0beaaf7959ec228b8409/html5/thumbnails/33.jpg)
Content
• Distributed File Systems• Introducing Google File System• Design Overview• System Interaction• Master Operation• Fault Tolerance and Diagnosis• Measurements and Benchmarks• Experience• Related Works• Conclusion• Reference
![Page 34: Google File Systems](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546d0beaaf7959ec228b8409/html5/thumbnails/34.jpg)
Master Operation
•Namespace Management and Locking•Replica Placement•Creation, Re-replication, Rebalancing•Garbage Collection•Stale Replica Detection
![Page 35: Google File Systems](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546d0beaaf7959ec228b8409/html5/thumbnails/35.jpg)
Namespace Management and Locking
•Each master operation acquires a set of locks before it runs
•Creating /home/user/foo while /home/user is snapshotted to /save/user
![Page 36: Google File Systems](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546d0beaaf7959ec228b8409/html5/thumbnails/36.jpg)
Replica Placement
•Chunk replica placement policy serves two purposes:▫Maximize data reliability and availability.▫Maximize network bandwidth utilization
![Page 37: Google File Systems](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546d0beaaf7959ec228b8409/html5/thumbnails/37.jpg)
Creation, Re-replication, Rebalancing• Creation
▫Want to place new replicas on chunkservers with below-average disk space utilization
▫Limit the number of “recent” creations on each chunkserver
▫Spread replicas of a chunk across racks.• Re-replication
▫As soon as # of replicas go below user specified goal
• Rebalancing▫Moves replicas for better disk space and load
balancing
![Page 38: Google File Systems](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546d0beaaf7959ec228b8409/html5/thumbnails/38.jpg)
Garbage Collection
•Mechanism▫Master logs the deletion immediately.▫File is just renamed to a hidden name.▫Removes any such hidden files if they have
existed for more than three days.▫In a regular scan of the chunk namespace,
master identifies orphaned chunks and erases the metadata for those chunks.
![Page 39: Google File Systems](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546d0beaaf7959ec228b8409/html5/thumbnails/39.jpg)
Stale Replica Detection
•Chunk version number to distinguish between up-to-date and stale replicas.
•Master removes stale replicas in its regular garbage collection.
![Page 40: Google File Systems](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546d0beaaf7959ec228b8409/html5/thumbnails/40.jpg)
Fault Tolerance and Diagnosis
• High Availability▫Fast Recovery
Master and the chunkserver are designed to restore their state and start in seconds.
▫Chunk Replication master clones existing replicas as needed to keep
each chunk fully replicated▫Master Replication
The master state is replicated for reliability Operation log and checkpoints are replicated on
multiple machines “Shadow master” read-only access to the FS even
when the primary master is down
![Page 41: Google File Systems](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546d0beaaf7959ec228b8409/html5/thumbnails/41.jpg)
Fault Tolerance and Diagnosis (2)
•Data Integrity▫Each chunkserver uses checksumming to
detect corruption of stored data.▫Chunk is broken up into 64 KB blocks. Each
has a corresponding 32 bit checksum▫Checksum computation is heavily optimized
for writes that append to the end of a chunk
![Page 42: Google File Systems](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546d0beaaf7959ec228b8409/html5/thumbnails/42.jpg)
Fault Tolerance and Diagnosis (3)
•Diagnostic Tools▫Extensive and detailed diagnostic logging
for in problem isolation, debugging, and performance analysis
▫GFS servers generate diagnostic logs that record many significant events and all RPC requests and replies
![Page 43: Google File Systems](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546d0beaaf7959ec228b8409/html5/thumbnails/43.jpg)
Measurements and BenchmarksMicro-benchmarks
GFS cluster consisting of one master, two master replicas, 16 chunkservers, and 16 clients
![Page 44: Google File Systems](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546d0beaaf7959ec228b8409/html5/thumbnails/44.jpg)
Measurements and Benchmarks (2)•Real World Clusters
• Cluster A is used regularly for research and development
• Cluster B is primarily used for production data processing
![Page 45: Google File Systems](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546d0beaaf7959ec228b8409/html5/thumbnails/45.jpg)
Measurements and Benchmarks (3)
![Page 46: Google File Systems](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546d0beaaf7959ec228b8409/html5/thumbnails/46.jpg)
Experience
•Biggest problems were disk and Linux related.▫Many of disks claimed to the Linux driver that
they supported a range of IDE protocol versions but in fact responded reliably only to the more recent ones.
▫Despite occasional problems, the availability of Linux code has helped to explore and understand system behavior.
![Page 47: Google File Systems](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546d0beaaf7959ec228b8409/html5/thumbnails/47.jpg)
Related Works (1/3)• Both GFS & AFS provides a location
independent namespace▫data to be moved transparently for load balance▫ fault tolerance
• Unlike AFS, GFS spreads a file’s data across storage servers in a way more akin to xFS and Swift in order to deliver aggregate performance and increased fault tolerance
• GFS currently uses replication for redundancy and consumes more raw storage than xFS or Swift.
![Page 48: Google File Systems](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546d0beaaf7959ec228b8409/html5/thumbnails/48.jpg)
Related Works (2/3)• In contrast to systems like AFS, xFS, Frangipani,
and Intermezzo, GFS does not provide any caching below the file system interface.
• GFS uses a centralized approach in order to simplify the design, increase its reliability, and gain flexibility▫unlike Frangipani, xFS, Minnesota’s GFS and
GPFS▫Makes it easier to implement sophisticated chunk
placement and replication policies since the master already has most of the relevant information and controls how it changes.
![Page 49: Google File Systems](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546d0beaaf7959ec228b8409/html5/thumbnails/49.jpg)
Related Works (3/3)• GFS delivers aggregated performance by focusing on
the needs of our applications rather than building a POSIX-compliant file system, unlike in Lustre
• NASD architecture is based on network-attached disk drives, similarly GFS uses commodity machines as chunkservers
• GFS chunkservers use lazily allocated fixed-size chunks, whereas NASD uses variable-length objects
• The producer-consumer queues enabled by atomic record appends address a similar problem as the distributed queues in River▫River uses memory-based queues distributed across
machines
![Page 50: Google File Systems](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546d0beaaf7959ec228b8409/html5/thumbnails/50.jpg)
Conclusion•GFS demonstrates the qualities essential
for supporting large-scale data processing workloads on commodity hardware.
•Provides fault tolerance by constant monitoring, replicating crucial data, and fast and automatic recovery
•Delivers high aggregate throughput to many concurrent readers and writers performing a variety of tasks
![Page 51: Google File Systems](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546d0beaaf7959ec228b8409/html5/thumbnails/51.jpg)
Reference
•Ghemawat. S., Gobioff. H., Leung. S., 2003. The Google file system. In Proceedings of the nineteenth ACM symposium on Operating systems principles (SOSP '03). ACM, New York, NY, USA, 29-43.
•Coulouris. G., Dollimore. J., Kindberg. T. 2005. Distributed Systems: Concepts and Design (4th Edition). Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA.
![Page 52: Google File Systems](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546d0beaaf7959ec228b8409/html5/thumbnails/52.jpg)
Thank You