google file system

By Sanjay Ghemawat, Howard Gobioff, and

Shun-Tak Leung

The Google File SystemGFS

Presented by148215N J.R.K.C Jayakody148201T M.M. Shareek Ahamed

2

Google Platform Characteristics•100s to 1000s of PCs in cluster•Many modes of failure for each PC:•App bugs, OS bugs•Human error•Disk failure, memory failure, net failure, power supply failure•Connector failure

•Monitoring, fault tolerance, auto-recovery essential

2

4

Google File System: Design Criteria•Detect, tolerate, recover from failures automatically•Large files, >= 100 MB in size•Large, streaming reads (>= 1 MB in size)•Concurrent appends by multiple clients (e.g., producer-consumer queues)•Want atomicity for appends without synchronization overhead among clients

4

5

GFS: Architecture

•One master server

•Many chunk servers (100s – 1000s)•Spread across racks•Chunk: 64 MB portion of file, identified by 64-bit, globally unique ID

•Many clients accessing same and different files stored on same cluster

5

6

GFS: Architecture6

Master Server•Holds all metadata:•Namespace (directory hierarchy)•Access control information (per-file)•Mapping from files to chunks•Current locations of chunks (chunkservers)

•Delegates consistency management•Garbage collects orphaned chunks•Migrates chunks between chunkservers

7

8

Chunkserver

•Stores 64 MB file chunks on local disk using standard Linux filesystem, each with version number and checksum•Read/write requests specify chunk handle and byte range•Chunks replicated on configurable number of chunkservers (default: 3)•No caching of file data (beyond standard Linux buffer cache)

8

9

Client•Issues control (metadata) requests to master server

•Issues data requests directly to chunkservers

•Caches metadata

•Does no caching of data•Streaming reads (read once) and append writes (write once) don’t benefit much from caching at client

9

10

Client Read•Client sends master:• read(file name, chunk index)

•Master’s reply:•chunk ID, chunk version number, locations of replicas

•Client sends “closest” chunkserver w/replica:• read(chunk ID, byte range)

•Chunkserver replies with data

10

11

GFS: Architecture

11

12

Client Write•Some chunkserver is primary for each chunk•Master grants lease to primary (typically for 60 sec.)•Leases renewed using periodic heartbeat messages between master and chunkservers

•Client asks master for primary and secondary replicas for each chunk•Client sends data to replicas•Pipelined: each replica forwards as it receives

12

13

Client Write

13

14

Client Write

•All replicas acknowledge data write to client•Client sends write request to primary•Primary assigns serial number to write request, providing ordering•Primary forwards write request with same serial number to secondaries•Secondaries all reply to primary after completing write•Primary replies to client

14

15

Client Record Append•Google uses large files as queues between multiple producers and consumers•Same control flow as for writes•Client pushes data to replicas of last chunk of file•Client sends request to primary•Common case: request fits in current last chunk:•Primary appends data to own replica•Primary tells secondaries to do same at same byte offset in theirs•Primary replies with success to client

15

16

Client Record Append•When data won’t fit in last chunk:•Primary fills current chunk with padding•Primary instructs other replicas to do same•Primary replies to client, “retry on next chunk”

•If record append fails at any replica, client retries operation•So replicas of same chunk may contain different data—even duplicates of all or part of record data

•What guarantee does GFS provide on success?•Data written at least once in atomic unit

16

17

GFS: Consistency Model•Changes to data are ordered as chosen by a primary•All replicas will be consistent

•Record append completes at least once•Applications must cope with possible duplicates

17

18

Applications andRecord Append Semantics•Applications should use self-describing records and checksums when using Record Append•Reader can identify padding / record fragments

•If application cannot tolerate duplicated records, should include unique ID in record•Reader can use unique IDs to filter duplicates

18

19

Logging at Master•Master has all metadata information•Lose it, and you’ve lost the filesystem!

•Master logs all client requests to disk sequentially•Replicates log entries to remote backup servers

19

20

Chunk Leases and Version Numbers•If no outstanding lease when client requests write, master grants new one•Chunks have version numbers•Stored on disk at master and chunkservers•Each time master grants new lease, increments version, informs all replicas

•Master can revoke leases•e.g., when client requests rename or snapshot of file

20

Master Operations•Locking Operations•Replica Placement•Garbage Collection •Creation, Re-replication

21

Locking OperationsGoogle Directory Structure

It uses “Lookup tables” to Map Namespaces

Each Node Can be a “File” or a “Directory”

22Full path name Meta data

Node

Locking Operations•To access /d1/d2/leaf

•Need to lock /d1/d1/d2/d1/d2/leaf

•Can modify a directory concurrently •A read lock on a directory•A write lock on a file

•Totally ordered locking to prevent deadlocks

23

Replica Placement•Several replicas for each chunk•Default: 3 replicas

=> 3 Replicas in the same machine?? (No use)=> Spread the chunk replicas across machines and racks

•Goals: •Maximize data reliability and availability

•Tradeoff: Network Traffic

24

Creation, Re-replication, Rebalancing

Replica Creation- Find a disk with below Avg disk space- Limit Creations in one Chunk server- Spread Chunks across racksWhy Re-replication- If a Chunkserver goes down- If Replica corrupted- Replication Priority increased

25

(Masters Responsibility)

Garbage Collection •Delete is not actually a delete

•When a file is deleted=> Hide the file with a timestamp (Still Readable and can Undo)

•Master scans and delete the hidden files older than 3 days.

•Stale replica removal

26

Fault Tolerance and DiagnosisFast recovery•Master and Chunkservers are designed to restore their states and start in seconds regardless of termination conditions

•Chunk replication: - In multiple machines and racks.

•Master replication:- Primary master and Shadow masters.

27

Performance Measurements

Performance measured on a cluster with•1 Master, 2 master replicas•16 Chunkservers•16 Clients

Machines: 100 Mbps EthernetSwitches: 1 Gbps link

28

Machine Configuration

Processor: 1.4 GHz Dual PIIIRAM: 2GBHD: 2 x 80GB

19 Servers 16 Clients

Reads •Each client reads 4MB randomly, from a 320 GB file set simultan-eously.•4MB x 256 times

= 1GB Reads/Client

•Why low?1) Multiple clients

may Read from the same chunk

2) Network Traffic

80%

75%

10

1 client => 12.5 MB/s => 10 MB/s16 clients => 125 MB/s => 94 MB/s

29

Writes•N clients writing to N files simultaneously.

•Why low?Writing to 3 replicas

50%

50%

1 client => 12.5 MB/s => 6.3 MB/s16 clients => 67 MB/s => 35 MB/s 30

Record Appends•N clients appending to a single file simultaneously.• Append rate goes down with clients.

1 client => 125 MB/s => 6.0 MB/s16 clients => 125 MB/s => 4.8 MB/s 31

Performance (Real-world Cluster)•Cluster A: Research and development•Used by over a hundred engineers.• Typical task initiated by user and ran for few hours.• Task reads MB’s-TB’s of data, transforms/analyzes the data, and writes results back.

•Cluster B: Production data processing• Typical task runs much longer than a Cluster A task.• Continuously generates and processes multi-TB data sets.•Human users rarely involved.

•Measurements were taken after Clusters ran for a week.

32

Characteristics of two GFS Clusters

33

Cluster A: Research and developmentCluster B: Production data processing

Performance metrics for two GFS Clusters

34

Cluster A: Research and developmentCluster B: Production data processing

Performance (Real-world Cluster)•Experiment in recovery time:

• Killed One Chunkserver in Cluster B (Production)

• Chunkserver has = 15,000 chunks = 600 GB of data

•Took 23.2 minutes to restore all the chunks. 35

Conclusion• GFS is a distributed file system that support

large-scale data processing workloads on commodity hardware• GFS has different points in the design space• Component failures as the norm• Optimize for huge files

• GFS provides fault tolerance• Replicating data• Fast and automatic recovery• Chunk replication

• GFS has the simple, centralized master that does not become a bottleneck• GFS is a successful file system• An important tool that enables to

continue to innovate on Google’s ideas

36

Thank you

37

google file system

Software