google file system

37

Click here to load reader

Upload: shareek-ahamed

Post on 16-Apr-2017

1.442 views

Category:

Software


0 download

TRANSCRIPT

Page 1: Google File System

By Sanjay Ghemawat, Howard Gobioff, and

Shun-Tak Leung

The Google File SystemGFS

Presented by148215N J.R.K.C Jayakody148201T M.M. Shareek Ahamed

Page 2: Google File System

2

Google Platform Characteristics•100s to 1000s of PCs in cluster•Many modes of failure for each PC:•App bugs, OS bugs•Human error•Disk failure, memory failure, net failure, power supply failure•Connector failure

•Monitoring, fault tolerance, auto-recovery essential

2

Page 3: Google File System

3

Page 4: Google File System

4

Google File System: Design Criteria•Detect, tolerate, recover from failures automatically•Large files, >= 100 MB in size•Large, streaming reads (>= 1 MB in size)•Concurrent appends by multiple clients (e.g., producer-consumer queues)•Want atomicity for appends without synchronization overhead among clients

4

Page 5: Google File System

5

GFS: Architecture

•One master server

•Many chunk servers (100s – 1000s)•Spread across racks•Chunk: 64 MB portion of file, identified by 64-bit, globally unique ID

•Many clients accessing same and different files stored on same cluster

5

Page 6: Google File System

6

GFS: Architecture6

Page 7: Google File System

Master Server•Holds all metadata:•Namespace (directory hierarchy)•Access control information (per-file)•Mapping from files to chunks•Current locations of chunks (chunkservers)

•Delegates consistency management•Garbage collects orphaned chunks•Migrates chunks between chunkservers

7

Page 8: Google File System

8

Chunkserver

•Stores 64 MB file chunks on local disk using standard Linux filesystem, each with version number and checksum•Read/write requests specify chunk handle and byte range•Chunks replicated on configurable number of chunkservers (default: 3)•No caching of file data (beyond standard Linux buffer cache)

8

Page 9: Google File System

9

Client•Issues control (metadata) requests to master server

•Issues data requests directly to chunkservers

•Caches metadata

•Does no caching of data•Streaming reads (read once) and append writes (write once) don’t benefit much from caching at client

9

Page 10: Google File System

10

Client Read•Client sends master:• read(file name, chunk index)

•Master’s reply:•chunk ID, chunk version number, locations of replicas

•Client sends “closest” chunkserver w/replica:• read(chunk ID, byte range)

•Chunkserver replies with data

10

Page 11: Google File System

11

GFS: Architecture

11

Page 12: Google File System

12

Client Write•Some chunkserver is primary for each chunk•Master grants lease to primary (typically for 60 sec.)•Leases renewed using periodic heartbeat messages between master and chunkservers

•Client asks master for primary and secondary replicas for each chunk•Client sends data to replicas•Pipelined: each replica forwards as it receives

12

Page 13: Google File System

13

Client Write

13

Page 14: Google File System

14

Client Write

•All replicas acknowledge data write to client•Client sends write request to primary•Primary assigns serial number to write request, providing ordering•Primary forwards write request with same serial number to secondaries•Secondaries all reply to primary after completing write•Primary replies to client

14

Page 15: Google File System

15

Client Record Append•Google uses large files as queues between multiple producers and consumers•Same control flow as for writes•Client pushes data to replicas of last chunk of file•Client sends request to primary•Common case: request fits in current last chunk:•Primary appends data to own replica•Primary tells secondaries to do same at same byte offset in theirs•Primary replies with success to client

15

Page 16: Google File System

16

Client Record Append•When data won’t fit in last chunk:•Primary fills current chunk with padding•Primary instructs other replicas to do same•Primary replies to client, “retry on next chunk”

•If record append fails at any replica, client retries operation•So replicas of same chunk may contain different data—even duplicates of all or part of record data

•What guarantee does GFS provide on success?•Data written at least once in atomic unit

16

Page 17: Google File System

17

GFS: Consistency Model•Changes to data are ordered as chosen by a primary•All replicas will be consistent

•Record append completes at least once•Applications must cope with possible duplicates

17

Page 18: Google File System

18

Applications andRecord Append Semantics•Applications should use self-describing records and checksums when using Record Append•Reader can identify padding / record fragments

•If application cannot tolerate duplicated records, should include unique ID in record•Reader can use unique IDs to filter duplicates

18

Page 19: Google File System

19

Logging at Master•Master has all metadata information•Lose it, and you’ve lost the filesystem!

•Master logs all client requests to disk sequentially•Replicates log entries to remote backup servers

19

Page 20: Google File System

20

Chunk Leases and Version Numbers•If no outstanding lease when client requests write, master grants new one•Chunks have version numbers•Stored on disk at master and chunkservers•Each time master grants new lease, increments version, informs all replicas

•Master can revoke leases•e.g., when client requests rename or snapshot of file

20

Page 21: Google File System

Master Operations•Locking Operations•Replica Placement•Garbage Collection •Creation, Re-replication

21

Page 22: Google File System

Locking OperationsGoogle Directory Structure

It uses “Lookup tables” to Map Namespaces

Each Node Can be a “File” or a “Directory”

22Full path name Meta data

Node

Page 23: Google File System

Locking Operations•To access /d1/d2/leaf

•Need to lock /d1/d1/d2/d1/d2/leaf

•Can modify a directory concurrently •A read lock on a directory•A write lock on a file

•Totally ordered locking to prevent deadlocks

23

Page 24: Google File System

Replica Placement•Several replicas for each chunk•Default: 3 replicas

=> 3 Replicas in the same machine?? (No use)=> Spread the chunk replicas across machines and racks

•Goals: •Maximize data reliability and availability

•Tradeoff: Network Traffic

24

Page 25: Google File System

Creation, Re-replication, Rebalancing

Replica Creation- Find a disk with below Avg disk space- Limit Creations in one Chunk server- Spread Chunks across racksWhy Re-replication- If a Chunkserver goes down- If Replica corrupted- Replication Priority increased

25

(Masters Responsibility)

Page 26: Google File System

Garbage Collection •Delete is not actually a delete

•When a file is deleted=> Hide the file with a timestamp (Still Readable and can Undo)

•Master scans and delete the hidden files older than 3 days.

•Stale replica removal

26

Page 27: Google File System

Fault Tolerance and DiagnosisFast recovery•Master and Chunkservers are designed to restore their states and start in seconds regardless of termination conditions

•Chunk replication: - In multiple machines and racks.

•Master replication:- Primary master and Shadow masters.

27

Page 28: Google File System

Performance Measurements

Performance measured on a cluster with•1 Master, 2 master replicas•16 Chunkservers•16 Clients

Machines: 100 Mbps EthernetSwitches: 1 Gbps link

28

Machine Configuration

Processor: 1.4 GHz Dual PIIIRAM: 2GBHD: 2 x 80GB

19 Servers 16 Clients

Page 29: Google File System

Reads •Each client reads 4MB randomly, from a 320 GB file set simultan-eously.•4MB x 256 times

= 1GB Reads/Client

•Why low?1) Multiple clients

may Read from the same chunk

2) Network Traffic

80%

75%

10

1 client => 12.5 MB/s => 10 MB/s16 clients => 125 MB/s => 94 MB/s

29

Page 30: Google File System

Writes•N clients writing to N files simultaneously.

•Why low?Writing to 3 replicas

50%

50%

1 client => 12.5 MB/s => 6.3 MB/s16 clients => 67 MB/s => 35 MB/s 30

Page 31: Google File System

Record Appends•N clients appending to a single file simultaneously.• Append rate goes down with clients.

1 client => 125 MB/s => 6.0 MB/s16 clients => 125 MB/s => 4.8 MB/s 31

Page 32: Google File System

Performance (Real-world Cluster)•Cluster A: Research and development•Used by over a hundred engineers.• Typical task initiated by user and ran for few hours.• Task reads MB’s-TB’s of data, transforms/analyzes the data, and writes results back.

•Cluster B: Production data processing• Typical task runs much longer than a Cluster A task.• Continuously generates and processes multi-TB data sets.•Human users rarely involved.

•Measurements were taken after Clusters ran for a week.

32

Page 33: Google File System

Characteristics of two GFS Clusters

33

Cluster A: Research and developmentCluster B: Production data processing

Page 34: Google File System

Performance metrics for two GFS Clusters

34

Cluster A: Research and developmentCluster B: Production data processing

Page 35: Google File System

Performance (Real-world Cluster)•Experiment in recovery time:

• Killed One Chunkserver in Cluster B (Production)

• Chunkserver has = 15,000 chunks = 600 GB of data

•Took 23.2 minutes to restore all the chunks. 35

Page 36: Google File System

Conclusion• GFS is a distributed file system that support

large-scale data processing workloads on commodity hardware• GFS has different points in the design space• Component failures as the norm• Optimize for huge files

• GFS provides fault tolerance• Replicating data• Fast and automatic recovery• Chunk replication

• GFS has the simple, centralized master that does not become a bottleneck• GFS is a successful file system• An important tool that enables to

continue to innovate on Google’s ideas

36

Page 37: Google File System

Thank you

37