Large Scale Sharing Large Scale Sharing GFS and PASTGFS and PAST
Mahesh BalakrishnanMahesh Balakrishnan
Distributed File SystemsDistributed File Systems
Traditional Definition:Traditional Definition: Data and/or metadata stored at remote Data and/or metadata stored at remote
locations, accessed by client over the locations, accessed by client over the network.network.
Various degrees of centralization: from NFS to Various degrees of centralization: from NFS to xFS.xFS.
GFS and PASTGFS and PAST Unconventional, specialized functionalityUnconventional, specialized functionality Large-scale in data and nodesLarge-scale in data and nodes
The Google File SystemThe Google File System
Specifically designed for Google’s Specifically designed for Google’s
backend needsbackend needs
Web Spiders append to huge filesWeb Spiders append to huge files
Application data patterns:Application data patterns: Multiple producer – multiple consumerMultiple producer – multiple consumer
Many-way mergingMany-way merging
GFS GFS Traditional File Systems Traditional File Systems
Design Space CoordinatesDesign Space Coordinates
Commodity ComponentsCommodity Components
Very large files – Multi GBVery large files – Multi GB
Large sequential accessesLarge sequential accesses
Co-design of Applications and File SystemCo-design of Applications and File System
Supports small files, random access writes Supports small files, random access writes
and reads, but not efficientlyand reads, but not efficiently
GFS ArchitectureGFS Architecture
Interface: Interface: Usual: create, delete, open, close, etcUsual: create, delete, open, close, etc
Special: snapshot, record appendSpecial: snapshot, record append
Files divided into fixed size chunksFiles divided into fixed size chunks
Each chunk replicated at chunkserversEach chunk replicated at chunkservers
Single master maintains metadataSingle master maintains metadata
Master, Chunkservers, Clients: Linux Master, Chunkservers, Clients: Linux
workstations, user-level processworkstations, user-level process
Client File RequestClient File Request
Client finds chunkid for offset within fileClient finds chunkid for offset within file
Client sends <filename, chunkid> to MasterClient sends <filename, chunkid> to Master
Master returns chunk handle and chunkserver locationsMaster returns chunk handle and chunkserver locations
Design Choices: MasterDesign Choices: Master
Single master maintains all metadata …Single master maintains all metadata … Simple DesignSimple Design
Global decision making for chunk replicationGlobal decision making for chunk replication
and placementand placement
Bottleneck?Bottleneck?
Single Point of Failure?Single Point of Failure?
Design Choices: MasterDesign Choices: Master
Single master maintains all metadata … in Single master maintains all metadata … in
memory!memory! Fast master operationsFast master operations
Allows background scans of entire dataAllows background scans of entire data
Memory Limit? Memory Limit?
Fault Tolerance?Fault Tolerance?
Relaxed Consistency ModelRelaxed Consistency Model
File Regions are -File Regions are - Consistent: All clients see the same thingConsistent: All clients see the same thing Defined: After mutation, all clients see exactly Defined: After mutation, all clients see exactly
what the mutation wrotewhat the mutation wrote
Ordering of Concurrent Mutations –Ordering of Concurrent Mutations – For each chunk’s replica set, Master gives For each chunk’s replica set, Master gives
one replica primary leaseone replica primary lease Primary replica decides ordering of mutations Primary replica decides ordering of mutations
and sends to other replicasand sends to other replicas
Anatomy of a MutationAnatomy of a Mutation
1 2 Client gets chunkserver 1 2 Client gets chunkserver locations from masterlocations from master
3 Client pushes data to 3 Client pushes data to replicas, in a chainreplicas, in a chain
4 Client sends write request to 4 Client sends write request to primary; primary assigns primary; primary assigns sequence number to write sequence number to write and applies itand applies it
5 6 Primary tells other replicas to 5 6 Primary tells other replicas to apply writeapply write
7 Primary replies to client7 Primary replies to client
Connection Connection withwith Consistency Model Consistency Model
Secondary replica encounters error while applying write Secondary replica encounters error while applying write (step 5): region Inconsistent.(step 5): region Inconsistent.
Client code breaks up single large write into multiple Client code breaks up single large write into multiple small writes: region Consistent, but Undefined.small writes: region Consistent, but Undefined.
Special FunctionalitySpecial Functionality
Atomic Record AppendAtomic Record Append Primary appends to itself, then tells other replicas to Primary appends to itself, then tells other replicas to
write at that offsetwrite at that offset
If secondary replica fails to write data (step 5), If secondary replica fails to write data (step 5),
duplicates in successful replicas, padding in failed onesduplicates in successful replicas, padding in failed ones
region defined where append successful, inconsistent where region defined where append successful, inconsistent where
failedfailed
SnapshotSnapshot Copy-on-write: chunks copied lazily to same replicaCopy-on-write: chunks copied lazily to same replica
Master InternalsMaster Internals
Namespace managementNamespace management
Replica Placement Replica Placement
Chunk Creation, Re-replication, Chunk Creation, Re-replication,
RebalancingRebalancing
Garbage CollectionGarbage Collection
Stale Replica DetectionStale Replica Detection
Dealing with FaultsDealing with Faults
High availabilityHigh availability Fast master and chunkserver recoveryFast master and chunkserver recovery
Chunk replicationChunk replication
Master state replication: read-only shadow replicasMaster state replication: read-only shadow replicas
Data IntegrityData Integrity Chunk broken into 64KB blocks, with 32 bit checksumChunk broken into 64KB blocks, with 32 bit checksum
Checksums stored in memory, logged to diskChecksums stored in memory, logged to disk
Optimized for appends, since no verifying requiredOptimized for appends, since no verifying required
Micro-benchmarksMicro-benchmarks
Storage Data for ‘real’ clustersStorage Data for ‘real’ clusters
PerformancePerformance
Workload BreakdownWorkload Breakdown% of operations% of operations
for given sizefor given size
% of bytes% of bytes
transferred fortransferred for
given operationgiven operation
sizesize
GFS: ConclusionGFS: Conclusion
Very application-specific: more Very application-specific: more engineering than researchengineering than research
PASTPAST
Internet-based P2P global storage utilityInternet-based P2P global storage utility Strong persistenceStrong persistence High availabilityHigh availability ScalabilityScalability SecuritySecurity
Not a conventional FSNot a conventional FS Files have unique idFiles have unique id Clients can insert and retrieve filesClients can insert and retrieve files Files are immutableFiles are immutable
PAST OperationsPAST Operations
Nodes have random unique nodeIdsNodes have random unique nodeIds
No searching, directory lookup, key distributionNo searching, directory lookup, key distribution
Supported Operations:Supported Operations:
Insert: (name, key, k, file) Insert: (name, key, k, file) fileId fileId Stores on k nodes closest in id spaceStores on k nodes closest in id space
Lookup: (fileId) Lookup: (fileId) file file
Reclaim: (fileId, key)Reclaim: (fileId, key)
PastryPastry
P2P routing substrateP2P routing substrate
route (key, msg) : routes to numerically closest route (key, msg) : routes to numerically closest
node in less than node in less than loglog22bb N N steps steps
Routing Table Size: (2Routing Table Size: (2bb - 1) * log - 1) * log22b b N + 2N + 2ll
b : determines tradeoff between per node state b : determines tradeoff between per node state
and lookup orderand lookup order
ll : failure tolerance: delivery guaranteed unless : failure tolerance: delivery guaranteed unless
ll/2 adjacent nodeIds fail/2 adjacent nodeIds fail
10233102: Routing Table10233102: Routing Table
|L|/2 larger and |L|/2 |L|/2 larger and |L|/2 smaller nodeIdssmaller nodeIds
Routing EntriesRouting Entries
|M| closest nodes|M| closest nodes
PAST operations/securityPAST operations/security
Insert: Insert: Certificate created with fileId, file content hash, Certificate created with fileId, file content hash,
replication factor and signed with private keyreplication factor and signed with private key File and certificate routed through PastryFile and certificate routed through Pastry First node in k closest accepts file and forwards to First node in k closest accepts file and forwards to
other k-1other k-1
Security: SmartcardsSecurity: Smartcards Public/Private keyPublic/Private key Generate and verify certificatesGenerate and verify certificates Ensure integrity of nodeId and fileId assignmentsEnsure integrity of nodeId and fileId assignments
Storage ManagementStorage Management
Design GoalsDesign Goals High global storage utilizationHigh global storage utilization Graceful degradation near max utilizationGraceful degradation near max utilization
PAST tries to:PAST tries to: Balance free storage space amongst nodesBalance free storage space amongst nodes Maintain k closest nodes replication invariantMaintain k closest nodes replication invariant
Storage Load ImbalanceStorage Load Imbalance Variance in number of files assigned to nodeVariance in number of files assigned to node Variance in size distribution of inserted filesVariance in size distribution of inserted files Variance in storage capacity of PAST nodesVariance in storage capacity of PAST nodes
Storage ManagementStorage Management
Large capacity storage nodes have multiple nodeIdsLarge capacity storage nodes have multiple nodeIds
Replica DiversionReplica Diversion If node A cannot store file, it stores pointer to file at leaf set node If node A cannot store file, it stores pointer to file at leaf set node
B which is not in k closestB which is not in k closest
What if A or B fail? Duplicate pointer in k+1 closest nodeWhat if A or B fail? Duplicate pointer in k+1 closest node
Policies for directing and accepting replicas: tPolicies for directing and accepting replicas: tpripri and t and tdivdiv
thresholds for file size / free space. thresholds for file size / free space.
File DiversionFile Diversion If insert fails, client retries with different fileIdIf insert fails, client retries with different fileId
Storage ManagementStorage Management
Maintaining replication invariantMaintaining replication invariant Failures and joinsFailures and joins
CachingCaching k-replication in PAST for availabilityk-replication in PAST for availability
Extra copies stored to reduce client latency, network Extra copies stored to reduce client latency, network
traffictraffic
Unused disk space utilizedUnused disk space utilized
Greedy Dual-Size replacement policyGreedy Dual-Size replacement policy
PerformancePerformance
Workloads:Workloads: 8 Web Proxy Logs8 Web Proxy Logs
Combined file systemsCombined file systems
k=5, b=4k=5, b=4
# of nodes = 2250# of nodes = 2250
Without replica and file Without replica and file diversion:diversion:
51.1% insertions failed51.1% insertions failed 60.8% global utilization60.8% global utilization
4 normal distributions 4 normal distributions
of node storage sizesof node storage sizes
Effect of Storage ManagementEffect of Storage Management
Effect of tEffect of tpripri
ttdivdiv = 0.05 = 0.05
ttpripri varied varied
Lower tLower tpripri::
Better utilization,Better utilization,
More failuresMore failures
Effect of tEffect of tdivdiv
ttpripri = 0.1 = 0.1
ttdivdiv varied varied
Trend similarTrend similar
to tto tpripri
File and Replica DiversionsFile and Replica Diversions
Ratio of file diversionsRatio of file diversions
vs utilizationvs utilization
Ratio of replicaRatio of replica
diversions vsdiversions vs
utilizationutilization
Distribution of Insertion FailuresDistribution of Insertion Failures
Web logs traceWeb logs trace
File system traceFile system trace
CachingCaching
ConclusionConclusion