distributed file systems andy wang cop 5611 advanced operating systems
TRANSCRIPT
Outline
Basic concepts NFS Andrew File System Replicated file systems
Ficus Coda
Serverless file systems
Basic Distributed FS Concepts
You are here, the file’s there, what do you do about it?
Important questions What files can I access? How do I name them? How do I get the data? How do I synchronize with others?
What files can be accessed?
Several possible choices Every file in the world Every file stored in this kind of system Every file in my local installation Selected volumes Selected individual files
What dictates the choice?
Why not make every file available? Naming issues Scaling issues Local autonomy Security Network traffic
Naming Files in a Distributed System
How much transparency? Does every user/machine/sub-network
need its own namespace? How do I find a site that stores the
file that I name? Is it implicit in the name?
Can my naming scheme scale? Must everyone agree on my scheme?
How do I get remote files?
Fetch it over the network? How much caching? Replication? What security is required for data
transport?
Synchronization and Consistency
Will there be trouble if multiple sites want to update a file?
Can I get any guarantee that I always see consistent versions of data? i.e., will I ever see old data after new? How soon do I see new data?
NFS
Networked file system Provide distributed filing by remote
access With a high degree of transparency
Method of providing highly transparent access to remote files
Developed by Sun
NFS Characteristics
Volume-level access RPC-based (uses XDR) Stateless remote file access Location (not name) transparent Implementation for many systems
All interoperate, even non-Unix ones Currently based on VFS
VFS/Vnode Review
VFS—Virtual File System Common interface allowing multiple
file system implementations on one system
Plugged in below user level Files represented by vnodes
NFS File Handles
On clients, files are represented by vnodes
The client internally represents remote files as handles
Opaque to client But meaningful to server
To name remote file, provide handle to server
NFS Handle Diagram
file descriptor
vnode
handle inode
vnode
handleUser process
VFS level
NFS level
Client side Server side
NFS server
VFS level
UFS
How to make this work?
Could integrate it into the kernel Non-portable, non-distributable
Instead, use existing features to do the work VFS for common interface RPC for data transport
Using RPC for NFS
Must have some process at server that answers the RPC requests Continuously running daemon
process Somehow, must perform mounts
over machine boundaries A second daemon process for this
NFS Processes
nfsd daemons—server daemons that accept RPC calls for NFS
rpc.mountd daemons—server daemons that handle mount requests
biod daemons—optional client daemons that can improve performance
NFS from the Client’s Side
User issues a normal file operation Like read()
Passes through vnode interface to client-side NFS implementation
Client-side NFS implementation formats and sends an RPC packet to perform operation
Single client blocks until RPC returns
NFS RPC Procedures
16 RPC procedures to implement NFS Some for files, some for file systems Including directory ops, link ops, read,
write, etc. Lookup() is the key operation
Because it fetches handles Other NFS file operations use the
handle
Mount Operations
Must mount an NFS file system on the client before you can use it Requires local and remote operations Local ops indicate mount point has an
NFS-type VFS at that point in hierarchy Remote operations go to remote
rpc.mountd Mount provides “primal” file handle
NFS on the Server Side The server side is represented by the
local VFS actually storing the data Plus rpc.mountd and nfsd daemons NFS is stateless—servers do not keep
track of clients Each NFS operation must be self-
contained (from server’s point of view)
Implications of Statelessness
Self-contained NFS RPC requests NFS operations should be idempotent
NFS should use a stateless transport protocol (e.g., UDP)
Servers don’t worry about client crashes
Server crashes won’t leave junk
More Implications of Statelessness
Servers don’t know what files clients think are open Unlike in UFS, LFS, most local VFS file
systems Makes it much harder to provide
certain semantics Scales nicely, though
Preserving UNIX File Operation Semantics
NFS works hard to provide identical semantics to local UFS operations
Some of this is tricky Especially given statelessness of server E.g., how do you avoid discarding
pages of unlinked file a client has open?
Sleazy NFS Tricks
Used to provide desired semantics despite statelessness of the server
E.g., if client unlinks open file, send rename to server rather than remove Perform actual remove when file is
closed Won’t work if file removed on server Won’t work with cooperating clients
File Handles
Method clients use to identify files Created by the server on file lookup Must be unique mappings of server
file identifier to universal identifier File handles become invalid when
server frees or reuses inode Inode generation number in handle
shows when stale
nfsd Daemon
Handle incoming RPC requests Often multiple nfsd daemons per
site A nfsd daemon makes kernel calls
to do the real work Allows multiple threads
biod Daemon
Does readahead for clients To make use of kernel file buffer
cache Only improves performance—NFS
works correctly without biod daemon
Also flushes buffered writes for clients
rpc.mount Daemon
Runs on server to handle VFS-level operations for NFS
Particularly remote mount requests Provides initial file handle for a
remote volume Also checks that incoming requests
are from privileged ports (in UDP/IP packet source address)
rpc.lockd Daemon
NFS server is stateless, so it does not handle file locking
rpc.lockd provides locking Runs on both client and server
Client side catches request, forwards to server daemon
rpc.lockd handles lock recovery when server crashes
rpc.statd Daemon
Also runs on both client and server Used to check status of a machine Server’s rpc.lockd asks rpc.statd to
store permanent lock information (in file system) And to monitor status of locking machine
If client crashes, clear its locks from server
Recovering Locks After a Crash
If server crashes and recovers, its rpc.lockd contacts clients to reestablish locks
If client crashes, rpc.statd contacts client when it becomes available again
Client has short grace period to revalidate locks Then they’re cleared
What can you cache, con’t?
File attributes Specially cached by NFS Directory attributes handled a little
differently than file attributes Especially important because many
programs get and set attributes frequently
Security in NFS
NFS inherits RPC mechanism security Some RPC mechanisms provide
decent security Some don’t
Mount security provided via knowing which ports are permitted to mount what
The Andrew File System
A different approach to remote file access
Meant to service a large organization Such as a university campus
Scaling is a major goal
Basic Andrew Model
Files are stored permanently at file server machines
Users work from workstation machines With their own private namespace
Andrew provides mechanisms to cache user’s files from shared namespace
User Model of AFS Use
Sit down at any AFS workstation anywhere
Log in and authenticate who I am Access all files without regard to
which workstation I’m using
The Local Namespace
Each workstation stores a few files Mostly system programs and
configuration files Workstations are treated as
generic, interchangeable entities
Virtue and Vice
Vice is the system run by the file servers Distributed system
Virtue is the protocol client workstations use to communicate to Vice
Overall Architecture
System is viewed as a WAN composed of LANs
Each LAN has a Vice cluster server Which stores local files
But Vice makes all files available to all clients
Caching the User Files
Goal is to offload work from servers to clients
When must servers do work? To answer requests To move data
Whole files cached at clients
Why Whole-file Caching?
Minimizes communications with server
Most files used in entirety, anyway Easier cache management problem Requires substantial free disk space
on workstations- Doesn’t address huge file problems
The Shared Namespace
An Andrew installation has globally shared namespace
All client’s files in the namespace with the same names
High degree of name and location transparency
How do servers provide the namespace?
Files are organized into volumes Volumes are grafted together into
overall namespace Each file has globally unique ID Volumes are stored at individual
servers But a volume can be moved from
server to server
Finding a File
At high level, files have names Directory translates name to
unique ID If client knows where the volume
is, it simply sends unique ID to appropriate server
Finding a Volume
What if you enter a new volume? How do you find which server stores
the volume? Volume-location database stored
on each server Once information on volume is
known, client caches it
Making a Volume
When a volume moves from server to server, update database Heavyweight distributed operation
What about clients with cached information?
Old server maintains forwarding info Also eases server update
Handling Cached Files
Files fetched transparently when needed
File system traps opens Sends them to local Venus process
The Venus Daemon
Responsible for handling single client cache
Caches files on open Writes modified versions back on close Cached files saved locally after close Cache directory entry translations, too
Consistency for AFS
If my workstation has a locally cached copy of a file, what if someone else changes it?
Callbacks used to invalidate my copy
Requires servers to keep info on who caches files
Write Consistency in AFS
What if I write to my cached copy of a file?
Need to get write permission from server Which invalidates other copies
Permission obtained on open for write Need to obtain new data at this point
Write Consistency in AFS, Con’t
Initially, written only to local copy On close, Venus sends update to
server Extra mechanism to handle
failures
Storage of Andrew Files
Stored in UNIX file systems Client cache is a directory on local
machine Low-level names do not match
Andrew names
Venus Cache Management
Venus keeps two caches Status Data
Status cache kept in virtual memory For fast attribute lookup
Data cache kept on disk
Venus Process Architecture
Venus is single user process But multithreaded Uses RPC to talk to server
RPC is built on low level datagram service
AFS Security
Only server/Vice are trusted here Client machines might be corrupted
No client programs run on Vice machines
Clients must authenticate themselves to servers
Encrypted transmissions
AFS File Protection
AFS supports access control lists Each file has list of users who can
access it And permitted modes of access
Maintained by Vice Used to mimic UNIX access control
AFS Read-only Replication
For volumes containing files that are used frequently, but not changed often E.g., executables
AFS allows multiple servers to store read-only copies
Replicated File Systems
NFS provides remote access AFS provides high quality caching Why isn’t this enough?
More precisely, when isn’t this enough?
When Do You Need Replication?
For write performance For reliability For availability For mobile computing For load sharing Optimistic replication increases
these advantages
Some Replicated File Systems
Locus Ficus Coda Rumor All optimistic: few conservative file
replication systems have been built
Ficus
Optimistic file replication based on peer-to-peer model
Built in Unix context Meant to service large network of
workstations Built using stackable layers
Peer-to-peer Replication
All replicas are equal No replicas are masters, or servers All replicas can provide any service All replicas can propagate updates
to all other replicas Client/server is the other popular
model
Basic Ficus Architecture Ficus replicates at volume
granularity Given volume can be replicated
many times Performance limitations on scale
Updates propagated as they occur On single best-efforts basis
Consistency achieved by periodic reconciliation
Stackable Layers in Ficus
Ficus is built out of stackable layers
Exact composition depends on what generation of system you look at
Reconciliation in Ficus
Reconciliation process runs periodically on each Ficus site For each local volume replica
Reconciliation strategy implies eventual consistency guarantee Frequency of reconciliation affects
how long “eventually” takes
Steps in Reconciliation
1. Get information about the state of a remote replica
2. Get information about the state of the local replica
3. Compare the two sets of information
4. Change local replica to reflect remote changes
Gossiping and Reconciliation
Reconciliation benefits from the use of gossip
In example just shown, an update originating at A got to B through communications between B and C
So B can get the update without talking to A directly
Benefits of Gossiping
Potentially less communications Shares load of sending updates Easier recovery behavior Handles disconnections nicely Handles mobile computing nicely Peer model systems get more
benefit than client/server model systems
Reconciliation Topology
Reconciliation in Ficus is pair-wise In the general case, which pairs of
replicas should reconcile? Reconciling all pairs is unnecessary
Due to gossip Want to minimize number of recons
But propagate data quickly
Problems in File Reconciliation
Recognizing updates Recognizing update conflicts Handling conflicts Recognizing name conflicts Update/remove conflicts Garbage collection Ficus has solutions for all these
problems
Recognizing Updates in Ficus
Ficus keeps per-file version vectors Updates detected by version
vector comparisons The data for the later version can
then be propagated Ficus propagates full files
Recognizing Update Conflicts
Concurrent updates can lead to update conflicts
Version vectors permit detection of update conflicts
Works for n-way conflicts, too
Handling Update Conflicts
Ficus uses resolver programs to handle conflicts
Resolvers work on one pair of replicas of one file
System attempts to deduce file type and call proper resolver
If all resolvers fail, notify user Ficus also blocks access to file
Handling Directory Conflicts
Directory updates have very limited semantics So directory conflicts are easier to
deal with Ficus uses in-kernel mechanisms
to automatically fix most directory conflicts
How Did This Directory Get Into This State?
If we could figure out what operations were performed on each side that cased each replica to enter this state,
We could produce a merged version
But there are several possibilities
Possibility 1
1. Earth and Mars exist2. Create Saturn at replica 13. Create Sedna at replica 2Correct result is directory containing
Earth, Mars, Saturn, and Sedna
The Create/delete Ambiguity This is an example of a general
problem with replicated data Cannot be solved with per-file
version vectors Requires per-entry information Ficus keeps such information Must save removed files’ entries
for a while
Possibility 2
1. Earth, Mars, and Saturn exist2. Delete Saturn at replica 23. Create Sedna at replica 2 Correct result is directory
containing Earth, Mars, and Sedna
And there are other possibilities
Recognizing Name Conflicts
Name conflicts occur when two different files are concurrently given same name
Ficus recognizes them with its per-entry directory info
Then what? Handle similarly to update conflicts
Add disambiguating suffixes to names
Internal Representation of Problem Directory
Earth
Mars
Saturn
Earth
Mars
Saturn
Sedna
Replica 1 Replica 2
Update/remove Conflicts
Consider case where file “Saturn” has two replicas
1. Replica 1 receives an update2. Replica 2 is removed What should happen? A matter of systems semantics,
basically
Ficus’ No-lost-updates Semantics
Ficus handles this problem by defining its semantics to be no-lost-updates
In other words, the update must not disappear
But the remove must happen Put “Saturn” in the orphanage
Requires temporarily saving removed files
Removals and Hard Links
Unix and Ficus support hard links Effectively, multiple names for a file
Cannot remove a file’s bits until the last hard link to the file is removed
Tricky in a distributed system
Link Example, Part III
Replica 1
foodir
red blue
Replica 2
foodir
red blue
delete blue
bardir
create hard link in bardir to blue
What Should Happen Here?
Clearly, the link named foodir/blue should disappear
And the link in bardir link point to? But what version of the data should
the bardir link point to? No-lost-update semantics say it
must be the update at replica 1
Garbage Collection in Ficus
Ficus cannot throw away removed things at once Directory entries Updated files for no-lost-updates Non-updated files due to hard links
When can Ficus reclaim the space these use?
When Can I Throw Away My Data
Not until all links to the file disappear Global information, not local
Moreover, just because I know all links have disappeared doesn’t mean I can throw everything away Must wait till everyone knows
Requires two trips around the ring
Why Can’t I Forget When I Know There Are No Links
I can throw the data away I don’t need it, nobody else does either
But I can’t forget that I knew this Because not everyone knows it
For them to throw their data away, they must learn
So I must remember for their benefit
Coda
A different approach to optimistic replication
Inherits a lot form Andrew Basically, a client/server solution Developed at CMU
Coda Replication Model
Files stored permanently at server machines
Client workstations download temporary replicas, not cached copies
Can perform updates without getting token from the server
So concurrent updates possible
Detecting Concurrent Updates
Workstation replicas only reconcile with their server
At recon time, they compare their state of files with server’s state Detecting any problems
Since workstations don’t gossip, detection is easier than in Ficus
Handling Concurrent Updates
Basic strategy is similar to Ficus’ Resolver programs are called to
deal with conflicts Coda allows resolvers to deal with
multiple related conflicts at once Also has some other refinements
to conflict resolution
Server Replication in Coda
Unlike Andrew, writable copies of a file can be stored at multiple servers
Servers have peer-to-peer replication Servers have strong connectivity,
crash infrequently Thus, Coda uses simpler peer-to-peer
algorithms than Ficus must
Why Is Coda Better Than AFS?
Writes don’t lock the file Writes happen quicker More local autonomy
Less write traffic on the network Workstations can be disconnected Better load sharing among servers
Comparing Coda to Ficus
Coda uses simpler algorithms Less likely to be bugs Less likely to be performance
problems Coda doesn’t allow client gossiping Coda has built-in security Coda garbage collection simpler
Serverless Network File Systems
New network technologies are much faster, with much higher bandwidth
In some cases, going over the net is quicker than going to local disk
How can we improve file systems by taking advantage of this change?
Fundamental Ideas of xFS
Peer workstations providing file service for each other
High degree of location independence
Make use of all machine’s caches Provide reliability in case of
failures
xFS Developed at Berkeley Inherits ideas from several sources
LFS Zebra (RAID-like ideas) Multiprocessor cache consistency
Built for Network of Workstations (NOW) environment
What Does a File Server Do?
Stores file data blocks on its disks Maintains file location information Maintains cache of data blocks Manages cache consistency for its
clients
xFS Must Provide These Services
In essence, every machine takes on some of the server’s responsibilities
Any data or metadata might be located at any machine
Key challenge is providing same services centralized server provided in a distributed system
Key xFS Concepts
Metadata manager Stripe groups for data storage Cooperative caching Distributed cleaning processes
How Do I Locate a File in xFS?
I’ve got a file name, but where is it? Assuming it’s not locally cached
File’s director converts name to a unique index number
Consult the metadata manager to find out where file with that index number is stored-the manager map
The Manager Map
Data structure that allows translation of index numbers to file managers Not necessarily file locations
Kept by each metadata manager Globally replicated data structure Simply says what machine manages
the file
Using the Manager Map
Look up index number in local map Index numbers are clustered, so
many fewer entries than files Send request to responsible
manager
What Does the Manager Do? Manager keeps two types of
information1. imap information2. caching information If some other sites has the file in its
cache, tell requester to go to that site
Always use cache before disk Even if cache is remote
What if No One Caches the Block?
Metadata manager for this file then must consult its imap
Imap tells which disks store the data block
Files are striped across disks stored on multiple machines Typically single block is on one disk
Writing Data
xFS uses RAID-like methods to store data
RAID sucks for small writes So xFS avoids small writes By using LFS-style operations
Batch writes until you have a full stripe’s worth
Stripe Groups
Set of disks that cooperatively store data in RAID fashion
xFS uses single parity disk Alternative to striping all data
across all disks
Cooperative Caching Each site’s cache can service
requests from all other sites Working from assumption that
network access is quicker than disk access
Metadata managers used to keep track of where data is cached So remote cache access takes 3
network hops
Getting a Block from a Remote Cache
ManagerMap
Client
CacheConsistency
State
MetaDataServer
UnixCache
CachingSite
RequestBlock
1 2
3
Providing Cache Consistency
Per-block token consistency To write a block, client requests
token from metadata server Metadata server retrievers token
from whoever has it And invalidates other caches
Writing site keeps token
Which Sites Should Manage Which Files?
Could randomly assign equal number of file index groups to each site
Better if the site using a file also manages it In particular, if most frequent writer
manages it Can reduce network traffic by ~50%
Cleaning Up
File data (and metadata) is stored in log structures spread across machines
A distributed cleaning method is required
Each machine stores info on its usage of stripe groups
Each cleans up its own mess