distributed file systems andy wang cop 5611 advanced operating systems

Distributed File Systems

Andy WangCOP 5611

Advanced Operating Systems

Outline

Basic concepts NFS Andrew File System Replicated file systems

Ficus Coda

Serverless file systems

Basic Distributed FS Concepts

You are here, the file’s there, what do you do about it?

Important questions What files can I access? How do I name them? How do I get the data? How do I synchronize with others?

What files can be accessed?

Several possible choices Every file in the world Every file stored in this kind of system Every file in my local installation Selected volumes Selected individual files

What dictates the choice?

Why not make every file available? Naming issues Scaling issues Local autonomy Security Network traffic

Naming Files in a Distributed System

How much transparency? Does every user/machine/sub-network

need its own namespace? How do I find a site that stores the

file that I name? Is it implicit in the name?

Can my naming scheme scale? Must everyone agree on my scheme?

How do I get remote files?

Fetch it over the network? How much caching? Replication? What security is required for data

transport?

Synchronization and Consistency

Will there be trouble if multiple sites want to update a file?

Can I get any guarantee that I always see consistent versions of data? i.e., will I ever see old data after new? How soon do I see new data?

NFS

Networked file system Provide distributed filing by remote

access With a high degree of transparency

Method of providing highly transparent access to remote files

Developed by Sun

NFS Characteristics

Volume-level access RPC-based (uses XDR) Stateless remote file access Location (not name) transparent Implementation for many systems

All interoperate, even non-Unix ones Currently based on VFS

VFS/Vnode Review

VFS—Virtual File System Common interface allowing multiple

file system implementations on one system

Plugged in below user level Files represented by vnodes

NFS Diagram

NFS Client

NFS Server

/tmp

/

/mnt

x y

/home

/

/bin

foo bar

NFS File Handles

On clients, files are represented by vnodes

The client internally represents remote files as handles

Opaque to client But meaningful to server

To name remote file, provide handle to server

NFS Handle Diagram

file descriptor

vnode

handle inode

vnode

handleUser process

VFS level

NFS level

Client side Server side

NFS server

VFS level

UFS

How to make this work?

Could integrate it into the kernel Non-portable, non-distributable

Instead, use existing features to do the work VFS for common interface RPC for data transport

Using RPC for NFS

Must have some process at server that answers the RPC requests Continuously running daemon

process Somehow, must perform mounts

over machine boundaries A second daemon process for this

NFS Processes

nfsd daemons—server daemons that accept RPC calls for NFS

rpc.mountd daemons—server daemons that handle mount requests

biod daemons—optional client daemons that can improve performance

NFS from the Client’s Side

User issues a normal file operation Like read()

Passes through vnode interface to client-side NFS implementation

Client-side NFS implementation formats and sends an RPC packet to perform operation

Single client blocks until RPC returns

NFS RPC Procedures

16 RPC procedures to implement NFS Some for files, some for file systems Including directory ops, link ops, read,

write, etc. Lookup() is the key operation

Because it fetches handles Other NFS file operations use the

handle

Mount Operations

Must mount an NFS file system on the client before you can use it Requires local and remote operations Local ops indicate mount point has an

NFS-type VFS at that point in hierarchy Remote operations go to remote

rpc.mountd Mount provides “primal” file handle

NFS on the Server Side The server side is represented by the

local VFS actually storing the data Plus rpc.mountd and nfsd daemons NFS is stateless—servers do not keep

track of clients Each NFS operation must be self-

contained (from server’s point of view)

Implications of Statelessness

Self-contained NFS RPC requests NFS operations should be idempotent

NFS should use a stateless transport protocol (e.g., UDP)

Servers don’t worry about client crashes

Server crashes won’t leave junk

More Implications of Statelessness

Servers don’t know what files clients think are open Unlike in UFS, LFS, most local VFS file

systems Makes it much harder to provide

certain semantics Scales nicely, though

Preserving UNIX File Operation Semantics

NFS works hard to provide identical semantics to local UFS operations

Some of this is tricky Especially given statelessness of server E.g., how do you avoid discarding

pages of unlinked file a client has open?

Sleazy NFS Tricks

Used to provide desired semantics despite statelessness of the server

E.g., if client unlinks open file, send rename to server rather than remove Perform actual remove when file is

closed Won’t work if file removed on server Won’t work with cooperating clients

File Handles

Method clients use to identify files Created by the server on file lookup Must be unique mappings of server

file identifier to universal identifier File handles become invalid when

server frees or reuses inode Inode generation number in handle

shows when stale

NFS Daemon Processes

nfsd daemon biod daemon rpc.mount daemon rpc.lockd daemon rpc.statd daemon

nfsd Daemon

Handle incoming RPC requests Often multiple nfsd daemons per

site A nfsd daemon makes kernel calls

to do the real work Allows multiple threads

biod Daemon

Does readahead for clients To make use of kernel file buffer

cache Only improves performance—NFS

works correctly without biod daemon

Also flushes buffered writes for clients

rpc.mount Daemon

Runs on server to handle VFS-level operations for NFS

Particularly remote mount requests Provides initial file handle for a

remote volume Also checks that incoming requests

are from privileged ports (in UDP/IP packet source address)

rpc.lockd Daemon

NFS server is stateless, so it does not handle file locking

rpc.lockd provides locking Runs on both client and server

Client side catches request, forwards to server daemon

rpc.lockd handles lock recovery when server crashes

rpc.statd Daemon

Also runs on both client and server Used to check status of a machine Server’s rpc.lockd asks rpc.statd to

store permanent lock information (in file system) And to monitor status of locking machine

If client crashes, clear its locks from server

Recovering Locks After a Crash

If server crashes and recovers, its rpc.lockd contacts clients to reestablish locks

If client crashes, rpc.statd contacts client when it becomes available again

Client has short grace period to revalidate locks Then they’re cleared

Caching in NFS

What can you cache at NFS clients?

How do you handle invalid client caches?

What can you cache?

Data blocks read ahead by biod daemon Cached in normal file system cache

area

What can you cache, con’t?

File attributes Specially cached by NFS Directory attributes handled a little

differently than file attributes Especially important because many

programs get and set attributes frequently

Security in NFS

NFS inherits RPC mechanism security Some RPC mechanisms provide

decent security Some don’t

Mount security provided via knowing which ports are permitted to mount what

The Andrew File System

A different approach to remote file access

Meant to service a large organization Such as a university campus

Scaling is a major goal

Basic Andrew Model

Files are stored permanently at file server machines

Users work from workstation machines With their own private namespace

Andrew provides mechanisms to cache user’s files from shared namespace

User Model of AFS Use

Sit down at any AFS workstation anywhere

Log in and authenticate who I am Access all files without regard to

which workstation I’m using

The Local Namespace

Each workstation stores a few files Mostly system programs and

configuration files Workstations are treated as

generic, interchangeable entities

Virtue and Vice

Vice is the system run by the file servers Distributed system

Virtue is the protocol client workstations use to communicate to Vice

Overall Architecture

System is viewed as a WAN composed of LANs

Each LAN has a Vice cluster server Which stores local files

But Vice makes all files available to all clients

Andrew Architecture Diagram

LAN

WAN

LAN

LAN

Caching the User Files

Goal is to offload work from servers to clients

When must servers do work? To answer requests To move data

Whole files cached at clients

Why Whole-file Caching?

Minimizes communications with server

Most files used in entirety, anyway Easier cache management problem Requires substantial free disk space

on workstations- Doesn’t address huge file problems

The Shared Namespace

An Andrew installation has globally shared namespace

All client’s files in the namespace with the same names

High degree of name and location transparency

How do servers provide the namespace?

Files are organized into volumes Volumes are grafted together into

overall namespace Each file has globally unique ID Volumes are stored at individual

servers But a volume can be moved from

server to server

Finding a File

At high level, files have names Directory translates name to

unique ID If client knows where the volume

is, it simply sends unique ID to appropriate server

Finding a Volume

What if you enter a new volume? How do you find which server stores

the volume? Volume-location database stored

on each server Once information on volume is

known, client caches it

Making a Volume

When a volume moves from server to server, update database Heavyweight distributed operation

What about clients with cached information?

Old server maintains forwarding info Also eases server update

Handling Cached Files

Files fetched transparently when needed

File system traps opens Sends them to local Venus process

The Venus Daemon

Responsible for handling single client cache

Caches files on open Writes modified versions back on close Cached files saved locally after close Cache directory entry translations, too

Consistency for AFS

If my workstation has a locally cached copy of a file, what if someone else changes it?

Callbacks used to invalidate my copy

Requires servers to keep info on who caches files

Write Consistency in AFS

What if I write to my cached copy of a file?

Need to get write permission from server Which invalidates other copies

Permission obtained on open for write Need to obtain new data at this point

Write Consistency in AFS, Con’t

Initially, written only to local copy On close, Venus sends update to

server Extra mechanism to handle

failures

Storage of Andrew Files

Stored in UNIX file systems Client cache is a directory on local

machine Low-level names do not match

Andrew names

Venus Cache Management

Venus keeps two caches Status Data

Status cache kept in virtual memory For fast attribute lookup

Data cache kept on disk

Venus Process Architecture

Venus is single user process But multithreaded Uses RPC to talk to server

RPC is built on low level datagram service

AFS Security

Only server/Vice are trusted here Client machines might be corrupted

No client programs run on Vice machines

Clients must authenticate themselves to servers

Encrypted transmissions

AFS File Protection

AFS supports access control lists Each file has list of users who can

access it And permitted modes of access

Maintained by Vice Used to mimic UNIX access control

AFS Read-only Replication

For volumes containing files that are used frequently, but not changed often E.g., executables

AFS allows multiple servers to store read-only copies

Distributed FS, Continued

Andy WangCOP 5611

Advanced Operating Systems

Outline

Replicated file systems Ficus Coda

Serverless file systems

Replicated File Systems

NFS provides remote access AFS provides high quality caching Why isn’t this enough?

More precisely, when isn’t this enough?

When Do You Need Replication?

For write performance For reliability For availability For mobile computing For load sharing Optimistic replication increases

these advantages

Some Replicated File Systems

Locus Ficus Coda Rumor All optimistic: few conservative file

replication systems have been built

Ficus

Optimistic file replication based on peer-to-peer model

Built in Unix context Meant to service large network of

workstations Built using stackable layers

Peer-to-peer Replication

All replicas are equal No replicas are masters, or servers All replicas can provide any service All replicas can propagate updates

to all other replicas Client/server is the other popular

model

Basic Ficus Architecture Ficus replicates at volume

granularity Given volume can be replicated

many times Performance limitations on scale

Updates propagated as they occur On single best-efforts basis

Consistency achieved by periodic reconciliation

Stackable Layers in Ficus

Ficus is built out of stackable layers

Exact composition depends on what generation of system you look at

Ficus Stackable Layers Diagram

Select

FLFS

Storage

FPFS

Transport

Storage

FPFS

Ficus Diagram

Site A

Site B

Site C

1

2 3

An Update Occurs

Site A

Site B

Site C

1

2 3

Reconciliation in Ficus

Reconciliation process runs periodically on each Ficus site For each local volume replica

Reconciliation strategy implies eventual consistency guarantee Frequency of reconciliation affects

how long “eventually” takes

Steps in Reconciliation

1. Get information about the state of a remote replica

2. Get information about the state of the local replica

3. Compare the two sets of information

4. Change local replica to reflect remote changes

Ficus Reconciliation DiagramC ReconcilesWith ASite

A

Site B

Site C

1

2 3

Ficus Reconciliation Diagram Con’t

B ReconcilesWith C

Site A

Site B

Site C

1

2 3

Gossiping and Reconciliation

Reconciliation benefits from the use of gossip

In example just shown, an update originating at A got to B through communications between B and C

So B can get the update without talking to A directly

Benefits of Gossiping

Potentially less communications Shares load of sending updates Easier recovery behavior Handles disconnections nicely Handles mobile computing nicely Peer model systems get more

benefit than client/server model systems

Reconciliation Topology

Reconciliation in Ficus is pair-wise In the general case, which pairs of

replicas should reconcile? Reconciling all pairs is unnecessary

Due to gossip Want to minimize number of recons

But propagate data quickly

Ring Reconciliation Topology

Adaptive Ring Topology

Problems in File Reconciliation

Recognizing updates Recognizing update conflicts Handling conflicts Recognizing name conflicts Update/remove conflicts Garbage collection Ficus has solutions for all these

problems

Recognizing Updates in Ficus

Ficus keeps per-file version vectors Updates detected by version

vector comparisons The data for the later version can

then be propagated Ficus propagates full files

Recognizing Update Conflicts

Concurrent updates can lead to update conflicts

Version vectors permit detection of update conflicts

Works for n-way conflicts, too

Handling Update Conflicts

Ficus uses resolver programs to handle conflicts

Resolvers work on one pair of replicas of one file

System attempts to deduce file type and call proper resolver

If all resolvers fail, notify user Ficus also blocks access to file

Handling Directory Conflicts

Directory updates have very limited semantics So directory conflicts are easier to

deal with Ficus uses in-kernel mechanisms

to automatically fix most directory conflicts

Directory Conflict Diagram

Earth

Mars

Saturn

Earth

Mars

Sedna

Replica 2Replica 1

How Did This Directory Get Into This State?

If we could figure out what operations were performed on each side that cased each replica to enter this state,

We could produce a merged version

But there are several possibilities

Possibility 1

1. Earth and Mars exist2. Create Saturn at replica 13. Create Sedna at replica 2Correct result is directory containing

Earth, Mars, Saturn, and Sedna

The Create/delete Ambiguity This is an example of a general

problem with replicated data Cannot be solved with per-file

version vectors Requires per-entry information Ficus keeps such information Must save removed files’ entries

for a while

Possibility 2

1. Earth, Mars, and Saturn exist2. Delete Saturn at replica 23. Create Sedna at replica 2 Correct result is directory

containing Earth, Mars, and Sedna

And there are other possibilities

Recognizing Name Conflicts

Name conflicts occur when two different files are concurrently given same name

Ficus recognizes them with its per-entry directory info

Then what? Handle similarly to update conflicts

Add disambiguating suffixes to names

Internal Representation of Problem Directory

Earth

Mars

Saturn

Earth

Mars

Saturn

Sedna

Replica 1 Replica 2

Update/remove Conflicts

Consider case where file “Saturn” has two replicas

1. Replica 1 receives an update2. Replica 2 is removed What should happen? A matter of systems semantics,

basically

Ficus’ No-lost-updates Semantics

Ficus handles this problem by defining its semantics to be no-lost-updates

In other words, the update must not disappear

But the remove must happen Put “Saturn” in the orphanage

Requires temporarily saving removed files

Removals and Hard Links

Unix and Ficus support hard links Effectively, multiple names for a file

Cannot remove a file’s bits until the last hard link to the file is removed

Tricky in a distributed system

Link Example

Replica 1

foodir

red blue

Replica 2

foodir

red blue

Link Example, Part II

Replica 1

foodir

red blue

Replica 2

foodir

red blue

update blue

Link Example, Part III

Replica 1

foodir

red blue

Replica 2

foodir

red blue

delete blue

bardir

create hard link in bardir to blue

What Should Happen Here?

Clearly, the link named foodir/blue should disappear

And the link in bardir link point to? But what version of the data should

the bardir link point to? No-lost-update semantics say it

must be the update at replica 1

Garbage Collection in Ficus

Ficus cannot throw away removed things at once Directory entries Updated files for no-lost-updates Non-updated files due to hard links

When can Ficus reclaim the space these use?

When Can I Throw Away My Data

Not until all links to the file disappear Global information, not local

Moreover, just because I know all links have disappeared doesn’t mean I can throw everything away Must wait till everyone knows

Requires two trips around the ring

Why Can’t I Forget When I Know There Are No Links

I can throw the data away I don’t need it, nobody else does either

But I can’t forget that I knew this Because not everyone knows it

For them to throw their data away, they must learn

So I must remember for their benefit

Coda

A different approach to optimistic replication

Inherits a lot form Andrew Basically, a client/server solution Developed at CMU

Coda Replication Model

Files stored permanently at server machines

Client workstations download temporary replicas, not cached copies

Can perform updates without getting token from the server

So concurrent updates possible

Detecting Concurrent Updates

Workstation replicas only reconcile with their server

At recon time, they compare their state of files with server’s state Detecting any problems

Since workstations don’t gossip, detection is easier than in Ficus

Handling Concurrent Updates

Basic strategy is similar to Ficus’ Resolver programs are called to

deal with conflicts Coda allows resolvers to deal with

multiple related conflicts at once Also has some other refinements

to conflict resolution

Server Replication in Coda

Unlike Andrew, writable copies of a file can be stored at multiple servers

Servers have peer-to-peer replication Servers have strong connectivity,

crash infrequently Thus, Coda uses simpler peer-to-peer

algorithms than Ficus must

Why Is Coda Better Than AFS?

Writes don’t lock the file Writes happen quicker More local autonomy

Less write traffic on the network Workstations can be disconnected Better load sharing among servers

Comparing Coda to Ficus

Coda uses simpler algorithms Less likely to be bugs Less likely to be performance

problems Coda doesn’t allow client gossiping Coda has built-in security Coda garbage collection simpler

Serverless Network File Systems

New network technologies are much faster, with much higher bandwidth

In some cases, going over the net is quicker than going to local disk

How can we improve file systems by taking advantage of this change?

Fundamental Ideas of xFS

Peer workstations providing file service for each other

High degree of location independence

Make use of all machine’s caches Provide reliability in case of

failures

xFS Developed at Berkeley Inherits ideas from several sources

LFS Zebra (RAID-like ideas) Multiprocessor cache consistency

Built for Network of Workstations (NOW) environment

What Does a File Server Do?

Stores file data blocks on its disks Maintains file location information Maintains cache of data blocks Manages cache consistency for its

clients

xFS Must Provide These Services

In essence, every machine takes on some of the server’s responsibilities

Any data or metadata might be located at any machine

Key challenge is providing same services centralized server provided in a distributed system

Key xFS Concepts

Metadata manager Stripe groups for data storage Cooperative caching Distributed cleaning processes

How Do I Locate a File in xFS?

I’ve got a file name, but where is it? Assuming it’s not locally cached

File’s director converts name to a unique index number

Consult the metadata manager to find out where file with that index number is stored-the manager map

The Manager Map

Data structure that allows translation of index numbers to file managers Not necessarily file locations

Kept by each metadata manager Globally replicated data structure Simply says what machine manages

the file

Using the Manager Map

Look up index number in local map Index numbers are clustered, so

many fewer entries than files Send request to responsible

manager

What Does the Manager Do? Manager keeps two types of

information1. imap information2. caching information If some other sites has the file in its

cache, tell requester to go to that site

Always use cache before disk Even if cache is remote

What if No One Caches the Block?

Metadata manager for this file then must consult its imap

Imap tells which disks store the data block

Files are striped across disks stored on multiple machines Typically single block is on one disk

Writing Data

xFS uses RAID-like methods to store data

RAID sucks for small writes So xFS avoids small writes By using LFS-style operations

Batch writes until you have a full stripe’s worth

Stripe Groups

Set of disks that cooperatively store data in RAID fashion

xFS uses single parity disk Alternative to striping all data

across all disks

Cooperative Caching Each site’s cache can service

requests from all other sites Working from assumption that

network access is quicker than disk access

Metadata managers used to keep track of where data is cached So remote cache access takes 3

network hops

Getting a Block from a Remote Cache

ManagerMap

Client

CacheConsistency

State

MetaDataServer

UnixCache

CachingSite

RequestBlock

1 2

3

Providing Cache Consistency

Per-block token consistency To write a block, client requests

token from metadata server Metadata server retrievers token

from whoever has it And invalidates other caches

Writing site keeps token

Which Sites Should Manage Which Files?

Could randomly assign equal number of file index groups to each site

Better if the site using a file also manages it In particular, if most frequent writer

manages it Can reduce network traffic by ~50%

Cleaning Up

File data (and metadata) is stored in log structures spread across machines

A distributed cleaning method is required

Each machine stores info on its usage of stripe groups

Each cleans up its own mess

Basic Performance Results

Early results from incomplete system

Can provide up to 10 times the bandwidth of file data as single NFS server

Even better on creating small files Doesn’t compare xFS to

multimachine servers

distributed file systems andy wang cop 5611 advanced operating systems

Documents

file available

worldevery file

normal file operation

kind of systemevery

possible choicesevery

data transportusing

rpc packet

rpc requestscontinuously