cs 843 - distributed computing systems chapter 8: distributed file systems chin-chih chang,...

Systems Chapter 8: Distributed File Systems Chin-Chih Chang, [email protected] From Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edition 3, © Addison-Wesley 2001

Post on 20-Dec-2015




6 download


Page 1: CS 843 - Distributed Computing Systems Chapter 8: Distributed File Systems Chin-Chih Chang, chang@cs.twsu.edu From Coulouris, Dollimore and Kindberg Distributed

CS 843 - Distributed Computing Systems Chapter 8: Distributed File Systems

Chin-Chih Chang, [email protected]

From Coulouris, Dollimore and Kindberg

Distributed Systems: Concepts and Design

Edition 3, © Addison-Wesley 2001

Page 2: CS 843 - Distributed Computing Systems Chapter 8: Distributed File Systems Chin-Chih Chang, chang@cs.twsu.edu From Coulouris, Dollimore and Kindberg Distributed


Distributed File System

.File Service Architecture .Sun NFS .Andrew File System .Advances and Summary

Page 3: CS 843 - Distributed Computing Systems Chapter 8: Distributed File Systems Chin-Chih Chang, chang@cs.twsu.edu From Coulouris, Dollimore and Kindberg Distributed




share information in the form of files throughout an intranet.


provide access to files stored at a server similar to files on local disk

Page 4: CS 843 - Distributed Computing Systems Chapter 8: Distributed File Systems Chin-Chih Chang, chang@cs.twsu.edu From Coulouris, Dollimore and Kindberg Distributed


Figure 8.1 Storage systems and their properties

Sharing Persis-tence




Main memory RAM

File system UNIX file system

Distributed file system Sun NFS

Web Web server

Distributed shared memory Ivy (Ch. 16)

Remote objects (RMI/ORB) CORBA

Persistent object store 1 CORBA PersistentObject Service

Persistent distributed object store PerDiS, Khazana




Page 5: CS 843 - Distributed Computing Systems Chapter 8: Distributed File Systems Chin-Chih Chang, chang@cs.twsu.edu From Coulouris, Dollimore and Kindberg Distributed


Introduction (cont…)

Characteristics of file systems

-- files contains data and attributes

-- file system is for storing and managing

large numbers of files, also controlling

access to files

-- modules in non-distributed systems

Page 6: CS 843 - Distributed Computing Systems Chapter 8: Distributed File Systems Chin-Chih Chang, chang@cs.twsu.edu From Coulouris, Dollimore and Kindberg Distributed


Figure 8.2 File system modules

Directory module: relates file names to file IDs

File module: relates file IDs to particular files

Access control module: checks permission for operation requested

File access module: reads or writes file data or attributes

Block module: accesses and allocates disk blocks

Device module: disk I/O and buffering

Page 7: CS 843 - Distributed Computing Systems Chapter 8: Distributed File Systems Chin-Chih Chang, chang@cs.twsu.edu From Coulouris, Dollimore and Kindberg Distributed


Figure 8.3 File attribute record structure

File length

Creation timestamp

Read timestamp

Write timestamp

Attribute timestamp

Reference count


File type

Access control list

Page 8: CS 843 - Distributed Computing Systems Chapter 8: Distributed File Systems Chin-Chih Chang, chang@cs.twsu.edu From Coulouris, Dollimore and Kindberg Distributed


Figure 8.4 UNIX file system operations

filedes = open(name, mode)filedes = creat(name, mode)

Opens an existing file with the given name. Creates a new file with the given name. Both operations deliver a file descriptor referencing the openfile. The mode is read, write or both.

status = close(filedes) Closes the open file filedes.

count = read(filedes, buffer, n)

count = write(filedes, buffer, n)

Transfers n bytes from the file referenced by filedes to buffer. Transfers n bytes to the file referenced by filedes from buffer.

Both operations deliver the number of bytes actually transferredand advance the read-write pointer.

pos = lseek(filedes, offset, whence)

Moves the read-write pointer to offset (relative or absolute,depending on whence).

status = unlink(name) Removes the file name from the directory structure. If the filehas no other names, it is deleted.

status = link(name1, name2) Adds a new name (name2) for a file (name1).

status = stat(name, buffer) Gets the file attributes for file name into buffer.

Page 9: CS 843 - Distributed Computing Systems Chapter 8: Distributed File Systems Chin-Chih Chang, chang@cs.twsu.edu From Coulouris, Dollimore and Kindberg Distributed


Introduction (cont…)

Distributed File System Requirements Transparency

Concurrent File Updates

File Replication

HW and OS heterogeneity

Fault tolerance




Page 10: CS 843 - Distributed Computing Systems Chapter 8: Distributed File Systems Chin-Chih Chang, chang@cs.twsu.edu From Coulouris, Dollimore and Kindberg Distributed


Introduction (cont…)


Access Transparency

Location Transparency

Mobility Transparency

Performance Transparency

Scaling Transparency

Page 11: CS 843 - Distributed Computing Systems Chapter 8: Distributed File Systems Chin-Chih Chang, chang@cs.twsu.edu From Coulouris, Dollimore and Kindberg Distributed


Introduction (cont…)

Case studies Abstract File Service Architecture Model

SUN NFS (network file system)

key interface published; access transparency; support for hardware and OS heterogeneity

Andrew File System (from CMU)

support information sharing on a large scale by minimizing client-server communication

Page 12: CS 843 - Distributed Computing Systems Chapter 8: Distributed File Systems Chin-Chih Chang, chang@cs.twsu.edu From Coulouris, Dollimore and Kindberg Distributed


File Service Architecture

Flat file service

Operations on the contents of files

Unique file identifiers (UFIDs)

Directory service

Mapping (text names – UFIDs)

client of flat file service

Client module

integrating and extending file and directory services

Page 13: CS 843 - Distributed Computing Systems Chapter 8: Distributed File Systems Chin-Chih Chang, chang@cs.twsu.edu From Coulouris, Dollimore and Kindberg Distributed


Figure 8.5 File service architecture

Client computer Server computer



Client module

Flat file service

Directory service

Page 14: CS 843 - Distributed Computing Systems Chapter 8: Distributed File Systems Chin-Chih Chang, chang@cs.twsu.edu From Coulouris, Dollimore and Kindberg Distributed


Figure 8.6 Flat file service interface

Read(FileId, i, n) -> Data — throws BadPosition

If 1 ≤ i ≤ Length(File): Reads a sequence of up to n itemsfrom a file starting at item i and returns it in Data.

Write(FileId, i, Data) — throws BadPosition

If 1 ≤ i ≤ Length(File)+1: Writes a sequence of Data to afile, starting at item i, extending the file if necessary.

Create() -> FileId Creates a new file of length 0 and delivers a UFID for it.

Delete(FileId) Removes the file from the file store.

GetAttributes(FileId) -> Attr Returns the file attributes for the file.

SetAttributes(FileId, Attr) Sets the file attributes (only those attributes that are notshaded in Figure 8.3).

Page 15: CS 843 - Distributed Computing Systems Chapter 8: Distributed File Systems Chin-Chih Chang, chang@cs.twsu.edu From Coulouris, Dollimore and Kindberg Distributed


File Service Architecture (cont…)

Comparing with UNIX file system interface

-- functionally equivalent

-- no open and close operation


-- idempotent (at-least-once RPC semantics)

-- Stateless

Access control

-- performed at server

Page 16: CS 843 - Distributed Computing Systems Chapter 8: Distributed File Systems Chin-Chih Chang, chang@cs.twsu.edu From Coulouris, Dollimore and Kindberg Distributed


Figure 8.7 Directory service interface

Lookup(Dir, Name) -> FileId— throws NotFound

Locates the text name in the directory and returns therelevant UFID. If Name is not in the directory, throws anexception.

AddName(Dir, Name, File) — throws NameDuplicate

If Name is not in the directory, adds (Name, File) to thedirectory and updates the file’s attribute record.If Name is already in the directory: throws an exception.

UnName(Dir, Name) — throws NotFound

If Name is in the directory: the entry containing Name isremoved from the directory. If Name is not in the directory: throws an exception.

GetNames(Dir, Pattern) -> NameSeq Returns all the text names in the directory that match theregular expression Pattern.

Page 17: CS 843 - Distributed Computing Systems Chapter 8: Distributed File Systems Chin-Chih Chang, chang@cs.twsu.edu From Coulouris, Dollimore and Kindberg Distributed


File Service Architecture (cont…)

Hierarchic file system

-- tree structure

-- implemented by client module

-- UNIX file naming scheme is not strictly hierarchy (link)

File grouping

-- moved between servers

-- unique ID = IP address + date

Page 18: CS 843 - Distributed Computing Systems Chapter 8: Distributed File Systems Chin-Chih Chang, chang@cs.twsu.edu From Coulouris, Dollimore and Kindberg Distributed


SUN NFS architecture

UNIX kernel


Client computer Server computer

system calls

Local Remote











UNIX kernel

Virtual file systemVirtual file system


er f

ile s



Page 19: CS 843 - Distributed Computing Systems Chapter 8: Distributed File Systems Chin-Chih Chang, chang@cs.twsu.edu From Coulouris, Dollimore and Kindberg Distributed


SUN NFS (cont…)

NFS protocol is OS independent

NFS server and client modules reside in kernel, communicate using RPC

NFS provides access transparency, achieved by a virtual file system

-- File handle: File-system-ID +

i-node-number-of- file +

i-node-generation number

Page 20: CS 843 - Distributed Computing Systems Chapter 8: Distributed File Systems Chin-Chih Chang, chang@cs.twsu.edu From Coulouris, Dollimore and Kindberg Distributed


SUN NFS (cont…)

NFS client module supplies interface for application program

Access control and authentication checked by server

-- NFS server is stateless

Page 21: CS 843 - Distributed Computing Systems Chapter 8: Distributed File Systems Chin-Chih Chang, chang@cs.twsu.edu From Coulouris, Dollimore and Kindberg Distributed


Figure 8.9 NFS server operations (simplified) – 1

lookup(dirfh, name) -> fh, attr Returns file handle and attributes for the file name in the directory dirfh.

create(dirfh, name, attr) -> newfh, attr

Creates a new file name in directory dirfh with attributes attr andreturns the new file handle and attributes.

remove(dirfh, name) status Removes file name from directory dirfh.

getattr(fh) -> attr Returns file attributes of file fh. (Similar to the UNIX stat system call.)

setattr(fh, attr) -> attr Sets the attributes (mode, user id, group id, size, access time andmodify time of a file). Setting the size to 0 truncates the file.

read(fh, offset, count) -> attr, data Returns up to count bytes of data from a file starting at offset.Also returns the latest attributes of the file.

write(fh, offset, count, data) -> attr Writes count bytes of data to a file starting at offset. Returns theattributes of the file after the write has taken place.

rename(dirfh, name, todirfh, toname)-> status

Changes the name of file name in directory dirfh to toname indirectory to todirfh.

link(newdirfh, newname, dirfh, name) -> status

Creates an entry newname in the directory newdirfh which refers tofile name in the directory dirfh.

Continues on next slide ...

Page 22: CS 843 - Distributed Computing Systems Chapter 8: Distributed File Systems Chin-Chih Chang, chang@cs.twsu.edu From Coulouris, Dollimore and Kindberg Distributed


Figure 8.9 NFS server operations (simplified) – 2

symlink(newdirfh, newname, string)-> status

Creates an entry newname in the directory newdirfh of typesymbolic link with the value string. The server does not interpretthe string but makes a symbolic link file to hold it.

readlink(fh) -> string Returns the string that is associated with the symbolic link fileidentified by fh.

mkdir(dirfh, name, attr) -> newfh, attr

Creates a new directory name with attributes attr and returns thenew file handle and attributes.

rmdir(dirfh, name) -> status Removes the empty directory name from the parent directory dirfh.Fails if the directory is not empty.

readdir(dirfh, cookie, count) -> entries

Returns up to count bytes of directory entries from the directorydirfh. Each entry contains a file name, a file handle, and an opaquepointer to the next directory entry, called a cookie. The cookie is

used in subsequent readdir calls to start reading from the followingentry. If the value of cookie is 0, reads from the first entry in thedirectory.

statfs(fh) -> fsstats Returns file system information (such as block size, number offree blocks and so on) for the file system containing a file fh.

Page 23: CS 843 - Distributed Computing Systems Chapter 8: Distributed File Systems Chin-Chih Chang, chang@cs.twsu.edu From Coulouris, Dollimore and Kindberg Distributed


SUN NFS (cont…)

Mount service

-- path name translate UNIX: file pathname i-node reference

NFS: performed by client – lookup

-- automounter

Page 24: CS 843 - Distributed Computing Systems Chapter 8: Distributed File Systems Chin-Chih Chang, chang@cs.twsu.edu From Coulouris, Dollimore and Kindberg Distributed


Figure 8.10Local and remote file systems accessible on an NFS client

jim jane joeann



Client Server 2

. . . nfs



big bobjon


Server 1





. . .


(root) (root)

Note: The file system mounted at /usr/students in the client is actually the sub-tree located at /export/people in Server 1; the file system mounted at /usr/staff in the client is actually the sub-tree located at /nfs/users in Server 2.

Page 25: CS 843 - Distributed Computing Systems Chapter 8: Distributed File Systems Chin-Chih Chang, chang@cs.twsu.edu From Coulouris, Dollimore and Kindberg Distributed


SUN NFS (cont…)


-- server caching .caches recently read disk blocks

.write-through or write-commit

-- client caching . caches the results of read, write, …

. timestamp-based method to validata

Other optimization

Page 26: CS 843 - Distributed Computing Systems Chapter 8: Distributed File Systems Chin-Chih Chang, chang@cs.twsu.edu From Coulouris, Dollimore and Kindberg Distributed


SUN NFS (cont…)


By Sandberg (1987)

good in access of files except

frequent use of getattr call and

poor performance of write operation

By Sun using LADDIS

NFS offers effective solution to distributed storage

needs in intranets of most sizes and types of use

Page 27: CS 843 - Distributed Computing Systems Chapter 8: Distributed File Systems Chin-Chih Chang, chang@cs.twsu.edu From Coulouris, Dollimore and Kindberg Distributed


SUN NFS Summary

Follows abstract model in section 8.2

Good location and access transparency

Support heterogeneous hardware and OSs

Server implementation is stateless

Enhanced by caching

No support for migration of file or file systems

Page 28: CS 843 - Distributed Computing Systems Chapter 8: Distributed File Systems Chin-Chih Chang, chang@cs.twsu.edu From Coulouris, Dollimore and Kindberg Distributed


Andrew File System

Provides transparent access to remote shared files

Compatible with NFS

Perform well with larger number of active users

Page 29: CS 843 - Distributed Computing Systems Chapter 8: Distributed File Systems Chin-Chih Chang, chang@cs.twsu.edu From Coulouris, Dollimore and Kindberg Distributed


Andrew File System(cont…)

Two unusual design characteristics:

-- whole-file serving

.entire contents of files are transmitted to client computers

-- whole-file caching

.locally cached copies usually remind valid for long periods

.files in regular use are normally retained in cache

Page 30: CS 843 - Distributed Computing Systems Chapter 8: Distributed File Systems Chin-Chih Chang, chang@cs.twsu.edu From Coulouris, Dollimore and Kindberg Distributed


Andrew File System(cont…)

Design is based on some observations:

--files are small, < 10K

--read operations are much more common

than write

--sequential access is common, random access is


--most files are using by only one user, when being

shared, usually only one user can modify.

--files are referenced in bursts

Page 31: CS 843 - Distributed Computing Systems Chapter 8: Distributed File Systems Chin-Chih Chang, chang@cs.twsu.edu From Coulouris, Dollimore and Kindberg Distributed


Figure 8.11Distribution of processes in the Andrew File System


Workstations Servers




UNIX kernel

UNIX kernel




ViceUNIX kernel

UNIX kernel

UNIX kernel

Page 32: CS 843 - Distributed Computing Systems Chapter 8: Distributed File Systems Chin-Chih Chang, chang@cs.twsu.edu From Coulouris, Dollimore and Kindberg Distributed


Figure 8.12File name space seen by clients of AFS

/ (root)

tmp bin cmuvmunix. . .




Page 33: CS 843 - Distributed Computing Systems Chapter 8: Distributed File Systems Chin-Chih Chang, chang@cs.twsu.edu From Coulouris, Dollimore and Kindberg Distributed


Figure 8.13 System call interception in AFS

UNIX filesystem calls

Non-local fileoperations




UNIX kernel


UNIX file system


Page 34: CS 843 - Distributed Computing Systems Chapter 8: Distributed File Systems Chin-Chih Chang, chang@cs.twsu.edu From Coulouris, Dollimore and Kindberg Distributed


Figure 8.14 Implementation of file system calls in AFS

User process UNIX kernel Venus Net Vice


If FileName refers to afile in shared file space,pass the request toVenus.

Open the local file andreturn the filedescriptor to theapplication.

Check list of files inlocal cache. If notpresent or there is novalid callback promise,send a request for thefile to the Vice serverthat is custodian of thevolume containing thefile.

Place the copy of thefile in the local filesystem, enter its localname in the local cachelist and return the localname to UNIX.

Transfer a copy of thefile and a callbackpromise to theworkstation. Log thecallback promise.

read(FileDescriptor,Buffer, length)

Perform a normalUNIX read operationon the local copy.

write(FileDescriptor,Buffer, length)

Perform a normalUNIX write operationon the local copy.

close(FileDescriptor) Close the local copyand notify Venus thatthe file has been closed. If the local copy has

been changed, send acopy to the Vice serverthat is the custodian ofthe file.

Replace the filecontents and send acallback to all otherclients holdingcallbackpromises on the file.

Page 35: CS 843 - Distributed Computing Systems Chapter 8: Distributed File Systems Chin-Chih Chang, chang@cs.twsu.edu From Coulouris, Dollimore and Kindberg Distributed


Andrew File System(cont…)

Cache consistency

-- callback promise



-- update semantics

best approximation to one-copy file semantics

no mechanism for the control of concurrent updates

Page 36: CS 843 - Distributed Computing Systems Chapter 8: Distributed File Systems Chin-Chih Chang, chang@cs.twsu.edu From Coulouris, Dollimore and Kindberg Distributed


Figure 8.15 The main components of the Vice service interface

Fetch(fid) -> attr, data Returns the attributes (status) and, optionally, the contents of fileidentified by the fid and records a callback promise on it.

Store(fid, attr, data) Updates the attributes and (optionally) the contents of a specifiedfile.

Create() -> fid Creates a new file and records a callback promise on it.

Remove(fid) Deletes the specified file.

SetLock(fid, mode) Sets a lock on the specified file or directory. The mode of thelock may be shared or exclusive. Locks that are not removed expire after 30 minutes.

ReleaseLock(fid) Unlocks the specified file or directory.

RemoveCallback(fid) Informs server that a Venus process has flushed a file from itscache.

BreakCallback(fid) This call is made by a Vice server to a Venus process. It cancelsthe callback promise on the relevant file.

Page 37: CS 843 - Distributed Computing Systems Chapter 8: Distributed File Systems Chin-Chih Chang, chang@cs.twsu.edu From Coulouris, Dollimore and Kindberg Distributed


Andrew File System(cont…)

UNIX kernel modification

Location database

Bulk transfers

Partial file caching


Wide-area support

Page 38: CS 843 - Distributed Computing Systems Chapter 8: Distributed File Systems Chin-Chih Chang, chang@cs.twsu.edu From Coulouris, Dollimore and Kindberg Distributed


Recent Advances

NFS enhancements

achieving one-copy update semantics

-- Spritely NFS

-- Not Quite NFS

-- WebNFS

-- NFS version 4

Page 39: CS 843 - Distributed Computing Systems Chapter 8: Distributed File Systems Chin-Chih Chang, chang@cs.twsu.edu From Coulouris, Dollimore and Kindberg Distributed


Recent Advances (cont…)

AFS enhancements

-- DCE/DFS (www.opengroup.org)

-- Improvements in storage organization


New design approaches

-- xFS

-- Frangipani

Page 40: CS 843 - Distributed Computing Systems Chapter 8: Distributed File Systems Chin-Chih Chang, chang@cs.twsu.edu From Coulouris, Dollimore and Kindberg Distributed



Key design issues

caching, consistency, failure tolerance,

high throughput, scalability


stateless protocol, high-performance


good scalability, whole-file serving and caching

Page 41: CS 843 - Distributed Computing Systems Chapter 8: Distributed File Systems Chin-Chih Chang, chang@cs.twsu.edu From Coulouris, Dollimore and Kindberg Distributed

Copyright © George Coulouris, Jean Dollimore, Tim Kindberg 2001 email: [email protected] material is made available for private study and for direct use by individual teachers.It may not be included in any product or employed in any service without the written permission of the authors.

Viewing: These slides must be viewed in slide show mode.

Teaching material based on Distributed Systems: Concepts and Design, Edition 3, Addison-Wesley 2001.

Copyright © George Coulouris, Jean Dollimore, Tim Kindberg 2001 email: [email protected] material is made available for private study and for direct use by individual teachers.It may not be included in any product or employed in any service without the written permission of the authors.

Viewing: These slides must be viewed in slide show mode.

Distributed Systems Course

Distributed File Systems

Chapter 2 Revision: Failure modelChapter 8: 8.1 Introduction8.2 File service architecture8.3 Sun Network File System (NFS)[8.4 Andrew File System (personal study)]8.5 Recent advances8.6 Summary

Page 42: CS 843 - Distributed Computing Systems Chapter 8: Distributed File Systems Chin-Chih Chang, chang@cs.twsu.edu From Coulouris, Dollimore and Kindberg Distributed


Learning objectives

Understand the requirements that affect the design of distributed services

NFS: understand how a relatively simple, widely-used service is designed– Obtain a knowledge of file systems, both local and networked– Caching as an essential design technique– Remote interfaces are not the same as APIs– Security requires special consideration

Recent advances: appreciate the ongoing research that often leads to major advances


Page 43: CS 843 - Distributed Computing Systems Chapter 8: Distributed File Systems Chin-Chih Chang, chang@cs.twsu.edu From Coulouris, Dollimore and Kindberg Distributed


Chapter 2 Revision: Failure model


Figure 2.11

Class of failure Affects Description

Fail-stop Process Process halts and remains halted. Other processes maydetect this state.

Crash Process Process halts and remains halted. Other processes maynot be able to detect this state.

Omission Channel A message inserted in an outgoing message buffer neverarrives at the other end’s incoming message buffer.

Send-omission Process A process completes a send, but the message is not putin its outgoing message buffer.

Receive-omission Process A message is put in a process’s incoming messagebuffer, but that process does not receive it.


Processor channel

Process/channel exhibits arbitrary behaviour: it maysend/transmit arbitrary messages at arbitrary times,commit omissions; a process may stop or take anincorrect step.

Page 44: CS 843 - Distributed Computing Systems Chapter 8: Distributed File Systems Chin-Chih Chang, chang@cs.twsu.edu From Coulouris, Dollimore and Kindberg Distributed


Storage systems and their properties


In first generation of distributed systems (1974-95), file systems (e.g. NFS) were the only networked storage systems.

With the advent of distributed object systems (CORBA, Java) and the web, the picture has become more complex.

Page 45: CS 843 - Distributed Computing Systems Chapter 8: Distributed File Systems Chin-Chih Chang, chang@cs.twsu.edu From Coulouris, Dollimore and Kindberg Distributed


Figure 8.1

Storage systems and their properties

Sharing Persis-tence




Main memory RAM

File system UNIX file system

Distributed file system Sun NFS

Web Web server

Distributed shared memory Ivy (Ch. 16)

Remote objects (RMI/ORB) CORBA

Persistent object store 1 CORBA PersistentObject Service

Persistent distributed object store PerDiS, Khazana





Types of consistency between copies: 1 - strict one-copy consistency√ - approximate consistencyX - no automatic consistency

Page 46: CS 843 - Distributed Computing Systems Chapter 8: Distributed File Systems Chin-Chih Chang, chang@cs.twsu.edu From Coulouris, Dollimore and Kindberg Distributed


What is a file system? 1

Persistent stored data sets

Hierarchic name space visible to all processes

API with the following characteristics:– access and update operations on persistently stored data sets– Sequential access model (with additional random facilities)

Sharing of data between users, with access control

Concurrent access:– certainly for read-only access– what about updates?

Other features:– mountable file stores– more? ...


Page 47: CS 843 - Distributed Computing Systems Chapter 8: Distributed File Systems Chin-Chih Chang, chang@cs.twsu.edu From Coulouris, Dollimore and Kindberg Distributed


What is a file system? 2


filedes = open(name, mode)filedes = creat(name, mode)

Opens an existing file with the given name. Creates a new file with the given name. Both operations deliver a file descriptor referencing the openfile. The mode is read, write or both.

status = close(filedes) Closes the open file filedes.

count = read(filedes, buffer, n)

count = write(filedes, buffer, n)

Transfers n bytes from the file referenced by filedes to buffer. Transfers n bytes to the file referenced by filedes from buffer.Both operations deliver the number of bytes actually transferredand advance the read-write pointer.

pos = lseek(filedes, offset, whence)

Moves the read-write pointer to offset (relative or absolute,depending on whence).

status = unlink(name) Removes the file name from the directory structure. If the filehas no other names, it is deleted.

status = link(name1, name2) Adds a new name (name2) for a file (name1).

status = stat(name, buffer) Gets the file attributes for file name into buffer.

Figure 8.4 UNIX file system operations

Class Exercise AWrite a simple C program to copy a file using the UNIX file system operations shown in Figure 8.4.

copyfile(char * oldfile, * newfile){

<you write this part, using open(), creat(), read(), write()>


Note: remember that read() returns 0 when you attempt to read beyond the end of the file.

Page 48: CS 843 - Distributed Computing Systems Chapter 8: Distributed File Systems Chin-Chih Chang, chang@cs.twsu.edu From Coulouris, Dollimore and Kindberg Distributed


updated by system:

File length

Creation timestamp

Read timestamp

Write timestamp

Attribute timestamp

Reference count


File type

Access control list

E.g. for UNIX: rw-rw-r--

What is a file system? 4


Figure 8.3 File attribute record structure

updated by owner:

Page 49: CS 843 - Distributed Computing Systems Chapter 8: Distributed File Systems Chin-Chih Chang, chang@cs.twsu.edu From Coulouris, Dollimore and Kindberg Distributed



Access: Same operations

Location: Same name space after relocation of files or processes

Mobility: Automatic relocation of files is possible

Performance: Satisfactory performance across a specified range of system loads

Scaling: Service can be expanded to meet additional loads

Concurrency properties


File-level or record-level locking

Other forms of concurrency control to minimise contention

Replication properties

File service maintains multiple identical copies of files

• Load-sharing between servers makes service more scalable

• Local access has better response (lower latency)

• Fault tolerance

Full replication is difficult to implement.

Caching (of all or part of a file) gives most of the benefits (except fault tolerance)

Heterogeneity properties

Service can be accessed by clients running on (almost) any OS or hardware platform.

Design must be compatible with the file systems of different OSes

Service interfaces must be open - precise specifications of APIs are published.

Fault tolerance

Service must continue to operate even when clients make errors or crash.

• at-most-once semantics

• at-least-once semantics •requires idempotent operations

Service must resume after a server machine crashes.

If the service is replicated, it can continue to operate even during a server crash.


Unix offers one-copy update semantics for operations on local files - caching is completely transparent.

Difficult to achieve the same for distributed file systems while maintaining good performance and scalability.


Must maintain access control and privacy as for local files.

•based on identity of user making request

•identities of remote users must be authenticated

•privacy requires secure communication

Service interfaces are open to all processes not excluded by a firewall.

•vulnerable to impersonation and other attacks


Goal for distributed file systems is usually performance comparable to local file system.

File service requirements





Fault tolerance





Page 50: CS 843 - Distributed Computing Systems Chapter 8: Distributed File Systems Chin-Chih Chang, chang@cs.twsu.edu From Coulouris, Dollimore and Kindberg Distributed


Model file service architecture

Client computer Server computer



Client module

Flat file service

Directory service




Figure 8.5

Page 51: CS 843 - Distributed Computing Systems Chapter 8: Distributed File Systems Chin-Chih Chang, chang@cs.twsu.edu From Coulouris, Dollimore and Kindberg Distributed



A unique identifier for files anywhere in the network. Similar to the remote object references described in Section 4.3.3.

Server operations for the model file service

Flat file service

Read(FileId, i, n) -> Data

Write(FileId, i, Data)

Create() -> FileId


GetAttributes(FileId) -> Attr

SetAttributes(FileId, Attr)

Directory service

Lookup(Dir, Name) -> FileId

AddName(Dir, Name, File)

UnName(Dir, Name)

GetNames(Dir, Pattern) -> NameSeq

Pathname lookup

Pathnames such as '/usr/bin/tar' are resolved by iterative calls to lookup(), one call for each component of the path, starting with the ID of the root directory '/' which is known in every client.


position of first byte

position of first byte

Class Exercise B

Show how each file operation of the program that you wrote in Class Exercise A would be executed using the operations of the Model File Service in Figures 8.6 and 8.7.

Figures 8.6 and 8.7


Page 52: CS 843 - Distributed Computing Systems Chapter 8: Distributed File Systems Chin-Chih Chang, chang@cs.twsu.edu From Coulouris, Dollimore and Kindberg Distributed


File Group

A collection of files that can be located on any server or moved between servers while maintaining the same names.– Similar to a UNIX filesystem – Helps with distributing the load of file

serving between several servers.– File groups have identifiers which are

unique throughout the system (and hence for an open system, they must be globally unique). Used to refer to file groups and files

To construct a globally unique ID we use some unique attribute of the machine on which it is created, e.g. IP number, even though the file group may move subsequently.

IP address date

32 bits 16 bits

File Group ID:


Page 53: CS 843 - Distributed Computing Systems Chapter 8: Distributed File Systems Chin-Chih Chang, chang@cs.twsu.edu From Coulouris, Dollimore and Kindberg Distributed


Case Study: Sun NFS

An industry standard for file sharing on local networks since the 1980s

An open standard with clear and simple interfaces

Closely follows the abstract file service model defined above

Supports many of the design requirements already mentioned:– transparency– heterogeneity– efficiency– fault tolerance

Limited achievement of:– concurrency– replication– consistency– security


Page 54: CS 843 - Distributed Computing Systems Chapter 8: Distributed File Systems Chin-Chih Chang, chang@cs.twsu.edu From Coulouris, Dollimore and Kindberg Distributed


NFS architecture

Client computer Server computer









Virtual file systemVirtual file system


er f

ile s



UNIX kernel

system calls


(remote operations)


Operations on local files


remote files


Figure 8.8Application






Client computer

Page 55: CS 843 - Distributed Computing Systems Chapter 8: Distributed File Systems Chin-Chih Chang, chang@cs.twsu.edu From Coulouris, Dollimore and Kindberg Distributed



NFS architecture: does the implementation have to be in the system kernel?

No:– there are examples of NFS clients and servers that run at application-

level as libraries or processes (e.g. early Windows and MacOS implementations, current PocketPC, etc.)

But, for a Unix implementation there are advantages:– Binary code compatible - no need to recompile applications

Standard system calls that access remote files can be routed through the NFS client module by the kernel

– Shared cache of recently-used blocks at client– Kernel-level server can access i-nodes and file blocks directly

but a privileged (root) application program could do almost the same.

– Security of the encryption key used for authentication.

Page 56: CS 843 - Distributed Computing Systems Chapter 8: Distributed File Systems Chin-Chih Chang, chang@cs.twsu.edu From Coulouris, Dollimore and Kindberg Distributed


• read(fh, offset, count) -> attr, data• write(fh, offset, count, data) -> attr• create(dirfh, name, attr) -> newfh, attr• remove(dirfh, name) status• getattr(fh) -> attr• setattr(fh, attr) -> attr• lookup(dirfh, name) -> fh, attr• rename(dirfh, name, todirfh, toname)• link(newdirfh, newname, dirfh, name)• readdir(dirfh, cookie, count) -> entries• symlink(newdirfh, newname, string) -> status• readlink(fh) -> string• mkdir(dirfh, name, attr) -> newfh, attr• rmdir(dirfh, name) -> status• statfs(fh) -> fsstats

NFS server operations (simplified)

fh = file handle:

Filesystem identifier i-node number i-node generation


Model flat file service

Read(FileId, i, n) -> DataWrite(FileId, i, Data)Create() -> FileIdDelete(FileId)GetAttributes(FileId) -> AttrSetAttributes(FileId, Attr)

Model directory service

Lookup(Dir, Name) -> FileIdAddName(Dir, Name, File)UnName(Dir, Name)GetNames(Dir, Pattern) ->NameSeq

Figure 8.9

Page 57: CS 843 - Distributed Computing Systems Chapter 8: Distributed File Systems Chin-Chih Chang, chang@cs.twsu.edu From Coulouris, Dollimore and Kindberg Distributed


NFS access control and authentication

Stateless server, so the user's identity and access rights must be checked by the server on each request. – In the local file system they are checked only on open()

Every client request is accompanied by the userID and groupID – not shown in the Figure 8.9 because they are inserted by the RPC system

Server is exposed to imposter attacks unless the userID and groupID are protected by encryption

Kerberos has been integrated with NFS to provide a stronger and more comprehensive security solution– Kerberos is described in Chapter 7. Integration of NFS with Kerberos is covered

later in this chapter.


Page 58: CS 843 - Distributed Computing Systems Chapter 8: Distributed File Systems Chin-Chih Chang, chang@cs.twsu.edu From Coulouris, Dollimore and Kindberg Distributed


Mount service

Mount operation:

mount(remotehost, remotedirectory, localdirectory)

Server maintains a table of clients who have mounted filesystems at that server

Each client maintains a table of mounted file systems holding:

< IP address, port number, file handle>

Hard versus soft mounts


Page 59: CS 843 - Distributed Computing Systems Chapter 8: Distributed File Systems Chin-Chih Chang, chang@cs.twsu.edu From Coulouris, Dollimore and Kindberg Distributed


Local and remote file systems accessible on an NFS client

jim jane joeann



Client Server 2

. . . nfs



big bobjon


Server 1





. . .


(root) (root)

Note: The file system mounted at /usr/students in the client is actually the sub-tree located at /export/people in Server 1; the file system mounted at /usr/staff in the client is actually the sub-tree located at /nfs/users in Server 2.


Figure 8.10

Page 60: CS 843 - Distributed Computing Systems Chapter 8: Distributed File Systems Chin-Chih Chang, chang@cs.twsu.edu From Coulouris, Dollimore and Kindberg Distributed


NFS optimization - server caching

Similar to UNIX file caching for local files:– pages (blocks) from disk are held in a main memory buffer cache until the space

is required for newer pages. Read-ahead and delayed-write optimizations.

– For local files, writes are deferred to next sync event (30 second intervals)

– Works well in local context, where files are always accessed through the local cache, but in the remote case it doesn't offer necessary synchronization guarantees to clients.

NFS v3 servers offers two strategies for updating the disk:– write-through - altered pages are written to disk as soon as they are received at

the server. When a write() RPC returns, the NFS client knows that the page is on the disk.

– delayed commit - pages are held only in the cache until a commit() call is received for the relevant file. This is the default mode used by NFS v3 clients. A commit() is issued by the client whenever a file is closed.


Page 61: CS 843 - Distributed Computing Systems Chapter 8: Distributed File Systems Chin-Chih Chang, chang@cs.twsu.edu From Coulouris, Dollimore and Kindberg Distributed


NFS optimization - client caching

Server caching does nothing to reduce RPC traffic between client and server– further optimization is essential to reduce server load in large networks

– NFS client module caches the results of read, write, getattr, lookup and readdir operations

– synchronization of file contents (one-copy semantics) is not guaranteed when two or more clients are sharing the same file.

Timestamp-based validity check – reduces inconsistency, but doesn't eliminate it

– validity condition for cache entries at the client:

(T - Tc < t) v (Tmclient = Tmserver)

– t is configurable (per file) but is typically set to 3 seconds for files and 30 secs. for directories

– it remains difficult to write distributed applications that share files with NFS


t freshness guaranteeTc time when cache entry was last

validatedTm time when block was last

updated at serverT current time

Page 62: CS 843 - Distributed Computing Systems Chapter 8: Distributed File Systems Chin-Chih Chang, chang@cs.twsu.edu From Coulouris, Dollimore and Kindberg Distributed


Other NFS optimizations

Sun RPC runs over UDP by default (can use TCP if required)

Uses UNIX BSD Fast File System with 8-kbyte blocks

reads() and writes() can be of any size (negotiated between client and server)

the guaranteed freshness interval t is set adaptively for individual files to reduce gettattr() calls needed to update Tm

file attribute information (including Tm) is piggybacked in replies to all file requests


Page 63: CS 843 - Distributed Computing Systems Chapter 8: Distributed File Systems Chin-Chih Chang, chang@cs.twsu.edu From Coulouris, Dollimore and Kindberg Distributed


NFS summary 1

An excellent example of a simple, robust, high-performance distributed service.

Achievement of transparencies (See section 1.4.7):

Access: Excellent; the API is the UNIX system call interface for both local and remote files.

Location: Not guaranteed but normally achieved; naming of filesystems is controlled by client mount operations, but transparency can be ensured by an appropriate system configuration.

Concurrency: Limited but adequate for most purposes; when read-write files are shared concurrently between clients, consistency is not perfect.

Replication: Limited to read-only file systems; for writable files, the SUN Network Information Service (NIS) runs over NFS and is used to replicate essential system files, see Chapter 14.

cont'd *

Page 64: CS 843 - Distributed Computing Systems Chapter 8: Distributed File Systems Chin-Chih Chang, chang@cs.twsu.edu From Coulouris, Dollimore and Kindberg Distributed


NFS summary 2

Achievement of transparencies (continued):

Failure: Limited but effective; service is suspended if a server fails. Recovery from failures is aided by the simple stateless design.

Mobility: Hardly achieved; relocation of files is not possible, relocation of filesystems is possible, but requires updates to client configurations.

Performance: Good; multiprocessor servers achieve very high performance, but for a single filesystem it's not possible to go beyond the throughput of a multiprocessor server.

Scaling: Good; filesystems (file groups) may be subdivided and allocated to separate servers. Ultimately, the performance limit is determined by the load on the server holding the most heavily-used filesystem (file group).


Page 65: CS 843 - Distributed Computing Systems Chapter 8: Distributed File Systems Chin-Chih Chang, chang@cs.twsu.edu From Coulouris, Dollimore and Kindberg Distributed


Recent advances in file services

NFS enhancementsWebNFS - NFS server implements a web-like service on a well-known port.

Requests use a 'public file handle' and a pathname-capable variant of lookup(). Enables applications to access NFS servers directly, e.g. to read a portion of a large file.

One-copy update semantics (Spritely NFS, NQNFS) - Include an open() operation and maintain tables of open files at servers, which are used to prevent multiple writers and to generate callbacks to clients notifying them of updates. Performance was improved by reduction in gettattr() traffic.

Improvements in disk storage organisationRAID - improves performance and reliability by striping data redundantly across

several disk drives

Log-structured file storage - updated pages are stored contiguously in memory and committed to disk in large contiguous blocks (~ 1 Mbyte). File maps are modified whenever an update occurs. Garbage collection to recover disk space.


Page 66: CS 843 - Distributed Computing Systems Chapter 8: Distributed File Systems Chin-Chih Chang, chang@cs.twsu.edu From Coulouris, Dollimore and Kindberg Distributed


New design approaches 1

Distribute file data across several servers– Exploits high-speed networks (ATM, Gigabit Ethernet)

– Layered approach, lowest level is like a 'distributed virtual disk'

– Achieves scalability even for a single heavily-used file

'Serverless' architecture– Exploits processing and disk resources in all available network nodes

– Service is distributed at the level of individual files

Examples: xFS (section 8.5): Experimental implementation demonstrated a substantial

performance gain over NFS and AFS

Frangipani (section 8.5): Performance similar to local UNIX file access

Tiger Video File System (see Chapter 15)

Peer-to-peer systems: Napster, OceanStore (UCB), Farsite (MSR), Publius (AT&T research) - see web for documentation on these very recent systems


Page 67: CS 843 - Distributed Computing Systems Chapter 8: Distributed File Systems Chin-Chih Chang, chang@cs.twsu.edu From Coulouris, Dollimore and Kindberg Distributed

76 *

New design approaches 2

Replicated read-write files– High availability– Disconnected working

re-integration after disconnection is a major problem if conflicting updates have ocurred

– Examples: Bayou system (Section 14.4.2) Coda system (Section 14.4.3)

Page 68: CS 843 - Distributed Computing Systems Chapter 8: Distributed File Systems Chin-Chih Chang, chang@cs.twsu.edu From Coulouris, Dollimore and Kindberg Distributed



Sun NFS is an excellent example of a distributed service designed to meet many important design requirements

Effective client caching can produce file service performance equal to or better than local file systems

Consistency versus update semantics versus fault tolerance remains an issue

Most client and server failures can be masked

Superior scalability can be achieved with whole-file serving (Andrew FS) or the distributed virtual disk approach


Future requirements:– support for mobile users, disconnected operation, automatic re-integration

(Cf. Coda file system, Chapter 14)

– support for data streaming and quality of service (Cf. Tiger file system, Chapter 15)

Page 69: CS 843 - Distributed Computing Systems Chapter 8: Distributed File Systems Chin-Chih Chang, chang@cs.twsu.edu From Coulouris, Dollimore and Kindberg Distributed


Exercise A solution

Write a simple C program to copy a file using the UNIX file system operations shown in Figure 8.4.

#define BUFSIZE 1024#define READ 0#define FILEMODE 0644void copyfile(char* oldfile, char* newfile){ char buf[BUFSIZE]; int i,n=1, fdold, fdnew;

if((fdold = open(oldfile, READ))>=0) {fdnew = creat(newfile, FILEMODE);while (n>0) {

n = read(fdold, buf, BUFSIZE);if(write(fdnew, buf, n) < 0) break;

}close(fdold); close(fdnew);}else printf("Copyfile: couldn't open file: %s \n", oldfile);

}main(int argc, char **argv) {

copyfile(argv[1], argv[2]);}


Page 70: CS 843 - Distributed Computing Systems Chapter 8: Distributed File Systems Chin-Chih Chang, chang@cs.twsu.edu From Coulouris, Dollimore and Kindberg Distributed


Show how each file operation of the program that you wrote in Class Exercise A would be executed using the operations of the Model File Service in Figures 8.6 and 8.7.

if((fdold = open(oldfile, READ))>=0) {fdnew = creat(newfile, FILEMODE);while (n>0) {

n = read(fdold, buf, BUFSIZE);if(write(fdnew, buf, n) < 0) break;

}close(fdold); close(fdnew);

Exercise B solution

server operations for: copyfile("/usr/include/glob.h", "/foo")

fdold = open('/usr/include/glob.h", READ)Client module actions:

FileId = Lookup(Root, "usr") - remote invocationFileId = Lookup(FileId, "include") - remote invocationFileId = Lookup(FileId, "glob.h") - remote invocation

client module makes an entry in an open files table with file = FileId, mode = READ, and RWpointer = 0. It returns the table row number as the value for fdold

fdnew = creat("/foo", FILEMODE)Client module actions:

FileId = create() - remote invocationAddName(Root, "foo", FileId) - remote invocationSetAttributes(FileId, attributes) - remote invocation

client module makes an entry in its openfiles table with file = FileId, mode = WRITE, and RWpointer = 0. It returns the table row number as the value for fdnew

n = read(fdold, buf, BUFSIZE)Client module actions:

Read(openfiles[fdold].file, openfiles[fdold].RWpointer, BUFSIZE)

- remote invocation

increment the RWpointer in the openfiles table by BUFSIZE and assign the resulting array of data to buf *