cs 519: lecture 4

92
CS 519: Lecture 4 I/O, Disks, and File Systems

Upload: wyanet

Post on 31-Jan-2016

42 views

Category:

Documents


0 download

DESCRIPTION

CS 519: Lecture 4. I/O, Disks, and File Systems. I/O Devices. So far we have talked about how to abstract and manage CPU and memory Computation “ inside ” computer is useful only if some results are communicated “ outside ” of the computer - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: CS 519: Lecture 4

CS 519: Lecture 4

I/O, Disks, and File Systems

Page 2: CS 519: Lecture 4

2Computer Science, Rutgers CS 519: Operating System Theory

I/O Devices

So far we have talked about how to abstract and manage CPU and memory

Computation “inside” computer is useful only if some results are communicated “outside” of the computer

I/O devices are the computer’s interface to the outside world (I/O Input/Output)

Example devices: display, keyboard, mouse, speakers, network interface, and disk

Page 3: CS 519: Lecture 4

3Computer Science, Rutgers CS 519: Operating System Theory

Basic Computer Structure

CPU Memory

Bridge

Disk NIC

Memory Bus(System Bus)

I/O Bus

Page 4: CS 519: Lecture 4

4Computer Science, Rutgers CS 519: Operating System Theory

OS: Abstractions and Access Calls

OS must virtualize a wide range of devices into a few simple abstractions:

StorageHard drives, tapes, CDROM

Networking

Ethernet, radio, serial line

Multimedia

DVD, camera, microphones

Operating system should provide consistent calls to access the abstractions

Otherwise, programming is too hard

Page 5: CS 519: Lecture 4

5Computer Science, Rutgers CS 519: Operating System Theory

User/OS Interface

Same interface is used to access devices (like disks and network cards) and more abstract resources (like files)

4 main calls:open()

close()

read()

write()

Semantics depend on the type of the device (block, char, net)

Page 6: CS 519: Lecture 4

6Computer Science, Rutgers CS 519: Operating System Theory

Unix I/O Calls

fileHandle = open(pathName, flags, mode)a file handle is a small integer, valid only within a single process, to operate on the device or file

pathname: a name in the file system. In Unix, devices are put under /dev. E.g. /dev/ttya is the first serial port, /dev/sda the first SCSI drive

flags: blocking or non-blocking …

mode: read only, read/write, append …

errorCode = close(fileHandle)Kernel will free the data structures associated with the device

Page 7: CS 519: Lecture 4

7Computer Science, Rutgers CS 519: Operating System Theory

Unix I/O Calls

byteCount = read(fileHandle, buf, count)read at most count bytes from the device and put them in the byte buffer buf. Bytes placed from 0th byte.

Kernel can give the process fewer bytes, user process must check the byteCount to see how many were actually returned.

A negative byteCount signals an error (value is the error type)

byteCount = write(fileHandle, buf, count)write at most count bytes from the buffer buf

actual number written returned in byteCount

a negative byteCount signals an error

Page 8: CS 519: Lecture 4

8Computer Science, Rutgers CS 519: Operating System Theory

I/O Semantics

From this basic interface, two different dimensions to how I/O is processed:

blocking vs. non-blocking vs. asynchronous

buffered vs. unbuffered

The OS tries to support as many of these dimensions as possible for each device

The semantics are specified in the open() system call

Page 9: CS 519: Lecture 4

9Computer Science, Rutgers CS 519: Operating System Theory

Blocking vs. Non-blocking vs. Asynchronous I/O

Blocking – process is blocked until all bytes in the count field are read or written

E.g., for a network device, if the user wrote 1000 bytes, then the OS would only unblock the process after the write() call completes.

+ Easy to use and understand

- If the device just can’t perform the operation (e.g. you unplug the cable), what to do? Give up an return the successful number of bytes.

Non-blocking – the OS only reads or writes as many bytes as is possible without blocking the process

+ Returns quickly

- More work for the programmer (but really good for robust programs)

Page 10: CS 519: Lecture 4

10Computer Science, Rutgers CS 519: Operating System Theory

Blocking vs. Non-blocking vs. Asynchronous I/O

Asynchronous – similar to non-blocking I/O. The I/O call returns immediately, without waiting for the operation to complete. I/O subsystem signals the process when I/O is done. Same advantages and disadvantages of non-blocking I/O.

Difference between non-blocking and asynchronous I/O: a non-blocking read() returns immediately with whatever data available; an asynchronous read() requests a transfer that will be performed in its entirety, but that will complete at some future time.

Page 11: CS 519: Lecture 4

11Computer Science, Rutgers CS 519: Operating System Theory

Buffered vs. Unbuffered I/O

Sometimes we want the ease of programming of blocking I/O without the long waits if the buffers on the device are small.

Buffered I/O allows the kernel to make a copy of the data and adjust to different device speeds.

write(): allows the process to write bytes and continue processing

read(): as device signals data is ready, kernel places data in the buffer. When process calls read(), the kernel just makes a copy.

Why not use buffered I/O?

-- Extra copy overhead

-- Delays sending data

Page 12: CS 519: Lecture 4

12Computer Science, Rutgers CS 519: Operating System Theory

Getting Back to Device Types

Most OSs have three device types (in terms of transfer modes):

Character devicesUsed for serial-line types of devices (e.g., USB port)

Block devices Used for mass-storage (e.g., disks and CDROM)

Network devices Used for network interfaces (e.g., Ethernet card)

What you can expect from the read/write calls changes with each device type

Page 13: CS 519: Lecture 4

13Computer Science, Rutgers CS 519: Operating System Theory

Character Devices

Device is represented by the OS as an ordered stream of bytes

bytes sent out to the device by the write system call

bytes read from the device by the read system call

Byte stream has no “start”, just open and start reading/writing

Page 14: CS 519: Lecture 4

14Computer Science, Rutgers CS 519: Operating System Theory

Block Devices

OS presents device as a large array of blocks

Each block has a fixed size (1KB - 8KB is typical)

User can read/write only in fixed-size blocks

Unlike other devices, block devices support random access

We can read or write anywhere in the device without having to ‘read all the bytes first’

Page 15: CS 519: Lecture 4

15Computer Science, Rutgers CS 519: Operating System Theory

Network Devices

Like block-based I/O devices, but each write call either sends the entire block (packet), up to some maximum fixed size, or none.

On the receiver, the read call returns all the bytes in the block, or none.

Page 16: CS 519: Lecture 4

16Computer Science, Rutgers CS 519: Operating System Theory

Random Access: The File Pointer

For random access in block devices, OS adds a concept called the file pointer

A file pointer is associated with each open file or device, if the device is a block device

The next read or write operates at the position in the device pointed to by the file pointer

The file pointer points to bytes, not blocks

Page 17: CS 519: Lecture 4

17Computer Science, Rutgers CS 519: Operating System Theory

The Seek Call

To set the file pointer:absoluteOffset = lseek(fileHandle, offset, from);

from specifies if the offset is absolute, from byte 0, or relative to the current file pointer position

The absolute offset is returned; negative numbers signal error codes

For devices, the offset should be a integral number of bytes.

Page 18: CS 519: Lecture 4

18Computer Science, Rutgers CS 519: Operating System Theory

Block Device Example

You want to read the 10th block of a disk

Each disk block is 4096 bytes longfh = open(/dev/sda, , , );

pos = lseek(fh, 4096*9, 0);

if (pos < 0) error;

bytesRead = read(fh, buf, 4096);

if (bytesRead < 0) error;

Page 19: CS 519: Lecture 4

19Computer Science, Rutgers CS 519: Operating System Theory

Getting and Setting Device-Specific Info

Unix has an I/O control system call: ErrorCode = ioctl(fileHandle, request, object);

request is a numeric command to the device

Can also pass an optional, arbitrary object to a device

The meaning of the command and the type of the object are device-specific

Page 20: CS 519: Lecture 4

20Computer Science, Rutgers CS 519: Operating System Theory

Programmed I/O vs. DMA

Programmed I/O is ok for sending commands, receiving status, and communication of a small amount of data

Inefficient for large amount of dataKeeps CPU busy during the transfer

Programmed I/O memory operations slow

Direct Memory AccessDevice read/write directly from/to memory

Memory device typically initiated from CPU

Device memory can be initiated by either the device or the CPU

Page 21: CS 519: Lecture 4

21Computer Science, Rutgers CS 519: Operating System Theory

Direct Memory Access

Used to avoid programmed I/O for large data movement

Requires DMA controller

Bypasses CPU to transfer data directly between I/O device and memory

Page 22: CS 519: Lecture 4

22Computer Science, Rutgers CS 519: Operating System Theory

Programmed I/O vs. DMA

CPU Memory

Disk

Interconnect

CPU Memory

Disk

Interconnect

CPU Memory

Disk

Interconnect

ProgrammedI/O

DMA DMADevice Memory

Problems?

Page 23: CS 519: Lecture 4

23Computer Science, Rutgers CS 519: Operating System Theory

Six Steps to Perform DMA Transfer

Source: SGG

Page 24: CS 519: Lecture 4

24Computer Science, Rutgers CS 519: Operating System Theory

Life Cycle of a Blocking I/O Request

driver

Naïve processing of blocking request:device driver executed by a dedicated kernel thread; only one I/O can be processed at a time.

More sophisticated approach wouldnot block the device driver and wouldnot require a dedicated kernel thread.

Source: SGG

Page 25: CS 519: Lecture 4

25Computer Science, Rutgers CS 519: Operating System Theory

Life Cycle of a Blocking I/O Request

Source: SGG

Page 26: CS 519: Lecture 4

26Computer Science, Rutgers CS 519: Operating System Theory

Performance

I/O a major factor in system performanceDemands CPU to execute device driver, kernel I/O code

State save/restore due to interrupts

Data copying

Disk I/O is extremely slow

Page 27: CS 519: Lecture 4

27Computer Science, Rutgers CS 519: Operating System Theory

Improving Performance

Reduce number of context switches

Reduce data copying

Reduce interrupts by using large transfers, smart controllers, polling

Use DMA

Balance CPU, memory, bus, and I/O performance for highest throughput

Page 28: CS 519: Lecture 4

28Computer Science, Rutgers CS 519: Operating System Theory

Device Driver

OS module controlling an I/O device

Hides the device specifics from the above layers in the kernelSupporting a common API

UNIX: block or character deviceBlock: device communicates with the CPU/memory in fixed-

size blocks

Character/Stream: stream of bytes

Translates logical I/O into device I/OE.g., logical disk blocks into {head, track, sector}

Performs data buffering and scheduling of I/O operations

StructureSeveral synchronous entry points: device initialization,

queue I/O requests, state control, read/write

An asynchronous entry point to handle interrupts

Page 29: CS 519: Lecture 4

29Computer Science, Rutgers CS 519: Operating System Theory

Some Common Entry Points for UNIX Device Drivers

Attach: attach a new device to the system.

Close: note the device is not in use.

Halt: prepare for system shutdown.

Init: initialize driver globals at load or boot time.

Intr: handle device interrupt.

Ioctl: implement control operations.

Mmap: implement memory-mapping.

Open: connect a process to a device.

Read: character-mode input.

Size: return logical size of block device.

Start: initialize driver at load or boot time.

Write: character-mode output.

Page 30: CS 519: Lecture 4

30Computer Science, Rutgers CS 519: Operating System Theory

User to Driver Control Flow

user

kernel

read, write, ioctl

special file ordinary file

file system

buffer cache

blockdevice

characterdevice

character queue

driver driver

Page 31: CS 519: Lecture 4

31Computer Science, Rutgers CS 519: Operating System Theory

Buffer Cache

When an I/O request is made for a block, the buffer cache is checked first

If block is missing from the cache, it is read into the buffer cache from the device

Exploits locality of reference as any other cache

Replacement policies similar to those for VM, but LRU is feasible

UNIX

Historically, UNIX has a buffer cache for the disk which does not share buffers with character/stream devices

Adds overhead in a path that has become increasingly common: disk NIC

Page 32: CS 519: Lecture 4

32Computer Science, Rutgers CS 519: Operating System Theory

Disks

Seek time: time to move the disk head to the desired track

Rotational delay: time to reach desired sector once head is over the desired track

Transfer rate: rate data read/write to disk

Some typical parameters:Seek: ~2-10ms

Rotational delay: ~3ms for 10000 rpm

Transfer rate: 200 MB/s

Sectors

Tracks

Page 33: CS 519: Lecture 4

33Computer Science, Rutgers CS 519: Operating System Theory

Disk Scheduling

Disks are at least four orders of magnitude slower than main memory

The performance of disk I/O is vital for the performance of the computer system as a whole

Access time (seek time + rotational delay) >> transfer time for a sector

Therefore the order in which sectors are read matters a lot

Disk schedulingUsually based on the position of the requested sector rather than according to the process priority

Possibly reorder stream of read/write request to improve performance

Page 34: CS 519: Lecture 4

34Computer Science, Rutgers CS 519: Operating System Theory

Disk Scheduling (Cont.)

Several algorithms exist to schedule the servicing of disk I/O requests.

We illustrate them with a request queue (tracks 0-199).

98, 183, 37, 122, 14, 124, 65, 67

Head pointer 53

Page 35: CS 519: Lecture 4

35Computer Science, Rutgers CS 519: Operating System Theory

FCFS

Illustration shows total head movement of 640 cylinders.

Source: SGG

Page 36: CS 519: Lecture 4

36Computer Science, Rutgers CS 519: Operating System Theory

SSTF

Selects the request with the minimum seek time from the current head position.

SSTF scheduling is a form of SJF scheduling; may cause starvation of some requests.

Illustration shows total head movement of 236 cylinders.

Page 37: CS 519: Lecture 4

37Computer Science, Rutgers CS 519: Operating System Theory

SSTF (Cont.)

Source: SGG

Page 38: CS 519: Lecture 4

38Computer Science, Rutgers CS 519: Operating System Theory

SCAN

The disk arm starts at one end of the disk, and moves toward the other end, servicing requests until it gets to the other end of the disk, where the head movement is reversed and servicing continues.

Sometimes called the elevator algorithm.

Illustration shows total head movement of 208 cylinders.

Page 39: CS 519: Lecture 4

39Computer Science, Rutgers CS 519: Operating System Theory

SCAN (Cont.)

Source: SGG

Page 40: CS 519: Lecture 4

40Computer Science, Rutgers CS 519: Operating System Theory

C-SCAN

Provides a more uniform wait time than SCAN.

The head moves from one end of the disk to the other, servicing requests as it goes. When it reaches the other end, however, it immediately returns to the beginning of the disk, without servicing any requests on the return trip.

Treats the cylinders as a circular list that wraps around from the last cylinder to the first one.

Page 41: CS 519: Lecture 4

41Computer Science, Rutgers CS 519: Operating System Theory

C-SCAN (Cont.)

Source: SGG

Page 42: CS 519: Lecture 4

42Computer Science, Rutgers CS 519: Operating System Theory

C-LOOK

Version of C-SCAN

Arm only goes as far as the last request in each direction, then reverses direction immediately, without first going all the way to the end of the disk.

Page 43: CS 519: Lecture 4

43Computer Science, Rutgers CS 519: Operating System Theory

C-LOOK (Cont.)

Source: SGG

Page 44: CS 519: Lecture 4

44Computer Science, Rutgers CS 519: Operating System Theory

Disk Scheduling Policies

Shortest-service-time-first (SSTF): pick the request that requires the least movement of the head

SCAN (back and forth over disk): good service distribution

C-SCAN (one way with fast return): lower service variability

Problem with SSTF, SCAN, and C-SCAN: arm may not move for long time (due to rapid-fire accesses to same track)

N-step SCAN: scan of N records at a time by breaking the request queue in segments of size at most N and cycling through them

FSCAN: uses two sub-queues, during a scan one queue is consumed while the other one is produced

Page 45: CS 519: Lecture 4

45Computer Science, Rutgers CS 519: Operating System Theory

Disk Management

Low-level formatting, or physical formatting — Dividing a disk into sectors that the disk controller can read and write.

To use a disk to hold files, the operating system still needs to record its own data structures on the disk.

Partition the disk into one or more groups of cylinders.

Logical formatting or “making a file system”.

Boot block initializes system.

The bootstrap is stored in ROM.

Bootstrap loader program.

Methods such as sector sparing used to handle bad blocks.

Page 46: CS 519: Lecture 4

46Computer Science, Rutgers CS 519: Operating System Theory

Swap Space Management

Virtual memory uses disk space as an extension of main memory.

Swap space is necessary for pages that have been written and then replaced from memory.

Swap space can be carved out of the normal file system, or, more commonly, it can be in a separate disk partition.

Swap space management

4.3BSD allocates swap space when process starts; holds text segment (the program) and data segment. (Swap and heap pages are created in main memory first.)

Kernel uses swap maps to track swap space use.

Solaris 2 allocates swap space only when a page is forced out of physical memory, not when the virtual memory page is first created.

Page 47: CS 519: Lecture 4

47Computer Science, Rutgers CS 519: Operating System Theory

Disk Reliability

Several improvements in disk-use techniques involve the use of multiple disks working cooperatively.

RAID is one important technique currently in common use.

Page 48: CS 519: Lecture 4

48Computer Science, Rutgers CS 519: Operating System Theory

RAID

Redundant Array of Inexpensive Disks (RAID)A set of physical disk drives viewed by the OS as a single logical drive

Replace large-capacity disks with multiple smaller-capacity drives to improve the I/O performance (at lower price)

Data are distributed across physical drives in a way that enables simultaneous access to data from multiple drives

Redundant disk capacity is used to compensate for the increase in the probability of failure due to multiple drives

Improve availability because no single point of failure

Six levels of RAID representing different design alternatives

Page 49: CS 519: Lecture 4

49Computer Science, Rutgers CS 519: Operating System Theory

RAID Level 0

Does not include redundancy

Data is stripped across the available disksTotal storage space across all disks is divided into strips

Strips are mapped round-robin to consecutive disks

A set of consecutive strips that maps exactly one strip to each disk in the array is called a stripe

Can you see how this improves the disk I/O bandwidth?

What access pattern gives the best performance?

strip 0 strip 3strip 2strip 1

strip 7strip 6strip 5strip 4

...

stripe 0

Page 50: CS 519: Lecture 4

50Computer Science, Rutgers CS 519: Operating System Theory

RAID Level 1

Redundancy achieved by duplicating all the data

Every disk has a mirror disk that stores exactly the same data

A read can be serviced by either of the two disks which contains the requested data (improved performance over RAID 0 if reads dominate)

A write request must be done on both disks but can be done in parallel

Recovery is simple but cost is high

strip 0 strip 8strip 0 strip 8

strip 9strip 1 strip 9strip 1

...

Page 51: CS 519: Lecture 4

51Computer Science, Rutgers CS 519: Operating System Theory

RAID Levels 2 and 3

b0 b1 b2 P(b) X2(i) = P(i) X1(i) X0(i)

Parallel access: all disks participate in every I/O request

Small strips (1 bit) since size of each read/write = # of disks * strip size

RAID 2: 1-bit strips and error-correcting code. ECC is calculated across corresponding bits on data disks and stored on O(log(# data disks)) ECC disks

Hamming code: can correct single-bit errors and detect double-bit errors

Example configurations data disks/ECC disks: 4/3, 10/4, 32/7

Less expensive than RAID 1 but still high overhead – not needed in most environments

RAID 3: 1-bit strips and a single redundant disk for parity bits

P(i) = X2(i) X1(i) X0(i)

On a failure, data can be reconstructed. Only tolerates one failure at a time

Page 52: CS 519: Lecture 4

52Computer Science, Rutgers CS 519: Operating System Theory

RAID Levels 4 and 5

RAID 4Large strips with a parity strip like RAID 3

Independent access - each disk operates independently, so multiple I/O request can be satisfied in parallel

Independent access small write = 2 reads + 2 writes

Example: if write performed only on strip 0:P’(i) = X2(i) X1(i) X0’(i)

= X2(i) X1(i) X0’(i) X0(i) X0(i)

= P(i) X0’(i) X0(i)

Parity disk can become bottleneck

RAID 5Like RAID 4 but parity strips are distributed across all disks

strip 0 P(0-2)

P(3-5)strip 3

strip 2strip 1

strip 5strip 4

Page 53: CS 519: Lecture 4

53Computer Science, Rutgers CS 519: Operating System Theory

Cost of Disk Storage

Main memory is much more expensive than disk storage

The cost/MB of hard disk storage is competitive with magnetic tape if only one tape is used per drive

The cheapest tape drives and the cheapest disk drives have had about the same storage capacity over the years

Page 54: CS 519: Lecture 4

54Computer Science, Rutgers CS 519: Operating System Theory

Cost of DRAM

Source: SGG

Page 55: CS 519: Lecture 4

55Computer Science, Rutgers CS 519: Operating System Theory

Cost of Disks

Source: SGG

Page 56: CS 519: Lecture 4

56Computer Science, Rutgers CS 519: Operating System Theory

Cost of Tapes

Source: SGG

Page 57: CS 519: Lecture 4

57Computer Science, Rutgers CS 519: Operating System Theory

File System

File system is an abstraction of the disk

File Tracks/sectors

File Control Block stores mapping info (+ protection, timestamps, size, etc)

To a user process

A file looks like a contiguous block of bytes (Unix)

A file system provides a coherent view of a group of files

A file system provides protection

API: create, open, delete, read, write files

Performance: throughput vs. response time

Reliability: minimize the potential for lost or destroyed data

E.g., RAID could be implemented in the OS (disk device driver)

Page 58: CS 519: Lecture 4

58Computer Science, Rutgers CS 519: Operating System Theory

File API

To read or write, need to openopen() returns a handle to the opened file

OS associates a (per-process) data structure with the handle

This data structure maintains current “cursor” position in the stream of bytes in the file

Read and write takes place from the current position

Can specify a different location explicitly

When done, should close the file

Page 59: CS 519: Lecture 4

59Computer Science, Rutgers CS 519: Operating System Theory

In-Memory File System Structures

Source: SGG

Page 60: CS 519: Lecture 4

60Computer Science, Rutgers CS 519: Operating System Theory

Files vs. Disk

Files Disk

???

Page 61: CS 519: Lecture 4

61Computer Science, Rutgers CS 519: Operating System Theory

Files vs. Disk

Files Disk

What’s the problem with this mapping function?What’s the potential benefit of this mapping function?

ContiguousLayout

Page 62: CS 519: Lecture 4

62Computer Science, Rutgers CS 519: Operating System Theory

Files vs. Disk

Files Disk

What’s the problem with this mapping function?

Page 63: CS 519: Lecture 4

63Computer Science, Rutgers CS 519: Operating System Theory

UNIX File

i-nodes

Page 64: CS 519: Lecture 4

64Computer Science, Rutgers CS 519: Operating System Theory

UNIX File Control Block (I-Node)

Source: SGG

Page 65: CS 519: Lecture 4

65Computer Science, Rutgers CS 519: Operating System Theory

De-fragmentation

Want index-based organization of disk blocks of a file for efficient random access and no fragmentation

Want sequential layout of disk blocks for efficient sequential access

How to reconcile?

Page 66: CS 519: Lecture 4

66Computer Science, Rutgers CS 519: Operating System Theory

De-fragmentation (cont’d)

Base structure is index-based

Optimize for sequential accessDe-fragmentation: move the blocks around to simulate actual sequential layout of files

Group allocation of blocks: group tracks together (cylinders). Try to allocate all blocks of a file from a single cylinder group so that they are close together. This style of grouped allocation was first proposed for the BSD Fast File System and later incorporated in ext2 (Linux).

Extents: on each write that extends a file, allocate a chunk of consecutive blocks. Some modern systems use extents, e.g. VERITAS (supported in many systems like Linux and Solaris), the first commercial journaling file system. Ext4 can use them also (extents are not the default option, though).

Page 67: CS 519: Lecture 4

67Computer Science, Rutgers CS 519: Operating System Theory

Free Space Management

No policy issues here – just mechanism

Bitmap: one bit for each block on the diskGood to find a contiguous group of free blocks

Files are often accessed sequentially

For 1TB disk and 4KB blocks, 32MB for the bitmap

Chained free portions: pointer to the next oneNot so good for sequential access (hard to find sequential blocks of appropriate size)

Index: treats free space as a file

Page 68: CS 519: Lecture 4

68Computer Science, Rutgers CS 519: Operating System Theory

File System

OK, we have files

How can we name them?

How can we organize them?

Page 69: CS 519: Lecture 4

69Computer Science, Rutgers CS 519: Operating System Theory

Tree-Structured Directories

Source: SGG

Page 70: CS 519: Lecture 4

70Computer Science, Rutgers CS 519: Operating System Theory

File Naming

Each file has an associated human-readable nameE.g., usr, bin, mid-term.pdf, design.pdf

File name must be globally uniqueOtherwise how would the system know which file we are referring to?

OS must maintain a mapping between a file name and the set of blocks belonging to the file

Mappings are kept in directories

Page 71: CS 519: Lecture 4

71Computer Science, Rutgers CS 519: Operating System Theory

Unix File System

Ordinary files (uninterpreted)

Directories

Directory is differentiated from ordinary file by bit in i-node

File of files: consists of records (directory entries), each of which contains info about a file and a pointer to its i-node

Organized as a rooted tree

Pathnames (relative and absolute)

Contains links to parent, itself

Multiple links to files can exist: hard (points to the actual file data) or symbolic (symbolic path to a hard link). Both types of links can be created with the ln utility. Removing a symbolic link does not affect the file data, whereas removing the last hard link to a file will remove the data.

Page 72: CS 519: Lecture 4

72Computer Science, Rutgers CS 519: Operating System Theory

Storage Organization

Info stored on the SB: size of the file system, number of free blocks,list of free blocks, index to the next free block, size of the I-node list,number of free I-nodes, list of free I-nodes, index to the next freeI-node, locks for free block and free I-node lists, and flag to indicatea modification to the SB

I-node contains: owner, type (directory, file, device), last modified time, last accessed time, last I-node modified time, access permissions, number of links to the file, size, and block pointers

Page 73: CS 519: Lecture 4

73Computer Science, Rutgers CS 519: Operating System Theory

Unix File System (Cont’d)

Tree-structured file hierarchies

Mounted on existing space by using mount

No hard links between different file systems

Page 74: CS 519: Lecture 4

74Computer Science, Rutgers CS 519: Operating System Theory

File Naming

Each file has a unique name

User visible (external) name must be symbolic

In a hierarchical file system, unique external names are given as pathnames (path from the root to the file)

Internal names: i-node in UNIX - an index into an array of file descriptors/headers for a volume

Directory: translation from external to internal name

May have more than one external name for a single internal name

Information about file is split between the directory and the file descriptor: name, type, size, location on disk, owner, permissions, date created, date last modified, date last access, link count

Page 75: CS 519: Lecture 4

75Computer Science, Rutgers CS 519: Operating System Theory

Name Space

In UNIX, “devices are files”E.g., /dev/cdrom, /dev/tape

User process accesses devices by accessing corresponding file

/

usr A B

C D

Page 76: CS 519: Lecture 4

76Computer Science, Rutgers CS 519: Operating System Theory

File System Buffer Cache

application: read/write files

OS: translate file to disk blocks

...buffer cache ...maintains

controls disk accesses: read/write blocks

hardware:

Any problems?

Page 77: CS 519: Lecture 4

77Computer Science, Rutgers CS 519: Operating System Theory

File System Buffer Cache

Disks are “stable” while memory is volatileWhat happens if you buffer a write and the machine crashes before the write has been saved to disk?

Can use write-through but write performance will suffer

In UNIX

Use un-buffered I/O when writing i-nodes or pointer blocks

Use buffered I/O for other writes and force sync every 30 seconds

Will talk more about this in a few slides

What about replacement?

How can we further improve performance?

Page 78: CS 519: Lecture 4

78Computer Science, Rutgers CS 519: Operating System Theory

Application-Controlled Caching

application: read/write files replacement policy

OS: translate file to disk blocks

...buffer cache ...maintains

controls disk accesses: read/write blocks

hardware:

Page 79: CS 519: Lecture 4

79Computer Science, Rutgers CS 519: Operating System Theory

Application-Controlled File Caching

Two-level block replacement: responsibility is split between kernel and user level

A global allocation policy performed by the kernel which decides which process will give up a block

A block replacement policy decided by the user:Kernel provides the candidate block as a hint to the process

The process can overrule the kernel’s choice by suggesting an alternative block

The suggested block is replaced by the kernel

Examples of alternative replacement policy: most-recently used (MRU)

Page 80: CS 519: Lecture 4

80Computer Science, Rutgers CS 519: Operating System Theory

Sound Kernel-User Cooperation

Oblivious processes should do no worse than under LRU

Foolish processes should not hurt other processes

Smart processes should perform better than LRU whenever possible and they should never perform worse

If kernel selects block A and user chooses B instead, the kernel swaps the position of A and B in the LRU list, and creates a “placeholder” tagged with B to point to A (kernel’s choice)

If the user process misses on B (i.e. it made a bad choice), and B is found in the placeholder, then the block pointed to by the placeholder is chosen (prevents hurting other processes)

Source: P. Cao et al. “Implementation and performance of integrated application-controlled file caching, prefetching, and disk scheduling”. ACM TOCS, 1996.

Page 81: CS 519: Lecture 4

81Computer Science, Rutgers CS 519: Operating System Theory

File Sharing

Can multiple processes open the same file at the same time?

What happens if two or more processes write to the same file?

What happens if two or more processes try to create the same file at the same time?

What happens if a process deletes a file when another has it opened?

Page 82: CS 519: Lecture 4

82Computer Science, Rutgers CS 519: Operating System Theory

File Sharing (Cont’d)

Several possibilities for file sharing semantics

Unix semantics: file associated with single physical imageWrites by one user are seen immediately by others who also have the file open.

One sharing mode allows file pointer to be shared.

Session semantics: file may be associated temporarily with several images at the same time

Writes by one user are not immediately seen by others who also have the file open.

Once a file is closed, the changes made to it are visible only in sessions starting later.

Immutable-file semantics: file declared as shared cannot be written.

Page 83: CS 519: Lecture 4

83Computer Science, Rutgers CS 519: Operating System Theory

File System Consistency on Crashes

File system almost always uses a buffer/disk cache for performance reasons

Two copies of a disk block (buffer cache, disk) consistency problem if the system crashes before all the modified blocks are written back to disk

This problem is critical especially for the blocks that contain control information: i-node, free-list, directory blocks

Example: if the directory block (contains pointer to i-node) is written back before the i-node of new file and the system crashes, the directory structure will be inconsistent

Similar case when free list is updated before i-node and the system crashes, free list will be incorrect

Page 84: CS 519: Lecture 4

84

File System MetadataConsistency Problem: Examples

Example 1: create a new file

Two updates: (1) allocate a free I-node; (2) create an entry in the directory

(1) and (2) must be write-through (expensive) or (1) must be written-back before (2)

If (2) is written back first and a crash occurs before (1) is written back the directory structure is inconsistent and cannot be recovered

Example 2: write a new block to a file

Two updates: (1) allocate a free block; (2) update the address table of the I-node

(1) and (2) must be write-through or (1) must be written-back before (2)

If (2) is written back first and a crash occurs before (1) is written back the I-node structure is inconsistent and cannot be recovered

Page 85: CS 519: Lecture 4

85Computer Science, Rutgers CS 519: Operating System Theory

More on File System Consistency

Utility programs for checking block and directory consistency

One approach to reduce inconsistency: Write critical blocks from the buffer cache to disk immediately. Data blocks can be written to disk periodically: sync

A more elaborate solution: use logs of metadata operations (“journaling”) to implement transactions. The idea is to write all metadata operations associated with a system call to a log. After these operations are logged, they are considered “committed” and the call can return. Meanwhile, the log is replayed across the actual file system structures. As changes are made, a pointer is updated to indicate which actions have been completed. When an entire transaction is completed, it is removed from the log. If the system crashes, it knows how to recover by looking at the log.

Page 86: CS 519: Lecture 4

86Computer Science, Rutgers CS 519: Operating System Theory

Protection Mechanisms

Files are OS objects: unique names and a finite set of operations that processes can perform on them

Protection domain defines a set of {object,rights} where right is the permission to perform one of the operations

At every instant in time, each process runs in some protection domain

In Unix, a protection domain is {uid, gid}

Protection domain in Unix is switched when running a program with SETUID/SETGID set or when the process enters the kernel mode by issuing a system call

How to store info about all the protection domains?

Page 87: CS 519: Lecture 4

87Computer Science, Rutgers CS 519: Operating System Theory

Protection Mechanisms (cont’d)

Access Control List (ACL): associate with each object a list of all the protection domains that may access the object and how

In Unix, an ACL defines three protection domains: owner, group and others

Capability List (C-list): associate with each process a list of objects that may be accessed along with the operations

C-list implementation issues: where/how to store them (hardware, kernel, encrypted in user space) and how to revoke them

Page 88: CS 519: Lecture 4

88Computer Science, Rutgers CS 519: Operating System Theory

Protection Mechanisms (cont’d)

Most systems use a combination of access control lists and capabilities. Example: In Unix, an access list is checked when first opening a file. After that, system relies on kernel information (per-process file table) that is established during the open call. This obviates the need for further protection checks.

Page 89: CS 519: Lecture 4

89Computer Science, Rutgers CS 519: Operating System Theory

Log-Structured File System (LFS)

As memory gets larger, buffer cache size increases increases the fraction of read requests that are satisfied from the buffer cache with no disk access

In the future, most disk accesses will be writes

But writes are usually done in small chunks in most file systems (control data, for instance) which makes the file system highly inefficient

LFS idea: structure the entire disk as a log

Periodically, or when required, all the pending writes being buffered in memory are collected and written as a single contiguous segment at the end of the log

Source: M. Rosenblum and J. Ousterhout. “The design and implementation of a log-structured file system”. ACM TOCS, 1992.

Page 90: CS 519: Lecture 4

90Computer Science, Rutgers CS 519: Operating System Theory

LFS segment

Contain i-nodes, directory blocks and data blocks, all mixed together

Each segment starts with a segment summary

Segment size: 512KB - 1MB

Two key issues: How to retrieve information from the log?

How to manage the free space on disk?

Page 91: CS 519: Lecture 4

91Computer Science, Rutgers CS 519: Operating System Theory

File Location in LFS

The i-node contains the disk addresses of the file blocks as in standard UNIX

But there is no fixed location for the i-node

An i-node map is used to maintain the current location of each i-node

i-node map blocks can also be scattered but a fixed checkpoint region on the disk identifies the location of all the i-node map blocks

Usually i-node map blocks are cached in main memory most of the time, thus disk accesses for them are rare

Page 92: CS 519: Lecture 4

92Computer Science, Rutgers CS 519: Operating System Theory

Segment Cleaning in LFS

LFS disk is divided into segments that are written sequentially

Live data must be copied out of a segment before the segment can be re-written

The process of copying data out of a segment: cleaning

A separate cleaner thread moves along the log, removes old segments from the end and puts live data into memory for rewriting in the next segment

As a result, an LFS disk appears like a big circular buffer with the writer thread adding new segments to the front and the cleaner thread removing old segments from the end

Bookkeeping is not trivial: i-node must be updated when blocks are moved to the current segment