074.784 operating systems design and …comp7840/notes/1_osreview_2up.pdf1 074.784 operating systems...
TRANSCRIPT
1
074.784 Operating Systems Design and Implementation
Peter GrahamSpring 2007 (January – April)
04.784 OSDI Overview 1
LogisticsInstructor: Peter Graham
E2-572 EITC (474-8837)[email protected]
Lectures: Wednesday 10:00 – 12:00
E2-461 EITC
Course Homepage:www.cs.umanitoba.ca/~comp7840
2
04.784 OSDI Overview 2
Course OverviewGoals: - with a focus on support for pervasive computing
Deeper understanding of OS design and implementation principles: OS/architecture interface/interactionCurrent trends in OS Research - with a focus on support for pervasive computing
Structure:Review basic material: OSs, Pervasive ConceptsRead and discuss papers on advanced issuesWrite a survey paper on an OS topicSignificant project a taste of hands-on work
me
you
04.784 OSDI Overview 3
TopicsOS Review: Processes, threads, and synchronization, Resource Management, Virtual Memory, I/O and file systemsPervasive Computing Introduction: Motivation, Key Issues, OS ChallengesCurrent research topics
Based on selected paper presentations
3
04.784 OSDI Overview 4
Course DetailsPrerequisites
Undergraduate OS and architecture coursesGood programming skills (in C/C++ and UNIX)
What to expectReading and critical analysis of other’s workImplementation Project (with evaluation)Write a Survey Paper – no “double dipping”
Hence (evaluation):Paper presentation 30%Implementation Project 30%Survey Paper 40%
04.784 OSDI Overview 5
ProjectGoals
Learn to design, implement/simulate, and evaluate an OS component related to pervasive computingImprove systems programming skills
StructureIndividual work (with a significant coding effort)Include a brief final project report
4
OS Mechanisms and Policies
04.784 OSDI Overview 7
What is an operating system?
A software layer between the hardware and the application programs/users which provides a virtual machine interface: easy and safeA resource manager that allows programs/users to share the hardware resources: fairly and efficiently
hardware
operating system
application (user)
5
04.784 OSDI Overview 8
How does an OS work?
Receives requests from the application: system callsSatisfies the requests: may issue commands to hardwareHandles hardware interrupts: may upcall the applicationOS complexity: synchronous calls + asynchronous events
hardware
OS
application (user) system calls upcalls
commands interruptsH/W independentH/W dependent
04.784 OSDI Overview 9
Mechanism and policy
Mechanisms: data structures and operations that implement an abstraction (e.g. the file buffer cache) Policies: the procedures that guide the selection of a certain course of action from among alternatives (e.g. the replacement policy for the buffer cache)Traditional OS is rigid: mechanism together with policy
hardware
operating system: mechanism+policy
application (user)
6
04.784 OSDI Overview 10
Mechanism-policy splitSingle policy often not the best for all casesSo, separate mechanisms from policies:
OS provides the mechanism + some policy
applications contribute to the policy
Flexibility + efficiency: require new OS structures and/or new OS interfaces
04.784 OSDI Overview 11
System abstraction: processes
A process is a system abstraction: An illusion of being the only job in the system.
hardware: computer
operating system: process
user: run application create, kill processes,inter-process comm.
multiplex resources
7
04.784 OSDI Overview 12
Processes: mechanism and policyMechanism:
Creation, destruction, suspension, context switch, signaling, IPC, etc.
Policy:Minor policy questions:
Who can create/destroy/suspend processes?How many active processes can each user have?
Major policy questions:How to share system resources between multiple processes?Typically broken into a number of orthogonal policies for individual resources such as CPU, memory, and disk.
04.784 OSDI Overview 13
A thread is a processor abstraction: An illusion of having 1 processor per execution context
- One or more threads per process
Processor abstraction: threads
hardware: processor
operating system: thread
application: execution contextcreate, kill, synch.
context switch
8
04.784 OSDI Overview 14
Threads: mechanism and policyMechanism:
Creation, destruction, suspension, context switch, signaling, synchronization, etc.
Policy:How to share the CPU between threads from different processes?How to share the CPU between threads from the same process?
04.784 OSDI Overview 15
Memory abstraction: virtual memory
Virtual memory is a memory abstraction: An illusion of large contiguous memory, typically more memory than is physically available
hardware: physical memory
operating system: virtual memory
application: address spacevirtual addresses
physical addresses
9
04.784 OSDI Overview 16
Virtual memory: mechanism
Virtual-to-physical memory mapping, page-fault, etc.Done with hardware support (DAT/MMU)
physical memory:
v-to-p memory mappings
processes:
virtual address spacesp1 p2
04.784 OSDI Overview 17
Virtual memory: policyHow to multiplex a virtual memory that is larger than the physical memory onto what is available?How to share physical memory between multiple processes?
10
04.784 OSDI Overview 18
Storage abstraction: file system
A file system is a storage abstraction: An illusion of structured storage space
hardware: disk
operating system: files, directories
application/user: copy file1 file2 naming, protection,operations on files
operations on disk blocks
04.784 OSDI Overview 19
File SystemMechanism:
File creation, deletion, read, write, file-block-to-disk-block mapping, file buffer cache, etc.
Policy:Sharing vs. protection?Which block to allocate?File buffer cache management?
11
04.784 OSDI Overview 20
Communication Abstraction:Messaging
Message passing is a communication abstraction: An illusion of reliable (sometimes ordered) transport
hardware: network interface
operating system: TCP/IP protocols
application: socketsnaming, messages
network packets
04.784 OSDI Overview 21
Message PassingMechanism:
Send, receive, buffering, retransmission, etc.
Policy:Congestion control and routingMultiplexing multiple connections onto a single NIC
12
04.784 OSDI Overview 22
Multiprocessors
Memory
memory bus
I/O bus
Net interfaceDisk
CPU
cache
CPU
cache
04.784 OSDI Overview 23
UMA Multiprocessors: OS issuesProcesses
How to divide processors among multiple processes? Time sharing vs. space sharing
ThreadsNew synchronization mechanismsHow to schedule threads of a single process on its allocated processors?Affinity scheduling?
13
OS Structure
04.784 OSDI Overview 25
Traditional OS structure
Monolithic/layered systemsone/N layers all executed in “kernel-mode” good performance but rigid
OS kernel
hardware
userprocess
filesystem
memorysystem
user system calls
14
04.784 OSDI Overview 26
Micro-kernel OS
client-server model, IPC between clients and serversthe micro-kernel provides protected communicationSome OS functions implemented as user-level servers flexible but efficiency is the problemeasy to extend for distributed systems
micro-kernel
hardware
clientprocess
fileserver
memoryserver
IPC
user mode
04.784 OSDI Overview 27
Extensible OS kernel
User processes can load customized OS services into the kernel Good performance but protection and scalability become problems
extensible kernel
hardware
process
defaultmemoryservice
user modeprocess
mymemoryservice
15
04.784 OSDI Overview 28
Virtual Machines
Old concept which is heavily revived todaythe real hardware is “cloned” into several identical virtual machinesOS functionality built on top of the virtual machine
hardware
user
exokernel
allocate resourceOS on virtual machine
Processes, Threads, and Synchronization
16
04.784 OSDI Overview 30
Execution modeMost processors support at least two modes of execution for protection reasons
Privileged - kernel-modeNon-privileged - user-mode
The portion of the OS that executes in kernel-mode is called the kernel
Can freely access hardware resourcesProtected from interference by user programs
Code running in kernel-mode can do anything—no protectionUser code executes in user-modeOS functionality that does not need direct access of hardware may also run in user-mode
04.784 OSDI Overview 31
Interrupts and trapsInterrupt: an asynchronous event
External events (not related to the processor state) which occur independently of the instruction execution in the processorCan be masked (specifically or not)e.g. I/O completion interrupt
Traps: a synchronous eventConditionally or unconditionally caused by the execution of the current instructione.g., floating point error
Interrupt and trap events are predefinedEach interrupt and trap has an associated interrupt vectorInterrupt vector specifies handler that should be called when the event occurs (i.e. points to the handler)
Interrupts and traps force the processor to save the current state of execution and transfer control to the handler
17
04.784 OSDI Overview 32
A processAn “instantiation” of a programSystem abstraction—the set of resources required for executing a program
Execution context(s)Address spaceFile handles, communication endpoints, etc.Register contents (i.e. process execution “state”)
Historically, all of the above “lumped” into a single abstractionMore recently, split into several abstractions
Threads, address space, protection domain, etc.
04.784 OSDI Overview 33
OS process managementSupports user creation/destruction of processes and support for inter-process communication (IPC)
Allocates resources to processes according to specific policies
Interleaves the execution of multiple processes to increase system utilization and permit effective sharing of resources among several users
18
04.784 OSDI Overview 34
Process imageThe physical representation of a process in the OSRequires a process control data structure (the “PCB” – Process Control Block)
Identification: process, parent process, userControl: scheduling (state, priority), resources (memory, openedfiles), IPCExecution contexts—threadsAn address space consisting of code, data, and stack segments
04.784 OSDI Overview 35
User modeWhen running in user-mode, a process can only access its virtual memory and processor resources (registers) directlyAll other resources can only be accessed indirectly through the
kernel by “calling the system”System callA system call is a call because it looks like a procedure callIn actuality, it’s a software trap
Why is a system call a “trap”, instead of a procedure call?How it is doneYou end up running OS code not a part of user program
19
04.784 OSDI Overview 36
System calls in a monolithic OS
kernel mode
user mode
read(…)
PC PSW
code for read system call
trap
interrupt vector for trap instruction
iret
04.784 OSDI Overview 37
Process creationHow to create a process? - Use a system call (of course)!In UNIX, a process can create another process using the fork()system call
int pid = fork()
The creating process is called the parent and the new process is called the childThe child process is created as a copy of the parent process (process image and process control structure) except for the identification and scheduling stateParent and child processes run in two different address spaces—by default no memory sharingProcess creation is expensive because of this copying
20
04.784 OSDI Overview 38
Process creation using fork()The UNIX shell is a command-line interpreter whose basic purpose is to allow users to run applications on a UNIX systemcmd arg1 arg2 ... argN
While(TRUE) {get_command(cmd, arguments)
if (fork() != 0) { /* parent */wait(&status);
} else { /* child */exec(cmd, arguments)
}}
04.784 OSDI Overview 39
Inter-process communicationMost operating systems provide several abstractions for inter-process communication: message passing , shared memory, etc. Communication requires synchronization between processes (i.e. data must be produced before it is consumed) Synchronization can be implicit (message passing) or may have to be explicit (shared memory)Explicit synchronization can be provided by the OS (semaphores, monitors, etc.) or can be achieved exclusively in user-mode (if processes share memory)
Ugly and tedious
21
04.784 OSDI Overview 40
ThreadsWhy limit ourselves to a single execution context?
For example, have to use select() to deal with multiple outstanding events. Having multiple execution contexts is more natural.Nice fit with multiprocessor systems
Multiple execution contexts threadsAll the threads of a process share the same address space and the same resourcesEach thread contains
An execution state: running, ready, etc.An execution context: PC, SP, other registersA per-thread stack
04.784 OSDI Overview 41
Process address space revisited
OS
Code
(Global) Data
Stack
Heap
(a) Single-threaded address space
OS
Code
(Global) DataStack
Heap
Stack
(b) Multi-threaded address space
22
04.784 OSDI Overview 42
Threads vs. processesWhy multiple threads?
Can’t we use multiple processes to do whatever we can do with multiple threads?
Of course, we need to be able to share memory (and other resources) between multiple processesBut this sharing is already supported
Operations on threads (creation, termination, scheduling, etc..) are cheaper than the corresponding operations on processes
This is because thread operations do not involve manipulations of other resources associated with processes (especially memory)
Inter-thread communication is supported through shared memory without kernel intervention
04.784 OSDI Overview 43
Thread state diagram
ready running
blockedsuspended
dispatch
timeout
wait forevent
event occurred
thread scheduling
activatesuspend
suspend
processscheduling
(swapped out)
23
04.784 OSDI Overview 44
Thread switchingTypically referred to as a context switchContext switching is the act of taking a thread off of the processor and replacing it with another one that is waiting to runA context switch takes place when
Time quota allocated to the executing thread expiresThe executing thread performs a blocking system callA memory fault due to a page missEtc.
How to do a context switch?
04.784 OSDI Overview 45
Thread implementationKernel-level threads (lightweight processes)
Kernel sees multiple execution contextThread management done by the kernel
User-level threadsImplemented as a thread library which contains the code for thread creation, termination, scheduling and switchingKernel sees one execution context and is unaware of thread activity
24
04.784 OSDI Overview 46
Threads: user- vs. kernel-levelAdvantages of user-level threads
Performance: low-cost thread operations (do not require crossing protection domains)Flexibility: scheduling can be application specificPortability: user-level thread library easy to port
Disadvantages of user-level threadsIf a user-level thread is blocked in the kernel, the entire process (all threads of that process) are blockedCannot take advantage of multiprocessing (the kernel assigns oneprocess to only one processor)
04.784 OSDI Overview 47
SynchronizationWhy synchronization?Problem
Threads (or processes) must (sometimes) share dataData integrity must be maintained
ExampleTransfer $10 from account A to account B
A ← A + 10B ← B - 10
We don’t want to be able to read A and B between the previous two statements
25
04.784 OSDI Overview 48
Some terminologyCritical section: a section of code which reads or writes shared dataRace condition: potential for interleaved execution of a critical section by multiple threads
Results are non-deterministic
Mutual exclusion: synchronization mechanism to avoid race conditions by ensuring exclusive execution of critical sectionsDeadlock: permanent blocking of threadsStarvation: execution but no progress
04.784 OSDI Overview 49
Requirements for mutexNo assumptions on hardware: speed, # of processorsExecution of CS takes a finite timeA thread/process not in CS cannot prevent other threads/processes to enter the CSEntering CS cannot de delayed indefinitely: no deadlock or starvation
26
04.784 OSDI Overview 50
Synchronization primitivesMost common primitives
Mutex/locksCondition variablesSemaphores
04.784 OSDI Overview 51
Mutual Exclusion
Lock(A)Lock(B)A ← A + 10B ← B - 10Unlock(B)Unlock(A)
Mutual exclusion ≡ want to be the only thread modifying a set of data items
Can look at it as exclusive access to data items or to a piece of code
Have three components:Acquire, Release, Waiting
Acquire/release operations often termed Lock/UnlockExample: transferring $10 from B to A
Function Transfer (Amount, A, B)Lock(Transfer_Lock)A ← A + 10B ← B - 10Unlock(Transfer_Lock)
27
04.784 OSDI Overview 52
What to do while waiting?Spinning
Waiting threads keep testing location until it changes valueNot very efficient in uniprocessor systems
BlockingOS or RT system de-schedules waiting threads
Spinning vs. blocking becomes an issue in multiprocessor systems
04.784 OSDI Overview 53
Deadlock
Lock A Lock B
A B
28
04.784 OSDI Overview 54
Deadlock
Lock A Lock B
A B
04.784 OSDI Overview 55
Deadlock
Lock A Lock B
A B
29
04.784 OSDI Overview 56
Deadlock (cont’d)Deadlock can occur whenever multiple parties are competing for exclusive access to multiple resourcesHow can we avoid deadlocks?
Deadlock preventionHow? See a textbook …ExpensiveWhat to do when discover a deadlock is about to happen?
Deadlock detection and recoveryHow to detect? How to recover?Potentially Expensive
Impose strict ordering on lockse.g., if need to lock both A and B, always lock A first, then lock B
04.784 OSDI Overview 57
SemaphoresSynchronized counting variablesFormally, a semaphore is comprised of:
An integer valueTwo operations: P() and V()
P()While value = 0, sleepDecrement value and return
V()Increments valueIf there are any threads sleeping waiting for value to become non-zero, wakeup at least 1 thread
Used around critical sections to implement “locks” (or …)
30
04.784 OSDI Overview 58
Condition variablesA condition variable is always associated with:
A conditionA lock
Typically used to wait for the condition to take on a given valueThree operations:
cond_wait(lock, cond_var)cond_signal(cond_var)cond_broadcast(cond_var)
04.784 OSDI Overview 59
Condition variablescond_wait(lock, cond_var)
Release the lockSleep on cond_varWhen wakened by the system, re-acquire the lock and return
cond_signal(cond_var)If at least 1 thread is sleeping on cond_var, wake 1 upOtherwise, no effect
cond_broadcast(cond_var)If at least 1 thread is sleeping on cond_var, wake everyone upOtherwise, no effect
31
04.784 OSDI Overview 60
Condition variablesCondition variables are implemented using locksImplementation is tricky because it involves multiple locks and a scheduling queueImplemented in the OS or run-time thread systems because they involve scheduling operations
Sleep/Wake
04.784 OSDI Overview 61
Posix threads (pthreads)thread creation and termination
pthread_create(&tid,NULL,start_fn,arg);pthread_exit(status);
thread joinpthread_join(tid, &status);
mutual exclusionpthread_mutex_lock(&lock);pthread_mutex_unlock(&lock);
condition variablepthread_cond_wait(&c,&lock);pthread_cond_signal(&c);
32
Memory Management
04.784 OSDI Overview 63
Memory hierarchy
Registers
Cache
Memory
Question: What if we want to support programs that require more memory than is available in the system?
33
04.784 OSDI Overview 64
Registers
Cache
Memory
Virtual Memory
Memory hierarchy (2)
Answer: Pretend we had something bigger→ Virtual Memory
04.784 OSDI Overview 65
Virtual memory: pagingA page is a cacheable unit of virtual memoryThe OS controls the mapping between pages of VM and “real” memory
More flexible (at a cost)
Cache
Memory
Memory
VM
framepage
34
04.784 OSDI Overview 66
Two views of memoryView from the hardware—physical memoryView from the software—what program seesMemory management in the OS coordinates these two views
Consistency: all address spaces can look “basically the same”Relocation: processes can be loaded at any physical addressProtection: a process cannot maliciously access memory belonging to another processSharing: may allow sharing of physical memory (must implement control)
04.784 OSDI Overview 67
Virtual MemoryVirtual memory is the OS abstraction that gives the programmer the illusion of an address space that may be larger than the physical address spaceVirtual memory can be implemented using either paging or segmentation but paging is presently most commonVirtual memory is motivated by both
Convenience: the programmer does not have to deal with the fact that individual machines may have very different amount of physical memory or with the sharing of memory among many usersFragmentation in multi-programming environments
35
04.784 OSDI Overview 68
Hardware translation
Translation from logical to physical can be done in software but without protectionHardware support is needed to ensure protectionSimplest solution with two registers: base and size
Processor Physicalmemory
translationbox (MMU)
04.784 OSDI Overview 69
Paging hardware
Pages are of fixed sizeThe physical memory corresponding to a page is called page frameTranslation done through a page table indexed by page numberEach entry in a page table contains the physical frame number that the virtual page is mapped to and the state of the page in memoryState: valid/invalid, access permission, reference bit, modified bit, caching Paging is transparent to the programmer
virtual address
page table
+ physical addresspage # offset
36
04.784 OSDI Overview 70
Address translation
CPU p d
p
f
f d
f
d
page tableMemory
virtual address
physical address
04.784 OSDI Overview 71
Translation Lookaside BuffersTranslation on every memory access—must be fastWhat to do? Caching, of course …
Why does caching work? That is, we still have to lookup the page table entry and use it to do translation, right?Same as normal memory cache—cache is smaller so can spend more $$ to make it faster
Cache for page table entries is called the Translation Lookaside Buffer (TLB)
Typically fully associativeNo more than 64 entries
Each TLB entry contains a page number and the corresponding PT entryOn each memory access, we look for the page—>frame mapping in the TLB
37
04.784 OSDI Overview 72
Address translation
CPU p d
f d
f
d
TLB
Memory
virtual address
physical address
p/f
f
04.784 OSDI Overview 73
TLB missWhat if the TLB does not contain the appropriate PT entry?
TLB missEvict an existing entry if do not have any free ones
Replacement policy?
Bring in the missing entry from the PT
TLB misses can be handled in hardware or softwareSoftware allows application to assist in replacement decisions
38
04.784 OSDI Overview 74
Where to store address space?Address space may be larger than physical memoryWhere do we keep it?Where do we keep the page table?
04.784 OSDI Overview 75
Where to store address space?
On the next device down our storage hierarchy, of course …
Memory
VM
Disk
39
04.784 OSDI Overview 76
Where to store page table?In memory, of course …
OS
Code
Globals
Stack
Heap
P1 Page Table
P0 Page Table
• Interestingly, use memory to “enlarge” view of memory, leaving LESS physical memory
• This kind of overhead is common
• Got to know what the right trade-off is
• Have to understand common application characteristics
• Have to be common enough!
04.784 OSDI Overview 77
Page table structure
Page table can become hugeWhat to do?
Two-Level PT: saves memory but requires two lookups per accessPage the page tablesInverted page tables (one entry per page frame in physical memory): translation through hash tables
PageTable
MasterPT
2nd-LevelPTs
P1 PT
P0 PT
Kernel PTNon-page-able
Page-able
OS Segment
40
04.784 OSDI Overview 78
How to deal with VM > RAM?
If address space of each process is ≤ size of physical memory, then no problem
Still useful to deal with fragmentation
When VM larger than physical memoryPart stored in memoryPart stored on disk
How do we make this work?
04.784 OSDI Overview 79
Demand pagingTo start a process (program), just load the code page where the process will start executingAs process references memory (instructions or data) outside of loaded page, bring in as necessaryHow to represent fact that a page of VM is not yet in memory?
012
1 vii
A
BC
0
1
23
A
0
1
2
BC
VM
Paging Table Memory Disk
41
04.784 OSDI Overview 80
Page faultWhat happens when process references a page marked as invalid inthe page table?
Page fault trapCheck that reference is validFind a free memory frameRead desired page from diskChange valid bit of page to vRestart instruction that was interrupted by the trap
Is it easy to restart an instruction?What happens if there is no free frame?
04.784 OSDI Overview 81
Page fault (2)So, what can happen on a memory access?
TLB miss → read page table entryTLB miss → read kernel page table entryPage fault for necessary page of process page tableAll frames are used → need to evict a page → modify a process page table entry
TLB miss → read kernel page table entryPage fault for necessary page of process page tableUh oh, how deep can this go?
Read in needed page, modify page table entry, fill TLB
42
04.784 OSDI Overview 82
Cost of handling a page faultTrap, check page table, find free memory frame (or find victim) … about 200 - 600 μsDisk seek and read … about 10 msMemory access … about 100 nsPage fault degrades performance by ~100,000!!!!!
And this doesn’t even count all the additional things that can happen along the way
Better not have too many page faults!If want no more than 10% degradation, can only have 1 page faultfor every 1,000,000 memory accessesOS had better do a great job of managing the movement of data between secondary storage and main memory
04.784 OSDI Overview 83
Page replacementWhat if there’s no free frame left on a page fault?
Free a frame that’s currently being usedSelect the frame to be replaced (victim)Write victim back to diskChange page table to reflect that victim is now invalidRead the desired page into the newly freed frameChange page table to reflect that new page is now validRestart faulting instructions
Optimization: do not need to write victim back if it has not been modified (need dirty bit per page).
43
04.784 OSDI Overview 84
Page replacement (2)Highly motivated to find a good replacement policy
That is, when evicting a page, how do we choose the best victim in order to minimize the page fault rate?
Is there an optimal replacement algorithm?If yes, what is the optimal page replacement algorithm?Let’s look at an example:
Suppose we have 3 memory frames and are running a program that has the following reference pattern
7, 0, 1, 2, 0, 3, 0, 4, 2, 3
Suppose we know the reference pattern in advance ...
04.784 OSDI Overview 85
Page replacement (3)Suppose we know the access pattern in advance
7, 0, 1, 2, 0, 3, 0, 4, 2, 3Optimal algorithm is to replace the page that will not be used for the longest period of timeWhat’s the problem with this algorithm?Realistic policies try to predict future behavior on the basis of past behavior
Works because of locality
44
04.784 OSDI Overview 86
FIFOFirst-in, First-out
Be fair, let every page live in memory for about the same amount of time, then toss it.
What’s the problem?Is this compatible with what we know about behavior of programs?
How does it do on our example?
7, 0, 1, 2, 0, 3, 0, 4, 2, 3
04.784 OSDI Overview 87
LRULeast Recently Used
On access to a page, timestamp itWhen need to evict a page, choose the one with the oldest timestampWhat’s the motivation here?
Is LRU optimal?In practice, LRU is quite good for most programs
Is it easy to implement?
45
04.784 OSDI Overview 88
Not frequently used caseHave a reference bit and software counter for each page frameAt each clock interrupt, the OS adds the reference bit of each frame to its counter and then clears the reference bitWhen need to evict a page, choose frame with lowest counterWhat’s the problem?
Doesn’t forget anything, no sense of time – hard to evict a page that was referenced a lot sometime in the past but is no longer relevant to the computationUpdating counters is expensive, especially since memory is getting rather large these days
Can be improved with an aging scheme: counters are shifted right before adding the reference bit and the reference bit is added to the leftmost bit (rather than to the rightmost one)
04.784 OSDI Overview 89
Clock (second-chance)Arrange physical pages in a circle, with a clock handHardware keeps 1 used bit per frame. Sets used bit on memory reference to a frame.
If bit is not set, hasn’t been used for a while
On page fault:Advance clock handCheck used bit
If 1, has been used recently, clear and go onIf 0, this is our victim
Can we always find a victim?
46
04.784 OSDI Overview 90
Nth-chanceSimilar to clock algorithm, exceptMaintain a counter as well as a used bitOn page fault:
Advance clock handCheck used bit
If 1, clear and set counter to 0If 0, increment counter, if counter < N, go on, otherwise, this is our victim
Why?N larger → better approximation of LRU
What’s the problem if N is too large?
04.784 OSDI Overview 91
Multi-programming environmentWhy?
Better utilization of resources (CPU, disks, memory, etc.)
Problems?Mechanism – TLB?Fairness?Over commitment of memory
What’s the potential problem?Each process needs it working set to perform wellIf too many processes are running, can have thrashing
47
04.784 OSDI Overview 92
Support for multiple processesMore than one address space can be loaded in memoryA register points to the current page tableOS updates the register when context switching between threads from different processesMost TLBs can cache entries from more than one PT
Store the process id to distinguish between virtual addresses belonging to different processes
If TLB caches entries from only one PT then it must be flushed at process switch time
04.784 OSDI Overview 93
Sharing
physical memory:
v-to-p memory mappings
processes:
virtual address spacesp1 p2
48
Input / Output (I/O)
04.784 OSDI Overview 95
I/O DevicesSo far we have talked about how to abstract and manage the CPU and memoryComputation “inside” a computer is useful only if some results are communicated “outside” of the computerI/O devices are the computer’s interface to the outside world (I/O ≡ Input/Output)
Example devices: display, keyboard, mouse, speakers, network interface, and disk
49
04.784 OSDI Overview 96
CPU Memory
memory bus
I/O bus
Net interfaceDisk
Basic Computer StructureBasic Computer Structure
04.784 OSDI Overview 97
CPU
System Bus &MMU/AGP/PCI
Controller
I/O Bus
IDE DiskController
USBController Another
I/O BusSerial &
Parallel Ports Keyboard & Mouse
Intel SR440BX Motherboard
50
04.784 OSDI Overview 98
CPU and I/O Device Communication
CPU/Memory ⇒ I/O DevicesHow does the CPU communicate with I/O devices?
Send/receive messages?Memory map
Each I/O device assigned a portion of the physical address spaceCPU I/O device
CPU writes to locations in this area to "talk" to I/O device
I/O device CPUPolling: CPU repeatedly checks location(s) in portion of address space assigned to deviceInterrupt: Device sends an interrupt (on an interrupt line) to get the attention of the CPU
CPU writing to (or reading from) the address range of device is called programmed I/O
04.784 OSDI Overview 99
Programmed I/O vs. DMA (1)Programmed I/O is O.K. for sending commands, receiving status, and communication of a small amount of dataInefficient for large amount of data however
Keeps CPU busy (doing useless work) during the transferProgrammed I/O ≡ memory operations → slow
Direct Memory AccessDevice read/write directly from/to memoryMemory → device typically initiated from CPUDevice → memory can be initiated by either the device or the CPU
51
04.784 OSDI Overview 100
Programmed I/O vs. DMA (2)
CPU Memory
Disk
Interconnect
CPU Memory
Disk
Interconnect
CPU Memory
Disk
Interconnect
ProgrammedI/O
DMADevice Memory
DMAMemory Device
04.784 OSDI Overview 101
Device DriversOS module controlling an I/O deviceHides the device specifics from the higher layers in the OS/kernel
Support a common APIUNIX: block or character device
Block: device communicates with the CPU/memory in fixed-size blocksCharacter: stream of bytes
Translates logical I/O into device I/OE.g. logical disk blocks into {cylinder, head, & sector}Performs data buffering and scheduling of I/O operationsStructure
Several synchronous entry points (system calls): device initialization, queue I/O requests, state control, read/writeAn asynchronous entry point to handle interrupts
52
04.784 OSDI Overview 102
I/O BufferingI/O Transfer – DMA
After an I/O request is placed, the source/destination of the I/O transfer (i.e. a buffer) must be “page-fixed”/”pinned” in memoryTo allow user process to continue (when possible), data is oftencopied from user address space to kernel buffers which are also pinned in memory
OK for write, not for read (no concurrency since waiting for input)Copying is expensive (and long block time for read)
This is the motivation for “asynchronous I/O”
Devices are typically slow compared to CPUHow do we speed up accesses? Caching, of course …
I/O bufferingBuffer cache: a buffer in main memory for block devicesCharacter queue: follows the producer/consumer model (charactersin the queue are read once)
04.784 OSDI Overview 103
Buffer CacheWhen an I/O request is made for a block, the buffer cache is checked firstIf the data is missing from the cache, it is read into the buffer cache from the deviceExploits locality of reference as does any other cacheReplacement policies similar to those for VMUNIX
Historically, UNIX has a buffer cache for the disk which does not share buffers with character/stream devicesUnfortunately adds overhead in a path that has become increasingly common: disk →NIC
E.g. file service
53
04.784 OSDI Overview 104
File SystemsFile system is an abstraction of the disk
Track/sector → filesTo a user process
A file looks like a contiguous block of bytesA file system provides a coherent view of a group of files
Typically also provides protection
API: create, delete, read, write filesPerformance: throughput vs. response timeReliability: goal is to minimize the potential for lost or destroyed data
E.g. RAID could be implemented in the OS as part of the disk device driver
04.784 OSDI Overview 105
Unix File SystemOrdinary files (uninterpreted byte streams)Directories
“File of files”Organized as a rooted “tree” (actually a DAG)Pathnames (relative and absolute)Contains links to parent and itself as well as contained filesMultiple links to files can exist
Link - hard OR symbolic
Typically tree-structured file hierarchiesMounted on existing space by using ‘mount’No links between different file systems
54
04.784 OSDI Overview 106
...
Unix File System (2)
/usr/lib/libc.a or /lib/libc.a
. . .... ......
...
...
...
/ Root directory
bin usr lib tmp
lib
Libc.a trashX11new.a
Basically a tree, but links convert toDAGs (no cycles!)
04.784 OSDI Overview 107
UNIX File System (3)root
swap
bin
usr
usr2
logical file system
file systemslogical disks
physical disks
Mapping file systemsto disks
55
04.784 OSDI Overview 108
File NamingEach file has a unique nameUser visible (external) name must be symbolic
In a hierarchical file system, unique external names are given as pathnames (path from the root to the file)
Internal names: i-node in UNIX - an index into a persistent array of file descriptors/headers for a specific partitionDirectory: translation from external to internal names
May have more than one external name for a single internal name (i.e. “name service”)
Information about file is split between the directory and the file descriptor: size, location on disk, owner, permissions, date created, date last modified, date last access, link count
04.784 OSDI Overview 109
Name SpaceIn UNIX, “devices are also treated as files”
E.g. /dev/cdrom, /dev/fd0User process accesses devices by accessing the corresponding file in /dev
Normally hidden from higher level Unix programs
/
dev A B
ttyX CDROM
56
04.784 OSDI Overview 110
File AllocationContiguous: a contiguous set of blocks is allocated to a file at the time of file creation
Good for sequential filesFile size must be known at the time of file creationExternal fragmentation – like memory allocation when giving a contiguous block to each job
Hmm, so what do we do?Use a disk block table (remember the page table?)Use Indexed allocation to avoid the problem
No/little fragmentationVery flexible - no need to know sizes apriori and can change size dynamically
04.784 OSDI Overview 111
Free Space ManagementNo policy issues here – just mechanismBitmap: one bit for each block on the disk
Good to find a contiguous group of free blocksFiles are often accessed sequentially
Small enough to be kept in memory and therefore fast!
Chained free portions: pointer to the next oneNot so good for sequential access but very flexible
Index: treats free space as a file from which allocations are made to create/expand other files
what is the difference in representation between a file that contains useful data and one that does not (i.e. contains free space) -nothing!
57
04.784 OSDI Overview 112
UNIX I-nodesMode
Link count
UidGidSize
TimesDisk block 1Disk block 2Disk block 3
Disk block 11
Single indirect
Double indirect
Triple indirect
... ......
......
...
DataBlock
DataBlock
DataBlock
File nameDirectory entry
Disk block 10
Disk block 12
...
04.784 OSDI Overview 113
UNIX I-nodes (2)File
DescriptorTables
Processi
Processj
Processk
(parent)
(child)
Open File Descriptor
Table
I-node ptr
I-nodes
I-node ptr
I-node ptr
R/W pointers
Active files
------
I-nodes
In memory On disk(s)
58
04.784 OSDI Overview 114
File System Buffer Cacheapplication: read/write files
OS: translate file to disk blocks
...buffer cache ...maintains
controls disk accesses: read/write blocks
hardware:
04.784 OSDI Overview 115
File System Buffer CacheDisks are “stable” while memory is volatile
What happens if you buffer a write and the machine crashes before the write has been saved to disk?Can use write-through but write performance will suffer
Greater write traffic to slow disk
In UNIXUse unbuffered I/O when writing i-nodes or pointer blocksUse buffered I/O for other writes and force sync every 30 seconds
What about replacement?How can we further improve performance?
59
04.784 OSDI Overview 116
File System ConsistencyFile systems almost always use a buffer/disk cache for performance reasons Two copies of a disk block (in the buffer cache and on disk) consistency problem if the system crashes before all the modified blocks are written back to diskThis problem is critical especially for the blocks that contain control information: i-node, free-list, directory blocks
This is why we have utility programs for checking block and directory consistency and making repairs after system crashes
Write back critical blocks from the buffer cache to the disk immediatelyData blocks are also written back periodically: sync
04.784 OSDI Overview 117
More on File System ConsistencyTo maintain file system consistency the ordering of updates from the buffer cache to the disk is criticalExample: if the directory block is written back before the i-node and the system crashes, the directory structure will be inconsistentAn elaborate solution: use dependencies between blocks containing control data in the buffer cache to specify the ordering of updates
60
04.784 OSDI Overview 118
Elements of storage management
Users
FileStructure
Records BlockCaches
ControllerCachesDirectory
management
Accesscontrol
Accessmethods
Diskscheduling
Fileallocation
Free spacemanagement
Buffering
File manipulation
04.784 OSDI Overview 119
Protection Mechanisms (1)Files are OS objects: with unique names and a finite set of operations that processes can perform on themA protection domain is a set of {object,rights} where rights is the permission to perform one of the operationsAt each instant in time, each process runs in some protection domainIn Unix, a protection domain is identified by {uid, gid} The protection domain in Unix is switched when running a program with SETUID/SETGID set or when the process enters the kernel mode by issuing a system callFundamental Issue: How to manage all the protection domains?
61
04.784 OSDI Overview 120
Protection Mechanisms (2)Access Control List (ACL): associates with each object a list of all the protection domains that may access the object and what they may do
In Unix the ACL concept for files is reduced to three protectiondomains: owner, group and others
Much smaller and therefore easier to manage
Capability List (C-list): associates with each process a list of objects that may be accessed along with the operations permitted on them
C-list implementation issues: where/how to store the capabilities (hardware, kernel, encrypted in user space) and how to revoke them and control their distribution to other processes