cmpt 431 dr. alexandra fedorova lecture iv: os support

59
CMPT 431 Dr. Alexandra Fedorova Lecture IV: OS Support

Upload: ben-blankenship

Post on 31-Mar-2015

224 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: CMPT 431 Dr. Alexandra Fedorova Lecture IV: OS Support

CMPT 431

Dr. Alexandra Fedorova

Lecture IV: OS Support

Page 2: CMPT 431 Dr. Alexandra Fedorova Lecture IV: OS Support

2CMPT 431 © A. Fedorova

Outline

• Continue discussing OS support for threads and processes

• Alternative distributed systems architectures inspired by limitations of threads

• Support for IPC• Scalable synchronization

Page 3: CMPT 431 Dr. Alexandra Fedorova Lecture IV: OS Support

3CMPT 431 © A. Fedorova

Process/Thread Support: Good Enough?• Many computer scientists observed limited scalability of MT

and MP architectures

Performance of a threaded web server

M. Welsh, SOSP ‘01

Page 4: CMPT 431 Dr. Alexandra Fedorova Lecture IV: OS Support

4CMPT 431 © A. Fedorova

Alternative Web Services Architectures

• Alternative architectures for web services that rely less heavily on threads/processes:– Single-Process Event-Driven (SPED)– Asymmetric Multiprocess Event-Driven (AMPED)– Stage Event-Driven Architecture (SEDA)

Page 5: CMPT 431 Dr. Alexandra Fedorova Lecture IV: OS Support

5CMPT 431 © A. Fedorova

Web Services Architecture. Case Study: A Web server

• Sequence of actions at the web server

• Each step can block:– Socket read/accept can block on network I/O– File find/read can block for disk I/O– Send can block on TCP buffer queue

• How do servers overlap blocking and computation?

V. Pai, USENIX ‘99

Page 6: CMPT 431 Dr. Alexandra Fedorova Lecture IV: OS Support

6CMPT 431 © A. Fedorova

Multiprocess (MP) or Multithreaded (MT) Architecture: A Review

V. Pai, USENIX ‘99• One process performs all

steps for a request• I/O and computation

overlap naturally• OS switches to a new

process when a process blocks

MP

V. Pai, USENIX ‘99• One thread performs all

steps for a request• I/O and computation

overlap is possible using kernel threads (provided by modern OS)

MT

Page 7: CMPT 431 Dr. Alexandra Fedorova Lecture IV: OS Support

7CMPT 431 © A. Fedorova

Single Process Event-Driven Architecture• A single process executes processing steps for all requests• Uses non-blocking network and disk I/O system calls• Uses select system call to check on the status of those operations• Problem #1: many OSs do not provide non-blocking system calls for

disk I/O• Problem #2: those that do, do not integrate them with select –

cannot check for completion of network and disk I/O simultaneously

V. Pai, USENIX ‘99

Page 8: CMPT 431 Dr. Alexandra Fedorova Lecture IV: OS Support

8CMPT 431 © A. Fedorova

Asymmetric Multiprocess Event Driven Architecture (AMPED)

• AMPED = MP + SPED• Use SPED architecture for I/O operations with non-blocking interface: socket

read/write, accept• Use MP architecture for I/O operations without the non-blocking interface:

file read/write:

• mmap the file• Use mincore to check if the file is in

memory• If not, spawn a helper process to bring

the file into memory• Communicate with the helper process

via IPCV. Pai, USENIX ‘99

• Flash – a web server implemented using AMPED (V. Pai, et al., USENIX ‘99)• Matches or exceeds performance of existing web servers by up to 50%

Page 9: CMPT 431 Dr. Alexandra Fedorova Lecture IV: OS Support

9CMPT 431 © A. Fedorova

Staged Event-Driven Architecture

• Observation: AMPED is good, but it is not easy to control application resources. E.g., which event to process first?

• SEDA: Create a stage for each logical step of processing; Manage each stage separately

• There is a queue of events for each stage, so you can tell how each stage is loaded

• Each stage can be processed by several (a small number of) threads• Adaptive load shedding – manage queues to control load

– E.g., if the stage that involves disk I/O is the bottleneck, drop the queued up requests or reject new requests

• Dynamic control – adjust the number of threads per stage based on demand

M. Welsh, SOSP ‘01

Page 10: CMPT 431 Dr. Alexandra Fedorova Lecture IV: OS Support

10CMPT 431 © A. Fedorova

Outline

• Continue discussing OS support for threads and processes

• Alternative distributed systems architectures inspired by limitations of threads

• Support for IPC• Support for scalable synchronization• Distributed operating systems

Page 11: CMPT 431 Dr. Alexandra Fedorova Lecture IV: OS Support

11CMPT 431 © A. Fedorova

OS Support for Inter-Process Communication (IPC)

• Cooperating processes or threads need to communicate• Threads share address space, so they communicate via

shared memory• What about processes? They do not share an address

space. They communicate via:– Unix pipes– Memory-mapped files– Inter-process shared memory

Page 12: CMPT 431 Dr. Alexandra Fedorova Lecture IV: OS Support

12CMPT 431 © A. Fedorova

Unix Pipes

Pipe is a communication channel among two processes

Using pipe in a shell:

prompt% cat log_file | grep “May 16”

cat

grep

writeread

Pipes can also be created using pipe() system call

Page 13: CMPT 431 Dr. Alexandra Fedorova Lecture IV: OS Support

13CMPT 431 © A. Fedorova

Implementation of Pipes

• In Solaris: a data structure containing two vnodes, a lock and a buffer

lock

fnode

fnode

buffer

vnode

vnode

• To the user, each end of the pipe is represented by a file descriptor

• The user reads/writes the pipe by reading/writing the file descriptor

• The OS blocks the process reading from an empty pipe

• The OS blocks the process writing into the full pipe (when the buffer is full)

Page 14: CMPT 431 Dr. Alexandra Fedorova Lecture IV: OS Support

14CMPT 431 © A. Fedorova

Memory-mapped Files

Address space of process A

File

Mapped file

Address space of process B

Mapped File

Page 15: CMPT 431 Dr. Alexandra Fedorova Lecture IV: OS Support

15CMPT 431 © A. Fedorova

Inter-process Shared Memory

• Inter-process shared memory: a piece of physical memory set up to be shared among processes

• Allocate inter-process shared memory using shmget• Get permission to use (attach to it) via shmat• Disadvantages: shared memory is not cleaned up

automatically when processes exit; it needs to be cleaned up explicitly

Page 16: CMPT 431 Dr. Alexandra Fedorova Lecture IV: OS Support

16CMPT 431 © A. Fedorova

Performance of IPC

• IPC involves inter-process context switching• The expensive kind of context switch, because it

involves switching address spaces• The cost of a context switch determines the cost

of IPC – largely depends on the hardware

Page 17: CMPT 431 Dr. Alexandra Fedorova Lecture IV: OS Support

17CMPT 431 © A. Fedorova

Outline

• Continue discussing OS support for threads and processes

• Alternative distributed systems architectures inspired by limitations of threads

• Support for IPC• Support for scalable synchronization

Page 18: CMPT 431 Dr. Alexandra Fedorova Lecture IV: OS Support

18CMPT 431 © A. Fedorova

Synchronization

if(account_balance >= amount){

account_balance -= amount;}

Thread 1: perform a withdrawal if(account_balance >= service_fee){

account_balance -= service_fee;}

Thread 2: subtract service fee

1 2

3 4

Unsynchronized Access

Account balanced has changed between steps 2 and 4!!!

Synchronized Access

lock_aquire(account_balance_lock);if(account_balance >= amount){

account_balance -= amount;}lock_release(account_balance);

lock_aquire(account_balance_lock);if(account_balance >= service_fee){

account_balance -= service_fee;}lock_release(account_balance);

Page 19: CMPT 431 Dr. Alexandra Fedorova Lecture IV: OS Support

19CMPT 431 © A. Fedorova

Synchronization Primitives (SP)

• Synchronization primitives provide atomic access to a critical section• Types of synchronization primitives

– mutex– semaphore– lock– condition variable– etc.

• Synchronization primitives are provided by the OS• Can also be implemented by a library (e.g., pthreads) or by the

application • Hardware provides special atomic instructions for implementation of

synchronization primitives (test-and-set, compare-and-swap, etc.)

Page 20: CMPT 431 Dr. Alexandra Fedorova Lecture IV: OS Support

20CMPT 431 © A. Fedorova

Implementation of SP

• Performance of applications that use SP is determined by an implementation of the SP

• A SP must be scalable – must continue to perform well as the number of contending threads increases

• We will look at several implementations of locks to understand how to create a scalable implementation

Page 21: CMPT 431 Dr. Alexandra Fedorova Lecture IV: OS Support

21CMPT 431 © A. Fedorova

What should you do if you can’t get a lock?

• Keep trying– “spin” or “busy-wait”– Good if delays are short

• Give up the processor– Good if delays are long– Always good on uniprocessor

• Systems usually use a combination:– Spin for a while, then give up the processor

• We will focus on multiprocessors, so we’ll look at spinlock implementations

© Herlihy-Shavit 2007

Page 22: CMPT 431 Dr. Alexandra Fedorova Lecture IV: OS Support

22CMPT 431 © A. Fedorova

A Shared Memory Multiprocessor

Bus

cache

memory

cachecache

© Herlihy-Shavit 2007

Page 23: CMPT 431 Dr. Alexandra Fedorova Lecture IV: OS Support

23CMPT 431 © A. Fedorova

Basic Spinlock

CS

Resets lock upon exit

spin lock

critical section

...

…lock suffers from contention

Sequential Bottleneck no parallelism

© Herlihy-Shavit 2007

Page 24: CMPT 431 Dr. Alexandra Fedorova Lecture IV: OS Support

24CMPT 431 © A. Fedorova

Review: Test-and-Set

• We have a boolean value in memory• Test-and-set (TAS)

– Swap true with prior value– Return value tells if prior value was true or false

• Can reset just by writing false

© Herlihy-Shavit 2007

Page 25: CMPT 431 Dr. Alexandra Fedorova Lecture IV: OS Support

25CMPT 431 © A. Fedorova

TAS• Provided by the hardware• Example SPARC: an assembly instruction load-store

unsigned byte ldstub

public class AtomicBoolean {

boolean value; public synchronized boolean getAndSet(boolean newValue)

{ boolean prior = value;

value = newValue;return prior;

}} Swap old and new values

loads a byte from memory to a return register writes the value 0xFF into the addressed byte atomically.

© Herlihy-Shavit 2007

• TAS can be implemented in a high-level language. • Example in Java:

Page 26: CMPT 431 Dr. Alexandra Fedorova Lecture IV: OS Support

26CMPT 431 © A. Fedorova

TAS Locks

• Value of TAS’ed memory shows lock state:– Lock is free: value is false– Lock is taken: value is true

• Acquire lock by calling TAS:– If result is false, you win– If result is true, you lose

• Release lock by writing false

Page 27: CMPT 431 Dr. Alexandra Fedorova Lecture IV: OS Support

27CMPT 431 © A. Fedorova

TAS Lock in SPARC Assembly

spin_lock:busy_loop: ldstub [%o0],%o1 tst %o1 bne busy_loop nop ! delay slot for branch retl nop ! delay slot for branch

loads old value into reg. o1. Writes “1” into memory at address in %o0 .Test if %o1 equals to zero.

If %o1 is not zero (old value is true), spin.

Page 28: CMPT 431 Dr. Alexandra Fedorova Lecture IV: OS Support

28CMPT 431 © A. Fedorova

TAS Lock in Java

class TASlock {

AtomicBoolean state = new AtomicBoolean(false);

void lock() {

while (state.getAndSet(true)) {} }

void unlock() {

state.set(false); }}

Initialize lock state to false (unlocked)

While lock is taken (true) spin.

Release the lock – set state to false

© Herlihy-Shavit 2007

Page 29: CMPT 431 Dr. Alexandra Fedorova Lecture IV: OS Support

29CMPT 431 © A. Fedorova

Performance of TAS Lock

• Experiment– N threads on a multiprocessor– Increment shared counter 1 million times (total)– The thread acquires a lock before incrementing the counter– Each thread does 1,000,000/N increments

• N does not exceed the number of processors no thread switching overhead

• How long should it take?• How long does it take?

Page 30: CMPT 431 Dr. Alexandra Fedorova Lecture IV: OS Support

30CMPT 431 © A. Fedorova

Expected performance

idealTota

l tim

e

Number of threads

no speedup because there is no parallelism

© Herlihy-Shavit 2007

lock_acquireincrementlock_release

lock_acquireincrementlock_release

lock_acquireincrementlock_release

lock_acquireincrementlock_release

Thread 1 Thread 2

same as sequential execution

Page 31: CMPT 431 Dr. Alexandra Fedorova Lecture IV: OS Support

31CMPT 431 © A. Fedorova

Actual Performance

TAS lock

Ideal Much worse than

ideal

Tota

l tim

e

Number of threads

© Herlihy-Shavit 2007

Page 32: CMPT 431 Dr. Alexandra Fedorova Lecture IV: OS Support

32CMPT 431 © A. Fedorova

Reasons for Bad TAS Lock Performance

• Has to do with cache behaviour on the multiprocessor system

• TAS causes a lot of invalidation misses– This hurts performance

• To understand what this means, let’s review how caches work

Page 33: CMPT 431 Dr. Alexandra Fedorova Lecture IV: OS Support

33CMPT 431 © A. Fedorova

Processor Issues Load Request

Bus

cache

memory

cachecache

datadata

© Herlihy-Shavit 2007

Page 34: CMPT 431 Dr. Alexandra Fedorova Lecture IV: OS Support

34CMPT 431 © A. Fedorova

Another Processor Issues Load Request

Bus

cache

memory

cachecache

data

dataBus

I got data

dataBus

I want data

© Herlihy-Shavit 2007

Page 35: CMPT 431 Dr. Alexandra Fedorova Lecture IV: OS Support

35CMPT 431 © A. Fedorova

memory

Bus

Processor Modifies Data

cache cachecache

data

datadata

Now other copies are invalid

data

© Herlihy-Shavit 2007

Page 36: CMPT 431 Dr. Alexandra Fedorova Lecture IV: OS Support

36CMPT 431 © A. Fedorova

Send Invalidation Message to Others

memory

Bus

cache cachecache

data

datadata data

Invalidate!

Bus

Other caches lose read

permission

No need to change now:

other caches can provide valid data © Herlihy-Shavit 2007

Page 37: CMPT 431 Dr. Alexandra Fedorova Lecture IV: OS Support

37CMPT 431 © A. Fedorova

Processor Asks for Data

memory

Bus

cache cachecache

data

datadata

Bus

I want data

data

© Herlihy-Shavit 2007

Page 38: CMPT 431 Dr. Alexandra Fedorova Lecture IV: OS Support

38CMPT 431 © A. Fedorova

Multiprocessor Caches: Summary

• Simultaneous reads and writes of shared data:– Make data invalid

• Invalidation is bad for performance• On next data request:

– Data must be fetched from another cache• This slows down performance

Page 39: CMPT 431 Dr. Alexandra Fedorova Lecture IV: OS Support

39CMPT 431 © A. Fedorova

What This Has to Do with TAS Locks

• Recall that TAS lock had bad performance

• Invalidations were the cause

TAS lock

IdealTota

l tim

e

Number of threads

• Here is why: • All spinners do load/store in a loop• They all read/write the same location• Cause lots of invalidations

Page 40: CMPT 431 Dr. Alexandra Fedorova Lecture IV: OS Support

40CMPT 431 © A. Fedorova

A Solution: Test-And-Test-And-Set Lock• Wait until lock “looks” free

– Spin on local cache– No bus use while lock busy

class TTASlock {

AtomicBoolean state = new AtomicBoolean(false);

void lock() {while (true) {

while (state.get()) {} if (!state.getAndSet(true)) return;

}}

Wait until the lock looks free. We read the lock instead of TASing it. We avoid repeated invalidations.

Now try to acquire it.

© Herlihy-Shavit 2007

Page 41: CMPT 431 Dr. Alexandra Fedorova Lecture IV: OS Support

41CMPT 431 © A. Fedorova

TTAS Lock Performance

TAS lock

TTAS lock

IdealBetter, but

still far from ideal

Tota

l tim

e

Number of threads

© Herlihy-Shavit 2007

Page 42: CMPT 431 Dr. Alexandra Fedorova Lecture IV: OS Support

42CMPT 431 © A. Fedorova

The Problem with TTAS Lock

• When the lock is released:– Everyone tries to acquire is– Everyone does TAS– There is a storm of invalidations

• Only one processor can use the bus at a time• So all processors queue up, waiting for the bus, so

they can perform the TAS

Page 43: CMPT 431 Dr. Alexandra Fedorova Lecture IV: OS Support

43CMPT 431 © A. Fedorova

A Solution: TTAS Lock with Backoff

• Intuition: If I fail to get the lock there must be contention

• So I should back off before trying again• Introduce a random “sleep” delay before trying to

acquire the lock again

© Herlihy-Shavit 2007

Page 44: CMPT 431 Dr. Alexandra Fedorova Lecture IV: OS Support

44CMPT 431 © A. Fedorova

TTAS Lock with Backoff: Performance

TAS lock

TTAS lockBackoff lockIdealTo

tal ti

me

Number of threads

© Herlihy-Shavit 2007

Page 45: CMPT 431 Dr. Alexandra Fedorova Lecture IV: OS Support

45CMPT 431 © A. Fedorova

Backoff Locks

• Better performance than TAS and TTAS• Caveats:

– Performance is sensitive to the choice of delay parameter

– The delay parameter depends on the number of processors and their speed

– Easy to tune for one platform– Difficult to write an implementation that will work well

across multiple platforms

© Herlihy-Shavit 2007

Page 46: CMPT 431 Dr. Alexandra Fedorova Lecture IV: OS Support

46CMPT 431 © A. Fedorova

An Idea

• Avoid useless invalidations– By keeping a queue of threads

• Each thread– Notifies next in line– Without bothering the others

© Herlihy-Shavit 2007

Page 47: CMPT 431 Dr. Alexandra Fedorova Lecture IV: OS Support

47CMPT 431 © A. Fedorova

Anderson Queue Lock

flags

next

T F F F F F F F

idle

locations on which thread spin, one per thread

Points to the next unused

“spin” location acquiring getAndIncrement:atomically get value of“next”, and increment “next”pointer

If “next” was TRUE, lock is acquired

© Herlihy-Shavit 2007

Page 48: CMPT 431 Dr. Alexandra Fedorova Lecture IV: OS Support

48CMPT 431 © A. Fedorova

Acquiring a Held Lock

flags

next

T F F F F F F F

acquired acquiring

getAndIncrement

T

released acquired

© Herlihy-Shavit 2007

Page 49: CMPT 431 Dr. Alexandra Fedorova Lecture IV: OS Support

49CMPT 431 © A. Fedorova

Anderson Lock: Performance

TAS lock

TTAS lock

IdealAnderson lock

Tota

l tim

e

Number of threads

Almost ideal. We avoid all unnecessary invalidations.Portable – no tunable parameters.

© Herlihy-Shavit 2007

Page 50: CMPT 431 Dr. Alexandra Fedorova Lecture IV: OS Support

50CMPT 431 © A. Fedorova

Scalable Synchronization: Summary

• Making synchronization primitives scalable is tricky• Performance tied to the hardware architecture• We looked at these spinlocks:

– TAS – poor performance due to invalidations– TTAS – avoids constant invalidations, but a storm of invalidations on lock

release– TTAS with backoff – eliminates the storm of invalidations on release– Anderson Queue Lock – completely eliminates all useless invalidations

• One could think of other optimizations…• For more information, look at the references in the syllabus

Page 51: CMPT 431 Dr. Alexandra Fedorova Lecture IV: OS Support

51CMPT 431 © A. Fedorova

Transactional Memory

• Programming with locks is tough• Yet everyone has to do synchronization –

multithreaded programming is driven by multicore revolution

• Transactional memory: concurrent programming without locks

Page 52: CMPT 431 Dr. Alexandra Fedorova Lecture IV: OS Support

52CMPT 431 © A. Fedorova

Coarse vs. Fine Synchronization

int update_shared_counters(int *counters, int n_counters) {

int i;

coarse_lock_acquire(counters_lock);for (i=0; i<n_counters; i++) {

fine_lock_acquire(counter_locks[i]);counters[i]++;fine_lock_release(counter_locks[i]);

}coarse_lock_release(counters_lock);

}

Coarse locks are easy to programBut perform poorly

Fine locks perform wellBut are difficult to program

Page 53: CMPT 431 Dr. Alexandra Fedorova Lecture IV: OS Support

53CMPT 431 © A. Fedorova

Transactional Memory To the Rescue!

• Can we have the best of both worlds?– Good performance– Ease of programming

• The answer is: – Transactional Memory (TM)

Page 54: CMPT 431 Dr. Alexandra Fedorova Lecture IV: OS Support

54CMPT 431 © A. Fedorova

Transactional Memory (TM)

• Programming model:– Extension to the language– Runtime and/or hardware support

• Lets you do synchronization without locks• Performance of fine grained locks• Ease of programming of coarse grained locks

Page 55: CMPT 431 Dr. Alexandra Fedorova Lecture IV: OS Support

55CMPT 431 © A. Fedorova

Transactional Memory vs. Locks

int update_shared_counters(int *counters, int n_counters) {

int i;ATOMIC_BEGIN(); coarse_lock_acquire(counters_lock);for (i=0; i<n_counters; i++) {

fine_lock_acquire(counter_locks[i]);counters[i]++;fine_lock_release(counter_locks[i]);

}coarse_lock_release(counters_lock);ATOMIC_END();

}

Transactional section• Looks like

coarse grained lock• Acts like fine

grained lock• Performance degrades

only if there is conflict

Page 56: CMPT 431 Dr. Alexandra Fedorova Lecture IV: OS Support

56CMPT 431 © A. Fedorova

The Backend of TM

read Awrite Bread Bwrite Awrite D

read Cwrite Cread Ewrite Eread D

Abort!

restart

Page 57: CMPT 431 Dr. Alexandra Fedorova Lecture IV: OS Support

57CMPT 431 © A. Fedorova

State of TM

• Still evolving– More work needed to make it usable and well

performing• It is very real

– Sun’s new Rock processor has TM support– Intel is very active

Page 58: CMPT 431 Dr. Alexandra Fedorova Lecture IV: OS Support

58CMPT 431 © A. Fedorova

OS Support For Distributed Systems: Summary (I)

• Networking– Access to network devices– Implementation of network protocols: TPC, UDP, IP

• Processes and Threads (because many DS components use MP/MT architectures). Must ensure:– Good load balance– Good response time– Minimize context switches– We looked at how Solaris time-sharing scheduler does this

Page 59: CMPT 431 Dr. Alexandra Fedorova Lecture IV: OS Support

59CMPT 431 © A. Fedorova

OS Support For Distributed Systems: Summary (II)

• Inter-process communication– Pipes– Memory-mapped files– Inter-process shared memory

• Scalable Synchronization