scheduling operating systems (234123) – spring 2013 scheduling dan tsafrir (18/3/2013, 8/4/2013)...

59
Operating Systems (234123) – Spring 2013 Scheduling Dan Tsafrir (18/3/2013, 8/4/2013) OS (234123) - spring 2013 - scheduling 1

Upload: julius-freeman

Post on 17-Dec-2015

217 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Scheduling Operating Systems (234123) – Spring 2013 Scheduling Dan Tsafrir (18/3/2013, 8/4/2013) OS (234123) - spring 2013 - scheduling1

Operating Systems (234123) – Spring 2013Scheduling

Dan Tsafrir (18/3/2013, 8/4/2013)

OS (234123) - spring 2013 - scheduling 1

Page 2: Scheduling Operating Systems (234123) – Spring 2013 Scheduling Dan Tsafrir (18/3/2013, 8/4/2013) OS (234123) - spring 2013 - scheduling1

OS (234123) - spring 2013 - scheduling 2

BATCH (NON-PREEMPTIVE)SCHEDULERS

What is it?

Page 3: Scheduling Operating Systems (234123) – Spring 2013 Scheduling Dan Tsafrir (18/3/2013, 8/4/2013) OS (234123) - spring 2013 - scheduling1

OS (234123) - spring 2013 - scheduling 3

Consider a different context…• In this course, we typically consider the OS of our own

computers• Need to change the mind set for some of the topics in

today’s lecture– So that they would make sense and have meaning

• New context: – “batch scheduling” on “supercomputers”– We will explain both of these terms next

Page 4: Scheduling Operating Systems (234123) – Spring 2013 Scheduling Dan Tsafrir (18/3/2013, 8/4/2013) OS (234123) - spring 2013 - scheduling1

OS (234123) - spring 2013 - scheduling 4

Supercomputers• Comprised of 100s to 100,000s of cores

– Many multicores connected with a high-speed network

• Used by 100s of scientific users – (physicists, chemists…)

• Users submit “jobs” (=programs)– Jobs can be serial (one core) or parallel (many cores)– A parallel job simultaneously uses N cores to solve a single scientific

problem quickly (ideally N times faster)– N is the “size” of the job (determined by the submitting user)– While the program is running, cores communicate with each other and

exchange information– Jobs can run from a few seconds to 10s of hours– Jobs can have many sizes (often a power of 2)

Page 5: Scheduling Operating Systems (234123) – Spring 2013 Scheduling Dan Tsafrir (18/3/2013, 8/4/2013) OS (234123) - spring 2013 - scheduling1

OS (234123) - spring 2013 - scheduling 5

Think of jobs as rectangles in core X time plane

Time

Core

s

size = width

runtime = length

one job

• Jobs can be “wide” or “narrow” (size)• Jobs can be “short” or “long” (runtime)

Page 6: Scheduling Operating Systems (234123) – Spring 2013 Scheduling Dan Tsafrir (18/3/2013, 8/4/2013) OS (234123) - spring 2013 - scheduling1

OS (234123) - spring 2013 - scheduling 6

This is how a schedule might look like

Time

core

s

Page 7: Scheduling Operating Systems (234123) – Spring 2013 Scheduling Dan Tsafrir (18/3/2013, 8/4/2013) OS (234123) - spring 2013 - scheduling1

OS (234123) - spring 2013 - scheduling 7

Batch scheduling• It is important that all the processes of a job run together (as

they repeatedly communicate)– Each process runs on a different core

• Jobs are often tailored to use the entire physical memory on each individual multicore– They cannot really share cores with other jobs via multiprogramming

• Thus, supercomputers use “batch scheduling”– When a job is scheduled to run, it gets its own cores– Cores are dedicated (not shared)– Each job runs to completion (until it terminates)– Only after, the cores are allocated to other waiting jobs– “Non-preemptive” (see later on)

Page 8: Scheduling Operating Systems (234123) – Spring 2013 Scheduling Dan Tsafrir (18/3/2013, 8/4/2013) OS (234123) - spring 2013 - scheduling1

OS (234123) - spring 2013 - scheduling 8

METRICS TO EVALUATE PERFORMANCEOF BATCH SCHEDULERS

Wait time, response time, slowdown, utilization, throughput

Page 9: Scheduling Operating Systems (234123) – Spring 2013 Scheduling Dan Tsafrir (18/3/2013, 8/4/2013) OS (234123) - spring 2013 - scheduling1

OS (234123) - spring 2013 - scheduling 9

Average wait time & response time• Average wait time

– The “wait time” of a job is the interval between the time the job is submitted to the time the job starts to run

• waitTIme = startTime - submitTime– The shorter the average wait time, the better the performance

• Average response time– The “response time” of a job is the interval between the time the job is

submitted to the time the job terminated• responseTime = terminationTime – submitTime• responseTime = waitTime + runTime

– The shorter the average response time, the better the performance

• Wait vs. response– Users typically care more about response time (wait for their job to end)– But schedulers can only affect wait time

Page 10: Scheduling Operating Systems (234123) – Spring 2013 Scheduling Dan Tsafrir (18/3/2013, 8/4/2013) OS (234123) - spring 2013 - scheduling1

OS (234123) - spring 2013 - scheduling 10

Connection between average wait & response

• Claim– For batch schedulers, difference between average wait time and

average response time is a constant– The constant is the average runtime of all jobs

• Proof– For each job i (i = 1,2, 3, … N)

• Let Wi be the wait time of job i• Let Ri be the runtime of job i• Let Ti be the response time of job i (Ti = Wi + Ri)

– With this notation we have

– The average runtime is a given; it stays the same regardless of the scheduler!

Page 11: Scheduling Operating Systems (234123) – Spring 2013 Scheduling Dan Tsafrir (18/3/2013, 8/4/2013) OS (234123) - spring 2013 - scheduling1

OS (234123) - spring 2013 - scheduling 11

Average slowdown (expansion factor)• The “slowdown” of a job is

– The ratio between its response time & its runtime– slowdown = responseTime / runTime = Ti/Ri– slowdown = (waitTime + runTime) / runTime = (Wi+Ri) / Ri

• Like wait & response, we aspire to minimize the average slowdown– slowdown = 1 job was immediately scheduled– The greater the slowdown, the longer the job is delayed

• Examples– If Ri = 1 minute, and Wi = 1 minute

• Job has a slowdown of (Wi+Ri)/Ri = (1+1)/1 = 2 • Namely, job was slowed by a factor of 2 relative to its runtime

– If Ri = 1 hour, and Wi = 1 minute, then the slowdown is much smaller• (Wi+Ri)/Ri = (60+1)/60 ≈ 1.02• The delay was insignificant relative to the job’s runtime

• The slowdown metric is also called – “Expansion factor”

Page 12: Scheduling Operating Systems (234123) – Spring 2013 Scheduling Dan Tsafrir (18/3/2013, 8/4/2013) OS (234123) - spring 2013 - scheduling1

OS (234123) - spring 2013 - scheduling 12

• Utilization (aspire to maximize)– Percentage of time the resource (CPU in our case) is busy– In this example, the utilization is:

• Throughput (aspire to maximize)– How much work is done in one

time unit– Examples

• Typical hard disk throughput: 50 MB/second (sequential access)• Database server throughput: transactions per second• Supercomputer throughput: job completions per second

• Depends on your point of view– Users typically care about wait time, response time, slowdown– The owners of the system typically care about utilization & throughput

0 1 2 3 4 5 6 7 Time

core

s

Utilization & throughput

(3 7 6)100 71%

3 7

Page 13: Scheduling Operating Systems (234123) – Spring 2013 Scheduling Dan Tsafrir (18/3/2013, 8/4/2013) OS (234123) - spring 2013 - scheduling1

OS (234123) - spring 2013 - scheduling 13

BATCH SCHEDULING EXAMPLES FCFS, EASY, backfilling, RR, SJF

Page 14: Scheduling Operating Systems (234123) – Spring 2013 Scheduling Dan Tsafrir (18/3/2013, 8/4/2013) OS (234123) - spring 2013 - scheduling1

OS (234123) - spring 2013 - scheduling 14

FCFS (First-Come First-Served) scheduling• Jobs are scheduled by their arrival time

– Jobs are scheduled by their arrival order– If there are enough free cores, a newly arriving job starts to run

immediately– Otherwise it waits until enough cores are freed

• Pros:– Easy to implement (FIFO

wait queue)– Perceived as most fair

• Cons:– Creates fragmentation (unutilized cores) – Small/short jobs might wait for a long, long while..

(Numbers indicatearrival order)

Page 15: Scheduling Operating Systems (234123) – Spring 2013 Scheduling Dan Tsafrir (18/3/2013, 8/4/2013) OS (234123) - spring 2013 - scheduling1

OS (234123) - spring 2013 - scheduling 15

EASY (= FCFS + backfilling) scheduling• The “backfilling”

optimization – A short waiting job can jump

over head of queue– Provided it doesn’t delay the

execution of head of queue

• EASY algorithm: whenever a job arrives or terminates:1. Try to start the job @ head of

wait queue (FCFS)2. Then, iterate over the rest of

the waiting jobs (in FCFS order) and try to backfill them

FCFS

Time

core

sco

res

FCFS + Backfilling = “EASY”

Page 16: Scheduling Operating Systems (234123) – Spring 2013 Scheduling Dan Tsafrir (18/3/2013, 8/4/2013) OS (234123) - spring 2013 - scheduling1

OS (234123) - spring 2013 - scheduling 16

EASY (= FCFS + backfilling) scheduling• Pros

– Better utilization (less fragmentation)

– Narrow/short jobs have better chance to run quicker

• Con: – Must know runtimes in

advance• To know width of holes• To know if backfill

candidates are short enough to fit holes

FCFS

Time

core

sco

res

FCFS + Backfilling = “EASY”

Page 17: Scheduling Operating Systems (234123) – Spring 2013 Scheduling Dan Tsafrir (18/3/2013, 8/4/2013) OS (234123) - spring 2013 - scheduling1

OS (234123) - spring 2013 - scheduling 17

EASY (= FCFS + backfilling) scheduling• Backfilling mandates users to estimate how long their job will

run• Upon job submission

• If a job tries to overrun its estimate– It is killed by the system– Provides incentive to supply accurate estimates

• Short estimate => better chance to backfill• Too short => jobs will be killed

• EASY (or FCFS) are very popular– Most supercomputer schedulers use them by default

• EASY stands for– Extensible Argonne Scheduling sYstem– Developed @ Argonne National Laboratory (USA) circa 1995

Page 18: Scheduling Operating Systems (234123) – Spring 2013 Scheduling Dan Tsafrir (18/3/2013, 8/4/2013) OS (234123) - spring 2013 - scheduling1

OS (234123) - spring 2013 - scheduling 18

SJF (Shortest-Job First) scheduling• Instead of

– Ordering jobs (or processes) by their arrival time (FCFS)

• Order them by– Their (typically estimated) runtime

• Not fair– Might cause “starvation” (whereby some job waits forever)

• But “optimal” for performance– As we will see later on, using some theoretical computations

• NOTE: limit of job-scheduling theoretical computations (called “queuing theory”)– Hard (impossible?) to do them for arbitrary parallel workloads– Therefore, for theoretical computations

• We will assume jobs are serial (job=process)– The intuition of ‘serial’ still typically applies to ‘parallel’

Page 19: Scheduling Operating Systems (234123) – Spring 2013 Scheduling Dan Tsafrir (18/3/2013, 8/4/2013) OS (234123) - spring 2013 - scheduling1

OS (234123) - spring 2013 - scheduling 19

Average wait time example: FCFS vs. SJF• Assume

– Processes P1, P2, P3 arrive together in the very same second – Assume their runtimes are 24, 3, and 3, respectively– Assume FCFS orders them by their index: P1, P2, P3– Whereas SJF orders them by their runtime: P2, P3, P1

• Then– The average wait time under FCFS is

• (0+24+27)/3 = 17

– The average wait time under SJF is• (0+3+6)/3 = 3

• SJF seems better than FCFS– In terms of optimizing the average wait time metric

P1 P2 P3324 3

P1P2 P33 3 24

Page 20: Scheduling Operating Systems (234123) – Spring 2013 Scheduling Dan Tsafrir (18/3/2013, 8/4/2013) OS (234123) - spring 2013 - scheduling1

OS (234123) - spring 2013 - scheduling 20

“Convoy effect”• Slowing down of all (possibly very short) processes on

account of the system currently servicing a very long process– As we’ve seen in the previous slide

• Does EASY suffer from convoy effect?– It might, but certainly less than FCFS– There are often holes in which short/narrow jobs can fit and many of

them indeed manage to start immediately

• Does SJF suffer from convoy effect?– It might (when?), but certainly less than FCFS

Page 21: Scheduling Operating Systems (234123) – Spring 2013 Scheduling Dan Tsafrir (18/3/2013, 8/4/2013) OS (234123) - spring 2013 - scheduling1

OS (234123) - spring 2013 - scheduling 21

Optimality of SJF for average wait time• Claim

– Given a 1-core system whereby all jobs are serial (= processes)– If all processes arrive together and their runtimes are known– Then the average wait time of SJF is equal to or smaller than the

average wait time of any other batch scheduler S

• Proof outline– Assume the scheduling order under S is: P(1), P(2), …, P(n)– If S is different than SJF, then there exist two processes P(i), P(i+1) such

that R(i) = P(i).runtime > P(i+1).runtime = R(i+1)– If we swap the scheduling order of P(i) and P(i+1) under S, then

we’ve increased the wait time of P(i) by R(i+1), we’ve decreased the wait time of P(i+1) by R(i)and this sums up all the changes that we’ve introduced

– And since R(i) > R(i+1), the overall average is reduced– We do the above repeatedly until we reach SJF

Page 22: Scheduling Operating Systems (234123) – Spring 2013 Scheduling Dan Tsafrir (18/3/2013, 8/4/2013) OS (234123) - spring 2013 - scheduling1

OS (234123) - spring 2013 - scheduling 22

Fairer variants of SJF• Motivation: we don’t want to starve jobs• SJBF (Shortest-job Backfilled First)

– Exactly like EASY in terms of servicing the head of the wait queue in FCFS order (and not allowing anyone to delay it)

– However, the backfilling traversal is done in SJF order

• LXF (Largest eXpansion Factor)– Recall that the “slowdown” or “expansion factor” metric for a job is

defined to be:• slowdown = (waitTime + runtime) / runtime

– LXF is identical to EASY, but instead of ordering the wait queue in FCFS, it orders jobs based on their current slowdown (greater slowdown means higher priority)

– The backfilling activity is done “against” the job with the largest current slowdown (= the head of the LXF wait queue)

– Note that every scheduling decision (when jobs arrive/finish) requires a re-computation of slowdowns (because wait time has changed)

Page 23: Scheduling Operating Systems (234123) – Spring 2013 Scheduling Dan Tsafrir (18/3/2013, 8/4/2013) OS (234123) - spring 2013 - scheduling1

OS (234123) - spring 2013 - scheduling 23

PREEMPTIVESCHEDULERS

RR, selfish RR, negative feedback, multi-level priority queue

Page 24: Scheduling Operating Systems (234123) – Spring 2013 Scheduling Dan Tsafrir (18/3/2013, 8/4/2013) OS (234123) - spring 2013 - scheduling1

OS (234123) - spring 2013 - scheduling 24

Reminder• In the previous lecture

– We’ve talked aboutseveral processes states

• In this lecture– We only focus on two:

• Ready & running

• Namely, we assume that– Processes only consume CPU– They never do I/O– They are always either

• Running, or• Ready to run

Ready

Running

Waiting

I/O, page fault

I/Odone

schedule

short term

longer tern

preempt

Page 25: Scheduling Operating Systems (234123) – Spring 2013 Scheduling Dan Tsafrir (18/3/2013, 8/4/2013) OS (234123) - spring 2013 - scheduling1

OS (234123) - spring 2013 - scheduling 25

Preemption• The act of suspending one job (process) in favor of another

– Even though it is not finished yet

• Exercise– Assume a one-core system– Assume two processes, each requiring 10 hours of CPU time– Does it make sense to do preemption (say, every few milliseconds)?

• When would we want a scheduler to be preemptive?– When responsiveness to user matters (they actively wait for the output

of the program), and– When some jobs/processes are much shorter than others

• Examples– Two processes, one needs 10 hours of CPU time, the other needs 10

seconds (and a user is waiting for it to complete)– Two processes, one needs 10 hours of CPU time, the other is a word

processor (like MS Word or Emacs) that has just been awakened because the user clicked on a keyboard key

Page 26: Scheduling Operating Systems (234123) – Spring 2013 Scheduling Dan Tsafrir (18/3/2013, 8/4/2013) OS (234123) - spring 2013 - scheduling1

OS (234123) - spring 2013 - scheduling 26

Quantum• The maximal amount of time a process is allowed to run

before it is preempted– 10s to 100s of milliseconds in general-purpose OSes (like Linux or

Windows)

• Quantum is oftentimes set per-process– Processes that behave differently get different quanta, e.g., in Solaris

• A CPU-bound process gets long quanta but with low priority• Whereas an I/O-bound process gets short quanta with hi priority

– In Linux, the process “nice” value affects the quantum duration• “Nice” is a system call and a shell utility (see man)

Page 27: Scheduling Operating Systems (234123) – Spring 2013 Scheduling Dan Tsafrir (18/3/2013, 8/4/2013) OS (234123) - spring 2013 - scheduling1

Performance metrics for preemptive schedulers

• Wait time (aspire to minimize)– Same as in batch (non-preemptive) systems

• Response time (or “turnaround time”; aspire to minimize)– Like batch systems, stands for

• Time from process submission to process completion– But unlike batch systems

• responseTime waitTime + runTime– Instead,

• responseTime ≥ waitTime + runTime– Because

• Processes can be preempted, and context switches have a price

• Overhead (aspire to minimize); comprised of– How long a context switch takes, and how often a context switch occurs

• Utilization & throughput (aspire to maximize)– Same as in batch (non-preemptive) systems

OS (234123) - spring 2013 - scheduling 27

Page 28: Scheduling Operating Systems (234123) – Spring 2013 Scheduling Dan Tsafrir (18/3/2013, 8/4/2013) OS (234123) - spring 2013 - scheduling1

RR (Round-Robin) scheduling• Processes are arranged in a cyclic ready-queue

– The head process runs, until its quantum is exhausted – The head process is then preempted (suspended)– The scheduler resumes the next process in the circular list– When we‘ve cycled through all processes in the run-list (and we reach

the head process again), we say that the current “epoch” is over, and the next epoch begins

• Requires a timer interrupt– Typically, it’s a periodic interrupt (fires every few millisecond)– Upon receiving the interrupt, the OS checks if its time to preempt

• Features– For small enough quantum, it’s like everyone of the N processes

advances in 1/N of the speed of the core– With a huge quantum (infinity), RR resembles FCFS

OS (234123) - spring 2013 - scheduling 28

Page 29: Scheduling Operating Systems (234123) – Spring 2013 Scheduling Dan Tsafrir (18/3/2013, 8/4/2013) OS (234123) - spring 2013 - scheduling1

RR in a parallel system?• Called “gang scheduling”

– Time is divided to slots (seconds or preferably minutes)– Every job has a native time slot– Algorithm attempts to fill holes in time slots by assigning to them jobs

from other native slots (called “alternative slots”)• Challenge: to keep contiguous chunks of free cores• Uses a “buddy system” algorithm

– Why might gang scheduling be useful?– Supported by most commercial supercomputer schedulers

• But is rarely used• Since memory often has to be “swapped out” upon context

switches, which might take a very, very long time

OS (234123) - spring 2013 - scheduling 29

Page 30: Scheduling Operating Systems (234123) – Spring 2013 Scheduling Dan Tsafrir (18/3/2013, 8/4/2013) OS (234123) - spring 2013 - scheduling1

Price of preemption: example• Assume

– One core, 1 second quantum– 10 processes, each requires 100 seconds of CPU time

• Assuming no context switch overhead (takes zero time)– Then the “makespan” (time to completion of all processes) is

• 10 x 100 = 1000 seconds– Both for FCFS and for RR

• Assume context switch takes 1 second– The makespan of FCFS is

• 10 (proc) x 100 (sec) + 9 (ctx-sw) x 1 (sec) = 1009– Whereas the makespan of RR is

• 10 (proc) x 100 (sec) + 100 (proc) x 100 (ctx-se) - 1 = 10,999

• The shorter the quantum– The higher the overhead price

OS (234123) - spring 2013 - scheduling 30

Page 31: Scheduling Operating Systems (234123) – Spring 2013 Scheduling Dan Tsafrir (18/3/2013, 8/4/2013) OS (234123) - spring 2013 - scheduling1

Given preemptive, can find a better batch• Claim

– Let avgResp(X) be the average response time under algorithm X– Assume a single core system, and that all processes arrive together– Assume X is a preemptive algorithm with context switch price = 0– Then there exists a non-preemptive algorithm Y such that

avgResp(Y) <= avgResp(X)

• Proof outline1. Let Pk be the last preempted process to finish computing

2. Compress all of Pk’s quanta to the “right” (assume time progresses left to right), such that the last quantum remains where it is and all the rest of the quanta move to the right towards it

(Pk’s response time didn’t change, and the response time of the other processes didn’t become longer)

3. Go to 1 (until no preempted processes are found)OS (234123) - spring 2013 - scheduling 31

Pk Pr Pk PkPr

PkPr Pk PkPr

Page 32: Scheduling Operating Systems (234123) – Spring 2013 Scheduling Dan Tsafrir (18/3/2013, 8/4/2013) OS (234123) - spring 2013 - scheduling1

Corollary• Based on (1) the previous slide and (2) the proof about SJF

optimality from a few slides ago– SJF is also optimal relative to preemptive schedulers (that meet our

assumptions that all processes arrive together)

OS (234123) - spring 2013 - scheduling 32

Page 33: Scheduling Operating Systems (234123) – Spring 2013 Scheduling Dan Tsafrir (18/3/2013, 8/4/2013) OS (234123) - spring 2013 - scheduling1

Connection between RR and SJF• Claim

– Assume:• 1-core system• All processes arrive together (and only use CPU, no I/O)• Quantum is identical to all processes

– Then• avgResponse(RR) <= 2 x avgResponse(SJF)

• Notation– Process P1, P2, …, P_N– Ri = runtime of Pi; – Wi = wait time of Pi– Ti = Wi + Ri = response time of Pi– delay(i,k) = the delay Pi caused Pk under RR– delay(i,i) = Ri – Note that Tk =

OS (234123) - spring 2013 - scheduling 33

1

( , )N

i

delay i k

Page 34: Scheduling Operating Systems (234123) – Spring 2013 Scheduling Dan Tsafrir (18/3/2013, 8/4/2013) OS (234123) - spring 2013 - scheduling1

Connection between RR and SJF• Proof outline

– For any scheduling algorithm A, N*avgResponse(A) is always

– For A=SJF, we have

(because, given two processes, the shorter delays the longer)– And for A=RR, we have

(assume Pi is the shorter, then Pi delays Pj by Ri, and since it’s a perfect RR, Pj also delays Pi by Ri)

OS (234123) - spring 2013 - scheduling 34

1 1 1

1 1

( , )

( , ) ( , )

N N N

k Ak i j

N

i A Ai i j N

T delay i j

R delay i j delay j i

1 1

min( , )N

i i ji i j N

R R R

1 1

2 min( , )N

i i ji i j N

R R R

Page 35: Scheduling Operating Systems (234123) – Spring 2013 Scheduling Dan Tsafrir (18/3/2013, 8/4/2013) OS (234123) - spring 2013 - scheduling1

SRTF (Shortest-Remaining-Time First)• Assume different jobs may arrive at different times• SJF is not optimal

– As it is not preemptive– And a short job might arrive while a very long job is running

• SRTF is just like SJF– But is allowed to use preemption, – Hence it is optimal (assuming a zero context-switch price)

• Whenever a new job arrives or an old job terminates– SRTF schedules the job with the shortest remaining time– Thereby making an optimal decision

OS (234123) - spring 2013 - scheduling 35

Page 36: Scheduling Operating Systems (234123) – Spring 2013 Scheduling Dan Tsafrir (18/3/2013, 8/4/2013) OS (234123) - spring 2013 - scheduling1

Selfish RR• New processes wait in a FIFO queue

– Not yet scheduled

• Older processes scheduled using RR• New processes are scheduled when

1. No ready-to-run “old” processes exist2. “Aging” is being applied to new processes (some per-process

counter is increased over time); when the counter passes a certain threshold, the “new” process becomes “old” and is transferred to the RR queue

• Fast aging– Algorithm resembles RR

• Slow aging– Algorithm resembles FCFS

OS (234123) - spring 2013 - scheduling 36

Page 37: Scheduling Operating Systems (234123) – Spring 2013 Scheduling Dan Tsafrir (18/3/2013, 8/4/2013) OS (234123) - spring 2013 - scheduling1

Scheduling using priorities• Every process is assigned a priority

– That reflects how “important” it is– Can change over time

• Processes with higher priority are favored– Scheduled before processes with lower priorities

• For example, in SJF– The priority is the runtime (smaller => better)

OS (234123) - spring 2013 - scheduling 37

Page 38: Scheduling Operating Systems (234123) – Spring 2013 Scheduling Dan Tsafrir (18/3/2013, 8/4/2013) OS (234123) - spring 2013 - scheduling1

Negative feedback principle• All general-purpose OSes (Linux, Windows, …) embody a

negative feedback policy in their scheduler– Running reduces priority to run more– Not running increases priority to run

• I/O-bound vs. CPU-bound– I/O-bound processes (that seldom use the CPU) get higher priority– This is why editors are responsive even if they run in the presence of

CPU-bound processes like• while(1) {

sqrt( time() ); }

• How about a video with high-frame rate? Or a 3D game?– Negative feedback doesn’t help them (they consume lots of CPU)– Need other ways to identify/prioritize them (very hard to do securely)

OS (234123) - spring 2013 - scheduling 38

Page 39: Scheduling Operating Systems (234123) – Spring 2013 Scheduling Dan Tsafrir (18/3/2013, 8/4/2013) OS (234123) - spring 2013 - scheduling1

Multi-level priority queue• Several RR queues

– Each is associated with a priority– Higher-priority queues are at the “top”– Lower-priority queues are at the “bottom“

• Processes migrate between queues = have a dynamic priority– “Important” processes (e.g., I/O-bound) move up– “Unimportant” processes (e.g., CPU-bound) move down

• Variants of multi-level priority queue are used by all general-purpose OSes– Priority is greatly affected by how much CPU is consumed by processes

• I/O bound move up; CPU-bound move down– Some OSes allocate short quanta to the higher priority queues– Some don’t– Some do the opposite

OS (234123) - spring 2013 - scheduling 39

Page 40: Scheduling Operating Systems (234123) – Spring 2013 Scheduling Dan Tsafrir (18/3/2013, 8/4/2013) OS (234123) - spring 2013 - scheduling1

THE LINUX <= 2.4 SCHEDULERA COMPREHENSIVE EXAMPLE

Summary can be found in Chapter #6 of: http://www.cs.technion.ac.il/~dan/papers/MscThesis.pdf

OS (234123) - spring 2013 - scheduling 40

Page 41: Scheduling Operating Systems (234123) – Spring 2013 Scheduling Dan Tsafrir (18/3/2013, 8/4/2013) OS (234123) - spring 2013 - scheduling1

The <=2.4 scheduler maintenance period

OS (234123) - spring 2013 - scheduling 41

fromhere

untilhere

O(1)CFS

Page 42: Scheduling Operating Systems (234123) – Spring 2013 Scheduling Dan Tsafrir (18/3/2013, 8/4/2013) OS (234123) - spring 2013 - scheduling1

Prologue• For the reminder of this presentations

– All definitions relate to the <=2.4 scheduler...– Not necessarily to other schedulers

• Ignoring…– The description ignores the so called “real time” tasks

• SCHED_RR, SCHED_FIFO (scheduling policies defined by POSIX)• Have different scheduling semantics then those outlines later

• Focusing…– Instead, we focus on the default scheduling policy

• SCHED_OTHER• (Will never run if there exists a runnable SCHED_RR or

SCHED_FIFO task)

OS (234123) - spring 2013 - scheduling 42

Page 43: Scheduling Operating Systems (234123) – Spring 2013 Scheduling Dan Tsafrir (18/3/2013, 8/4/2013) OS (234123) - spring 2013 - scheduling1

Definitions: task• Within the Linux kernel

– Every process is called a “task”– Every thread (to be defined later on) is also called a “task”

OS (234123) - spring 2013 - scheduling 43

Page 44: Scheduling Operating Systems (234123) – Spring 2013 Scheduling Dan Tsafrir (18/3/2013, 8/4/2013) OS (234123) - spring 2013 - scheduling1

Definitions: epoch• (As mentioned earlier for RR…)• Every runnable task gets allocated a quantum

– CPU time the task is allowed to consume before it’s stopped by the OS

• When all quanta of all runnable tasks becomes zero– Start a new epoch, namely– Allocate an additional running time to all tasks

• (Runnable or not)

OS (234123) - spring 2013 - scheduling 44

Page 45: Scheduling Operating Systems (234123) – Spring 2013 Scheduling Dan Tsafrir (18/3/2013, 8/4/2013) OS (234123) - spring 2013 - scheduling1

Definitions: priorities• Task’s priority

– Every task is associated with an integer– Higher value indicates higher priority to run– Every task has two different kinds of priorities…

• Task’s static priority component– Doesn’t change with time– Unless user invokes the nice() system call– Determines he max quantum for this task

• Task’s dynamic priority component– The (i) remaining time for this task to run and (ii) its current priority– Decreases over time (while the task is assigned a CPU and is running)– When reaches zero, OS forces task to yield the CPU– Reinitialized at the start of every epoch according to the static

component

OS (234123) - spring 2013 - scheduling 45

Page 46: Scheduling Operating Systems (234123) – Spring 2013 Scheduling Dan Tsafrir (18/3/2013, 8/4/2013) OS (234123) - spring 2013 - scheduling1

Definitions: HZ, resolution, and ticks• HZ

– Linux gets a timer interrupt HZ times a second– Namely, it gets an interrupt every 1/HZ second– (HZ=100 for x86 / Linux 2.4)

• Tick– The time the elapses between to consecutive timer interrupts– A tick duration is therefore 1/HZ– Namely, tick = 10 milliseconds by default

• Scheduler timing resolution– The OS measures the passage of time by counting ticks– The units of the dynamic priority component is ‘ticks’

OS (234123) - spring 2013 - scheduling 46

Page 47: Scheduling Operating Systems (234123) – Spring 2013 Scheduling Dan Tsafrir (18/3/2013, 8/4/2013) OS (234123) - spring 2013 - scheduling1

Definitions: per-task scheduling info• Every task is associated with a task_struct• Every task_struct has 5 fields used by the scheduler

1. nice2. counter3. processor4. need_resched5. mm

OS (234123) - spring 2013 - scheduling 47

Page 48: Scheduling Operating Systems (234123) – Spring 2013 Scheduling Dan Tsafrir (18/3/2013, 8/4/2013) OS (234123) - spring 2013 - scheduling1

Definitions: task’s nice (kernel vs. user)• The static component

– In kernel, initialized to 20 by default– Can by changed by nice() and sched_setscheduler()

• To be any value between 1 … 40

• User’s nice (parameter to the nice system call)– Between -20 … 19; – Smaller value indicates higher priority (<0 requires superuser)

• Kernel’s nice (field in task_struct used by the scheduler)– As noted, between 1 … 40– Higher value indicates higher priority (in contrast to user’s nice)

• Conversion– Kernel’s nice = 20 - user’s nice

OS (234123) - spring 2013 - scheduling 48

Page 49: Scheduling Operating Systems (234123) – Spring 2013 Scheduling Dan Tsafrir (18/3/2013, 8/4/2013) OS (234123) - spring 2013 - scheduling1

• The dynamic component (time to run in epoch & priority)– Upon task creation

• child.counter = parent.counter/2; parent.counter -= child.counter;– Upon a new epoch

• task.counter = task.counter/2 + NICE_TO_TICKS(task.nice) = half of dynamic + convert_to_ticks(static)

• NICE_TO_TICKS– In 2.4, scales 20 (=DEF_PRIORITY) to number of ticks comprising 50 ms– Namely, scale s20 to 5 ticks (recall that each tick is 10 ms by default):

• #define NICE_TO_TICKS(nice) ( (nice)/4 + 1 )• (The +1 is to make sure value is always > 0)

• Quantum range is therefore– ( 1/4 + 1=) 1 tick = 10 ms (min)– (20/4 + 1=) 6 ticks = 60 ms (default)– (40/4 + 1=) 11 ticks = 110 ms (max)

Definitions: task’s counter

OS (234123) - spring 2013 - scheduling 49

α

Page 50: Scheduling Operating Systems (234123) – Spring 2013 Scheduling Dan Tsafrir (18/3/2013, 8/4/2013) OS (234123) - spring 2013 - scheduling1

Definitions: processor, need_resched, mm• Task’s processor

– Logical ID of last core upon which task has executed most recently– If task is currently running

• ‘processor’ = logical ID of core upon which the task executes now

• Task’s need_resched– Boolean checked by kernel just before switching back to user-mode– If set, search for a “better” task than the one currently running– If such a task is found, context switch to it– Since this flag is checked only for the currently running task

• Usually easier to think of it a per-core rather than per-task variable

• Task’s mm– A pointer to the task’s “memory address space” (more details later on)

OS (234123) - spring 2013 - scheduling 50

Page 51: Scheduling Operating Systems (234123) – Spring 2013 Scheduling Dan Tsafrir (18/3/2013, 8/4/2013) OS (234123) - spring 2013 - scheduling1

Scheduler is comprised of 4 functions1. goodness()

– Given a task, return how desirable it is– Compare tasks by this value to decide which will run next

2. schedule()– Actual implementation of the scheduling algorithm– Uses goodness to decide which task will run next on a given core

3. __wake_up_common()– Wake up a task when the event it has been waiting for happened– Event may be, e.g., completion of I/O

4. reschedule_idle– Given a task, check whether it can be scheduled on some core– Preferably on an idle one, but if there aren't any, by preempting a

less desirable task (according to goodness)– Used by both __wake_up_common() and by schedule()

OS (234123) - spring 2013 - scheduling 51

Page 52: Scheduling Operating Systems (234123) – Spring 2013 Scheduling Dan Tsafrir (18/3/2013, 8/4/2013) OS (234123) - spring 2013 - scheduling1

How desirable is a task on a given core?int goodness(task t, cpu this_cpu) { // bigger = more desirable

g = t.counterif( g == 0 )

// exhausted quantum, wait until next epochreturn 0

if( t.processor == this_cpu )// try to avoid migration between coresg += PROC_CHANGE_BONUS

if( t.mm == this_cpu.current_task.mm )// prioritize threads sharing same address space// as context-switch would be cheaperg += SAME_ADDRESS_SPACE_BONUS

return g }

OS (234123) - spring 2013 - scheduling 52

Page 53: Scheduling Operating Systems (234123) – Spring 2013 Scheduling Dan Tsafrir (18/3/2013, 8/4/2013) OS (234123) - spring 2013 - scheduling1

Try to schedule task on some corevoid reschedule_idle(task t) {

next_cpu = NILif( t.processor is idle ) // t’s most recent core is idle

next_cpu = t.processorelse if( there exists an idle cpu ) // some other core is idle

next_cpu = least recently active idle cpuelse // no core is idle; is t more desirable

// than a currently running task?threshold = PREEMPTION THRESHOLDforeach cpu c in [all cpus] // find c where t is most desirable

gdiff = goodness(t,c) - goodness( c.current_task,c)if( gdiff > threshold )

threshold = gdiffnext_cpu = c

if( next_cpu != NIL ) // found a core for tprev_need = next_cpu.current_task.need_reschednext_cpu.current_task.need_resched = trueif( (prev_need == false) && (next_cpu != this_cpu) )

interrupt next_cpu}

OS (234123) - spring 2013 - scheduling 53

Page 54: Scheduling Operating Systems (234123) – Spring 2013 Scheduling Dan Tsafrir (18/3/2013, 8/4/2013) OS (234123) - spring 2013 - scheduling1

Wakeup blocked task(s)void __wake_up_common(wait_queue q) {

// blocked tasks residing in q wait for an event that has just happened// so try to reschedule all of them…foreach task t in [q]

remove t from qadd t to ready-to-run listreschedule_idle(t)

}

OS (234123) - spring 2013 - scheduling 54

Page 55: Scheduling Operating Systems (234123) – Spring 2013 Scheduling Dan Tsafrir (18/3/2013, 8/4/2013) OS (234123) - spring 2013 - scheduling1

The heart of the schedulervoid schedule(cpu this_cpu) {

// called when need_resched of this_cpu is on, when switching from// kernel mode back to user mode. need_resched can set by, e.g., the// tick handler, or by I/O device drivers that initiate a slow operation// (and hence move the associated tasks to a wait_queue)prev = this_cpu.current_task

START:if( prev's state is runnable )

next = prevnext_g = goodness(prev, this_cpu)

elsenext_g = -1

foreach task t in [runnable && not executing]// search for ‘next’ = the next task to run on this_cpucur_g = goodness(t, this_cpu)if( cur_g > next_g )

next = tnext_g = cur_g

OS (234123) - spring 2013 - scheduling 55

Page 56: Scheduling Operating Systems (234123) – Spring 2013 Scheduling Dan Tsafrir (18/3/2013, 8/4/2013) OS (234123) - spring 2013 - scheduling1

The heart of the scheduler – continued// …continue function from previous slide…

if( next_g == -1 ) // no ready tasksend function // schedules “idle task” (halts this_cpu)

else if( next_g == 0 ) // all quanta exhausted => start new epochforeach task t

t.counter = t.counter/2 + NICE_TO_TICKS(t.nice)goto START;

else if( next != prev )next.processor = this_cpunext.need_resched = false // 'next' will run nextcontext_switch(prev, next);if( prev is still runnable )

reschedule_idle(prev) // perhaps on another core

// ‘next’ (which may be equal to ‘prev’) will run next….}

OS (234123) - spring 2013 - scheduling 56

Page 57: Scheduling Operating Systems (234123) – Spring 2013 Scheduling Dan Tsafrir (18/3/2013, 8/4/2013) OS (234123) - spring 2013 - scheduling1

Definitions: task’s counter• Recall that

– task.counter = task.counter/2 + NICE_TO_TICKS(task.nice) = half of dynamic + convert_to_ticks(static)

• Claim– The counter value of an I/O-bound task will quickly converge to 2α– (Prove it yourselves)

• Corollary– By default, an I/O bound task will have a counter of 12 ticks (=120 ms)– (So long as it remains I/O bound that consume negligible CPU time)

• Notice– As noted earlier, this is why text editors remain responsive– Even in the face of heavy CPU-bound background tasks – When awakened upon event (e.g., key press), usually immediately get

the CPU

OS (234123) - spring 2013 - scheduling 57

α

Page 58: Scheduling Operating Systems (234123) – Spring 2013 Scheduling Dan Tsafrir (18/3/2013, 8/4/2013) OS (234123) - spring 2013 - scheduling1

Aftermath• <=2.4 is in fact a multi-level priority queue

– Every priority level integer is (logically) a RR queue– When counter is decreased => task moves between queues

• Drawback– It is linear scheduler…– Indeed, the “O(1)” scheduler (successor of <=2.4) is O(1) – The CFS scheduler (successor of O(1)) is O(logn)

OS (234123) - spring 2013 - scheduling 58

Page 59: Scheduling Operating Systems (234123) – Spring 2013 Scheduling Dan Tsafrir (18/3/2013, 8/4/2013) OS (234123) - spring 2013 - scheduling1

Price of linearity – motivating the O(1)

OS (234123) - spring 2013 - scheduling 59

number of runnable tasks in the system

dura

tion

of s

ched

ule(

) [cy

cles

]CPU 1 of 4 CPU 2 of 4