a concrete treatment of tasks - imperial college london

43
A concrete treatment of tasks Previously we looked at decomposing programs into tasks Dependencies: task A must execute before task B Scheduling: find an order of execution that respects dependencies ASAP scheduling: As Soon As Possible HPCE / dt10 / 2013 / 4.1

Upload: others

Post on 04-Oct-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A concrete treatment of tasks - Imperial College London

A concrete treatment of tasks

• Previously we looked at decomposing programs into tasks

– Dependencies: task A must execute before task B

– Scheduling: find an order of execution that respects dependencies

– ASAP scheduling: As Soon As Possible

HPCE / dt10 / 2013 / 4.1

Page 2: A concrete treatment of tasks - Imperial College London

A concrete treatment of tasks

• Previously we looked at decomposing programs into tasks

– Dependencies: task A must execute before task B

– Scheduling: find an order of execution that respects dependencies

– ASAP scheduling: As Soon As Possible

• Now we’ll look at a formal task-based system : Cilk

– Very influential academic project started in the early 90s

– Good combination of theory and practical results

• Some of the structure from this lecture is adapted from Leiserson and Prokop, “A

Minicourse on Multithreaded Programming”, 1998.

HPCE / dt10 / 2013 / 4.2

Page 3: A concrete treatment of tasks - Imperial College London

A concrete treatment of tasks

• Previously we looked at decomposing programs into tasks

– Dependencies: task A must execute before task B

– Scheduling: find an order of execution that respects dependencies

– ASAP scheduling: As Soon As Possible

• Now we’ll look at a formal task-based system : Cilk

– Very influential academic project started in the early 90s

– Good combination of theory and practical results

– Eventually bought by Intel, now incorporated into Intel compiler

• http://software.intel.com/en-us/articles/intel-cilk-plus/

Basic concepts used in lots of other libraries

• Intel TBB, Microsoft TPL

• Some of the structure from this lecture is adapted from Leiserson and Prokop, “A

Minicourse on Multithreaded Programming”, 1998.

HPCE / dt10 / 2013 / 4.3

Page 4: A concrete treatment of tasks - Imperial College London

The Cilk Language

• Cilk is a faithful extension of C

– If you delete Cilk keywords from a program it will still execute as C

– Serial Elision principle: remove the keywords, it becomes serial

HPCE / dt10 / 2013 / 4.4

Page 5: A concrete treatment of tasks - Imperial College London

The Cilk Language

• Cilk is a faithful extension of C

– If you delete Cilk keywords from a program it will still execute as C

– Serial Elision principle: remove the keywords, it becomes serial

cilk int Fib(int n)

{

if(n<2)

return n;

int x=spawn Fib(n-1);

int y=spawn Fib(n-2);

sync;

return x+y;

}

HPCE / dt10 / 2013 / 4.5

Page 6: A concrete treatment of tasks - Imperial College London

The Cilk Language

• Cilk is a faithful extension of C

– If you delete Cilk keywords from a program it will still execute as C

– Serial Elision principle: remove the keywords, it becomes serial

• Two fundamental operations in Cilk

– spawn : indicate a function call that may operate in parallel

– sync : wait until all spawned functions have completed

cilk int Fib(int n)

{

if(n<2)

return n;

int x=spawn Fib(n-1);

int y=spawn Fib(n-2);

sync;

return x+y;

}

HPCE / dt10 / 2013 / 4.6

Page 7: A concrete treatment of tasks - Imperial College London

The Cilk Language

• Cilk is a faithful extension of C

– If you delete Cilk keywords from a program it will still execute as C

– Serial Elision principle: remove the keywords, it becomes serial

• Two fundamental operations in Cilk

– spawn : indicate a function call that may operate in parallel

– sync : wait until all spawned functions have completed

cilk int Fib(int n)

{

if(n<2)

return n;

int x=spawn Fib(n-1);

int y=spawn Fib(n-2);

sync;

return x+y;

}

int Fib(int n)

{

if(n<2)

return n;

int x=Fib(n-1);

int y=Fib(n-2);

return x+y;

}

HPCE / dt10 / 2013 / 4.7

Page 8: A concrete treatment of tasks - Imperial College London

Cilk programs as a DAG (Directed Acyclic Graph)

• The pattern of spawn and sync commands defines a graph

– The graph contains dependencies between different functions

– spawn command creates a new task with an out-bound link

cilk int Fib(n=3)

{

if(n<2)

return n;

int x=spawn Fib(3-1);

int y=spawn Fib(3-2);

sync;

return x+y;

}

HPCE / dt10 / 2013 / 4.8

Page 9: A concrete treatment of tasks - Imperial College London

Cilk programs as a DAG (Directed Acyclic Graph)

• The pattern of spawn and sync commands defines a graph

– The graph contains dependencies between different functions

– spawn command creates a new task with an out-bound link

cilk int Fib(n=3)

{

if(n<2)

return n;

int x=spawn Fib(3-1);

int y=spawn Fib(3-2);

sync;

return x+y;

}

cilk int Fib(n=2)

{

if(n<2)

return n;

int x=spawn Fib(2-1);

int y=spawn Fib(2-2);

sync;

return x+y;

}

HPCE / dt10 / 2013 / 4.9

Page 10: A concrete treatment of tasks - Imperial College London

Cilk programs as a DAG (Directed Acyclic Graph)

• The pattern of spawn and sync commands defines a graph

– The graph contains dependencies between different functions

– spawn command creates a new task with an out-bound link

cilk int Fib(n=3)

{

if(n<2)

return n;

int x=spawn Fib(3-1);

int y=spawn Fib(3-2);

sync;

return x+y;

}

cilk int Fib(n=2)

{

if(n<2)

return n;

int x=spawn Fib(2-1);

int y=spawn Fib(2-2);

sync;

return x+y;

}

cilk int Fib(n=1)

{

if(n<2)

return n;

int x=spawn Fib(1-1);

int y=spawn Fib(1-2);

sync;

return x+y;

} HPCE / dt10 / 2013 / 4.10

Page 11: A concrete treatment of tasks - Imperial College London

Cilk programs as a DAG (Directed Acyclic Graph)

• The pattern of spawn and sync commands defines a graph

– The graph contains dependencies between different functions

– spawn command creates a new task with an out-bound link

– sync command creates inbound link from spawned tasks

cilk int Fib(n=3)

{

if(n<2)

return n;

int x=spawn Fib(3-1);

int y=spawn Fib(3-2);

sync;

return x+y;

}

cilk int Fib(n=2)

{

if(n<2)

return n;

int x=spawn Fib(2-1);

int y=spawn Fib(2-2);

sync;

return x+y;

}

cilk int Fib(n=1)

{

if(n<2)

return n;

int x=spawn Fib(1-1);

int y=spawn Fib(1-2);

sync;

return x+y;

} HPCE / dt10 / 2013 / 4.11

Page 12: A concrete treatment of tasks - Imperial College London

Cilk programs as a DAG

• The pattern of spawn and sync commands defines a graph

– The graph contains dependencies between different functions

– spawn command creates a new task with an out-bound link

– sync command creates inbound link from spawned tasks

cilk int Fib(n=3)

{

if(n<2)

return n;

int x=spawn Fib(n-1);

int y=spawn Fib(n-2);

sync;

return x+y;

}

cilk int Fib(n=2)

{

if(n<2)

return n;

int x=spawn Fib(n-1);

int y=spawn Fib(n-2);

sync;

return x+y;

}

cilk int Fib(n=1)

{

if(n<2)

return n;

int x=spawn Fib(n-1);

int y=spawn Fib(n-2);

sync;

return x+y;

} HPCE / dt10 / 2013 / 4.12

Page 13: A concrete treatment of tasks - Imperial College London

Cilk programs as a DAG

• The pattern of spawn and sync commands defines a graph

– The graph contains dependencies between different functions

– spawn command creates a new task with an out-bound link

– sync command creates inbound link from spawned tasks

cilk int Fib(n=3)

{

if(n<2)

return n;

int x=spawn Fib(n-1);

int y=spawn Fib(n-2);

sync;

return x+y;

}

cilk int Fib(n=2)

{

if(n<2)

return n;

int x=spawn Fib(n-1);

int y=spawn Fib(n-2);

sync;

return x+y;

}

cilk int Fib(n=1)

{

if(n<2)

return n;

int x=spawn Fib(n-1);

int y=spawn Fib(n-2);

sync;

return x+y;

}

cilk int Fib(n=1)

{

if(n<2)

return n;

...

}

HPCE / dt10 / 2013 / 4.13

Page 14: A concrete treatment of tasks - Imperial College London

Cilk programs as a DAG

• The pattern of spawn and sync commands defines a graph

– The graph contains dependencies between different functions

– spawn command creates a new task with an out-bound link

– sync command creates inbound link from spawned tasks

cilk int Fib(n=3)

{

if(n<2)

return n;

int x=spawn Fib(n-1);

int y=spawn Fib(n-2);

sync;

return x+y;

}

cilk int Fib(n=2)

{

if(n<2)

return n;

int x=spawn Fib(n-1);

int y=spawn Fib(n-2);

sync;

return x+y;

}

cilk int Fib(n=1)

{

if(n<2)

return n;

int x=spawn Fib(n-1);

int y=spawn Fib(n-2);

sync;

return x+y;

}

cilk int Fib(n=1)

{

if(n<2)

return n;

...

}

cilk int Fib(n=0)

{

if(n<2)

return n;

...

}

HPCE / dt10 / 2013 / 4.14

Page 15: A concrete treatment of tasks - Imperial College London

cilk int Fib(int n)

{

if(n<2)

return n;

int x=spawn Fib(n-1);

int y=spawn Fib(n-2);

sync;

return x+y;

}

HPCE / dt10 / 2013 / 4.15

Page 16: A concrete treatment of tasks - Imperial College London

HPCE / dt10 / 2013 / 4.16

Page 17: A concrete treatment of tasks - Imperial College London

Steps within a function

execution sequentially

HPCE / dt10 / 2013 / 4.17

Page 18: A concrete treatment of tasks - Imperial College London

Steps within a function

execution sequentially

Independent functions may

execute in parallel

HPCE / dt10 / 2013 / 4.18

Page 19: A concrete treatment of tasks - Imperial College London

Steps within a function

execution sequentially

Independent functions may

execute in parallel

HPCE / dt10 / 2013 / 4.19

Page 20: A concrete treatment of tasks - Imperial College London

Total Work : T1 - total time required to execute all tasks

Critical path : T∞ - longest path through all tasks

assume each step takes unit time : total work = 35; critical path = 16

HPCE / dt10 / 2013 / 4.20

Page 21: A concrete treatment of tasks - Imperial College London

Total Work : T1 - total time required to execute all tasks

Critical path : T∞ - longest path through all tasks

assume each step takes unit time : total work = 35; critical path = 16

HPCE / dt10 / 2013 / 4.21

Page 22: A concrete treatment of tasks - Imperial College London

Best case and worst-case times

• Define three times: T1, TP, T∞

– T1 : Time to execute on one processor (Total Work)

– TP : Time to execute on P processors

– T∞ : Time to execute on infinite processors (Critical Path)

– T1 / TP : Speedup with P processors

HPCE / dt10 / 2013 / 4.22

Page 23: A concrete treatment of tasks - Imperial College London

Best case and worst-case times

• Define three times: T1, TP, T∞

– T1 : Time to execute on one processor (Total Work)

– TP : Time to execute on P processors

– T∞ : Time to execute on infinite processors (Critical Path)

– T1 / TP : Speedup with P processors

• Can establish an ordering on the times

– T1 / P ≤ TP - Maximum speedup with P processors is P

– TP ≥ T∞ - Finite processors are no faster than infinite

HPCE / dt10 / 2013 / 4.23

Page 24: A concrete treatment of tasks - Imperial College London

Best case and worst-case times

• Define three times: T1, TP, T∞

– T1 : Time to execute on one processor (Total Work)

– TP : Time to execute on P processors

– T∞ : Time to execute on infinite processors (Critical Path)

– T1 / TP : Speedup with P processors

• Can establish an ordering on the times

– T1 / P ≤ TP - Maximum speedup with P processors is P

– TP ≥ T∞ - Finite processors are no faster than infinite

• Can talk about scalability

– if T1 / TP = O(P) then Linear speedup (perfect scaling)

– We always want linear speedup – can we achieve it?

HPCE / dt10 / 2013 / 4.24

Page 25: A concrete treatment of tasks - Imperial College London

Greedy Schedulers

• A Greedy Scheduler executes work using an ASAP approach

– Each “time step” launch all tasks with no dependencies

– The notion of a time-step is deliberately context dependent

HPCE / dt10 / 2013 / 4.25

Page 26: A concrete treatment of tasks - Imperial College London

Greedy Schedulers

• A Greedy Scheduler executes work using an ASAP approach

– Each “time step” launch all tasks with no dependencies

– The notion of a time-step is deliberately context dependent

• When executing with P processors we have two types of step

– complete step : There are P or more tasks ready to execute

– incomplete step : There are less than P tasks ready to execute

HPCE / dt10 / 2013 / 4.26

Page 27: A concrete treatment of tasks - Imperial College London

Greedy Schedulers

• A Greedy Scheduler executes work using an ASAP approach

– Each “time step” launch all tasks with no dependencies

– The notion of a time-step is deliberately context dependent

• When executing with P processors we have two types of step

– complete step : There are P or more tasks ready to execute

– incomplete step : There are less than P tasks ready to execute

• A greedy scheduler always achieves TP ≤ T1 / P + T∞

– Best case is easy to visualise

• we do all work in TP complete steps

– Worst case is a bit more difficult

Steps on critical path execute in incomplete steps

• Last step on critical path frees up all remaining work for complete steps

HPCE / dt10 / 2013 / 4.27

Page 28: A concrete treatment of tasks - Imperial College London

HPCE / dt10 / 2013 / 4.28

Page 29: A concrete treatment of tasks - Imperial College London

Linear Scaling and Greedy Schedulers

• Previous equations assume zero-cost scheduling

– Some overhead involved in tracking tasks that can be run

– Some overhead in scheduling ready tasks to a processor

HPCE / dt10 / 2013 / 4.29

Page 30: A concrete treatment of tasks - Imperial College London

Linear Scaling and Greedy Schedulers

• Previous equations assume zero-cost scheduling

– Some overhead involved in tracking tasks that can be run

– Some overhead in scheduling ready tasks to a processor

• Define critical overhead : c

– Smallest c such that TP ≤ T1 / P + c×T

– Covers the cost of tracking dependencies on critical path

HPCE / dt10 / 2013 / 4.30

Page 31: A concrete treatment of tasks - Imperial College London

Linear Scaling and Greedy Schedulers

• Previous equations assume zero-cost scheduling

– Some overhead involved in tracking tasks that can be run

– Some overhead in scheduling ready tasks to a processor

• Define critical overhead : c

– Smallest c such that TP ≤ T1 / P + c×T

– Covers the cost of tracking dependencies on critical path

Linear scaling if there is usually much more work than CPUs

– Average parallelism : P = T1 / T

Assumption of parallel slackness : P / P >> c

– Therefore: T1 / P >> c × T

And so: TP ≈ T1 / P (linear speedup)

• Assumption of parallel slackness implies linear speedup

HPCE / dt10 / 2013 / 4.31

Page 32: A concrete treatment of tasks - Imperial College London

Is that a reasonable assumption?

• Central idea is that most steps are complete

– All processors are occupied most of the time

– Does computation look like that?

HPCE / dt10 / 2013 / 4.32

Page 33: A concrete treatment of tasks - Imperial College London

Is that a reasonable assumption?

• Central idea is that most steps are complete

– All processors are occupied most of the time

– Does computation look like that?

• Recall Gustafson’s law and the finite-difference example

– T1 = O(n2); T = O(n)

– P = T1 / T = O(n)

Assuming c is not too high we should get linear scaling

• Recall the circuit placement example

– Each placed node potentially adds M more nodes to execute

• For lots of stuff the assumption is broadly true

HPCE / dt10 / 2013 / 4.33

Page 34: A concrete treatment of tasks - Imperial College London

Work-first rule

• Define work overhead : c1 = T1 / TS

– TS : Time to run serial version of program (serial elision)

– Cost of dynamic scheduling vs static scheduling on one CPU

HPCE / dt10 / 2013 / 4.34

Page 35: A concrete treatment of tasks - Imperial College London

Work-first rule

• Define work overhead : c1 = T1 / TS

– TS : Time to run serial version of program (serial elision)

– Cost of dynamic scheduling vs static scheduling on one CPU

• What is the importance of c1 vs c ?

– Substitute into previous defn (TP ≤ T1 / P + c×T)

– TP ≤ c1 Ts / P + c×T

Now re-introduce assumption of parallel slackness (P / P >> c)

• T1 / (T × P) >> c

T1 / P >> c T

• c1 TS / P >> c T

Therefore: TP ≈ c1 Ts / P

HPCE / dt10 / 2013 / 4.35

Page 36: A concrete treatment of tasks - Imperial College London

Work-first rule

• Define work overhead : c1 = T1 / TS

– TS : Time to run serial version of program (serial elision)

– Cost of dynamic scheduling vs static scheduling on one CPU

• What is the importance of c1 vs c ?

– Substitute into previous defn (TP ≤ T1 / P + c×T)

– TP ≤ c1 Ts / P + c×T

Now re-introduce assumption of parallel slackness (P / P >> c)

• T1 / (T × P) >> c

T1 / P >> c T

• c1 TS / P >> c T

Therefore: TP ≈ c1 Ts / P

• Work-first rule: minimise c1 rather than c

HPCE / dt10 / 2013 / 4.36

Page 37: A concrete treatment of tasks - Imperial College London

Total Work : T1 - total time required for Cilk on one processor (red+green)

Serial Work : TS - total time required for serial-elisions (green only)

assume each step takes unit time : total work = 35; serial work = 22

HPCE / dt10 / 2013 / 4.37

Page 38: A concrete treatment of tasks - Imperial College London

Interpreting the work-first rule

• The work-first rule appears in many guises

– What are c1 and c in practise?

• Multi-core CPUs and OSs support traditional threads

– c1 : How much time to swap between two threads on a CPU?

– c : How much time to create a new thread?

• GPUs support hundreds of parallel threads

– c1 : Nano-second scheduling of threads in a kernel

– c : Milli-second cost to manage kernels from the CPU

• Intel TBB supports thousands of tasks

– c1 : Agglomeration of loop iterations to reduce overheads

– c : Hierarchical task based scheduler (based on Cilk)

• Bear this principle in mind as we look at real systems

HPCE / dt10 / 2013 / 4.38

Page 39: A concrete treatment of tasks - Imperial College London

An example: Matrix-Matrix Multiply

• Recursive decomposition of matrix-matrix multiplication (MMM)

– Sub-divide matrix into quadrants

– Perform calculations using operations on quadrants

– Split matrices until we’re down to 1x1 matrices (scalars) • Yes, technically we could use Strassen’s Algorithm, but we’ll ignore that for now

• Standard MMM is O(n3)

– What is the big-O complexity of this algorithm?

1111011010110010

1101010010010000

1110

0100

1110

0100

1110

0100

BABABABA

BABABABA

BB

BB

AA

AA

CC

CC

HPCE / dt10 / 2013 / 4.39

Page 40: A concrete treatment of tasks - Imperial College London

// Some sort of matrix - the details are not important typedef ... matrix; // Return a sub-matrix of A // BT=0 -> top, BT=1 -> bottom; LR=0 -> left, LR=1 -> right // The returned matrix is a _view_ on the matrix. Modifications // to the quad affect matrix X too matrix quad(matrix A, int BT, int LR) // Perform a classic n^3 matrix multiply-add // DST = DST + A * B void multiply_add_dense(matrix DST, matrix A, matrix B);

1111011010110010

1101010010010000

1110

0100

1110

0100

1110

0100

BABABABA

BABABABA

BB

BB

AA

AA

CC

CC

HPCE / dt10 / 2013 / 4.40

Page 41: A concrete treatment of tasks - Imperial College London

void multiply_add_recursive(matrix DST, matrix A, matrix B) { if((DST.cols <= 4) || (A.cols<=4) || (DST.rows<=4)){ multiply_add_dense(DST, A, B); }else{ multiply_add_recursive(quad(DST,0,0), quad(A,0,0), quad(B,0,0)); multiply_add_recursive(quad(DST,0,1), quad(A,0,0), quad(B,0,1)); multiply_add_recursive(quad(DST,1,0), quad(A,1,0), quad(B,0,0)); multiply_add_recursive(quad(DST,1,1), quad(A,1,0), quad(B,0,1)); multiply_add_recursive(quad(DST,0,0), quad(A,0,1), quad(B,1,0)); multiply_add_recursive(quad(DST,0,1), quad(A,0,1), quad(B,1,1)); multiply_add_recursive(quad(DST,1,0), quad(A,1,1), quad(B,1,0)); multiply_add_recursive(quad(DST,1,1), quad(A,1,1), quad(B,1,1)); } }

1111011010110010

1101010010010000

1110

0100

1110

0100

1110

0100

BABABABA

BABABABA

BB

BB

AA

AA

CC

CC

HPCE / dt10 / 2013 / 4.41

Page 42: A concrete treatment of tasks - Imperial College London

cilk void multiply_add_recursive(matrix DST, matrix A, matrix B) { if((DST.cols <= 4) || (A.cols<=4) || (DST.rows<=4)){ multiply_add_dense(DST, A, B); }else{ spawn multiply_add_recursive(quad(DST,0,0), quad(A,0,0), quad(B,0,0)); spawn multiply_add_recursive(quad(DST,0,1), quad(A,0,0), quad(B,0,1)); spawn multiply_add_recursive(quad(DST,1,0), quad(A,1,0), quad(B,0,0)); spawn multiply_add_recursive(quad(DST,1,1), quad(A,1,0), quad(B,0,1)); sync; spawn multiply_add_recursive(quad(DST,0,0), quad(A,0,1), quad(B,1,0)); spawn multiply_add_recursive(quad(DST,0,1), quad(A,0,1), quad(B,1,1)); spawn multiply_add_recursive(quad(DST,1,0), quad(A,1,1), quad(B,1,0)); spawn multiply_add_recursive(quad(DST,1,1), quad(A,1,1), quad(B,1,1)); sync; } }

1111011010110010

1101010010010000

1110

0100

1110

0100

1110

0100

BABABABA

BABABABA

BB

BB

AA

AA

CC

CC

HPCE / dt10 / 2013 / 4.42

Page 43: A concrete treatment of tasks - Imperial College London

See the course home-page for the MMM C and Cilk

programs, and some info on trying Cilk:

http://cas.ee.ic.ac.uk/people/dt10/teaching/2012/hpce/

HPCE / dt10 / 2013 / 4.43