a concrete treatment of tasks - imperial college london

A concrete treatment of tasks

• Previously we looked at decomposing programs into tasks

– Dependencies: task A must execute before task B

– Scheduling: find an order of execution that respects dependencies

– ASAP scheduling: As Soon As Possible

HPCE / dt10 / 2013 / 4.1

http://software.intel.com/en-us/articles/intel-cilk-plus/






• Now we’ll look at a formal task-based system : Cilk

– Very influential academic project started in the early 90s

– Good combination of theory and practical results

• Some of the structure from this lecture is adapted from Leiserson and Prokop, “A

Minicourse on Multithreaded Programming”, 1998.

HPCE / dt10 / 2013 / 4.2







• Now we’ll look at a formal task-based system : Cilk

– Very influential academic project started in the early 90s

– Good combination of theory and practical results

– Eventually bought by Intel, now incorporated into Intel compiler

• http://software.intel.com/en-us/articles/intel-cilk-plus/

Basic concepts used in lots of other libraries

• Intel TBB, Microsoft TPL

• Some of the structure from this lecture is adapted from Leiserson and Prokop, “A

Minicourse on Multithreaded Programming”, 1998.

HPCE / dt10 / 2013 / 4.3








The Cilk Language

• Cilk is a faithful extension of C

– If you delete Cilk keywords from a program it will still execute as C

– Serial Elision principle: remove the keywords, it becomes serial

HPCE / dt10 / 2013 / 4.4

The Cilk Language




cilk int Fib(int n)

{

if(n<2)

return n;

int x=spawn Fib(n-1);

int y=spawn Fib(n-2);

sync;

return x+y;

}

HPCE / dt10 / 2013 / 4.5

The Cilk Language




• Two fundamental operations in Cilk

– spawn : indicate a function call that may operate in parallel

– sync : wait until all spawned functions have completed

cilk int Fib(int n)

{

if(n<2)

return n;



sync;

return x+y;

}

HPCE / dt10 / 2013 / 4.6

The Cilk Language




• Two fundamental operations in Cilk

– spawn : indicate a function call that may operate in parallel

– sync : wait until all spawned functions have completed

cilk int Fib(int n)

{

if(n<2)

return n;



sync;

return x+y;

}

int Fib(int n)

{

if(n<2)

return n;

int x=Fib(n-1);

int y=Fib(n-2);

return x+y;

}

HPCE / dt10 / 2013 / 4.7

Cilk programs as a DAG (Directed Acyclic Graph)

• The pattern of spawn and sync commands defines a graph

– The graph contains dependencies between different functions

– spawn command creates a new task with an out-bound link

cilk int Fib(n=3)

{

if(n<2)

return n;

int x=spawn Fib(3-1);

int y=spawn Fib(3-2);

sync;

return x+y;

}

HPCE / dt10 / 2013 / 4.8





cilk int Fib(n=3)

{

if(n<2)

return n;



sync;

return x+y;

}

cilk int Fib(n=2)

{

if(n<2)

return n;



sync;

return x+y;

}

HPCE / dt10 / 2013 / 4.9





cilk int Fib(n=3)

{

if(n<2)

return n;



sync;

return x+y;

}

cilk int Fib(n=2)

{

if(n<2)

return n;



sync;

return x+y;

}

cilk int Fib(n=1)

{

if(n<2)

return n;



sync;

return x+y;

} HPCE / dt10 / 2013 / 4.10





– sync command creates inbound link from spawned tasks

cilk int Fib(n=3)

{

if(n<2)

return n;



sync;

return x+y;

}

cilk int Fib(n=2)

{

if(n<2)

return n;



sync;

return x+y;

}

cilk int Fib(n=1)

{

if(n<2)

return n;



sync;

return x+y;

} HPCE / dt10 / 2013 / 4.11

Cilk programs as a DAG





cilk int Fib(n=3)

{

if(n<2)

return n;



sync;

return x+y;

}

cilk int Fib(n=2)

{

if(n<2)

return n;



sync;

return x+y;

}

cilk int Fib(n=1)

{

if(n<2)

return n;



sync;

return x+y;

} HPCE / dt10 / 2013 / 4.12






cilk int Fib(n=3)

{

if(n<2)

return n;



sync;

return x+y;

}

cilk int Fib(n=2)

{

if(n<2)

return n;



sync;

return x+y;

}

cilk int Fib(n=1)

{

if(n<2)

return n;



sync;

return x+y;

}

cilk int Fib(n=1)

{

if(n<2)

return n;

...

}

HPCE / dt10 / 2013 / 4.13






cilk int Fib(n=3)

{

if(n<2)

return n;



sync;

return x+y;

}

cilk int Fib(n=2)

{

if(n<2)

return n;



sync;

return x+y;

}

cilk int Fib(n=1)

{

if(n<2)

return n;



sync;

return x+y;

}

cilk int Fib(n=1)

{

if(n<2)

return n;

...

}

cilk int Fib(n=0)

{

if(n<2)

return n;

...

}

HPCE / dt10 / 2013 / 4.14

cilk int Fib(int n)

{

if(n<2)

return n;



sync;

return x+y;

}

HPCE / dt10 / 2013 / 4.15

HPCE / dt10 / 2013 / 4.16

Steps within a function

execution sequentially

HPCE / dt10 / 2013 / 4.17



Independent functions may

execute in parallel

HPCE / dt10 / 2013 / 4.18



Independent functions may

execute in parallel

HPCE / dt10 / 2013 / 4.19

Total Work : T1 - total time required to execute all tasks

Critical path : T∞ - longest path through all tasks

assume each step takes unit time : total work = 35; critical path = 16

HPCE / dt10 / 2013 / 4.20

Total Work : T1 - total time required to execute all tasks

Critical path : T∞ - longest path through all tasks

assume each step takes unit time : total work = 35; critical path = 16

HPCE / dt10 / 2013 / 4.21

Best case and worst-case times

• Define three times: T1, TP, T∞

– T1 : Time to execute on one processor (Total Work)

– TP : Time to execute on P processors

– T∞ : Time to execute on infinite processors (Critical Path)

– T1 / TP : Speedup with P processors

HPCE / dt10 / 2013 / 4.22







• Can establish an ordering on the times

– T1 / P ≤ TP - Maximum speedup with P processors is P

– TP ≥ T∞ - Finite processors are no faster than infinite

HPCE / dt10 / 2013 / 4.23







• Can establish an ordering on the times

– T1 / P ≤ TP - Maximum speedup with P processors is P

– TP ≥ T∞ - Finite processors are no faster than infinite

• Can talk about scalability

– if T1 / TP = O(P) then Linear speedup (perfect scaling)

– We always want linear speedup – can we achieve it?

HPCE / dt10 / 2013 / 4.24

Greedy Schedulers

• A Greedy Scheduler executes work using an ASAP approach

– Each “time step” launch all tasks with no dependencies

– The notion of a time-step is deliberately context dependent

HPCE / dt10 / 2013 / 4.25

Greedy Schedulers




• When executing with P processors we have two types of step

– complete step : There are P or more tasks ready to execute

– incomplete step : There are less than P tasks ready to execute

HPCE / dt10 / 2013 / 4.26

Greedy Schedulers




• When executing with P processors we have two types of step

– complete step : There are P or more tasks ready to execute

– incomplete step : There are less than P tasks ready to execute

• A greedy scheduler always achieves TP ≤ T1 / P + T∞

– Best case is easy to visualise

• we do all work in TP complete steps

– Worst case is a bit more difficult

Steps on critical path execute in incomplete steps

• Last step on critical path frees up all remaining work for complete steps

HPCE / dt10 / 2013 / 4.27

HPCE / dt10 / 2013 / 4.28

Linear Scaling and Greedy Schedulers

• Previous equations assume zero-cost scheduling

– Some overhead involved in tracking tasks that can be run

– Some overhead in scheduling ready tasks to a processor

HPCE / dt10 / 2013 / 4.29





• Define critical overhead : c

– Smallest c such that TP ≤ T1 / P + c×T

– Covers the cost of tracking dependencies on critical path

HPCE / dt10 / 2013 / 4.30





• Define critical overhead : c

– Smallest c such that TP ≤ T1 / P + c×T

– Covers the cost of tracking dependencies on critical path

Linear scaling if there is usually much more work than CPUs

– Average parallelism : P = T1 / T

Assumption of parallel slackness : P / P >> c

– Therefore: T1 / P >> c × T

And so: TP ≈ T1 / P (linear speedup)

• Assumption of parallel slackness implies linear speedup

HPCE / dt10 / 2013 / 4.31

Is that a reasonable assumption?

• Central idea is that most steps are complete

– All processors are occupied most of the time

– Does computation look like that?

HPCE / dt10 / 2013 / 4.32

Is that a reasonable assumption?

• Central idea is that most steps are complete

– All processors are occupied most of the time

– Does computation look like that?

• Recall Gustafson’s law and the finite-difference example

– T1 = O(n2); T = O(n)

– P = T1 / T = O(n)

Assuming c is not too high we should get linear scaling

• Recall the circuit placement example

– Each placed node potentially adds M more nodes to execute

• For lots of stuff the assumption is broadly true

HPCE / dt10 / 2013 / 4.33

Work-first rule

• Define work overhead : c1 = T1 / TS

– TS : Time to run serial version of program (serial elision)

– Cost of dynamic scheduling vs static scheduling on one CPU

HPCE / dt10 / 2013 / 4.34

Work-first rule




• What is the importance of c1 vs c ?

– Substitute into previous defn (TP ≤ T1 / P + c×T)

– TP ≤ c1 Ts / P + c×T

Now re-introduce assumption of parallel slackness (P / P >> c)

• T1 / (T × P) >> c

T1 / P >> c T

• c1 TS / P >> c T

Therefore: TP ≈ c1 Ts / P

HPCE / dt10 / 2013 / 4.35

Work-first rule




• What is the importance of c1 vs c ?

– Substitute into previous defn (TP ≤ T1 / P + c×T)

– TP ≤ c1 Ts / P + c×T

Now re-introduce assumption of parallel slackness (P / P >> c)

• T1 / (T × P) >> c

T1 / P >> c T

• c1 TS / P >> c T

Therefore: TP ≈ c1 Ts / P

• Work-first rule: minimise c1 rather than c

HPCE / dt10 / 2013 / 4.36

Total Work : T1 - total time required for Cilk on one processor (red+green)

Serial Work : TS - total time required for serial-elisions (green only)

assume each step takes unit time : total work = 35; serial work = 22

HPCE / dt10 / 2013 / 4.37

Interpreting the work-first rule

• The work-first rule appears in many guises

– What are c1 and c in practise?

• Multi-core CPUs and OSs support traditional threads

– c1 : How much time to swap between two threads on a CPU?

– c : How much time to create a new thread?

• GPUs support hundreds of parallel threads

– c1 : Nano-second scheduling of threads in a kernel

– c : Milli-second cost to manage kernels from the CPU

• Intel TBB supports thousands of tasks

– c1 : Agglomeration of loop iterations to reduce overheads

– c : Hierarchical task based scheduler (based on Cilk)

• Bear this principle in mind as we look at real systems

HPCE / dt10 / 2013 / 4.38

An example: Matrix-Matrix Multiply

• Recursive decomposition of matrix-matrix multiplication (MMM)

– Sub-divide matrix into quadrants

– Perform calculations using operations on quadrants

– Split matrices until we’re down to 1x1 matrices (scalars) • Yes, technically we could use Strassen’s Algorithm, but we’ll ignore that for now

• Standard MMM is O(n3)

– What is the big-O complexity of this algorithm?

1111011010110010

1101010010010000

1110

0100

1110

0100

1110

0100

BABABABA

BABABABA

BB

BB

AA

AA

CC

CC

HPCE / dt10 / 2013 / 4.39

// Some sort of matrix - the details are not important typedef ... matrix; // Return a sub-matrix of A // BT=0 -> top, BT=1 -> bottom; LR=0 -> left, LR=1 -> right // The returned matrix is a _view_ on the matrix. Modifications // to the quad affect matrix X too matrix quad(matrix A, int BT, int LR) // Perform a classic n^3 matrix multiply-add // DST = DST + A * B void multiply_add_dense(matrix DST, matrix A, matrix B);

1111011010110010

1101010010010000

1110

0100

1110

0100

1110

0100

BABABABA

BABABABA

BB

BB

AA

AA

CC

CC

HPCE / dt10 / 2013 / 4.40

void multiply_add_recursive(matrix DST, matrix A, matrix B) { if((DST.cols <= 4) || (A.cols<=4) || (DST.rows<=4)){ multiply_add_dense(DST, A, B); }else{ multiply_add_recursive(quad(DST,0,0), quad(A,0,0), quad(B,0,0)); multiply_add_recursive(quad(DST,0,1), quad(A,0,0), quad(B,0,1)); multiply_add_recursive(quad(DST,1,0), quad(A,1,0), quad(B,0,0)); multiply_add_recursive(quad(DST,1,1), quad(A,1,0), quad(B,0,1)); multiply_add_recursive(quad(DST,0,0), quad(A,0,1), quad(B,1,0)); multiply_add_recursive(quad(DST,0,1), quad(A,0,1), quad(B,1,1)); multiply_add_recursive(quad(DST,1,0), quad(A,1,1), quad(B,1,0)); multiply_add_recursive(quad(DST,1,1), quad(A,1,1), quad(B,1,1)); } }

1111011010110010

1101010010010000

1110

0100

1110

0100

1110

0100

BABABABA

BABABABA

BB

BB

AA

AA

CC

CC

HPCE / dt10 / 2013 / 4.41

cilk void multiply_add_recursive(matrix DST, matrix A, matrix B) { if((DST.cols <= 4) || (A.cols<=4) || (DST.rows<=4)){ multiply_add_dense(DST, A, B); }else{ spawn multiply_add_recursive(quad(DST,0,0), quad(A,0,0), quad(B,0,0)); spawn multiply_add_recursive(quad(DST,0,1), quad(A,0,0), quad(B,0,1)); spawn multiply_add_recursive(quad(DST,1,0), quad(A,1,0), quad(B,0,0)); spawn multiply_add_recursive(quad(DST,1,1), quad(A,1,0), quad(B,0,1)); sync; spawn multiply_add_recursive(quad(DST,0,0), quad(A,0,1), quad(B,1,0)); spawn multiply_add_recursive(quad(DST,0,1), quad(A,0,1), quad(B,1,1)); spawn multiply_add_recursive(quad(DST,1,0), quad(A,1,1), quad(B,1,0)); spawn multiply_add_recursive(quad(DST,1,1), quad(A,1,1), quad(B,1,1)); sync; } }

1111011010110010

1101010010010000

1110

0100

1110

0100

1110

0100

BABABABA

BABABABA

BB

BB

AA

AA

CC

CC

HPCE / dt10 / 2013 / 4.42

See the course home-page for the MMM C and Cilk

programs, and some info on trying Cilk:

http://cas.ee.ic.ac.uk/people/dt10/teaching/2012/hpce/

HPCE / dt10 / 2013 / 4.43

http://cas.ee.ic.ac.uk/people/dt10/teaching/2012/hpce/

a concrete treatment of tasks - imperial college london

Documents