a concrete treatment of tasks - imperial college london
TRANSCRIPT
A concrete treatment of tasks
• Previously we looked at decomposing programs into tasks
– Dependencies: task A must execute before task B
– Scheduling: find an order of execution that respects dependencies
– ASAP scheduling: As Soon As Possible
HPCE / dt10 / 2013 / 4.1
A concrete treatment of tasks
• Previously we looked at decomposing programs into tasks
– Dependencies: task A must execute before task B
– Scheduling: find an order of execution that respects dependencies
– ASAP scheduling: As Soon As Possible
• Now we’ll look at a formal task-based system : Cilk
– Very influential academic project started in the early 90s
– Good combination of theory and practical results
• Some of the structure from this lecture is adapted from Leiserson and Prokop, “A
Minicourse on Multithreaded Programming”, 1998.
HPCE / dt10 / 2013 / 4.2
A concrete treatment of tasks
• Previously we looked at decomposing programs into tasks
– Dependencies: task A must execute before task B
– Scheduling: find an order of execution that respects dependencies
– ASAP scheduling: As Soon As Possible
• Now we’ll look at a formal task-based system : Cilk
– Very influential academic project started in the early 90s
– Good combination of theory and practical results
– Eventually bought by Intel, now incorporated into Intel compiler
• http://software.intel.com/en-us/articles/intel-cilk-plus/
Basic concepts used in lots of other libraries
• Intel TBB, Microsoft TPL
• Some of the structure from this lecture is adapted from Leiserson and Prokop, “A
Minicourse on Multithreaded Programming”, 1998.
HPCE / dt10 / 2013 / 4.3
The Cilk Language
• Cilk is a faithful extension of C
– If you delete Cilk keywords from a program it will still execute as C
– Serial Elision principle: remove the keywords, it becomes serial
HPCE / dt10 / 2013 / 4.4
The Cilk Language
• Cilk is a faithful extension of C
– If you delete Cilk keywords from a program it will still execute as C
– Serial Elision principle: remove the keywords, it becomes serial
cilk int Fib(int n)
{
if(n<2)
return n;
int x=spawn Fib(n-1);
int y=spawn Fib(n-2);
sync;
return x+y;
}
HPCE / dt10 / 2013 / 4.5
The Cilk Language
• Cilk is a faithful extension of C
– If you delete Cilk keywords from a program it will still execute as C
– Serial Elision principle: remove the keywords, it becomes serial
• Two fundamental operations in Cilk
– spawn : indicate a function call that may operate in parallel
– sync : wait until all spawned functions have completed
cilk int Fib(int n)
{
if(n<2)
return n;
int x=spawn Fib(n-1);
int y=spawn Fib(n-2);
sync;
return x+y;
}
HPCE / dt10 / 2013 / 4.6
The Cilk Language
• Cilk is a faithful extension of C
– If you delete Cilk keywords from a program it will still execute as C
– Serial Elision principle: remove the keywords, it becomes serial
• Two fundamental operations in Cilk
– spawn : indicate a function call that may operate in parallel
– sync : wait until all spawned functions have completed
cilk int Fib(int n)
{
if(n<2)
return n;
int x=spawn Fib(n-1);
int y=spawn Fib(n-2);
sync;
return x+y;
}
int Fib(int n)
{
if(n<2)
return n;
int x=Fib(n-1);
int y=Fib(n-2);
return x+y;
}
HPCE / dt10 / 2013 / 4.7
Cilk programs as a DAG (Directed Acyclic Graph)
• The pattern of spawn and sync commands defines a graph
– The graph contains dependencies between different functions
– spawn command creates a new task with an out-bound link
cilk int Fib(n=3)
{
if(n<2)
return n;
int x=spawn Fib(3-1);
int y=spawn Fib(3-2);
sync;
return x+y;
}
HPCE / dt10 / 2013 / 4.8
Cilk programs as a DAG (Directed Acyclic Graph)
• The pattern of spawn and sync commands defines a graph
– The graph contains dependencies between different functions
– spawn command creates a new task with an out-bound link
cilk int Fib(n=3)
{
if(n<2)
return n;
int x=spawn Fib(3-1);
int y=spawn Fib(3-2);
sync;
return x+y;
}
cilk int Fib(n=2)
{
if(n<2)
return n;
int x=spawn Fib(2-1);
int y=spawn Fib(2-2);
sync;
return x+y;
}
HPCE / dt10 / 2013 / 4.9
Cilk programs as a DAG (Directed Acyclic Graph)
• The pattern of spawn and sync commands defines a graph
– The graph contains dependencies between different functions
– spawn command creates a new task with an out-bound link
cilk int Fib(n=3)
{
if(n<2)
return n;
int x=spawn Fib(3-1);
int y=spawn Fib(3-2);
sync;
return x+y;
}
cilk int Fib(n=2)
{
if(n<2)
return n;
int x=spawn Fib(2-1);
int y=spawn Fib(2-2);
sync;
return x+y;
}
cilk int Fib(n=1)
{
if(n<2)
return n;
int x=spawn Fib(1-1);
int y=spawn Fib(1-2);
sync;
return x+y;
} HPCE / dt10 / 2013 / 4.10
Cilk programs as a DAG (Directed Acyclic Graph)
• The pattern of spawn and sync commands defines a graph
– The graph contains dependencies between different functions
– spawn command creates a new task with an out-bound link
– sync command creates inbound link from spawned tasks
cilk int Fib(n=3)
{
if(n<2)
return n;
int x=spawn Fib(3-1);
int y=spawn Fib(3-2);
sync;
return x+y;
}
cilk int Fib(n=2)
{
if(n<2)
return n;
int x=spawn Fib(2-1);
int y=spawn Fib(2-2);
sync;
return x+y;
}
cilk int Fib(n=1)
{
if(n<2)
return n;
int x=spawn Fib(1-1);
int y=spawn Fib(1-2);
sync;
return x+y;
} HPCE / dt10 / 2013 / 4.11
Cilk programs as a DAG
• The pattern of spawn and sync commands defines a graph
– The graph contains dependencies between different functions
– spawn command creates a new task with an out-bound link
– sync command creates inbound link from spawned tasks
cilk int Fib(n=3)
{
if(n<2)
return n;
int x=spawn Fib(n-1);
int y=spawn Fib(n-2);
sync;
return x+y;
}
cilk int Fib(n=2)
{
if(n<2)
return n;
int x=spawn Fib(n-1);
int y=spawn Fib(n-2);
sync;
return x+y;
}
cilk int Fib(n=1)
{
if(n<2)
return n;
int x=spawn Fib(n-1);
int y=spawn Fib(n-2);
sync;
return x+y;
} HPCE / dt10 / 2013 / 4.12
Cilk programs as a DAG
• The pattern of spawn and sync commands defines a graph
– The graph contains dependencies between different functions
– spawn command creates a new task with an out-bound link
– sync command creates inbound link from spawned tasks
cilk int Fib(n=3)
{
if(n<2)
return n;
int x=spawn Fib(n-1);
int y=spawn Fib(n-2);
sync;
return x+y;
}
cilk int Fib(n=2)
{
if(n<2)
return n;
int x=spawn Fib(n-1);
int y=spawn Fib(n-2);
sync;
return x+y;
}
cilk int Fib(n=1)
{
if(n<2)
return n;
int x=spawn Fib(n-1);
int y=spawn Fib(n-2);
sync;
return x+y;
}
cilk int Fib(n=1)
{
if(n<2)
return n;
...
}
HPCE / dt10 / 2013 / 4.13
Cilk programs as a DAG
• The pattern of spawn and sync commands defines a graph
– The graph contains dependencies between different functions
– spawn command creates a new task with an out-bound link
– sync command creates inbound link from spawned tasks
cilk int Fib(n=3)
{
if(n<2)
return n;
int x=spawn Fib(n-1);
int y=spawn Fib(n-2);
sync;
return x+y;
}
cilk int Fib(n=2)
{
if(n<2)
return n;
int x=spawn Fib(n-1);
int y=spawn Fib(n-2);
sync;
return x+y;
}
cilk int Fib(n=1)
{
if(n<2)
return n;
int x=spawn Fib(n-1);
int y=spawn Fib(n-2);
sync;
return x+y;
}
cilk int Fib(n=1)
{
if(n<2)
return n;
...
}
cilk int Fib(n=0)
{
if(n<2)
return n;
...
}
HPCE / dt10 / 2013 / 4.14
cilk int Fib(int n)
{
if(n<2)
return n;
int x=spawn Fib(n-1);
int y=spawn Fib(n-2);
sync;
return x+y;
}
HPCE / dt10 / 2013 / 4.15
HPCE / dt10 / 2013 / 4.16
Steps within a function
execution sequentially
HPCE / dt10 / 2013 / 4.17
Steps within a function
execution sequentially
Independent functions may
execute in parallel
HPCE / dt10 / 2013 / 4.18
Steps within a function
execution sequentially
Independent functions may
execute in parallel
HPCE / dt10 / 2013 / 4.19
Total Work : T1 - total time required to execute all tasks
Critical path : T∞ - longest path through all tasks
assume each step takes unit time : total work = 35; critical path = 16
HPCE / dt10 / 2013 / 4.20
Total Work : T1 - total time required to execute all tasks
Critical path : T∞ - longest path through all tasks
assume each step takes unit time : total work = 35; critical path = 16
HPCE / dt10 / 2013 / 4.21
Best case and worst-case times
• Define three times: T1, TP, T∞
– T1 : Time to execute on one processor (Total Work)
– TP : Time to execute on P processors
– T∞ : Time to execute on infinite processors (Critical Path)
– T1 / TP : Speedup with P processors
HPCE / dt10 / 2013 / 4.22
Best case and worst-case times
• Define three times: T1, TP, T∞
– T1 : Time to execute on one processor (Total Work)
– TP : Time to execute on P processors
– T∞ : Time to execute on infinite processors (Critical Path)
– T1 / TP : Speedup with P processors
• Can establish an ordering on the times
– T1 / P ≤ TP - Maximum speedup with P processors is P
– TP ≥ T∞ - Finite processors are no faster than infinite
HPCE / dt10 / 2013 / 4.23
Best case and worst-case times
• Define three times: T1, TP, T∞
– T1 : Time to execute on one processor (Total Work)
– TP : Time to execute on P processors
– T∞ : Time to execute on infinite processors (Critical Path)
– T1 / TP : Speedup with P processors
• Can establish an ordering on the times
– T1 / P ≤ TP - Maximum speedup with P processors is P
– TP ≥ T∞ - Finite processors are no faster than infinite
• Can talk about scalability
– if T1 / TP = O(P) then Linear speedup (perfect scaling)
– We always want linear speedup – can we achieve it?
HPCE / dt10 / 2013 / 4.24
Greedy Schedulers
• A Greedy Scheduler executes work using an ASAP approach
– Each “time step” launch all tasks with no dependencies
– The notion of a time-step is deliberately context dependent
HPCE / dt10 / 2013 / 4.25
Greedy Schedulers
• A Greedy Scheduler executes work using an ASAP approach
– Each “time step” launch all tasks with no dependencies
– The notion of a time-step is deliberately context dependent
• When executing with P processors we have two types of step
– complete step : There are P or more tasks ready to execute
– incomplete step : There are less than P tasks ready to execute
HPCE / dt10 / 2013 / 4.26
Greedy Schedulers
• A Greedy Scheduler executes work using an ASAP approach
– Each “time step” launch all tasks with no dependencies
– The notion of a time-step is deliberately context dependent
• When executing with P processors we have two types of step
– complete step : There are P or more tasks ready to execute
– incomplete step : There are less than P tasks ready to execute
• A greedy scheduler always achieves TP ≤ T1 / P + T∞
– Best case is easy to visualise
• we do all work in TP complete steps
– Worst case is a bit more difficult
Steps on critical path execute in incomplete steps
• Last step on critical path frees up all remaining work for complete steps
HPCE / dt10 / 2013 / 4.27
HPCE / dt10 / 2013 / 4.28
Linear Scaling and Greedy Schedulers
• Previous equations assume zero-cost scheduling
– Some overhead involved in tracking tasks that can be run
– Some overhead in scheduling ready tasks to a processor
HPCE / dt10 / 2013 / 4.29
Linear Scaling and Greedy Schedulers
• Previous equations assume zero-cost scheduling
– Some overhead involved in tracking tasks that can be run
– Some overhead in scheduling ready tasks to a processor
• Define critical overhead : c
– Smallest c such that TP ≤ T1 / P + c×T
– Covers the cost of tracking dependencies on critical path
HPCE / dt10 / 2013 / 4.30
Linear Scaling and Greedy Schedulers
• Previous equations assume zero-cost scheduling
– Some overhead involved in tracking tasks that can be run
– Some overhead in scheduling ready tasks to a processor
• Define critical overhead : c
– Smallest c such that TP ≤ T1 / P + c×T
– Covers the cost of tracking dependencies on critical path
Linear scaling if there is usually much more work than CPUs
– Average parallelism : P = T1 / T
Assumption of parallel slackness : P / P >> c
– Therefore: T1 / P >> c × T
And so: TP ≈ T1 / P (linear speedup)
• Assumption of parallel slackness implies linear speedup
HPCE / dt10 / 2013 / 4.31
Is that a reasonable assumption?
• Central idea is that most steps are complete
– All processors are occupied most of the time
– Does computation look like that?
HPCE / dt10 / 2013 / 4.32
Is that a reasonable assumption?
• Central idea is that most steps are complete
– All processors are occupied most of the time
– Does computation look like that?
• Recall Gustafson’s law and the finite-difference example
– T1 = O(n2); T = O(n)
– P = T1 / T = O(n)
Assuming c is not too high we should get linear scaling
• Recall the circuit placement example
– Each placed node potentially adds M more nodes to execute
• For lots of stuff the assumption is broadly true
HPCE / dt10 / 2013 / 4.33
Work-first rule
• Define work overhead : c1 = T1 / TS
– TS : Time to run serial version of program (serial elision)
– Cost of dynamic scheduling vs static scheduling on one CPU
HPCE / dt10 / 2013 / 4.34
Work-first rule
• Define work overhead : c1 = T1 / TS
– TS : Time to run serial version of program (serial elision)
– Cost of dynamic scheduling vs static scheduling on one CPU
• What is the importance of c1 vs c ?
– Substitute into previous defn (TP ≤ T1 / P + c×T)
– TP ≤ c1 Ts / P + c×T
Now re-introduce assumption of parallel slackness (P / P >> c)
• T1 / (T × P) >> c
T1 / P >> c T
• c1 TS / P >> c T
Therefore: TP ≈ c1 Ts / P
HPCE / dt10 / 2013 / 4.35
Work-first rule
• Define work overhead : c1 = T1 / TS
– TS : Time to run serial version of program (serial elision)
– Cost of dynamic scheduling vs static scheduling on one CPU
• What is the importance of c1 vs c ?
– Substitute into previous defn (TP ≤ T1 / P + c×T)
– TP ≤ c1 Ts / P + c×T
Now re-introduce assumption of parallel slackness (P / P >> c)
• T1 / (T × P) >> c
T1 / P >> c T
• c1 TS / P >> c T
Therefore: TP ≈ c1 Ts / P
• Work-first rule: minimise c1 rather than c
HPCE / dt10 / 2013 / 4.36
Total Work : T1 - total time required for Cilk on one processor (red+green)
Serial Work : TS - total time required for serial-elisions (green only)
assume each step takes unit time : total work = 35; serial work = 22
HPCE / dt10 / 2013 / 4.37
Interpreting the work-first rule
• The work-first rule appears in many guises
– What are c1 and c in practise?
• Multi-core CPUs and OSs support traditional threads
– c1 : How much time to swap between two threads on a CPU?
– c : How much time to create a new thread?
• GPUs support hundreds of parallel threads
– c1 : Nano-second scheduling of threads in a kernel
– c : Milli-second cost to manage kernels from the CPU
• Intel TBB supports thousands of tasks
– c1 : Agglomeration of loop iterations to reduce overheads
– c : Hierarchical task based scheduler (based on Cilk)
• Bear this principle in mind as we look at real systems
HPCE / dt10 / 2013 / 4.38
An example: Matrix-Matrix Multiply
• Recursive decomposition of matrix-matrix multiplication (MMM)
– Sub-divide matrix into quadrants
– Perform calculations using operations on quadrants
– Split matrices until we’re down to 1x1 matrices (scalars) • Yes, technically we could use Strassen’s Algorithm, but we’ll ignore that for now
• Standard MMM is O(n3)
– What is the big-O complexity of this algorithm?
1111011010110010
1101010010010000
1110
0100
1110
0100
1110
0100
BABABABA
BABABABA
BB
BB
AA
AA
CC
CC
HPCE / dt10 / 2013 / 4.39
// Some sort of matrix - the details are not important typedef ... matrix; // Return a sub-matrix of A // BT=0 -> top, BT=1 -> bottom; LR=0 -> left, LR=1 -> right // The returned matrix is a _view_ on the matrix. Modifications // to the quad affect matrix X too matrix quad(matrix A, int BT, int LR) // Perform a classic n^3 matrix multiply-add // DST = DST + A * B void multiply_add_dense(matrix DST, matrix A, matrix B);
1111011010110010
1101010010010000
1110
0100
1110
0100
1110
0100
BABABABA
BABABABA
BB
BB
AA
AA
CC
CC
HPCE / dt10 / 2013 / 4.40
void multiply_add_recursive(matrix DST, matrix A, matrix B) { if((DST.cols <= 4) || (A.cols<=4) || (DST.rows<=4)){ multiply_add_dense(DST, A, B); }else{ multiply_add_recursive(quad(DST,0,0), quad(A,0,0), quad(B,0,0)); multiply_add_recursive(quad(DST,0,1), quad(A,0,0), quad(B,0,1)); multiply_add_recursive(quad(DST,1,0), quad(A,1,0), quad(B,0,0)); multiply_add_recursive(quad(DST,1,1), quad(A,1,0), quad(B,0,1)); multiply_add_recursive(quad(DST,0,0), quad(A,0,1), quad(B,1,0)); multiply_add_recursive(quad(DST,0,1), quad(A,0,1), quad(B,1,1)); multiply_add_recursive(quad(DST,1,0), quad(A,1,1), quad(B,1,0)); multiply_add_recursive(quad(DST,1,1), quad(A,1,1), quad(B,1,1)); } }
1111011010110010
1101010010010000
1110
0100
1110
0100
1110
0100
BABABABA
BABABABA
BB
BB
AA
AA
CC
CC
HPCE / dt10 / 2013 / 4.41
cilk void multiply_add_recursive(matrix DST, matrix A, matrix B) { if((DST.cols <= 4) || (A.cols<=4) || (DST.rows<=4)){ multiply_add_dense(DST, A, B); }else{ spawn multiply_add_recursive(quad(DST,0,0), quad(A,0,0), quad(B,0,0)); spawn multiply_add_recursive(quad(DST,0,1), quad(A,0,0), quad(B,0,1)); spawn multiply_add_recursive(quad(DST,1,0), quad(A,1,0), quad(B,0,0)); spawn multiply_add_recursive(quad(DST,1,1), quad(A,1,0), quad(B,0,1)); sync; spawn multiply_add_recursive(quad(DST,0,0), quad(A,0,1), quad(B,1,0)); spawn multiply_add_recursive(quad(DST,0,1), quad(A,0,1), quad(B,1,1)); spawn multiply_add_recursive(quad(DST,1,0), quad(A,1,1), quad(B,1,0)); spawn multiply_add_recursive(quad(DST,1,1), quad(A,1,1), quad(B,1,1)); sync; } }
1111011010110010
1101010010010000
1110
0100
1110
0100
1110
0100
BABABABA
BABABABA
BB
BB
AA
AA
CC
CC
HPCE / dt10 / 2013 / 4.42
See the course home-page for the MMM C and Cilk
programs, and some info on trying Cilk:
http://cas.ee.ic.ac.uk/people/dt10/teaching/2012/hpce/
HPCE / dt10 / 2013 / 4.43