basic cilk programming
DESCRIPTION
Basic Cilk Programming. Multithreaded Programming in Cilk L ECTURE 1. Adapted from. Charles E. Leiserson Supercomputing Technologies Research Group Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology. Multi-core processors. …. P. P. P. $. $. - PowerPoint PPT PresentationTRANSCRIPT
Basic Cilk Programming
Multithreaded Programming in
CilkLECTURE 1
Multithreaded Programming in
CilkLECTURE 1
Charles E. LeisersonSupercomputing Technologies Research Group
Computer Science and Artificial Intelligence LaboratoryMassachusetts Institute of Technology
Adapted from
Multi-core processors
P P P
NetworkNetwork
…
MemoryMemory I/OI/O
$ $ $
MIMD – shared memory
Cilk Overview• Cilk extends the C language with just a handful
of keywords: cilk, spawn, sync• Every Cilk program has a serial semantics.• Not only is Cilk fast, it provides performance
guarantees based on performance abstractions.• Cilk is processor-oblivious.• Cilk’s provably good runtime system auto-
matically manages low-level aspects of parallel execution, including protocols, load balancing, and scheduling.
• Cilk supports speculative parallelism.
Fibonacciint fib (int n) {if (n<2) return (n); else { int x,y; x = fib(n-1); y = fib(n-2); return (x+y); }}
int fib (int n) {if (n<2) return (n); else { int x,y; x = fib(n-1); y = fib(n-2); return (x+y); }}
C elisionC elision
cilk int fib (int n) { if (n<2) return (n); else { int x,y; x = spawn fib(n-1); y = spawn fib(n-2); sync; return (x+y); }}
cilk int fib (int n) { if (n<2) return (n); else { int x,y; x = spawn fib(n-1); y = spawn fib(n-2); sync; return (x+y); }}
Cilk codeCilk code
Cilk is a faithful extension of C. A Cilk program’s serial elision is always a legal implementation of Cilk semantics. Cilk provides no new data types.
Basic Cilk Keywords
cilk int fib (int n) { if (n<2) return (n); else { int x,y; x = spawn fib(n-1); y = spawn fib(n-2); sync; return (x+y); }}
cilk int fib (int n) { if (n<2) return (n); else { int x,y; x = spawn fib(n-1); y = spawn fib(n-2); sync; return (x+y); }}
Identifies a function as a Cilk procedure, capable of being spawned in parallel.
Identifies a function as a Cilk procedure, capable of being spawned in parallel.
The named child Cilk procedure can execute in parallel with the parent caller.
The named child Cilk procedure can execute in parallel with the parent caller.Control cannot pass this
point until all spawned children have returned.
Control cannot pass this point until all spawned children have returned.
cilk int fib (int n) { if (n<2) return (n); else { int x,y; x = spawn fib(n-1); y = spawn fib(n-2); sync; return (x+y); }}
cilk int fib (int n) { if (n<2) return (n); else { int x,y; x = spawn fib(n-1); y = spawn fib(n-2); sync; return (x+y); }}
Dynamic Multithreading
The computation dag unfolds dynamically.
Example: fib(4)
“Processor oblivious”“Processor oblivious”
4
3
2
2
1
1 1 0
0
Multithreaded Computation
• The dag G = (V, E) represents a parallel instruction stream.• Each vertex v 2 V represents a (Cilk) thread: a maximal
sequence of instructions not containing parallel control (spawn, sync, return).
• Every edge e 2 E is either a spawn edge, a return edge, or a continue edge.
spawn edgereturn edge
continue edge
initial thread final thread
Algorithmic Complexity MeasuresTP = execution time on P processors
Algorithmic Complexity MeasuresTP = execution time on P processors
T1 = work
Algorithmic Complexity MeasuresTP = execution time on P processors
T1 = work
T1 = span*
* Also called critical-path length or computational depth.
Algorithmic Complexity MeasuresTP = execution time on P processors
T1 = work
LOWER BOUNDS
• TP ¸ T1/P
• TP ¸ T1
LOWER BOUNDS
• TP ¸ T1/P
• TP ¸ T1
*Also called critical-path length or computational depth.
T1 = span*
Speedup
Definition: T1/TP = speedup on P processors.
If T1/TP = (P) · P, we have linear speedup;= P, we have perfect linear speedup;> P, we have superlinear speedup,
which is not possible in our model, because of the lower bound TP ¸ T1/P.
July 13, 2006 14
Parallelism
Because we have the lower bound TP ¸ T1, the maximum possible speedup given T1 and T1 isT1/T1 = parallelism
= the average amount of work per step along the span.
July 13, 2006 15
Span: T1 = ?Work: T1 = ?
Example: fib(4)
Assume for simplicity that each Cilk thread in fib() takes unit time to execute.
Span: T1 = 8
33 44
55
66
11
22 77
88
Work: T1 = 17
July 13, 2006 16
Parallelism: T1/T1 = 2.125Span: T1 = ?Work: T1 = ?
Example: fib(4)
Assume for simplicity that each Cilk thread in fib() takes unit time to execute.
Span: T1 = 8Work: T1 = 17 Using many more
than 2 processors makes little sense.
Using many more than 2 processors makes little sense.
July 13, 2006 17
Parallelizing Vector Additionvoid vadd (real *A, real *B, int n){ int i; for (i=0; i<n; i++) A[i]+=B[i];}
void vadd (real *A, real *B, int n){ int i; for (i=0; i<n; i++) A[i]+=B[i];}
C
July 13, 2006 18
Parallelizing Vector AdditionC
C if (n<=BASE) { int i; for (i=0; i<n; i++) A[i]+=B[i]; } else {
if (n<=BASE) { int i; for (i=0; i<n; i++) A[i]+=B[i]; } else {
void vadd (real *A, real *B, int n){
vadd (A, B, n/2);vadd (A+n/2, B+n/2, n-n/2);
}} }}
Parallelization strategy: 1. Convert loops to recursion.
void vadd (real *A, real *B, int n){ int i; for (i=0; i<n; i++) A[i]+=B[i];}
void vadd (real *A, real *B, int n){ int i; for (i=0; i<n; i++) A[i]+=B[i];}
July 13, 2006 19
if (n<=BASE) { int i; for (i=0; i<n; i++) A[i]+=B[i]; } else {
if (n<=BASE) { int i; for (i=0; i<n; i++) A[i]+=B[i]; } else {
Parallelizing Vector AdditionC
Parallelization strategy: 1. Convert loops to recursion.2. Insert Cilk keywords.
void vadd (real *A, real *B, int n){cilk
spawn vadd (A, B, n/2);vadd (A+n/2, B+n/2, n-n/2);spawn
Side benefit: D&C is generally good for caches!
Side benefit: D&C is generally good for caches!
}} }}
sync;
Cilk
void vadd (real *A, real *B, int n){ int i; for (i=0; i<n; i++) A[i]+=B[i];}
void vadd (real *A, real *B, int n){ int i; for (i=0; i<n; i++) A[i]+=B[i];}
July 13, 2006 20
Vector Additioncilk void vadd (real *A, real *B, int n){ if (n<=BASE) { int i; for (i=0; i<n; i++) A[i]+=B[i]; } else { spawn vadd (A, B, n/2); spawn vadd (A+n/2, B+n/2, n-n/2); sync; }}
cilk void vadd (real *A, real *B, int n){ if (n<=BASE) { int i; for (i=0; i<n; i++) A[i]+=B[i]; } else { spawn vadd (A, B, n/2); spawn vadd (A+n/2, B+n/2, n-n/2); sync; }}
Work: T1 = Span: T1 = Parallelism: T1/T1 = (n/lg n)
(n)
Vector Addition AnalysisTo add two vectors of length n, where BASE = (1):
(lg n)
BASE
Another ParallelizationC void vadd1 (real *A, real *B, int n){
int i; for (i=0; i<n; i++) A[i]+=B[i];}void vadd (real *A, real *B, int n){ int j; for (j=0; j<n; j+=BASE) { vadd1(A+j, B+j, min(BASE, n-j)); }}
void vadd1 (real *A, real *B, int n){ int i; for (i=0; i<n; i++) A[i]+=B[i];}void vadd (real *A, real *B, int n){ int j; for (j=0; j<n; j+=BASE) { vadd1(A+j, B+j, min(BASE, n-j)); }}
Cilk cilk void vadd1 (real *A, real *B, int n){ int i; for (i=0; i<n; i++) A[i]+=B[i];}cilk void vadd (real *A, real *B, int n){ int j; for (j=0; j<n; j+=BASE) { spawn vadd1(A+j, B+j, min(BASE, n-j)); }sync;}
cilk void vadd1 (real *A, real *B, int n){ int i; for (i=0; i<n; i++) A[i]+=B[i];}cilk void vadd (real *A, real *B, int n){ int j; for (j=0; j<n; j+=BASE) { spawn vadd1(A+j, B+j, min(BASE, n-j)); }sync;}
Work: T1 = Span: T1 = Parallelism: T1/T1 = (1)
(n)
Analysis…
…
(n)
To add two vectors of length n, where BASE = (1):
BASE
Parallelism: T1/T1 = (√n )
Optimal Choice of BASE
To add two vectors of length n using an optimal choice of BASE to maximize parallelism:
…
Work: T1 = (n)
BASE
…
Span: T1 = (BASE + n/BASE)Choosing BASE = √n ) T1 = √n)
Weird! Don’t we want to remove recursion?
Parallel Programming =Sequential Program +Decomposition +Mapping +Communication and synchronization
July 13, 2006 25
Scheduling• Cilk allows the
programmer to express potential parallelism in an application.
• The Cilk scheduler maps Cilk threads onto processors dynamically at runtime.
• Since on-line schedulers are complicated, we’ll illustrate the ideas with an off-line scheduler.
P P P
NetworkNetwork
…
MemoryMemory I/OI/O
$ $ $
Greedy SchedulingIDEA: Do as much as possible on every step.
Definition: A thread is ready if all its predecessors have executed.
Greedy SchedulingIDEA: Do as much as possible on every step.
Complete step •¸ P threads ready.• Run any P.
Definition: A thread is ready if all its predecessors have executed.
P = 3
Greedy SchedulingIDEA: Do as much as possible on every step.
Complete step •¸ P threads ready.• Run any P.
Incomplete step • < P threads ready.• Run all of them.
Definition: A thread is ready if all its predecessors have executed.
P = 3
Theorem [Graham ’68 & Brent ’75]. Any greedy scheduler achieves
TP T1/P + T.
Greedy-Scheduling Theorem
Proof. • # complete steps · T1/P,
since each complete step performs P work.
• # incomplete steps · T1, since each incomplete step reduces the span of the unexecuted dag by 1. ■
P = 3
Optimality of GreedyCorollary. Any greedy scheduler achieves within a factor of 2 of optimal.
Proof. Let TP* be the execution time produced by the optimal scheduler.
Since TP* ¸ max{T1/P, T1} (lower bounds), we have
TP · T1/P + T1
· 2¢max{T1/P, T1}
· 2TP* . ■
Linear SpeedupCorollary. Any greedy scheduler achieves near-perfect linear speedup whenever P ¿ T1/T1. Proof. Since P ¿ T1/T1 is equivalent to T1 ¿ T1/P, the Greedy Scheduling Theorem gives us
TP · T1/P + T1
¼ T1/P .
Thus, the speedup is T1/TP ¼ P. ■Definition. The quantity (T1/T1 )/P is called the parallel slackness.
Lessons
Work and span can predict performance on large machines better than running
times on small machines can.
Focus on improving Parallelism (ie. Maximize (T1/T1 )). This will allow you to
effectively use larger processor counts.
Cilk Performance● Cilk’s “work-stealing” scheduler achieves
■ TP = T1/P + O(T1) expected time (provably);
■ TP T1/P + T1 time (empirically).
● Near-perfect linear speedup if P ¿ T1/T1 .
● Instrumentation in Cilk allows the user to determine accurate measures of T1 and T1 .
● The average cost of a spawn in Cilk-5 is only 2–6 times the cost of an ordinary C function call, depending on the platform.
Cilk’s Work-Stealing SchedulerEach processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack.
PP PP PPPP
Spawn!
Cilk’s Work-Stealing SchedulerEach processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack.
PP PP PPPP
Spawn!Spawn!
Cilk’s Work-Stealing SchedulerEach processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack.
PP PP PPPP
Return!
Cilk’s Work-Stealing SchedulerEach processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack.
PP PP PPPP
Return!
Cilk’s Work-Stealing SchedulerEach processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack.
PP PP PPPP
When a processor runs out of work, it steals a thread from the top of a random victim’s deque.
Steal!
Cilk’s Work-Stealing SchedulerEach processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack.
PP PP PPPP
When a processor runs out of work, it steals a thread from the top of a random victim’s deque.
Steal!
Cilk’s Work-Stealing SchedulerEach processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack.
PP PP PPPP
When a processor runs out of work, it steals a thread from the top of a random victim’s deque.
Cilk’s Work-Stealing SchedulerEach processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack.
PP PP PPPP
When a processor runs out of work, it steals a thread from the top of a random victim’s deque.
Spawn!
Performance of Work-StealingTheorem: Cilk’s work-stealing scheduler achieves an expected running time of
TP T1/P + O(T1)on P processors.Pseudoproof. A processor is either working or stealing. The total time all processors spend working is T1. Each steal has a 1/P chance of reducing the span by 1. Thus, the expected cost of all steals is O(PT1). Since there are P processors, the expected time is
(T1 + O(PT1))/P = T1/P + O(T1) . ■
Space BoundsTheorem. Let S1 be the stack space required by a serial execution of a Cilk program. Then, the space required by a P-processor execution is at most SP · PS1 .Proof (by induction). The work-stealing algorithm maintains the busy-leaves property: every extant procedure frame with no extant descendents has a processor working on it. ■
PP
PP
PP
S1
P = 3
Linguistic Implications
Code like the following executes properly without any risk of blowing out memory:
for (i=1; i<1000000000; i++) { spawn foo(i);}sync;
for (i=1; i<1000000000; i++) { spawn foo(i);}sync;
MORAL
Better to steal parents than children!
MORAL
Better to steal parents than children!
Summary• Cilk is simple: cilk, spawn, sync• Recursion, recursion, recursion, …• Work & span• Work & span• Work & span• Work & span• Work & span• Work & span• Work & span• Work & span• Work & span• Work & span• Work & span• Work & span• Work & span• Work & span• Work & span
Sudoko
• A game where you fill in a grid with numbers– A number cannot appear more than once in any column– A number cannot appear more than once in any row– A number can not appear more than once in any “region”
• Typically presented with a 9 by 9 grid … but for simplicity we’ll consider a 4 by 4 grid
A 4 x 4 Sudoku puzzle with 11 open positions … we show three steps in the solution
1
23
Since 1 is the only number missing in this column
Since 3 already appears in this region
Since 3 is the only number missing in this row
Sudoko Algorithm
• The two-dimensional Sudoko grid is flattened into a vector– Unsolved locations are filled with zeros– The first two rows of the initial 4 x 4 puzzle are shown– The current working location [loc=0] is shown in red and the subgrid size is 3– Initially call spawn solve(size=3, grid, loc=0)
3 0 0 4 0 0 0 2 …
• The first location has a solution so move to next location– Recursively call spawn solve(size=3, grid, loc=loc+1)
grid
3 0 0 4 0 0 0 2 …
49
Exhaustive Search
• The next location [loc=1] has no solution (‘0’ in the current cell) so …– Create 4 new grids and try each of the 4 possibilities (1,2,3,4) concurrently– Note: the search goes much faster if the guess is first tested to see if it is legal– Spawn a new search tree for each guess k– Call: spawn solve(size=3, grid[k], loc=loc+1)
Source: Mattson and Keutzer, UCB CS294
3 1 0 4 0 0 0 2 …
new grids
3 2 0 4 0 0 0 2 …
3 3 0 4 0 0 0 2 …
3 4 0 4 0 0 0 2 …
Illegal since 3 and 4 are already in the same row
Cilk Sudoko solution (part 1 of 3)
cilk int solve(int size, int* grid, int loc){ int i, k, solved, solution[MAX_NUM]; int* grid[MAX_NUM]; int numNumbers = size*size: int Girdlen = numNumbers*numNumbers;
if (loc == Gridlen) { /* maximum depth; reached the end of the puzzle */ return check_solution(size, grid); }
/* if this node has a solution (given by puzzle) at this location */ /* move to next node location */ if (grid[loc] != 0) { solved = spawn solve(size, g, loc+1); return solved; }
Cilk Sudoko solution (part 2 of 3)
/* try each number (unique to row,col,sq) */ numGrids = 0; for (i = 0, k = 0; i < MAX_NUM; i++) { k = next_guess(size, k, loc, grid); if (k == 0) break; /* no more legal solutions at t his location */
/* need new grid to work with */ myGrid[i] = new_grid(size, grid); myGrid[i][loc] = k; solution[i] = spawn solve(size, myGrid[i], loc+1); nGrids += 1; }
sync;
Cilk Sudoko solution (part 3 of 3)
/* check to see if there is a solution */
solved = 0; for (i = 0; i < nGrids; i++) { if (solution[i] == 1) { int n; /* found a solution, copy result to parent */ for (n = loc; n < len; n++) { grid[n] = (myGrid[i])[n]; } solved = 1; } free(myGrid[i]); }
return solved;}