cilk plus - engineering school class web sites · 2015. 1. 19. · cilk plus ∙ the “cilk”...

27
Cilk Plus The “Cilk” part is a small set of linguistic extensions to C/C++ to support fork-join parallelism. (The “Plus” part supports vector parallelism.) Developed originally by Cilk Arts, an MIT spin- off, which was acquired by Intel in July 2009. Based on the award-winning Cilk multithreaded language developed at MIT. Features a provably efficient work-stealing scheduler. Provides a hyperobject library for parallelizing code with non-local variables. Includes the Cilkscreen race detector and Cilkview scalability analyzer.

Upload: others

Post on 11-Mar-2021

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Cilk Plus - Engineering School Class Web Sites · 2015. 1. 19. · Cilk Plus ∙ The “Cilk” part is a small set of linguistic extensions to C/C++ to support fork-join parallelism

Cilk Plus

∙  The “Cilk” part is a small set of linguistic extensions to C/C++ to support fork-join parallelism. (The “Plus” part supports vector parallelism.)

∙  Developed originally by Cilk Arts, an MIT spin-off, which was acquired by Intel in July 2009.

∙  Based on the award-winning Cilk multithreaded language developed at MIT.

∙  Features a provably efficient work-stealing scheduler.

∙  Provides a hyperobject library for parallelizing code with non-local variables.

∙  Includes the Cilkscreen race detector and Cilkview scalability analyzer.

Page 2: Cilk Plus - Engineering School Class Web Sites · 2015. 1. 19. · Cilk Plus ∙ The “Cilk” part is a small set of linguistic extensions to C/C++ to support fork-join parallelism

Nested Parallelism in Cilk uint64_t  fib(uint64_t  n)  {        if  (n  <  2)  {            return  n;        }  else  {          uint64_t  x,  y;          x  =  cilk_spawn  fib(n-­‐1);          y  =  fib(n-­‐2);          cilk_sync;                return  (x  +  y);      }  }  

The named child function may execute in parallel with the parent caller.

Control cannot pass this point until all spawned children have returned.

Cilk keywords grant permission for parallel execution. They do not command parallel execution.

Page 3: Cilk Plus - Engineering School Class Web Sites · 2015. 1. 19. · Cilk Plus ∙ The “Cilk” part is a small set of linguistic extensions to C/C++ to support fork-join parallelism

Loop Parallelism in Cilk

The iterations of a cilk_for loop execute in parallel.

//  indices  run  from  0,  not  1  cilk_for  (int  i=1;  i<n;  ++i)  {      for  (int  j=0;  j<i;  ++j)  {          double  temp  =  A[i][j];          A[i][j]  =  A[j][i];          A[j][i]  =  temp;      }  }  

Example: In-place matrix transpose

a11 a12 ⋯ a1n

a21 a22 ⋯ a2n

⋮ ⋮ ⋱ ⋮ an1 an2 ⋯ ann

a11 a21 ⋯ an1

a12 a22 ⋯ an2

⋮ ⋮ ⋱ ⋮ a1n a2n ⋯ ann

A AT

Page 4: Cilk Plus - Engineering School Class Web Sites · 2015. 1. 19. · Cilk Plus ∙ The “Cilk” part is a small set of linguistic extensions to C/C++ to support fork-join parallelism

Serial Semantics Cilk source

uint64_t  fib(uint64_t  n)  {        if  (n  <  2)  {            return  n;        }  else  {          uint64_t  x,  y;          x  =  fib(n-­‐1);          y  =  fib(n-­‐2);            return  (x  +  y);      }  }  

serialization

The serialization of a Cilk program is always a legal interpretation of the program’s semantics.

To obtain the serialization: #define  cilk_for  for  #define  cilk_spawn  #define  cilk_sync  

Remember, Cilk keywords grant permission for parallel execution. They do not command parallel execution.

uint64_t  fib(uint64_t  n)  {        if  (n  <  2)  {            return  n;        }  else  {          uint64_t  x,  y;          x  =  cilk_spawn  fib(n-­‐1);          y  =  fib(n-­‐2);          cilk_sync;                return  (x  +  y);      }  }  

Page 5: Cilk Plus - Engineering School Class Web Sites · 2015. 1. 19. · Cilk Plus ∙ The “Cilk” part is a small set of linguistic extensions to C/C++ to support fork-join parallelism

Scheduling The Cilk concurrency

platform allows the programmer to express logical parallelism in an application.

The Cilk scheduler maps the executing program onto the processor cores dynamically at runtime.

Cilk’s work-stealing scheduler is provably efficient.

uint64_t  fib(uint64_t  n)  {        if  (n  <  2)  {            return  n;        }  else  {          uint64_t  x,  y;          x  =  cilk_spawn  fib(n-­‐1);          y  =  fib(n-­‐2);          cilk_sync;                return  (x  +  y);      }  }  

Memory I/O

$

P

$

P

$

P

Network

Page 6: Cilk Plus - Engineering School Class Web Sites · 2015. 1. 19. · Cilk Plus ∙ The “Cilk” part is a small set of linguistic extensions to C/C++ to support fork-join parallelism

Conventional Regression Tests

Reliable Single-Threaded Code

Compiler

Parallel Performance

Reliable Multi-Threaded Code

Cilkscreen Race Detector

Parallel Regression Tests

Linker

1

2

5

3

Runtime System

4

Cilk Platform

CilkviewScalability Analyzer

6

uint64_t  fib(uint64_t  n)  {        if  (n  <  2)  {  return  n;  }        else  {          uint64_t  x,  y;          x  =  cilk_spawn  fib(n-­‐1);          y  =  fib(n-­‐2);          cilk_sync;                return  (x  +  y);      }  }   Cilk++ source

uint64_t  fib(uint64_t  n)  {        if  (n  <  2)  {  return  n;  }        else  {          uint64_t  x  =  fib(n-­‐1);          uint64_t  y  =  fib(n-­‐2);          return  (x  +  y);      }  }   Serialization

Binary

Hyperobject Library

Page 7: Cilk Plus - Engineering School Class Web Sites · 2015. 1. 19. · Cilk Plus ∙ The “Cilk” part is a small set of linguistic extensions to C/C++ to support fork-join parallelism

 

int  fib  (int  n)  {      if  (n<2)  return  (n);      else  {          int  x,y;          x  =  cilk_spawn  fib(n-­‐1);          y  =  fib(n-­‐2);          cilk_sync;          return  (x+y);      }  }  

Execution Model

Example: fib(4)  

Page 8: Cilk Plus - Engineering School Class Web Sites · 2015. 1. 19. · Cilk Plus ∙ The “Cilk” part is a small set of linguistic extensions to C/C++ to support fork-join parallelism

 

int  fib  (int  n)  {      if  (n<2)  return  (n);      else  {          int  x,y;          x  =  cilk_spawn  fib(n-­‐1);          y  =  fib(n-­‐2);          cilk_sync;          return  (x+y);      }  }  

Execution Model

The computation dag unfolds dynamically.

Example: fib(4)  

“Processor oblivious”

4  

3  

2  

2  

1  

1   1   0  

0  

Page 9: Cilk Plus - Engineering School Class Web Sites · 2015. 1. 19. · Cilk Plus ∙ The “Cilk” part is a small set of linguistic extensions to C/C++ to support fork-join parallelism

Computation Dag

●  A parallel instruction stream is a dag G = (V, E ). ●  Each vertex v ∈ V is a strand : a sequence of instructions

not containing a call, spawn, sync, or return (or thrown exception).

●  An edge e ∈ E is a spawn, call, return, or continue edge. ●  Loop parallelism (cilk_for) is converted to spawns and

syncs using recursive divide-and-conquer.

spawn edge return edge continue edge

initial strand final strand

strand

call edge

Page 10: Cilk Plus - Engineering School Class Web Sites · 2015. 1. 19. · Cilk Plus ∙ The “Cilk” part is a small set of linguistic extensions to C/C++ to support fork-join parallelism

How Much Parallelism?

Assuming that each strand executes in unit time, what is the parallelism of this computation?

Page 11: Cilk Plus - Engineering School Class Web Sites · 2015. 1. 19. · Cilk Plus ∙ The “Cilk” part is a small set of linguistic extensions to C/C++ to support fork-join parallelism

Amdahl’s “Law”

Gene M. Amdahl

If 50% of your application is parallel and 50% is serial, you can’t get more than a factor of 2 speedup, no matter how many processors it runs on.

In general, if a fraction α of an application must be run serially, the speedup can be at most 1/α.

Page 12: Cilk Plus - Engineering School Class Web Sites · 2015. 1. 19. · Cilk Plus ∙ The “Cilk” part is a small set of linguistic extensions to C/C++ to support fork-join parallelism

Quantifying Parallelism What is the parallelism of this computation?

Amdahl’s Law says that since the serial fraction is 3/18 = 1/6, the speedup is upper-bounded by 6.

Page 13: Cilk Plus - Engineering School Class Web Sites · 2015. 1. 19. · Cilk Plus ∙ The “Cilk” part is a small set of linguistic extensions to C/C++ to support fork-join parallelism

Performance Measures TP = execution time on P processors

T1 = work = 18

Page 14: Cilk Plus - Engineering School Class Web Sites · 2015. 1. 19. · Cilk Plus ∙ The “Cilk” part is a small set of linguistic extensions to C/C++ to support fork-join parallelism

Performance Measures

= 18 = 9 T1 = work T∞ = span*

* Also called critical-path length or computational depth.

TP = execution time on P processors

Page 15: Cilk Plus - Engineering School Class Web Sites · 2015. 1. 19. · Cilk Plus ∙ The “Cilk” part is a small set of linguistic extensions to C/C++ to support fork-join parallelism

* Also called critical-path length or computational depth.

WORK LAW ∙ TP ≥T1/P

SPAN LAW ∙ TP ≥ T∞

Performance Measures TP = execution time on P processors

= 18 = 9 T1 = work T∞ = span*

Page 16: Cilk Plus - Engineering School Class Web Sites · 2015. 1. 19. · Cilk Plus ∙ The “Cilk” part is a small set of linguistic extensions to C/C++ to support fork-join parallelism

Work: T1(A∪B) = Work: T1(A∪B) = T1(A) + T1(B)

Series Composition

A B

Span: T∞(A∪B) = T∞(A) + T∞(B) Span: T∞(A∪B) =

Page 17: Cilk Plus - Engineering School Class Web Sites · 2015. 1. 19. · Cilk Plus ∙ The “Cilk” part is a small set of linguistic extensions to C/C++ to support fork-join parallelism

Work: T1(A∪B) = Work: T1(A∪B) = T1(A) + T1(B)

Parallel Composition

A

B

Span: T∞(A∪B) = max{T∞(A), T∞(B)} Span: T∞(A∪B) =

Page 18: Cilk Plus - Engineering School Class Web Sites · 2015. 1. 19. · Cilk Plus ∙ The “Cilk” part is a small set of linguistic extensions to C/C++ to support fork-join parallelism

Definition. T1/TP = speedup on P processors.

●  If T1/TP < P, we have sublinear speedup. ●  If T1/TP = P, we have (perfect) linear speedup. ●  If T1/TP > P, we have superlinear speedup,

which is not possible in this simple performance model, because of the WORK LAW TP ≥ T1/P.

Speedup

Page 19: Cilk Plus - Engineering School Class Web Sites · 2015. 1. 19. · Cilk Plus ∙ The “Cilk” part is a small set of linguistic extensions to C/C++ to support fork-join parallelism

Parallelism

Because the SPAN LAW dictates that TP ≥ T∞, the maximum possible speedup given T1 and T∞ is T1/T∞ = parallelism

= the average amount of work per step along the span = 18/9 = 2 .

Page 20: Cilk Plus - Engineering School Class Web Sites · 2015. 1. 19. · Cilk Plus ∙ The “Cilk” part is a small set of linguistic extensions to C/C++ to support fork-join parallelism

Parallelism: T1/T∞ = Parallelism: T1/T∞ = 2.125

Work: T1 = 17 Work: T1 = Span: T∞ = 8 Span: T∞ =

Example: fib(4)  

Assume for simplicity that each strand in fib(4) takes unit time to execute. 4

5

6

1

2 7

8

3

Using many more than 2 processors can yield only marginal performance gains.

Page 21: Cilk Plus - Engineering School Class Web Sites · 2015. 1. 19. · Cilk Plus ∙ The “Cilk” part is a small set of linguistic extensions to C/C++ to support fork-join parallelism

Greedy Scheduling IDEA: Do as much as possible on every step.

Definition. A strand is ready if all its predecessors have executed.

Page 22: Cilk Plus - Engineering School Class Web Sites · 2015. 1. 19. · Cilk Plus ∙ The “Cilk” part is a small set of linguistic extensions to C/C++ to support fork-join parallelism

Greedy Scheduling

Complete step ●  ≥ P strands ready. ●  Run any P.

P = 3

IDEA: Do as much as possible on every step.

Definition. A strand is ready if all its predecessors have executed.

Page 23: Cilk Plus - Engineering School Class Web Sites · 2015. 1. 19. · Cilk Plus ∙ The “Cilk” part is a small set of linguistic extensions to C/C++ to support fork-join parallelism

Greedy Scheduling

Complete step ●  ≥ P strands ready. ●  Run any P.

P = 3

Incomplete step ●  < P strands ready. ●  Run all of them.

IDEA: Do as much as possible on every step.

Definition. A strand is ready if all its predecessors have executed.

Page 24: Cilk Plus - Engineering School Class Web Sites · 2015. 1. 19. · Cilk Plus ∙ The “Cilk” part is a small set of linguistic extensions to C/C++ to support fork-join parallelism

Theorem [G68, B75, EZL89]. Any greedy scheduler achieves

TP ≤ T1/P + T∞.

Analysis of Greedy

Proof. ∙  # complete steps ≤ T1/P,

since each complete step performs P work.

∙  # incomplete steps ≤ T∞, since each incomplete step reduces the span of the unexecuted dag by 1. ■

Page 25: Cilk Plus - Engineering School Class Web Sites · 2015. 1. 19. · Cilk Plus ∙ The “Cilk” part is a small set of linguistic extensions to C/C++ to support fork-join parallelism

Optimality of Greedy

Corollary. Any greedy scheduler achieves within a factor of 2 of optimal.

Proof. Let TP* be the execution time produced by the optimal scheduler. Since TP* ≥ max{T1/P, T∞} by the WORK and SPAN LAWS, we have

TP ≤ T1/P + T∞ ≤ 2·max{T1/P, T∞} ≤ 2TP* . ■

Page 26: Cilk Plus - Engineering School Class Web Sites · 2015. 1. 19. · Cilk Plus ∙ The “Cilk” part is a small set of linguistic extensions to C/C++ to support fork-join parallelism

Linear Speedup

Corollary. Any greedy scheduler achieves near-perfect linear speedup whenever T1/T∞ ≫ P.

Proof. Since T1/T∞ ≫ P is equivalent to T∞ ≪ T1/P, the Greedy Scheduling Theorem gives us

TP ≤ T1/P + T∞ ≈ T1/P .

Thus, the speedup is T1/TP ≈ P. ■

Definition. The quantity T1/PT∞ is called the parallel slackness.

Page 27: Cilk Plus - Engineering School Class Web Sites · 2015. 1. 19. · Cilk Plus ∙ The “Cilk” part is a small set of linguistic extensions to C/C++ to support fork-join parallelism

Cilk Performance

●  Cilk’s work-stealing scheduler achieves ■  TP = T1/P + O(T∞) expected time (provably); ■  TP ≈ T1/P + T∞ time (empirically).

●  Near-perfect linear speedup as long as P ≪ T1/T∞ .

●  Instrumentation in Cilkview allows you to measure T1 and T∞ .