the implementation of the cilk-5 multithreaded language (frigo, leiserson, and randall)

24
UNIVERSITY NIVERSITY OF OF M M ASSACHUSETTS, ASSACHUSETTS, A MHERST MHERST Department of Computer Science Department of Computer Science The Implementation of the Cilk-5 Multithreaded Language (Frigo, Leiserson, and Randall) Alistair Dundas Department of Computer Science University of Massachusetts

Upload: urbano

Post on 12-Jan-2016

38 views

Category:

Documents


2 download

DESCRIPTION

The Implementation of the Cilk-5 Multithreaded Language (Frigo, Leiserson, and Randall). Alistair Dundas Department of Computer Science University of Massachusetts. Outline. What is Cilk? Cilk example: the Fibonacci algorithm. The work-first principle. Work Stealing. The T.H.E. Protocol. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: The Implementation of the Cilk-5 Multithreaded Language (Frigo, Leiserson, and Randall)

UUNIVERSITYNIVERSITY OFOF M MASSACHUSETTS, ASSACHUSETTS, AAMHERST MHERST – – Department of Computer ScienceDepartment of Computer Science

The Implementation of the Cilk-5 Multithreaded

Language(Frigo, Leiserson, and

Randall)Alistair DundasDepartment of Computer Science

University of Massachusetts

Page 2: The Implementation of the Cilk-5 Multithreaded Language (Frigo, Leiserson, and Randall)

UUNIVERSITYNIVERSITY OFOF M MASSACHUSETTS, ASSACHUSETTS, AAMHERST MHERST – – Department of Computer ScienceDepartment of Computer Science 2

Outline

What is Cilk? Cilk example: the Fibonacci

algorithm. The work-first principle. Work Stealing. The T.H.E. Protocol. Empirical results. Summary and questions.

Page 3: The Implementation of the Cilk-5 Multithreaded Language (Frigo, Leiserson, and Randall)

UUNIVERSITYNIVERSITY OFOF M MASSACHUSETTS, ASSACHUSETTS, AAMHERST MHERST – – Department of Computer ScienceDepartment of Computer Science 3

What Is Cilk?

Extension of C for parallel programming. Designed for SMP machines with support

for shared memory. Benefits:

Provably efficient work stealing scheduler. Clean programming model. Benefits over normal thread programming:

discussion topic! Specifically: Source to source compiler

generating C.

Page 4: The Implementation of the Cilk-5 Multithreaded Language (Frigo, Leiserson, and Randall)

UUNIVERSITYNIVERSITY OFOF M MASSACHUSETTS, ASSACHUSETTS, AAMHERST MHERST – – Department of Computer ScienceDepartment of Computer Science 4

Example: Fibonacci Algorithm

int main (int argc, char *argv[]){ int n, result; n = atoi(argv[1]); result = fib(n); printf(“Result:%d\n”, result); return 0;}

int fib (int n){ if (n<2) return n; else { int x, y; x = fib (n-1); y = fib (n-2); return (x+y); }}

Page 5: The Implementation of the Cilk-5 Multithreaded Language (Frigo, Leiserson, and Randall)

UUNIVERSITYNIVERSITY OFOF M MASSACHUSETTS, ASSACHUSETTS, AAMHERST MHERST – – Department of Computer ScienceDepartment of Computer Science 5

Example: Fibonacci In Parallel

cilk int main (int argc, char *argv[]){ int n, result; n = atoi(argv[1]); result = spawn fib(n); sync; printf(“Result:%d\n”, result); return 0;}

cilk int fib (int n){ if (n<2) return n; else { int x, y; x = spawn fib (n-1); y = spawn fib (n-2); sync; return (x+y); }}

Page 6: The Implementation of the Cilk-5 Multithreaded Language (Frigo, Leiserson, and Randall)

UUNIVERSITYNIVERSITY OFOF M MASSACHUSETTS, ASSACHUSETTS, AAMHERST MHERST – – Department of Computer ScienceDepartment of Computer Science 6

Source to Source Compiler

Page 7: The Implementation of the Cilk-5 Multithreaded Language (Frigo, Leiserson, and Randall)

UUNIVERSITYNIVERSITY OFOF M MASSACHUSETTS, ASSACHUSETTS, AAMHERST MHERST – – Department of Computer ScienceDepartment of Computer Science 7

The Work First Principle

Work is the amount of time needed to execute the computation serially.

Critical path length is the execution time on an infinite number of processors.

The Work-First Principle: Minimize scheduling overhead borne by work at the expense of increasing the critical path.

Page 8: The Implementation of the Cilk-5 Multithreaded Language (Frigo, Leiserson, and Randall)

UUNIVERSITYNIVERSITY OFOF M MASSACHUSETTS, ASSACHUSETTS, AAMHERST MHERST – – Department of Computer ScienceDepartment of Computer Science 8

Theory: The Work First Principle

Where TP is the time on P processors: TP = T1/P + O(T) (1)

Making critical path overhead explicit: TP <= T1/P + cT (2)

Define average parallelism (max speedup): PAVERAGE = T1/T

Define parallel slackness: PAVERAGE/P

Page 9: The Implementation of the Cilk-5 Multithreaded Language (Frigo, Leiserson, and Randall)

UUNIVERSITYNIVERSITY OFOF M MASSACHUSETTS, ASSACHUSETTS, AAMHERST MHERST – – Department of Computer ScienceDepartment of Computer Science 9

The Work First Principle (cont)

Assumption of parallel slackness: PAVERAGE/P ≫ c

Combining these with the inequality, we get: TP ≈ T1/P

Define work overhead: c1 = T1/TS

TP ≈ c1TS/P Conclusion: Minimize work overhead.

Page 10: The Implementation of the Cilk-5 Multithreaded Language (Frigo, Leiserson, and Randall)

UUNIVERSITYNIVERSITY OFOF M MASSACHUSETTS, ASSACHUSETTS, AAMHERST MHERST – – Department of Computer ScienceDepartment of Computer Science 10

Work Stealing Algorithm

Each worker keeps a ready deque (double ended queue) of procedure instances waiting to run.

Workers treat the deque as a stack, pushing and popping procedure calls on to the end.

When workers have no more work, they steal from the front of another workers’ deque. Parents are stolen before children.

This is implemented using two versions of each procedure: a fast clone, and a slow clone.

Page 11: The Implementation of the Cilk-5 Multithreaded Language (Frigo, Leiserson, and Randall)

UUNIVERSITYNIVERSITY OFOF M MASSACHUSETTS, ASSACHUSETTS, AAMHERST MHERST – – Department of Computer ScienceDepartment of Computer Science 11

Fast Clone

Run fast clone when a procedure is spawned.

Little support for parallelism. Whenever a call is made, save complete

state, and push on to end of deque. When call returns, check to see if

procedure was stolen. If stolen, return immediately. If not stolen, carry on execution.

Since children are never stolen, sync is a no op.

Page 12: The Implementation of the Cilk-5 Multithreaded Language (Frigo, Leiserson, and Randall)

UUNIVERSITYNIVERSITY OFOF M MASSACHUSETTS, ASSACHUSETTS, AAMHERST MHERST – – Department of Computer ScienceDepartment of Computer Science 12

Fast Clone Example

cilk int fib (int n){ if (n<2) return n; else { int x, y; x = spawn fib (n-1); y = spawn fib (n-2); sync; return (x+y); }}

Page 13: The Implementation of the Cilk-5 Multithreaded Language (Frigo, Leiserson, and Randall)

UUNIVERSITYNIVERSITY OFOF M MASSACHUSETTS, ASSACHUSETTS, AAMHERST MHERST – – Department of Computer ScienceDepartment of Computer Science 13

1 int fib (int n)

2 {

3 fib.frame *f; frame pointer

4 f = alloc(sizeof(*f)); allocate frame

5 f->sig = fib.sig; initialize frame

6 if (n!2) {

7 free(f, sizeof(*f)); free frame

8 return n;

9 }

10 else { … }

Fast Clone Example

Page 14: The Implementation of the Cilk-5 Multithreaded Language (Frigo, Leiserson, and Randall)

UUNIVERSITYNIVERSITY OFOF M MASSACHUSETTS, ASSACHUSETTS, AAMHERST MHERST – – Department of Computer ScienceDepartment of Computer Science 14

Fast Clone Example11 int x, y; 12 f->entry = 1; save PC13 f->n = n; save live vars 14 *T = f; store frame pointer 15 push(); push frame 16 x = fib (n-1); do C call 17 if (pop(x) == FAILURE) pop frame 18 return 0; procedure stolen 19 < … > second spawn 20 ; sync is free! 21 free(f, sizeof(*f)); free frame 22 return (x+y); 23 } }

Page 15: The Implementation of the Cilk-5 Multithreaded Language (Frigo, Leiserson, and Randall)

UUNIVERSITYNIVERSITY OFOF M MASSACHUSETTS, ASSACHUSETTS, AAMHERST MHERST – – Department of Computer ScienceDepartment of Computer Science 15

Slow Clone

Slow clone used when a procedure is stolen.

Similar to fast clone, but also supports concurrent execution.

It restores program counter and procedure state using copy stored on deque.

Calling sync makes call to runtime system for check on children’s status.

Page 16: The Implementation of the Cilk-5 Multithreaded Language (Frigo, Leiserson, and Randall)

UUNIVERSITYNIVERSITY OFOF M MASSACHUSETTS, ASSACHUSETTS, AAMHERST MHERST – – Department of Computer ScienceDepartment of Computer Science 16

The T.H.E. Protocol

Deques held in shared memory. Workers operate at the end, thiefs at the front.

We must prevent race conditions where a thief and victim try to access the same procedure frame.

Locking deques would be expensive for workers.

The T.H.E Protocol removes overhead of the common case, where there is no conflict.

Page 17: The Implementation of the Cilk-5 Multithreaded Language (Frigo, Leiserson, and Randall)

UUNIVERSITYNIVERSITY OFOF M MASSACHUSETTS, ASSACHUSETTS, AAMHERST MHERST – – Department of Computer ScienceDepartment of Computer Science 17

The T.H.E. Protocol

Assumes only reads and writes are atomic. Head of the deque is H, tail is T, and (T ≥ H)

Only thief can change H. Only worker can change T.

To steal thiefs must get the lock L. At most two processors operating on deque.

Three cases of interaction: Two or more items on deque – each gets one. One item on deque – either worker or thief gets

frame, but not both. No items on deque – both worker and thief fail.

Page 18: The Implementation of the Cilk-5 Multithreaded Language (Frigo, Leiserson, and Randall)

UUNIVERSITYNIVERSITY OFOF M MASSACHUSETTS, ASSACHUSETTS, AAMHERST MHERST – – Department of Computer ScienceDepartment of Computer Science 18

One item on deque case

Both thief and worker assume they can get a procedure frame and change H or T.

If both thief and worker try to steal frame, one or both of them will discover (H > T), depending on instruction order.

If thief discovers (H > T) it backs off and restores H.

If worker discovers (H > T) it restores T, and then tries for the lock. Inside lock, procedure can be safely popped if still there.

Page 19: The Implementation of the Cilk-5 Multithreaded Language (Frigo, Leiserson, and Randall)

UUNIVERSITYNIVERSITY OFOF M MASSACHUSETTS, ASSACHUSETTS, AAMHERST MHERST – – Department of Computer ScienceDepartment of Computer Science 19

Empirical Results

On an eight processor Sun SMP, achieved average speed up of 6.2 from elison (serial C non-threaded versions).

Assumptions of work-first seem sound: Applications tested all showed high

amounts of “average parallelism”. Work overhead small for most programs.

Least speed up is where overhead is greatest.

Page 20: The Implementation of the Cilk-5 Multithreaded Language (Frigo, Leiserson, and Randall)

UUNIVERSITYNIVERSITY OFOF M MASSACHUSETTS, ASSACHUSETTS, AAMHERST MHERST – – Department of Computer ScienceDepartment of Computer Science 20

Summary

Extension of C for parallel programming.

Aims to simplify parallelization. Main idea is to prevent overhead for

workers rather than focus on critical path.

Empirical results show speed up average of 6.2 on an 8 processor machine.

Page 21: The Implementation of the Cilk-5 Multithreaded Language (Frigo, Leiserson, and Randall)

UUNIVERSITYNIVERSITY OFOF M MASSACHUSETTS, ASSACHUSETTS, AAMHERST MHERST – – Department of Computer ScienceDepartment of Computer Science 21

My Questions

A cilk spawn is always just a C call. Who starts the threads, and how many are there?

Why use Cilk rather than use threads directly?

What about using Cilk on a bewoulf cluster?

Are their test programs representative of SMP applications?

Page 22: The Implementation of the Cilk-5 Multithreaded Language (Frigo, Leiserson, and Randall)

UUNIVERSITYNIVERSITY OFOF M MASSACHUSETTS, ASSACHUSETTS, AAMHERST MHERST – – Department of Computer ScienceDepartment of Computer Science 22

Other Extentions

Inlets – a wrapper around spawned procedure returns.

Abort – terminates work no longer needed (e.g. in parallel search).

Locking facilities for access to shared data.

Page 23: The Implementation of the Cilk-5 Multithreaded Language (Frigo, Leiserson, and Randall)

UUNIVERSITYNIVERSITY OFOF M MASSACHUSETTS, ASSACHUSETTS, AAMHERST MHERST – – Department of Computer ScienceDepartment of Computer Science 23

T.H.E. Protocol: The Worker/Victim

push()

{ T++; }

pop() { T--; if (H > T) { T++; lock(L); T--; if (H > T) { T++; unlock(L); return FAILURE; } unlock(L); } return SUCCESS;}

steal() { lock(L); H++; if (H > T) { H--; unlock(L); return FAILURE; } unlock(L); return SUCCESS;}

Page 24: The Implementation of the Cilk-5 Multithreaded Language (Frigo, Leiserson, and Randall)

UUNIVERSITYNIVERSITY OFOF M MASSACHUSETTS, ASSACHUSETTS, AAMHERST MHERST – – Department of Computer ScienceDepartment of Computer Science 24

Fibonacci Illustration