# page1 15-210: parallelism in the real world types of paralellism parallel thinking nested...

Post on 19-Jan-2016

225 views

Embed Size (px)

TRANSCRIPT

Page*15-210: Parallelism in the Real WorldTypes of paralellismParallel ThinkingNested ParallelismExamples (Cilk, OpenMP, Java Fork/Join)Concurrency15-210

Cray-1 (1976): the worlds most expensive love seat15-210*

Data Center: Hundreds of thousands of computers*15-210

Since 2005: Multicore computers15-210*

Xeon Phi: Knights Landing72 Cores, x86, 500Gbytes of memory bandwidthSummer 2015

15-210Page*

Xeon Phi: Knights Landing72 Cores, x86, 500Gbytes of memory bandwidthSummer 2015

15-210Page*

Moores Law15-210*

Moores Law and Performance15-210*

Page *Andrew Chien, 200815-210

64 core blade servers ($6K)(shared memory)Page *x 4 =15-210

1024 cuda cores15-210*

15-210*

15-210*

Parallel HardwareMany forms of parallelismSupercomputers: large scale, shared memoryClusters and data centers: large-scale, distributed memory Multicores: tightly coupled, smaller scaleGPUs, on chip vector unitsInstruction-level parallelismParallelism is important in the real world. 15-210*

Key Challenge: Software(How to Write Parallel Code?)At a high-level, it is a two step process:Design a work-efficient, low-span parallel algorithmImplement it on the target hardwareIn reality: each system required different code because programming systems are immatureHuge effort to generate efficient parallel code.Example: Quicksort in MPI is 1700 lines of code, and about the same in CUDAImplement one parallel algorithm: a whole thesis.Take 15-418 (Parallel Computer Architecture and Prog.)

15-210*

15-210 ApproachEnable parallel thinking by raising abstraction level

I. Parallel thinking: Applicable to many machine models and programming languages II. Reason about correctness and efficiency of algorithms and data structures.15-210*

Parallel ThinkingRecognizing true dependences: unteach sequential programming.

Parallel algorithm-design techniquesOperations on aggregates: map/reduce/scanDivide & conquer, contractionViewing computation as DAG (based on dependences)

Cost model based on work and span

15-210*

Page *Quicksort from Aho-Hopcroft-Ullman (1974)procedure QUICKSORT(S): if S contains at most one element then return S else begin choose an element a randomly from S; let S1, S2 and S3 be the sequences of elements in S less than, equal to, and greater than a, respectively; return (QUICKSORT(S1) followed by S2 followed by QUICKSORT(S3))end15-210

Page *Quicksort from Sedgewick (2003) public void quickSort(int[] a, int left, int right) { int i = left-1; int j = right; if (right = j) break; swap(a,i,j); } swap(a, i, right); quickSort(a, left, i - 1); quickSort(a, i+1, right); }15-210

Styles of Parallel ProgrammingData parallelism/Bulk Synchronous/SPMD Nested parallelism : what we coveredMessage passingFutures (other pipelined parallelism)General ConcurrencyPage*15-210

Nested ParallelismNested Parallelism = arbitrary nesting of parallel loops + fork-joinAssumes no synchronization among parallel tasks except at joint points.Deterministic if no race conditions

Advantages: Good schedulers are knownEasy to understand, debug, and analyze costsPurely functional, or imperativeeither works

Page*15-210

Serial Parallel DAGsDependence graphs of nested parallel computations are series parallel

Two tasks are parallel if not reachable from each other.A data race occurs if two parallel tasks are involved in a race if they access the same location and at least one is a write.

Page*15-210

Nested Parallelism: parallel loopsPASL/C++: Parallel_for (cntr, 0, n, [&] (i) {}) Cilk/C++: cilk_for (i=0; i < n; ++i) {B[i] = A[i]+1;}Microsoft TPL (C#,F#): Parallel.ForEach(A, x => x+1);Nesl, Parallel Haskell: B = {x + 1 : x in A}OpenMP #pragma omp for for (i=0; i < n; i++) {B[i] = A[i] + 1;}Page*15-210

Nested Parallelism: fork-joinJava fork-join framework, Microsoft TPL (C#,F#) Dates back to the 60s. Used in dialects of Algol, Pascal cobegin { S1; S2; }OpenMP (C++, C, Fortran, ) coinvoke(f1,f2) Parallel.invoke(f1,f2) #pragma omp sections { #pragma omp section S1; #pragma omp section S2; }

Page*15-210

Nested Parallelism: fork-joinPage*PASL/C++ fork2 ([&] (), [&] ())

Cilk/C++ spawn S1; S2; sync;

Various functional languages (exp1 || exp2)

Dialects of ML and Lisy plet x = exp1 y = exp2 in exp3 end

15-210

Cilk vs. what weve coveredML: val (a,b) = par(fn () => f(x), fn () => g(y)) Psuedocode: (a,b) = (f(x) || g(y))Cilk: cilk_spawn a = f(x); b = g(y); cilk_sync; ML: S = map f APsuedocode: S = Cilk: cilk_for (int i = 0; i < n; i++) S[i] = f(A[i]) Page*15-210

- Cilk vs. what weve coveredML: S = tabulate f nPsuedocode: S =
- Example Cilkint fib (int n) { if (n
Example OpenMP:Numerical Integration15-210*

The C code for Approximating PI15-210*

The C/openMP code for Approx. PI15-210*

- Example : Java Fork/Joinclass Fib extends FJTask { volatile int result; // serves as arg and result int n; Fib(int _n) { n = _n; } public void run() { if (n
Cost Model (General)Compositional:

Work : total number of operations costs are added across parallel calls

Span : depth/critical path of the computationMaximum span is taken across forked calls

Parallelism = Work/SpanApproximately # of processors that can be effectively used.Page*15-210

- *Combining for parallel for: pfor (i=0; i
*Simple measures that give us a good sense of efficiency (work) and scalability (span).Can schedule in O(W/P + D) time on P processors.This is within a constant factor of optimal.Goals in designing an algorithmWork should be about the same as the sequential running time. When it matches asymptotically we say it is work efficient.Parallelism (W/D) should be polynomial. O(n1/2) is probably good enough Why Work and Span15-210

15-853

How do the problems do on a modern multicore15-210*

Chart3

31.6

24.6

17

14.5

18.45

19.3

20.8

19.3

12.25

17.5

21.8

17

T1/T32

Sheet1

BenchmarkT1/T32Tseq/T32

Sort31.631.6

Duplicate Removal24.621.6

Min Spanning Tree1711.2

Max Independ. Set14.510

Spanning Forest18.459

Breadth First Search19.314.5

Delaunay Triang.20.815

Triangle Ray Inter.19.317

Nearest Neighbors12.2511.7

Sparse MxV17.517

Nbody21.818

Suffix Array1715

Sheet1

T1/T32

Tseq/T32

Sheet2

Sheet3

Styles of Parallel ProgrammingData parallelism/Bulk Synchronous/SPMD Nested parallelism : what we coveredMessage passingFutures (other pipelined parallelism)General ConcurrencyPage*15-210

Parallelism vs. Concurrency*Parallelism: using multiple processors/cores running at the same time. Serial semantics, parallel execution. Property of hardware/software platform.Concurrency: non-determinacy due to interleaving threads. Concurrent semantics, parallel or serial execution. Property of the application.

15-210

ConcurrencysequentialconcurrentParallelismserialTraditional programmingTraditional OSparallelDeterministic parallelismGeneral parallelism

Concurrency : Stack Example 1struct link {int v; link* next;}struct stack { link* headPtr; void push(link* a) { a->next = headPtr; headPtr = a; } link* pop() { link* h = headPtr; if (headPtr != NULL) headPtr = headPtr->next; return h;} }*HAHA15-210

Concurrency : Stack Example 1struct link {int v; link* next;}struct stack { link* headPtr; void push(link* a) { a->next = headPtr; headPtr = a; } link* pop() { link* h = headPtr; if (headPtr != NULL) headPtr = headPtr->next; return h;} }*HAB15-210

Concurrency : Stack Example 1struct link {int v; link* next;}struct stack { link* headPtr; void push(link* a) { a->next = headPtr; headPtr = a; } link* pop() { link* h = headPtr; if (headPtr != NULL) headPtr = headPtr->next; return h;} }*HAB15-210

Concurrency : Stack Example 1struct link {int v; link* next;}struct stack { link* headPtr; void push(link* a) { a->next = headPtr; headPtr = a; } link* pop() { link* h = headPtr; if (headPtr != NULL) headPtr = headPtr->next; return h;} }*HAB15-210

CASbool CAS(ptr* addr, ptr a, ptr b) { atomic { if (*addr == a) { *addr = b; return 1; } else return 0; } } A built in instruction on most processors: CMPXCHG8B 8 byte version for x86 CMPXCHG16B 16 byte version

15-210Page*

Concurrency : Stack Example 2struct stack { link* headPtr; void push(link* a) { do { link* h = headPtr; a->next = h; while (!CAS(&headPtr, h, a)); } link* pop() { do { link* h = headPtr; if (h == NULL) return NULL; link* nxt = h->next; while (!CAS(&headPtr, h, nxt))} return h;} }*HA15-210

Concurrency : Stack Example 2struct stack { link* headPtr; void push(link* a) { do { link* h = headPtr; a->next = h; while (!CAS(&headPtr, h, a)); } link* pop() { do { link* h = headPtr; if (h == NULL) return NULL; link* nxt = h->next; while (!CAS(&headPtr, h, nxt))} return h;} }*HA15-210

Concurrency : Stack Example 2struct stack { link* headPtr; void push(link* a) { do { link* h = headPtr; a->next = h; while (!CAS(&headPtr, h, a)); } link* pop() { do { link* h = headPtr; if (h == NULL) return NULL; link* nxt = h->next; while (!CAS(&headPtr, h, nxt))} return h;} }*HAB15-210

Concurrency : Stack Example 2struct stack { link* headPtr; void push(link* a) { do { link* h = headPtr; a->next = h; while (!CAS(&headPtr, h, a)); } link* pop() { do { link* h = headPtr; if (h == NULL) return NULL; link* nxt = h->next; while (!CAS(&headPtr, h, nxt))} return h;} }*HAB15-210

Concurrency : Stack Example 2struct stack { link* headPtr; void push(link* a) { do { link* h = headPtr; a->next = h; while (!CAS(&headPtr, h, a)); } link* pop() { do { link* h = headPtr; if (h == NULL) return NULL; link* nxt = h->next; while (!CAS(&headPtr, h, nxt))} return h;} }*HAB15-210X

Concurrency : Stack Example 2P1 : x = s.pop(); y = s.pop(); s.push(x);P2 : z = s.pop();*The ABA problemCan be fixed with counter and 2CAS, butABCBCBefore:After:P2: h = headPtr;P2: nxt = h->next;P1: everythingP2: CAS(&headPtr,h,nxt)15-210

Concurrency : Stack Example 3struct link {int v; link* next;}struct stack { link* headPtr; void push(link* a) { atomic { a->next = headPtr; headPtr = a; }} link* pop() { atomic { link* h = headPtr; if (headPtr != NULL) headPtr = headPtr->next; return h;}} }*15-210

Concurrency : Stack Example 3void swapTop(stack s) { link* x = s.pop(); link* y = s.pop(); push(x); push(y);}

Queues are trickier than stacks.

*15-210

Styles of Parallel ProgrammingData parallelism/Bulk Synchronous/SPMD Nested parallelism : what we coveredMessage passingFutures (other pipelined parallelism)General ConcurrencyPage*15-210

Futures: Examplefun quickSort S = let fun qs([], rest) = rest | qs(h::T, rest) = let val L1 = filter (fn b => b < a) T val L2 = filter (fn b => b >= a) T in qs(L1,(a::(qs(L2, rest))) end fun filter(f,[]) = [] | filter(f,h::T) = if f(h) then(h::filter(f,T)) else filter(f,T)in qs(S,[])endPage*15-210

Futures: Examplefun quickSort S = let fun qs([], rest) = rest | qs(h::T, rest) = let val L1 = filter (fn b => b < a) T val L2 = filter (fn b => b >= a) T in qs(L1, future(a::(qs(L2, rest))) end fun filter(f,[]) = [] | filter(f,h::T) = if f(h) then future(h::filter(f,T)) else filter(f,T)in qs(S,[])endPage*15-210

15-210*Quicksort: Nested ParallelSpan = O(lg2 n)Parallel Partition and AppendWork = O(n log n)

15-210*Quicksort: FuturesSpan = O(n)Work = O(n log n)

Styles of Parallel Programming Data parallelism/Bulk Synchronous/SPMD * Nested parallelism : what we covered Message passing* Futures (other pipelined parallelism)* General ConcurrencyPage*15-210

Xeon 7500Page*15-210

QuestionHow do we get nested parallel programs to work well on a fixed set of processors? Programs say nothing about processors.

Page*Answer: good schedulers

15-210

15-210*Greedy SchedulesSpeedup versus Efficiency in Parallel Systems, Eager, Zahorjan and Lazowska, 1989

For any greedy schedule:

Efficiency =

Parallel Time =

15-853

Types of SchedulersBread-First SchedulersWork-stealing SchedulersDepth-first SchedulersPage*15-210

15-210*Breadth First SchedulesMost nave schedule. Used by most implementations of P-threads.O(n3) tasksBad space usage, bad locality

15-210*Work Stealing push new jobs on new endpop jobs from new endIf processor runs out of work, then steal from another old end Each processor tends to execute a sequential part of the computation.P1P2P3P4oldnewLocalwork queues

15-210*Work StealingTends to schedule sequential blocks of tasks= steal

15-210*Work Stealing TheoryFor strict computations Blumofe and Leiserson, 1999# of steals = O(PD)Space = O(PS1) S1 is the sequential spaceAcar, Blelloch and Blumofe, 2003# of cache misses on distributed caches M1+ O(CPD) M1 = sequential misses, C = cache size

15-210*Work Stealing PracticeUsed in Cilk SchedulerSmall overheads because common case of pushing/popping from local queue can be made fast (with good data structures and compiler help).No contention on a global queueHas good distributed cache behaviorCan indeed require O(S1P) memoryUsed in X10 scheduler, and others

15-210*Parallel Depth First Schedules (P-DFS)List scheduling based on Depth-First ordering1234567891012, 63, 45, 789102 processorscheduleFor strict computations a shared stack implements a P-DFS

15-210*Premature task in P-DFSA running task is premature if there is an earlier sequential task that is not complete1234567891012, 63, 45, 789102 processorschedule12345678910= premature

15-210*P-DFS TheoryBlelloch, Gibbons, Matias, 1999For any computation:Premature nodes at any time = O(PD)Space = S1 + O(PD)Blelloch and Gibbons, 2004With a shared cache of size C1 + O(PD) we have Mp = M1

15-210*P-DFS PracticeExperimentally uses less memory than work stealing and performs better on a shared cache.Requires some coarsening to reduce overheads

ConclusionsLots of parallel languagesLots of parallel programming stylesHigh-level ideas cross between languages and stylesConcurrency and parallelism are different Scheduling is importantPage*15-210

Cray-1 had muliple CPUs*The data center on the right is a google data center located in Danmark. Why Danmark? The climate there is quite cool. Placing a data center in Danmark solves the cooling problem of the datacenter and the heating problem of Danmark. Two birds with one stones.*Mathematical languageNo mention of specific implementation details.*Higly specific to sequential computer.We overengineered algorithms to fit particular architectures. *The mathematical formula makes clear that this is a map and ceduce.*Note how map and reduce is completely lost here in this C code.

Perhaps the most important point is that the sum is now being done sequentially but not via a reduce.

AS we saw in a prior example, we have sequentialized a perfectly computation

This is really a map and reduce. You may each value of x to 4/1.0 ? X*X and then you add all these values up.

The numstepns here controls the acuracy of the approximations.*Now openMP is a way to recover the lost parallelism from this code.

We will do this by adding openMP constructs.

Specifically what we want to do is to say that the for loop is parallel.

Now in addition to saying that the for loop is parallel, we have to tell the compiler that the variables I and x are private. This means that each iteration of the loop will have its own copy.All the other variable such as sum in shared. the sum which is a reduction with a plus. We tell the compiler that it is so.. In openMP you can only reduce over a fixed number of operations. You cannot use higher order functions as we often do in this course.

In summary, we hane*Why in concurrency hardMost basic algorithm : teach to our freshmen in third week.All of us realize this is not safe to do concurrently*Most basic algorithm : teach to our freshmen in third week.All of us realize this is not safe to do concurrently*Most basic algorithm : teach to our freshmen in third week.All of us realize this is not safe to do concurrently*Most basic algorithm : teach to our freshmen in third week.All of us realize this is not safe to do concurrently*Problem with counter and 2CAS: 2CAS not available on most machines, counter is assumed to be unbounded*synchronized in java*synchronized in java*