page1 15-210: parallelism in the real world types of paralellism parallel thinking nested...

of 71/71
Page1 15-210: Parallelism in the Real World Types of paralellism Parallel Thinking Nested Parallelism Examples (Cilk, OpenMP, Java Fork/Join) • Concurrency 15-210

Post on 19-Jan-2016

225 views

Category:

Documents

1 download

Embed Size (px)

TRANSCRIPT

  • Page*15-210: Parallelism in the Real WorldTypes of paralellismParallel ThinkingNested ParallelismExamples (Cilk, OpenMP, Java Fork/Join)Concurrency15-210

  • Cray-1 (1976): the worlds most expensive love seat15-210*

  • Data Center: Hundreds of thousands of computers*15-210

  • Since 2005: Multicore computers15-210*

  • Xeon Phi: Knights Landing72 Cores, x86, 500Gbytes of memory bandwidthSummer 2015

    15-210Page*

  • Xeon Phi: Knights Landing72 Cores, x86, 500Gbytes of memory bandwidthSummer 2015

    15-210Page*

  • Moores Law15-210*

  • Moores Law and Performance15-210*

  • Page *Andrew Chien, 200815-210

  • 64 core blade servers ($6K)(shared memory)Page *x 4 =15-210

  • 1024 cuda cores15-210*

  • 15-210*

  • 15-210*

  • Parallel HardwareMany forms of parallelismSupercomputers: large scale, shared memoryClusters and data centers: large-scale, distributed memory Multicores: tightly coupled, smaller scaleGPUs, on chip vector unitsInstruction-level parallelismParallelism is important in the real world. 15-210*

  • Key Challenge: Software(How to Write Parallel Code?)At a high-level, it is a two step process:Design a work-efficient, low-span parallel algorithmImplement it on the target hardwareIn reality: each system required different code because programming systems are immatureHuge effort to generate efficient parallel code.Example: Quicksort in MPI is 1700 lines of code, and about the same in CUDAImplement one parallel algorithm: a whole thesis.Take 15-418 (Parallel Computer Architecture and Prog.)

    15-210*

  • 15-210 ApproachEnable parallel thinking by raising abstraction level

    I. Parallel thinking: Applicable to many machine models and programming languages II. Reason about correctness and efficiency of algorithms and data structures.15-210*

  • Parallel ThinkingRecognizing true dependences: unteach sequential programming.

    Parallel algorithm-design techniquesOperations on aggregates: map/reduce/scanDivide & conquer, contractionViewing computation as DAG (based on dependences)

    Cost model based on work and span

    15-210*

  • Page *Quicksort from Aho-Hopcroft-Ullman (1974)procedure QUICKSORT(S): if S contains at most one element then return S else begin choose an element a randomly from S; let S1, S2 and S3 be the sequences of elements in S less than, equal to, and greater than a, respectively; return (QUICKSORT(S1) followed by S2 followed by QUICKSORT(S3))end15-210

  • Page *Quicksort from Sedgewick (2003) public void quickSort(int[] a, int left, int right) { int i = left-1; int j = right; if (right = j) break; swap(a,i,j); } swap(a, i, right); quickSort(a, left, i - 1); quickSort(a, i+1, right); }15-210

  • Styles of Parallel ProgrammingData parallelism/Bulk Synchronous/SPMD Nested parallelism : what we coveredMessage passingFutures (other pipelined parallelism)General ConcurrencyPage*15-210

  • Nested ParallelismNested Parallelism = arbitrary nesting of parallel loops + fork-joinAssumes no synchronization among parallel tasks except at joint points.Deterministic if no race conditions

    Advantages: Good schedulers are knownEasy to understand, debug, and analyze costsPurely functional, or imperativeeither works

    Page*15-210

  • Serial Parallel DAGsDependence graphs of nested parallel computations are series parallel

    Two tasks are parallel if not reachable from each other.A data race occurs if two parallel tasks are involved in a race if they access the same location and at least one is a write.

    Page*15-210

  • Nested Parallelism: parallel loopsPASL/C++: Parallel_for (cntr, 0, n, [&] (i) {}) Cilk/C++: cilk_for (i=0; i < n; ++i) {B[i] = A[i]+1;}Microsoft TPL (C#,F#): Parallel.ForEach(A, x => x+1);Nesl, Parallel Haskell: B = {x + 1 : x in A}OpenMP #pragma omp for for (i=0; i < n; i++) {B[i] = A[i] + 1;}Page*15-210

  • Nested Parallelism: fork-joinJava fork-join framework, Microsoft TPL (C#,F#) Dates back to the 60s. Used in dialects of Algol, Pascal cobegin { S1; S2; }OpenMP (C++, C, Fortran, ) coinvoke(f1,f2) Parallel.invoke(f1,f2) #pragma omp sections { #pragma omp section S1; #pragma omp section S2; }

    Page*15-210

  • Nested Parallelism: fork-joinPage*PASL/C++ fork2 ([&] (), [&] ())

    Cilk/C++ spawn S1; S2; sync;

    Various functional languages (exp1 || exp2)

    Dialects of ML and Lisy plet x = exp1 y = exp2 in exp3 end

    15-210

  • Cilk vs. what weve coveredML: val (a,b) = par(fn () => f(x), fn () => g(y)) Psuedocode: (a,b) = (f(x) || g(y))Cilk: cilk_spawn a = f(x); b = g(y); cilk_sync; ML: S = map f APsuedocode: S = Cilk: cilk_for (int i = 0; i < n; i++) S[i] = f(A[i]) Page*15-210

  • Cilk vs. what weve coveredML: S = tabulate f nPsuedocode: S =
  • Example Cilkint fib (int n) { if (n
  • Example OpenMP:Numerical Integration15-210*

  • The C code for Approximating PI15-210*

  • The C/openMP code for Approx. PI15-210*

  • Example : Java Fork/Joinclass Fib extends FJTask { volatile int result; // serves as arg and result int n; Fib(int _n) { n = _n; } public void run() { if (n
  • Cost Model (General)Compositional:

    Work : total number of operations costs are added across parallel calls

    Span : depth/critical path of the computationMaximum span is taken across forked calls

    Parallelism = Work/SpanApproximately # of processors that can be effectively used.Page*15-210

  • *Combining for parallel for: pfor (i=0; i
  • *Simple measures that give us a good sense of efficiency (work) and scalability (span).Can schedule in O(W/P + D) time on P processors.This is within a constant factor of optimal.Goals in designing an algorithmWork should be about the same as the sequential running time. When it matches asymptotically we say it is work efficient.Parallelism (W/D) should be polynomial. O(n1/2) is probably good enough Why Work and Span15-210

    15-853

  • How do the problems do on a modern multicore15-210*

    Chart3

    31.6

    24.6

    17

    14.5

    18.45

    19.3

    20.8

    19.3

    12.25

    17.5

    21.8

    17

    T1/T32

    Sheet1

    BenchmarkT1/T32Tseq/T32

    Sort31.631.6

    Duplicate Removal24.621.6

    Min Spanning Tree1711.2

    Max Independ. Set14.510

    Spanning Forest18.459

    Breadth First Search19.314.5

    Delaunay Triang.20.815

    Triangle Ray Inter.19.317

    Nearest Neighbors12.2511.7

    Sparse MxV17.517

    Nbody21.818

    Suffix Array1715

    Sheet1

    T1/T32

    Tseq/T32

    Sheet2

    Sheet3

  • Styles of Parallel ProgrammingData parallelism/Bulk Synchronous/SPMD Nested parallelism : what we coveredMessage passingFutures (other pipelined parallelism)General ConcurrencyPage*15-210

  • Parallelism vs. Concurrency*Parallelism: using multiple processors/cores running at the same time. Serial semantics, parallel execution. Property of hardware/software platform.Concurrency: non-determinacy due to interleaving threads. Concurrent semantics, parallel or serial execution. Property of the application.

    15-210

    ConcurrencysequentialconcurrentParallelismserialTraditional programmingTraditional OSparallelDeterministic parallelismGeneral parallelism

  • Concurrency : Stack Example 1struct link {int v; link* next;}struct stack { link* headPtr; void push(link* a) { a->next = headPtr; headPtr = a; } link* pop() { link* h = headPtr; if (headPtr != NULL) headPtr = headPtr->next; return h;} }*HAHA15-210

  • Concurrency : Stack Example 1struct link {int v; link* next;}struct stack { link* headPtr; void push(link* a) { a->next = headPtr; headPtr = a; } link* pop() { link* h = headPtr; if (headPtr != NULL) headPtr = headPtr->next; return h;} }*HAB15-210

  • Concurrency : Stack Example 1struct link {int v; link* next;}struct stack { link* headPtr; void push(link* a) { a->next = headPtr; headPtr = a; } link* pop() { link* h = headPtr; if (headPtr != NULL) headPtr = headPtr->next; return h;} }*HAB15-210

  • Concurrency : Stack Example 1struct link {int v; link* next;}struct stack { link* headPtr; void push(link* a) { a->next = headPtr; headPtr = a; } link* pop() { link* h = headPtr; if (headPtr != NULL) headPtr = headPtr->next; return h;} }*HAB15-210

  • CASbool CAS(ptr* addr, ptr a, ptr b) { atomic { if (*addr == a) { *addr = b; return 1; } else return 0; } } A built in instruction on most processors: CMPXCHG8B 8 byte version for x86 CMPXCHG16B 16 byte version

    15-210Page*

  • Concurrency : Stack Example 2struct stack { link* headPtr; void push(link* a) { do { link* h = headPtr; a->next = h; while (!CAS(&headPtr, h, a)); } link* pop() { do { link* h = headPtr; if (h == NULL) return NULL; link* nxt = h->next; while (!CAS(&headPtr, h, nxt))} return h;} }*HA15-210

  • Concurrency : Stack Example 2struct stack { link* headPtr; void push(link* a) { do { link* h = headPtr; a->next = h; while (!CAS(&headPtr, h, a)); } link* pop() { do { link* h = headPtr; if (h == NULL) return NULL; link* nxt = h->next; while (!CAS(&headPtr, h, nxt))} return h;} }*HA15-210

  • Concurrency : Stack Example 2struct stack { link* headPtr; void push(link* a) { do { link* h = headPtr; a->next = h; while (!CAS(&headPtr, h, a)); } link* pop() { do { link* h = headPtr; if (h == NULL) return NULL; link* nxt = h->next; while (!CAS(&headPtr, h, nxt))} return h;} }*HAB15-210

  • Concurrency : Stack Example 2struct stack { link* headPtr; void push(link* a) { do { link* h = headPtr; a->next = h; while (!CAS(&headPtr, h, a)); } link* pop() { do { link* h = headPtr; if (h == NULL) return NULL; link* nxt = h->next; while (!CAS(&headPtr, h, nxt))} return h;} }*HAB15-210

  • Concurrency : Stack Example 2struct stack { link* headPtr; void push(link* a) { do { link* h = headPtr; a->next = h; while (!CAS(&headPtr, h, a)); } link* pop() { do { link* h = headPtr; if (h == NULL) return NULL; link* nxt = h->next; while (!CAS(&headPtr, h, nxt))} return h;} }*HAB15-210X

  • Concurrency : Stack Example 2P1 : x = s.pop(); y = s.pop(); s.push(x);P2 : z = s.pop();*The ABA problemCan be fixed with counter and 2CAS, butABCBCBefore:After:P2: h = headPtr;P2: nxt = h->next;P1: everythingP2: CAS(&headPtr,h,nxt)15-210

  • Concurrency : Stack Example 3struct link {int v; link* next;}struct stack { link* headPtr; void push(link* a) { atomic { a->next = headPtr; headPtr = a; }} link* pop() { atomic { link* h = headPtr; if (headPtr != NULL) headPtr = headPtr->next; return h;}} }*15-210

  • Concurrency : Stack Example 3void swapTop(stack s) { link* x = s.pop(); link* y = s.pop(); push(x); push(y);}

    Queues are trickier than stacks.

    *15-210

  • Styles of Parallel ProgrammingData parallelism/Bulk Synchronous/SPMD Nested parallelism : what we coveredMessage passingFutures (other pipelined parallelism)General ConcurrencyPage*15-210

  • Futures: Examplefun quickSort S = let fun qs([], rest) = rest | qs(h::T, rest) = let val L1 = filter (fn b => b < a) T val L2 = filter (fn b => b >= a) T in qs(L1,(a::(qs(L2, rest))) end fun filter(f,[]) = [] | filter(f,h::T) = if f(h) then(h::filter(f,T)) else filter(f,T)in qs(S,[])endPage*15-210

  • Futures: Examplefun quickSort S = let fun qs([], rest) = rest | qs(h::T, rest) = let val L1 = filter (fn b => b < a) T val L2 = filter (fn b => b >= a) T in qs(L1, future(a::(qs(L2, rest))) end fun filter(f,[]) = [] | filter(f,h::T) = if f(h) then future(h::filter(f,T)) else filter(f,T)in qs(S,[])endPage*15-210

  • 15-210*Quicksort: Nested ParallelSpan = O(lg2 n)Parallel Partition and AppendWork = O(n log n)

  • 15-210*Quicksort: FuturesSpan = O(n)Work = O(n log n)

  • Styles of Parallel Programming Data parallelism/Bulk Synchronous/SPMD * Nested parallelism : what we covered Message passing* Futures (other pipelined parallelism)* General ConcurrencyPage*15-210

  • Xeon 7500Page*15-210

  • QuestionHow do we get nested parallel programs to work well on a fixed set of processors? Programs say nothing about processors.

    Page*Answer: good schedulers

    15-210

  • 15-210*Greedy SchedulesSpeedup versus Efficiency in Parallel Systems, Eager, Zahorjan and Lazowska, 1989

    For any greedy schedule:

    Efficiency =

    Parallel Time =

    15-853

  • Types of SchedulersBread-First SchedulersWork-stealing SchedulersDepth-first SchedulersPage*15-210

  • 15-210*Breadth First SchedulesMost nave schedule. Used by most implementations of P-threads.O(n3) tasksBad space usage, bad locality

  • 15-210*Work Stealing push new jobs on new endpop jobs from new endIf processor runs out of work, then steal from another old end Each processor tends to execute a sequential part of the computation.P1P2P3P4oldnewLocalwork queues

  • 15-210*Work StealingTends to schedule sequential blocks of tasks= steal

  • 15-210*Work Stealing TheoryFor strict computations Blumofe and Leiserson, 1999# of steals = O(PD)Space = O(PS1) S1 is the sequential spaceAcar, Blelloch and Blumofe, 2003# of cache misses on distributed caches M1+ O(CPD) M1 = sequential misses, C = cache size

  • 15-210*Work Stealing PracticeUsed in Cilk SchedulerSmall overheads because common case of pushing/popping from local queue can be made fast (with good data structures and compiler help).No contention on a global queueHas good distributed cache behaviorCan indeed require O(S1P) memoryUsed in X10 scheduler, and others

  • 15-210*Parallel Depth First Schedules (P-DFS)List scheduling based on Depth-First ordering1234567891012, 63, 45, 789102 processorscheduleFor strict computations a shared stack implements a P-DFS

  • 15-210*Premature task in P-DFSA running task is premature if there is an earlier sequential task that is not complete1234567891012, 63, 45, 789102 processorschedule12345678910= premature

  • 15-210*P-DFS TheoryBlelloch, Gibbons, Matias, 1999For any computation:Premature nodes at any time = O(PD)Space = S1 + O(PD)Blelloch and Gibbons, 2004With a shared cache of size C1 + O(PD) we have Mp = M1

  • 15-210*P-DFS PracticeExperimentally uses less memory than work stealing and performs better on a shared cache.Requires some coarsening to reduce overheads

  • ConclusionsLots of parallel languagesLots of parallel programming stylesHigh-level ideas cross between languages and stylesConcurrency and parallelism are different Scheduling is importantPage*15-210

    Cray-1 had muliple CPUs*The data center on the right is a google data center located in Danmark. Why Danmark? The climate there is quite cool. Placing a data center in Danmark solves the cooling problem of the datacenter and the heating problem of Danmark. Two birds with one stones.*Mathematical languageNo mention of specific implementation details.*Higly specific to sequential computer.We overengineered algorithms to fit particular architectures. *The mathematical formula makes clear that this is a map and ceduce.*Note how map and reduce is completely lost here in this C code.

    Perhaps the most important point is that the sum is now being done sequentially but not via a reduce.

    AS we saw in a prior example, we have sequentialized a perfectly computation

    This is really a map and reduce. You may each value of x to 4/1.0 ? X*X and then you add all these values up.

    The numstepns here controls the acuracy of the approximations.*Now openMP is a way to recover the lost parallelism from this code.

    We will do this by adding openMP constructs.

    Specifically what we want to do is to say that the for loop is parallel.

    Now in addition to saying that the for loop is parallel, we have to tell the compiler that the variables I and x are private. This means that each iteration of the loop will have its own copy.All the other variable such as sum in shared. the sum which is a reduction with a plus. We tell the compiler that it is so.. In openMP you can only reduce over a fixed number of operations. You cannot use higher order functions as we often do in this course.

    In summary, we hane*Why in concurrency hardMost basic algorithm : teach to our freshmen in third week.All of us realize this is not safe to do concurrently*Most basic algorithm : teach to our freshmen in third week.All of us realize this is not safe to do concurrently*Most basic algorithm : teach to our freshmen in third week.All of us realize this is not safe to do concurrently*Most basic algorithm : teach to our freshmen in third week.All of us realize this is not safe to do concurrently*Problem with counter and 2CAS: 2CAS not available on most machines, counter is assumed to be unbounded*synchronized in java*synchronized in java*