analysis of cilk

42
Analysis of Cilk

Upload: gala

Post on 20-Jan-2016

68 views

Category:

Documents


1 download

DESCRIPTION

Analysis of Cilk. A Formal Model for Cilk. A thread : maximal sequence of instructions in a procedure instance (at runtime!) not containing spawn , sync , return For a given computation, define a dag Threads are vertices Continuation edges within procedures Spawn edges - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Analysis of Cilk

Analysis of CilkAnalysis of Cilk

Page 2: Analysis of Cilk

A Formal Model for Cilk

A thread: maximal sequence of instructions in a procedure instance (at runtime!) not containing spawn, sync, return

For a given computation, define a dag Threads are vertices Continuation edges within procedures Spawn edges Initial & final threads (in main)

Page 3: Analysis of Cilk

Work & Critical Path

Threads are sequential: work = running time

Define TP = running time on P processors

Then T1 = work in the computation

And T∞ = critical-path length, longest path in the dag

Page 4: Analysis of Cilk

Lower Bounds on TP

TP ≥ T1 / P (no miracles in the model, but they do happen occasionally)

TP ≥ T∞ (dependencies limit parallelism)

Speedup is T1 / TP

(Asymptotic) linear speedup means Θ(P)

Parallelism is T1 / T∞ (average work available at every step along critical path)

Page 5: Analysis of Cilk

Greedy Schedulers

Execute at most P threads in every step Choice of threads is arbitrary (greedy) At most T1 / P complete steps (everybody

busy) At most T∞ incomplete steps

• All threads with in-degree = 0 are executed• Therefore, critical path length reduced by 1

Theorem: TP ≤ T1 / P + T∞

Linear speedup when P = O(T1 / T∞)

Page 6: Analysis of Cilk

Cilk Guarantees:

TP ≤ T1 / P + O(T∞) expected running time

Randomized greedy scheduler The developers claim: TP ≈ T1 / P + T∞

in practice This implies near-perfect speedup

when P << T1 / T∞

Page 7: Analysis of Cilk

But Which Greedy Schedule to Choose?

Busy leaves property: some processor executes every leaf in the dag

Busy leaves controls space consumption;can show SP = O(P S1)

Without busy leaves, worst case is SP = Θ(T1), not hard to reach

A processor that spawns a procedure executes it immediately; but another processor may steal the caller & execute it

Page 8: Analysis of Cilk

Work Stealing Idle processors search for work on other

processors at random When a busy victim is found, the theif steals

the top activation frame on victim’s stack (In practice, stack is maintained as a

dequeue) Why work stealing?

• Almost no overhead when everybody busy• Most of the overhead incurred by theives

• Work-first principle: little impact on T1 / P term

Page 9: Analysis of Cilk

Some Overhead Remains To achieve portability, spawn stack is

maintained as an explicit data structure Theives steal from this dequeue Could implement stealing directly from

stack, more complex and nonportable Main consequence:

• Spawns are more expensive than function calls

• Must do significant work at the bottom of recursions to hide this overhead

• Same is true for normal function calls, but to a lesser extent

Page 10: Analysis of Cilk

More Cilk FeaturesMore Cilk Features

Page 11: Analysis of Cilk

Inlets

x = spawn fib(n-1)is equivalent to:cilk int fib(int n) { int x = 0; inlet void summer(int result) { x += result; } if (n<2) return n; summer( spawn fib(n-1) ); summer( spawn fib(n-2) ); sync; return x;

Page 12: Analysis of Cilk

Inlet Semantics Inlet: an inner function that is called

when a child completes its execution An inlet is a thread in a procedure (so

spawn, sync are not allowed) All the threads of a procedure

instance are executed atomically with respect to one another (i.e., not concurrently)

Easy to reason about correctness x += spawn fib(n-1)is an implicit inlet

Page 13: Analysis of Cilk

Aborting Work

An abort statement in an inlet aborts already-spawned children of a procedure

Useful for aborting speculative searches Semantics:

• Children may not abort instantly• Aborted children do not return values, so don’t

use these values, as in x = spawn fib(n-1)• Does not prevent future spawns; be careful

with sequences of spawns

Page 14: Analysis of Cilk

The SYNCHED Built-In Variable

True only if no children are currently executing

False if some children may be executing now

Useful for avoiding space and work overheads that reduce the critical path when there is no need to

Page 15: Analysis of Cilk

Cilk’s Memory Model

Memory operations of two threads are guaranteed to be ordered only if there is a dependence path between them (ancestor-descendant relationship)

Unordered threads may see inconsistent views of memory

Page 16: Analysis of Cilk

Locks

Mutual-exclusion variables Memory operations that a thread

performs before releasing a lock are seen by other threads after they acquire the lock

Using locks invalidates all the performance guarantees that Cilk provides

In short, Cilk supports locks but don’t use them unless you must

Page 17: Analysis of Cilk

Useful but Obsolete

Cilk as a library• Can call Cilk procedures from C, C++,

Fortran• Necessary for building general-purpose C

libraries Cilk on clusters with distributed memory

• Programmer sees the same shared-memory model

• Used an interesting memory-consistency protocol to support shared-memory view

• Was performance ever good enough?

Page 18: Analysis of Cilk

Some Open ProblemsSome Open Problems

Perhaps good enough for a thesis

Page 19: Analysis of Cilk

Open Issues in Cilk

Theoretical question about the distributed-memory version: is performance monotone in the size of local caches?

Cilk as a library: resurrect Distributed-memory version: resurrect,

is it fast enough? Can you make it faster?

Page 20: Analysis of Cilk

Parallel Merge Sortin Cilk

Parallel Merge Sortin Cilk

Page 21: Analysis of Cilk

Parallel Merge Sort

merge_sort(A, n) if (n=1) return spawn merge_sort(A, n/2) spawn merge_sort(A+n/2, n-n/2) sync merge(A, n/2, n-n/2)

Page 22: Analysis of Cilk

Can’t Merge In Place!

merge_sort(A, T, n, AorT) if (n=1) { T0=A0 ; return }

spawn merge_sort(A, T, n/2, !TorA) spawn merge_sort(A+n/2, n-n/2, !

TorA) sync if (TorA=A) merge(A, T, n/2, n-n/2) if (TorA=T) merge(T, A, n/2, n-n/2)

Page 23: Analysis of Cilk

Analysis

Merging uses two pointers, move the smaller into sorted array

T1(n) = 2T1(n/2) + Θ(n)

T1(n) = Θ(n log n) We fill the output element by element T∞(n) = T∞(n/2) + Θ(n)

T∞(n) = Θ(n) Not very parallel . . .

Page 24: Analysis of Cilk

Parallel Merging

p_merge(A,n,B,m,C) // C is the output swap A,B if A is smaller if (m+n = 1) { C0=A0 ; return }

if (n = 1 /* implies m = 1 */) { merge ; return } locate An/2 between Bj and Bj+1 (binary search)

spawn p_merge(A,n/2,B,j,C) spawn p_merge(A+n/2, n-n/2, B+ j, n-j,

C+n/2+j) sync

Page 25: Analysis of Cilk

Analysis of Parallel Merging

When we merge n elements, both recursive calls merge at most 3n/4 elements

T∞(n) ≤ T∞(3n/4) + Θ(log n) T∞(n) = Θ(log2 n) Critical path is short! But the analysis of work is more complex (extra

work due to binary searches) T1(n) = T1(αn) + T1((1-α)n) + Θ(log n), ¼ ≤ α ≤ ¾ T1(n) = Θ(n) using substitution (nontrivial) Critical path for parallel merge sort T∞(n) = T∞(n/2) + Θ(log2 n) = Θ(log3 n)

Page 26: Analysis of Cilk

Analysis of ParallelMerge Sort

T∞(n) = T∞(n/2) + Θ(log2 n) = Θ(log3 n)

T1(n) = Θ(n) Impact of extra work in practice? Can find the median of 2 sorted arrays

of total size n in Θ(log n) time, leads to parallel merging and merge-sorting with shorter critical paths

Page 27: Analysis of Cilk

Analysis of ParallelMerge Sort

T∞(n) = T∞(n/2) + Θ(log2 n) = Θ(log3 n)

T1(n) = Θ(n) Impact of extra work in practice? Can find the median of 2 sorted arrays

of total size n in Θ(log n) time, leads to parallel merging and merge-sorting with shorter critical paths

Parallelizing an algorithm can be nontrivial!

Page 28: Analysis of Cilk

Cache-EfficientSorting

Cache-EfficientSorting

Page 29: Analysis of Cilk

Caches

Store recently-used data Not really LRU

• Usually 1, 2, or 4-way set associative• But up to 128-way set associative

Data transferred in blocks called cache lines

Write through or write back

Temporal locality: use same data again soon

Spatial locality: use nearby data soon

Page 30: Analysis of Cilk

Cache Misses inMerge Sort

Assume cache-line size = 1, LRU, write back

Assume cache holds M words When n <= M/2, exactly n reads, write

backs When n > M, at least n cache misses

(cover all cases in the proof!) Therefore, number of cache misses is

Θ(n log n/M) = Θ(n log n – log M) We can do much better

Page 31: Analysis of Cilk

The Key Idea

Merge M/2 sorted runs into one, not 2 into 1 Keep one element from each run in a heap,

together with a run label Extract the min, move to sorted run, insert

another element from same run into heap Reading from sorted runs & writing to sorted

ouput removes elements of the heap, but this cost is O(n) cache misses

Θ(n logM n) = Θ(n log n / log M)

Page 32: Analysis of Cilk

This is Poly-Merge Sort

Optimal in terms of cache misses Can adapt to long cache lines, sorting on disks,

etc Originally invented from sorting on tapes on a

machine with several tape drives Often, Θ(n log n / log M) is really Θ(n) in practice Example:

• 32 KB cache, 4+4 bytes elements• 4192-way merges• Can sort 64 MB of data in 1 merge, 256 GB in 2 merges • But more merges with long cache lines

Page 33: Analysis of Cilk

From Quick Sort To Sample Sort

Same number of cache misses with normal quick sort

Key idea• Choose a large random sample, Θ(M) elements• Sort the samples• Classify all the elements using binary searches• Determine size of intervals• Partition• Recursively sort the intervals

Cache miss # probably similar to merge sort

Page 34: Analysis of Cilk

Distributed-MemorySorting

Distributed-MemorySorting

Page 35: Analysis of Cilk

Issues in Sample Sort

Main idea: Partition input into P intervals, classify elements, send elements in ith interval to processor i, sort locally

Most of the communication in one global all-to-all phase

Load balancing: Intervals must besimilar in size

How do we sort the sample?

Page 36: Analysis of Cilk

Balancing the Load

Select a random sample of sP elements OK even if every processor selects s The probability of an interval larger

than cn/P grows linearly with n, shrinks exponentially with s; a large s virtually ensures uniform intervals

Example:• n = 109, s = 256•Pr[max interval > 2n/P] < 10-8

Page 37: Analysis of Cilk

Sorting the Sample

Can’t do it recursively! Sending the sample to one processor

• Θ(sP+n/P) communication in that processor

• Θ(sP log sP+n/P log( n/P )) work• Not scalable, OK for small P

Using a different algorithm that works well for small n/P, e.g., radix sort

Page 38: Analysis of Cilk

Distributed-MemoryRadix Sort

Sort blocks of r bits from least significant to most significant; use a stable sort

Counting sort of one block• Sequentially, count occurrences using a 2r

array, compute prefix sums, use as pointers In parallel, every processor counts

occurrences, then pipelined parallel prefix sums, send elements to destination processor

Θ((b/r) (2r + n/P) work & comm / proc

Page 39: Analysis of Cilk

Odds and EndsOdds and Ends

Page 40: Analysis of Cilk

CPU Utilization Issues

Avoid conditionals• In sorting algorithms, the compiler &

processors cannot predict outcome, so the pipeline stalls

• Example: partitioning in qsort without conditionals

Avoid stalling the pipeline• Even without conditionals, rapid use of

computed values stalls the pipeline• Same example

Page 41: Analysis of Cilk

Dirty Tricks

Exploit uniform input distributions approximately (quick sort, radix sort)• Fix mistakes by bubbling

To avoid conditionals when fixing mistakes• Split the array into small blocks• Use and’s to check for mistakes in a block• Fix a block only if it contains mistakes• Some conditionals, but not many• Compute probability of mistakes to optimize

Page 42: Analysis of Cilk

The ExerciseThe Exercise Get sort.cilk, more instuctions inside Convert sequential merge sort into

parallel merge sort• Make it as fast as possible as long is it is a

parallel merge sort (e.g., make the bottom of the recursion fast)

Convert the fast sort into the parallel fastest sort you can

Submit files in home directory, one page with output on 1, 2 procs + possible explanation (on one side of the page)