comparing and optimising parallel haskell implementations on multicore jost berthold simon marlow...

Post on 01-Apr-2015

234 Views

Category:

Documents

4 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Comparing and Optimising Parallel Haskell Implementations on Multicore

Jost BertholdSimon MarlowAbyd Al Zain

Kevin Hammond

The Parallel Haskell Landscape• research into parallelism using Haskell has been ongoing since the late 1980s

– semi-implicit, deterministic programming model: • par :: a -> b -> b

– strategies package up larger parallel computation patterns, separates algorithm from parallelism

– the GUM implementation ran on clusters or multiprocessors, using PVM– successful: linear speedups on large clusters

• Another Parallel Haskell variant: Eden– more explicit than par: programming model says where the evaluation happens– also able to express parallel computation skeletons, e.g. parMap– implementation based on GHC, runs on clusters and multiprocessors using PVM-

based communication– multiple heaps, not virtually-shared as in GUM (simpler implementation)

• Several other Parallel/Distributed Haskell dialects, mostly research prototypes and all based on distributed heaps (some virtually-shared)

The Parallel Haskell Landscape• Recently (2005) shared-memory parallelism added to GHC

– single shared heap– programming models supported:

• pure:– par and Strategies– soon: Data Parallel Haskell

• impure, non-deterministic:– Concurrent Haskell, STM

– widely available, high-quality implementation– very lightweight concurrency, we win concurrency benchmarks– parallel GC added recently

• This work:– compare distributed and shared-heap models– analyse performance of the shared-heap implementation

• implement execution profiling• make improvements to the runtime

Shared vs. Distributed heaps

• why a shared heap?– no communication overhead, hence easier to

program– good for fine-grained tasks with plenty of

communication and sharing• why a distributed heap?– parallel GC is much easier– no cache-coherency overhead– no mutexes

The GpH programming model

• par :: a -> b -> b• stores a pointer to a in a spark pool• an idle CPU takes a spark from the spark pool and

turns it into a thread• seq :: a -> b -> b– used for sequential ordering

parMap :: (a -> b) -> [a] -> [b]parMap f [] = []parMap f (x:xs) = let y = f x

ys = parMap f xs in y `par` (ys `seq` y:ys)

sumEuler :: Int -> IntsumEuler n = sum (map phi [1..n])

phi :: Int -> Intphi n = length (filter (relprime n) [1..(n-1)])

sumEuler :: Int -> IntsumEuler n = sum (parMap phi [1..n])

phi :: Int -> Intphi n = length (filter (relprime n) [1..(n-1)])

sumEuler :: Int -> IntsumEuler n = parChunkFoldMap (+) phi [1..n]

phi :: Int -> Intphi n = length (filter (relprime n) [1..(n-1)])

parChunkFoldMap :: (b -> b -> b) -> (a -> b) -> [a] -> bparChunkFoldMap f g xs = foldl1 f (map (foldl1 f . map g) (splitAtN c xs)

`using` parList rnf)

sumEuler benchmark

sumEuler execution profile1. Standard GHC, 8 CPUs (2 x quad-core)

2. Eden using PVM, 8 CPUs (2 x quad-core)

Analysis (1)

• The shared-heap implementation was spending a lot of time at the GC barrier.

• It turned out that the GC barrier had a bug: it was stopping one CPU at a time. We fixed that.

• Also, reducing the number of barriers, by increasing the size of the young generations, helps a bit.

sumEuler execution profile (2)1. Standard GHC, including fix for GC barrier and 5MB young generation

2. Eden using PVM, 8 CPUs (2 x quad-core)

Analysis (2)

• Some of the gaps are due to poor load-balancing.

• The existing load-balancing strategy was based on pushing spare work to idle CPUs– could be a long delay between a CPU becoming

idle and receiving work from another CPU.• We implemented lock-free work-stealing

queues for load-balancing of sparks.

sumEuler execution profile (3)1. Standard GHC + GC barrier fixes + work-stealing

2. Eden using PVM, 8 CPUs (2 x quad-core)

Analysis (3)

• High priority: implement per-CPU GC– each CPU has a local heap that can be collected

independently of the other CPUs.– Single shared global heap, collected much less frequently

using stop-the-world– e.g. Concurrent Caml, Manticore

• Lower the overhead for spark activation, by having a dedicated thread to run sparks. – This will make the implementation less sensitive to

granularity: less need to group work into “chunks”, easier for programmers to get speedup

Matrix multiplication

• Using strategies, we can parallelise matrix multiply either elementwise, by grouping rows or columns, or blockwise.

• In Eden, the matrix data is communicated between the processing elements, but no PE keeps a complete copy of the matrix.

Matrix multiplication1. Standard GHC, 8 CPUs (2 x quad-core)

2. Standard GHC + GC barrier fix + work-stealing

3. Eden

Analysis (4)

• The distributed memory implementation suffers due to communication overhead.

• Also the distributed-memory algorithm is more complex, due to trying to avoid copying the input data.

• We still have a way to go, though: GHC achieves a 5.6 speedup on 8 CPUs.

Further Challenges

• Work duplication– GHC doesn’t prevent multiple threads from duplicating

a computation, it tries to discover duplicated work in progress and halt one of the threads.

– to prevent duplication up-front is expensive – extra memory operations (black holes), or even atomic instructions

– we found that in some cases work duplication really is affecting scaling

– so we want to do this for some computations

Further Challenges• Space leak in par

– “par e1 e2” stores a pointer to e1 in the spark pool before evaluating e2

– Typically e2 and e1 share some computation– If we don’t have enough processors, we might not evaluate e1 in parallel– how do we know when we can discard that entry from the spark pool? If

we don’t ever discard entries from the spark pool, we have a space leak.– “when e2 has completed” doesn’t work, e.g. parMap– “when e1 is evaluated” also doesn’t work: e1 itself isn’t shared, but it

refers to shared computations– “when e1 is disjoint from the program’s live data” is too hard to

determine– workaround: use only “par x e2” where x is shared with e2.

Conclusions

• The tradeoff between distributed and shared heaps is a complex one– a distributed heap can give better performance– but is harder to program against: the programmer must think

about communication– We believe a shared heap is the better model in the short-term,

but as we need to scale to larger numbers of cores or NUMA architectures, a distributed or hybrid model will become necessary.

• We have made significant improvements to the performance of parallel programs in GHC– and identified several further areas for improvement– GHC 6.10.1 (released next week) contains some of these

improvements, download it and try it out!

top related