Transcript
Page 1: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

Multicore Haskell Now!Don Stewart | DEFUN | Edinburgh, Scotland | Sep 2009

Page 2: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

Haskell and Parallelism: Why?

• Language reasons:– Purity, laziness and types mean you

can find more parallelism in your code–No specified execution order– Speculation and parallelism safe.

• Purity provides inherently more parallelism• High level: more productivity than say, C++

Page 3: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

Haskell and Parallelism

• Statically typed and heavily optimized: more performance than, say, Python or Erlang.

• Custom multicore runtime: high performance threads a primary concern – thanks Simon Marlow!

• Mature: 20 year code base, long term industrial use, massive library system

• Demonstrated performance

Page 4: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

Page 5: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

The Goal

• Parallelism: exploit parallel computing hardware to improve performance

• Concurrency: logically independent tasks as a structuring technique

• Improve performance of programs by using multiple cores at the same time

• Improve performance by hiding latency for IO-heavy programs

Page 6: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

Overview

• Background + Refresh

• Toolchain

• GHC runtime architecture

• The Kit– Sparks and parallel strategies

– Threads and shared memory

– Transactional memory

– Data parallelism

• Debugging and profiling

• Garbage collection

Page 7: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

Source for this talk

• Slides and source on the blog, along with links to papers for further reading

– http://donsbot.wordpress.com

Page 8: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

Syntax refresh

main = print (take 1000 primes)

primes = sieve [2..]

where

sieve (p:xs) =

p : sieve [ x | x <- xs, x `mod` p > 0]

Page 9: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

Syntax refresh

main :: IO ()

main = print (take 1000 primes)

primes :: [Int]

primes = sieve [2..]

where

sieve :: [Int] -> [Int]

sieve (p:xs) =

p : sieve [ x | x <- xs, x `mod` p > 0]

Page 10: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

Compiling Haskell programs

$ ghc -O2 --make A.hs

[1 of 1] Compiling Main ( A.hs, A.o )

Linking A …

$ ./A

[2,3,5,7,11,13,17,19,23, … 7883,7901,7907,7919]

Page 11: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

Compiling parallel Haskell programs

$ ghc -O2 --make -threaded Foo.hs

[1 of 1] Compiling Main ( Foo.hs, Foo.o )

Linking Foo …

$ ./A +RTS -N8

Add the –threaded flag for parallel programs

Specify at runtime how many real (OS) threads to map Haskell's logical threads to:

In this talk “thread” means Haskell's cheap logical threads, not those 8 OS threads

Page 12: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

IO is kept separateIn Haskell, side effecting code is tagged statically, via its type.

getChar :: IO Char

putChar :: Char → IO ()

Such side-effecting code can only interact with other side effecting code. It can't mess with pure code. Checked statically.

Imperative (default sequentialisation and side effects) off by default :-)

Haskellers control effects by trapping them in the IO box

Page 13: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

The Toolchain

Page 14: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

Toolchain

• GHC 6.10.x or 6.12.x

• Haskell Platform 2009.2.0.2

– http://hackage.haskell.org/platform/

• Dual core x86-64 laptop running Linux

• GHC HEAD branch (6.12) is even better– Sparks cheaper

– GC parallelism tuned

Page 15: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

Warm up lap

• Check your machines are working

• Find out how many cores are available

• Fire up ghc

• Make sure you've got a threaded runtime

Page 16: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

Warm up lap: 01.hsimport GHC.Conc

import System.Info

import Text.Printf

import Data.Version

main = do

printf "Compiled with %s-%s on %s/%s\n"

compilerName

(showVersion compilerVersion)

os arch

printf "Running with %d OS threads\n" numCapabilities

Page 17: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

Warm up lap: 01.hs

$ ghc -O2 --make -threaded 01.hs

[1 of 1] Compiling Main ( 01.hs, 01.o )

Linking 01 …

$ ./01 +RTS -N2

Compiled with ghc-6.10 on linux/x86_64

Running with 2 OS threads

Page 18: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

The GHC Runtime

Page 19: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

The GHC Runtime

• Multiple virtual cpus– Each virtual cpu has a pool of OS threads

– CPU local spark pools for additional work

• Lightweight Haskell threads map onto OS threads: many to one.

• Automatic thread migration and load balancing

• Parallel, generational GC

• Transactional memory and Mvars.

• We will use all of these

Page 20: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

Runtime Settings

Standard flags when compiling and running parallel programs

– Compile with• -threaded -O2

– Run with• +RTS -N2

• +RTS -N4

• ...

Page 21: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

1. Implicit Parallelism: Sparks and Strategies

Page 22: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

The `par` combinator

Lack of side effects makes parallelism easy, right?

f x y = (x * y) + (y ^ 2)

• We could just evaluate every sub-expression in parallel

• It is always safe to speculate on pure code

Creates far too many parallel tasks to execute

So in Haskell, the strategy is to give the user control over which expressions are sensible to run in parallel

Page 23: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

Semi-implicit parallelism

• Haskell gives us “parallel annotations”.

• Annotations on code to that hint when parallelism is useful– Very cheap post-hoc/ad-hoc parallelism

• Multicore programming without explicit:– Threads

– Locks

– Communication

• Often good speedups with very little effort

Page 24: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

Provided by: the parallel library

http://hackage.haskell.org/packages/parallel

$ ghc-pkg list parallel

/usr/lib/ghc-6.10.4/./package.conf:

parallel-1.1.0.1

import Control.Parallel

$ cabal unpack paralllel

Ships with the Haskell Platform.

Page 25: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

The `par` combinator

All parallelism built up from the `par` combinator:

a `par` b

• Creates a spark for 'a'

• Runtime sees chance to convert spark into a thread

• Which in turn may get run in parallel, on another core

• 'b' is returned

• No restrictions on what you can annotate

Page 26: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

What `par` guarantees

• `par` doesn't guarantee a new Haskell thread

• It “hints” that it would be good to evaluate the argument in parallel

• The runtime is free to decide– Depending on workload

– Depending on cost of the value

• This allows `par` to be very cheap

• So we can use it almost anywhere

• To overapproximate the parallelism in our code

Page 27: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

The `pseq` combinator

We also need a way to say “do it in this thread first”

And the second function, pseq:

pseq :: a → b → b

Says

• “evaluate 'a' in the current thread, then return b”

• Ensures work is run in the right thread

Page 28: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

Putting it together

Together we can parallelise expressions:

f `par` e `pseq` f + e

• One spark created for 'f'

• 'f' spark converted to a thread and executed

• 'e' evaluated in current thread in parallel with 'f'

Page 29: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

Simple sparks

02.hs

$ ghc-6.11.20090228 02.hs --make -threaded -O2

$ time ./02

1405006117752879898543142606244511569936384000008189

./02 2.00s user 0.01s system 99% cpu 2.015 total

$ time ./02 +RTS -N2

1405006117752879898543142606244511569936384000008189

./02 +RTS -N2 2.14s user 0.03s system 140% cpu 1.542 total

Page 30: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

Cautions

• Note: ghc 6.10 won't convert programs with only 1 spark– Heuristic tweak

• Don't “accidentally parallelize”:– f `par` f + e

• `pseq` lets us methodically prevent accidents

• Main thread works on 'f' causing spark to fizzle

• Need roughly the same amount of work in each thread

• ghc 6.12: use ThreadScope to determine this

Page 31: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

Reading runtime output

• Add the -sstderr flag to the program:– ./02 +RTS -N2 -sstderr

• And we get:

7,904 bytes maximum residency (1 sample(s))

2 MB total memory in use (0 MB lost due to fragmentation)

Generation 0: 2052 collections, 0 parallel, 0.19s, 0.18s elapsed

Generation 1: 1 collections, 0 parallel, 0.00s, 0.00s elapsed

Parallel GC work balance: nan (0 / 0, ideal 2)

SPARKS: 2 (2 converted, 0 pruned)

%GC time 7.9% (10.8% elapsed)

Productivity 92.1% of total user, 144.6% of total elapsed

Page 32: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

ThreadScope output

• ThreadScope isn't live yet, but it already helps us think about spark code. Try it out!

Page 33: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

Increasing the parallelism : 03.hs

• Push the sparks down from the top level, into the recursion

• Parfib!! 03.hs

• Single core:

• $ time ./03 43 +RTS

parfib 43 = 433494437

./03 43 +RTS 22.42s user 0.05s system 97% cpu 23.087 total

Page 34: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

Increasing the parallelism : 03.hs

• Push the sparks down from the top level, into the recursion

• Parfib!! 03.hs

• $ time ./03 43 +RTS -N2

parfib 43 = 433494437

./03 43 +RTS -N2 27.21s user 0.27s system 136% cpu 20.072 total

• Only a little faster... what went wrong?

Page 35: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

Check what the runtime says

• ./03 43 +RTS -N2 -sstderr 24.74s user 0.40s system 120% cpu 20.806 total

...

SPARKS: 701498971 (116 converted, 447756454 pruned)

...

• Seems like an awful lot of sparks

• N.B. Sparks stats available only in >= ghc 6.11

Page 36: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

Still not using all the hardware

• Key trick:– Push sparks into recursion

– But have a cutoff for when the costs are too high.

• `par` is cheap (and getting cheaper!), but not free.

Page 37: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

Not too fine grained: 04.hs

• Use thresholds for sparking in the recursion

• $ time ./04 43 11 +RTS -N2

parfib 43 = 433494437

./04 43 17 +RTS -N2 -sstderr 8.05s user 0.03s system 190% cpu 4.239 total

Page 38: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

Laziness, NFData, GC

• Laziness plays a role in parallel Haskell– It enables, but can also confuse, parallelism.

• When designing parallel algorithms we often need to say precisely where data is evaluated

• Laziness can cause unevaluated data to migrate between threads, ending up evaluated in an unintended thread

• The NFData class solves this for us.

Page 39: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

NFData

class NFData a where

-- | Reduces its argument to (head) normal form

rnf :: Strategy a

• type Strategy a = a → ()

• For different types, we state how to strictly evaluate them (ensuring all work is done in the thread we expect).

Page 40: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

NFData instances

instance NFData a => NFData [a] where

rnf [] = ()

rnf (x:xs) = rnf x `seq` rnf xs

instance (NFData a, NFData b) => NFData (a,b) where

rnf (x,y) = rnf x `seq` rnf y

instance NFData Int

instance NFData Integer

instance NFData Float

instance NFData Double

Page 41: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

Parallel list sort : 05.hs

• Evaluate a list of randoms in the main thread

• Parallel sort to depth 'n'

• Sequential sort after that.

• Merge back up

Page 42: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

Parallel list sort

$ ghc -O2 -threaded --make 05.hs

./05 700000 10 +RTS -N2 -sstderr 5.99s user 0.21s system 120% cpu 5.130 total

Faster, but not stellar.

Stats explain why:

%GC time 47.5% (46.1% elapsed)

Parallel GC work balance: 1.49 (53828478 / 36153924, ideal 2)

SPARKS: 986 (986 converted, 0 pruned)

$ ghc -O2 -threaded --make 05.hs

./05 700000 10 +RTS -N2 -sstderr 5.99s user 0.21s system 120% cpu 5.130 total

Faster, but not steller.

Stats explain why:

%GC time 47.5% (46.1% elapsed)

Parallel GC work balance: 1.49 (53828478 / 36153924, ideal 2)

$ ghc -O2 -threaded --make 05.hs

./05 700000 10 +RTS -N2 -sstderr 5.99s user 0.21s system 120% cpu 5.130 total

Faster, but not steller.

Stats explain why:

%GC time 47.5% (46.1% elapsed)

Parallel GC work balance: 1.49 (53828478 / 36153924, ideal 2)

Page 43: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

Garbage collection

• The GHC garbage collector is a parallel stop-the-world collector

• Stopping-the-world means running no threads

• You don't want to do that very often

• Check your GC stats (-sstderr) and bring the GC percent down by increasing the default allocation (-H400M or -A400M).

Page 44: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

Up the allocation area

• ./05 700000 10 +RTS -N2 -sstderr -A300M 5.67s user 0.61s system 131% cpu 4.762 total– %GC time 19.6% (21.9% elapsed)

• Allocation heavy recursive programs are hard with sparks.

Page 45: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

Clean code is now dirty

• Parallel Haskell code gets covered in `par` grit

parMap _ [] = []

parMap f (x:xs) = let r = f x in r `par` r : parMap f xs

• We can parallelize the spine of the list, but the elements are still serial.

• Need nested strategies for parallelising types

Page 46: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

Parallel strategies

• We might want want to, say, force list elements:

parMap _ [] = []

parMap f (x:xs) = let r = f x

in forceList r `par` r : parMap f xs

• Can't write a custom version for every type

• So lets parameterize the functions by the strategy

• Using NFData

Page 47: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

Strategies

Adding composability and modularity:

parList :: Strategy a -> Strategy [a]

parList strat [] = Done

parList strat (x:xs) = strat x `par` parList strat xs

A strategy that applies another strategy to each element in parallel

Page 48: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

Cute combinators

• We can abstract out the common patterns

using :: a -> Strategy a -> a

using x s = s x `seq` x

demanding :: a -> Done -> a

demanding = flip seq

sparking :: a -> Done -> a

sparking = flip Parallel.par

Page 49: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

Strategies

From these two building blocks, a range of strategies for common parallelisation tasks are built up:

parMap :: (a -> b) -> [a] -> [b]

parMap f xs = map f xs `using` parList rnf

Default parallelisation strategies driven by the shape of the data

Reuse!

Very flexible, light programming model: just annotate your code speculating on what is worthwhile to evaluate.

Page 50: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

The binary-trees benchmark: 06.hs

• Binary-trees: allocate and traverse many trees

• Top level:– let vs = parMap rnf $ depth' maxN

[minN,minN+2..maxN]

• ./06 20 52.78s user 0.39s system 98% cpu 54.016 total

• ./06 20 +RTS -N2 -A500M -sstderr 31.41s user 1.09s system 168% cpu 19.284 total

Page 51: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

Strategies

Page 52: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

Programming Model

• Deterministic:– Same results with parallel and sequential programs

– No races, no errors

– Good for reasoning: erase the `par` and get the original program

• Cheap: sprinkle par as you like, then measure and refine

• Measurement much easier with Threadscope

• Strategies: high level combinators for common patterns

Page 53: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

Spark queues

• How does it work?– -N4 gives us 4 heavy OS threads

– Runtime muliplexes many Haskell threads

– Generated with forkIO or par

– ~One OS thread (“worker thread”) per cpu

– Worker threads may migrate

– Each cpu has a spark pool. `par` adds your thunk to the current cpus list of work

– Idle worker threads turn a spark into a Haskell thread

– Haskell threads keeps stealing sparks from others

Page 54: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

Sparks and Strategies: Summary

Cheap to annotate programs with `par` and `pseq`

• Fine-grained parallelism

• Sparks need to be cheap

• Work-strealing thread pool in runtime, underneath

• Relies on purity: no side effects to get in the way

• Takes practice to learn where `par` is beneficial

• A good tool to have in the kit

Page 55: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

Sparks and Strategies: Exercises

• Write a Haskell program with a top level parMap that computes several tasks in parallel, showing absolute speedups

• Write a recursive function that uses sparks and speeds up.

• Add a depth limit to a recursive function to gain further improvements

• Use a parallel strategy for the same code.

• Fold a binary tree in parallel.

Page 56: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

2. Explicit Parallelism: Threads and Shared Memory

Page 57: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

Explicit concurrency with threads

For stateful or imperative programs, we need explicit threads, not speculative sparks.

forkIO :: IO () → IO ThreadId

• Takes a block of code to run, and executes it in a new Haskell thread

Page 58: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

Concurrent programming with threads: 07.hs

import Control.Concurrent

import System.Directory

main = do

forkIO (writeFile "xyz" "thread was here")

v ← doesFileExist "xyz"

print v

Page 59: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

Programming model

• Threads are preemptively scheduled

• Non-deterministic scheduling: random interleaving

• When the main thread terminates, all threads terminate (“daemonic threads”)

• Threads may be preempted when they allocate memory

• Communicate via messages or shared memory

Page 60: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

Asynchronous Exceptions: 08.hs 09.hs

• We need to communicate with threads somehow.

• One simple way is via asynchronous messages.– import Control.Exception

• Just throw messages at each other, catching them and handling them as you see fit.

• Good technique to know

• Good for writing fault tolerant code

Page 61: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

Shared Memory: MVars

• We need to communicate between threads

• We need threads to wait on results

• Haskell is pure – variables are immutable, so sharing values is safe

• Use shared, mutable synchronizing variables to communicate

Synchronization achieved via MVars or STM

Page 62: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

Shared Memory: MVars

• import Control.Concurrent.MVar

• MVars are boxes. They are either full or empty– putMVar :: MVar a → a → IO ()

– takeMVar :: MVar a → IO a

• “put” on a full MVar causes the thread to sleep until the MVar is empty

• “take” on an empty MVar blocks until it is full.

• The runtime will wake you up when you're needed

Page 63: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

Putting things in their boxes

do box <- newEmptyMVar forkIO (f `pseq` putMVar box f)

e `pseq` return ()

f <- takeMVar box

print (e + f)

Page 64: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

Forking tasks and communicating: 10.hs

• Here we create explicit Haskell threads, and set up shared memory for them to communicate

• Lower level than using sparks. More control

$ time ./10 +RTS -N2 -stderr

93326215443944152681...

./10 +RTS -N2 -stderr 2.32s user 0.06s system 146% cpu 1.627 total

Page 65: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

Hiding IO Latency

• When you have some expensive IO action, fork a thread for the work

• And return to the user for more work

• Works well for hiding disk and network latency

• Transparently scales: just add more cores and the Haskell threads will go there.

Page 66: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

Waking up threads: 10a.hs

• Thread-ring: most threads are asleep waiting on a message

• Split into pools onto each core

• Don't want to migrate threads between cores!• $ time ./10a +RTS -N2 -qm -qw -RTS 50000000

./10a +RTS -N2 -qm -qw -RTS 50000000 10.38s user 0.02s system 99% cpu 10.423 total

• $ time ./10a +RTS -N2 -RTS 50000000

^C

./10a +RTS -N2 -RTS 50000000 28.23s user 41.22s system 84% cpu 1:21.78 total

Page 67: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

Interactive compression: 11.hs

• Fork a thread for each job to compress

• Immediately return to the user for more work

• Toss threads onto idle cores

• Threads are also lightweight. We can create a million of them: 12.hs

Page 68: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

Messages via Mvars: 13.hs

• We can send messages, and have threads sleep waiting for them

• If multiple threads are waiting on an Mvar, one will be randomly chosen to wake up

• Note how the logic changes such that we have to:– Poll for messages, rather than being interrupted

– Possible messages are statically constrained

Page 69: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

Safely modifying an MVar

• Sets of take/put have a race condition if an exception is thrown

• We may not put the value back in the MVar

• Instead we should use modifyMVar

– modifyMVar :: MVar a → (a → IO (a, b)) → IO b

• Captures safely the take ; f ; put pattern.

Page 70: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

Shared Memory: Chans: 14.hs

• Chans: good for unbounded numbers of shared messages

• Send and receive messages of a pipe-like structure

• Can be converted to a lazy list, representing all future messages!

Page 71: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

Care with Laziness

• As with `par`, laziness has a role with MVars and forkIO

• Possible to pass unevaluated actions between threads

• Work might also be migrated away from where you want it done

• strict-concurrency package has strict variables only

• putMVar mv $! x + 1

Page 72: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

Useful patterns: 15.hs

• A parM combinator– Generic work queue with children threads

– Synchronizes the computed message output

– Like parMap but for IO

– Children threads migrate around the cores

Page 73: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

IORef

• For programs with only a few threads, atomicModifyIORef can be used for very simple synchronization

• In Data.IORef

• May be quite a bit faster than MVars

Page 74: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

Explicit concurrency: Exercises

• Write a program that forks a number of children threads, each return a result to the parent via a shared chan, and the parent waits for all to finish before exiting.

• Write a program that downloads webpages in parallel (via the download-curl package).

Page 75: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

Transactional Memory

Page 76: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

MVars can deadlock

MVar programs can deadlock, if one thread is waiting for a value from another, that will never appear.

Haskell let's us write lock-free synchronization via software transactional memory

Higher level than MVars, much safer, composable, but a bit slower.

Comparing the performance of concurrent linked-list implementations in Haskell (Martin Sulzmann, Edmund S. L. Lam, Simon Marlow) DAMP 2009

Continuing theme: multiple levels of resolution

Page 77: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

Software Transactional Memory

• Each atomic block appears to run in complete isolation

• Runtime publishes modifications to shared variables to all threads, or,

• Restarts the transaction that suffered contention

• You have the illusion you're the only thread

Page 78: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

STM

• STM added to Haskell in 2005 (MVars in 1995, from Id).

• Used in real systems (ones I work on)

• A composable, safe synchronization abstraction

• An optimisitic model

– Transactions run inside atomic blocks assuming no conflicts

– System checks consistency at the end of the transaction

– Retry if conflicts

– Requires control of side effects (handled in the type system)

Page 79: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

The stm package

• http://hackage.haskell.org/packages/stm

• $ ghc-pkg list stm

/usr/lib/ghc-6.10.4/./package.conf:

stm-2.1.1.2

• import Control.Concurrent.STM

• $ cabal unpack stm

• In the Haskell Platform

Page 80: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

STM

data STM a

atomically :: STM a → IO a

retry :: STM a

orElse :: STM a → STM a → STM a

• We use 'STM a' to build up atomic blocks.

• Transaction code can only run inside atomic blocks

• Inside atomic blocks it appears as if no other threads are running (notion of isolation)

• However, the system uses logs and rollback to handle conflicts

• 'orElse' lets us compose atomic blocks into larger pieces

Page 81: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

Transaction variables

TVars replace MVars, and are used inside atomic blocks (don't have Mvar semantics though).

data TVar a

newTVar :: a → STM (TVar a)

readTVar :: TVar a → STM a

writeTVar :: TVar a → a → STM ()

Actions always succeed: implemented by logging and rollback when there are conflicts, so no deadlocks!

Type system ensures only side effects in transactions are reads and writes to TVars

Page 82: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

Atomic bank transfers

transfer :: TVar Int -> TVar Int -> Int -> IO ()

transfer from to amount =

atomically $ do

balance <- readTVar from

if balance < amount

then retry

else do

writeTVar from (balance - amount)

tobalance <- readTVar to

writeTVar to (tobalance + amount

Page 83: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

Extended example: GameInventory: 16.hs

• Multiplayer game, where player's possessions are their state

• Transactional transfers of those possessions

• Restrict shared state to TVars, reducing the load on the system watching for contentions.

• No deadlock!

Page 84: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

Safety

• For it to be possible to roll back transactions, atomic blocks can't have visible side effects

• Enforced by the type system– In the STM monad, you can guarantee atomic safety

• atomically :: STM a -> IO a

• No way to do IO.– Only pure code

– Exceptions

– Non termination

– Transactional effects

Page 85: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

retry: where the magic is

• How does the runtime know when to wake up an atomic section?

• It blocks the thread until something changes in one of its transaction variables

• Automatically waits until we can make progress!

Page 86: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

OrElse: trying alternatives

• Don't always just want to retry forever

• Sometimes we need to try something else– orElse :: STM a → STM a → STM a

• Compose two atomic sections into one

• If the first fails, try the second.

Page 87: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

Treating the world as a transaction

• You can actually run IO actions from STM– GHC.Conc.unsafeIOToSTM :: IO a → STM a

• If you can fulfil the proof obligations...

• Useful for say, lifting transactional database actions into transactions in Haskell.

• Mostly we'll try to return a value to the IO monad from the transaction

Page 88: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

TMVars

• Like MVars, but instead of blocking explicitly, cause a retry on the thread

• Either full or empty

• Will be woken up once progress can be made

• takeTMVar :: TMVar a → STM a

• putTMVar :: TMVar a → a → STM ()

Page 89: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

Summary of benefits

• STM composes easily!

• Just looks like monadic code

• Even when there are atomic sections involved

• No deadlocks.

• Lock safe code when composed is still lock safe

• Progress: keep your transactions short

Page 90: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

Exercises

• Write a program using two threads to pass messages back and forth via TMVars (think about how you will indicate messages via IO)

• Write a thread pool using STM instead of MVars and Chans. 'n' threads should be forked, taking work from a queue, and returning evaluated work.

Page 91: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

Data Parallelism:Briefly

Page 92: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

Data Parallel Haskell

We can write a lot of parallel programs with the last two techniques, but:

• par/seq are very light, but granularity is hard

• forkIO/MVar/STM are more precise, but more complex

• Trade offs between abstraction and precision

The third way to parallel Haskell programs:

• nested data parallelism

Page 93: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

Data Parallel Haskell

Simple idea:

Do the same thing in parallel

to every element of a large collection

If your program can be expressed this way, then,

• No explicit threads or communication

• Clear cost model (unlike `par`)

• Good locality, easy partitioning

Page 94: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

Parallel Arrays

• Adds parallel array syntax:– [: e :]

– Along with many parallel “combinators”• mapP, filterP, zipP, foldP, …

– Very high level approach

• Parallel comprehensions– Actually have parallel semantics

• DPH is oriented towards large array programming

Page 95: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

Import Data.Array.Parallel

sumsq :: [: Float :] Float→

sumsq a = sumP [: x*x | x a :]←

dotp :: [:Float:] -> [:Float:] -> Float

dotp v w = sumP (zipWithP (*) v w)

Similar functions for map, zip, append, filter, length etc.

• Break array into N chunks (for N cores)

• Run a sequential loop to apply 'f' to each chunk element

• Run that loop on each core

• Combine the results

Page 96: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

Cons of flat data parallelism

While simple, the downside is that a single parallel loop drives the whole program.

Not very compositional.

No rich data structures, just flat things.

So how about nested data parallelism?

Page 97: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

Nested Data Parallelism

Simple idea:

Do the same thing in parallel

to every element of a large collection

plus

Each thing you do may in turn be a nested parallel computation

Page 98: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

Nested Data Parallelism

If your program can be expressed this way, then,

• No explicit threads or communication

• Clear cost model (unlike `par`)

• Good locality, easy partitioning

• Breakthrough:

Flattening: a compiler transformation to

systematically transform any nested data parallel program into a flat one

Page 99: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

Import Data.Array.Parallel

Nested data-parallel programming, via the vectoriser:

type Vector = [: Float :]type Matrix = [: Vector :]

matMul :: Matrix → Vector → VectormatMul m v = [: vecMul r v | r ← m :]

Data parallel functions (vecMul) inside data parallel functions

Page 100: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

The vectorizer

• GHC gets significantly smarter– Implements a vectorizer

– Flattens nested data, changing representations, automatically

– Project to add a GPU backend well advanced

• See Roman and Manuel's talks today at 4.45– In the Haskell Implementors Workshop

– “Implementing Data Parallel Haskell”

– “Running Haskell Array Computations on a GPU”

Page 101: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

Summary of DPH

Page 102: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

Small example: vect.hs

• Uses the dph libraries– dph-prim-par

– Dph

sumSq :: Int → Int

sumSq n = I.sumP (mapP (\x -> x * x) (enumFromToP 1 n))

Requires -fvectorize

Page 103: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

Example: sumsq

• $ ghc -O2 -threaded --make vect.hs -package dph-par -package dph-prim-par-0.3

• $ time ./vect 100000000 +RTS -N2

N = 100000000: 2585/4813 2585/4813 2585/4813

./vect 100000000 +RTS -N2 2.81s user 2.22s system 178% cpu 2.814 tota

Page 104: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

Notes

• Still in “technology preview”

• Significantly better in GHC 6.12– More programs actually speedup

• Latest status at:– http://www.haskell.org/haskellwiki/GHC/Data_Parallel_Haskell

Page 105: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

Runtime Tweakings

Page 106: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

Fine Tuning Runtime Settings

• -C<secs> Context-switch interval in seconds

• -A<size> Sets the minimum allocation area size (default 256k) Egs: -A1m -A10k

• -M<size> Sets the maximum heap size (default unlimited) Egs: -M256k -M1G

• -g Use <n> OS threads for GC

• -qm Don't automatically migrate threads between CPUs

• -qw Migrate a thread to the current CPU when it is woken up

• -e<size> Size of spark pools (default 100)

• GC threads– -g

– -qm Don't automatically migrate threads between CPUs

• GC threads– -g

– -qm Don't automatically migrate threads between CPUs

Page 107: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

Summary

Page 108: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

Multicore Haskell Now

• Sophisticated, fast runtime

• Sparks and parallel strategies

• Explicit threads

• Messages and MVars for shared memory

• Transactional memory

• Data parallel arrays

• All in GHC 6.10, even better in GHC 6.12

• http://hackage.haskell.org/platform

Page 109: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

Thanks

This talk made possible by:

Simon Peyton Jones

Satnam Singh

Manuel Chakravarty

Gabriele Keller

Roman Leschinkskiy

Bryan O'Sullivan

Read their papers or visit haskell.org for the full story!

Simon Marlow

Tim Harris

Phil Trinder

Kevin Hammond

Martin Sulzmann

John Goerzon

Page 110: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

References

• http://donsbot.wordpress.com/2009/09/03/parallel-programming-in-haskell-a-reading-list/

• “Real World Haskell”, O'Sullivan, Goerzon, Stewart. O'Reilly 2008. Ch. 24, 25, 28.

• “A Tutorial on Parallel and Concurrent Programming in Haskell”, Peyton Jones and Singh. 2008

• “Runtime Support for Multicore Haskell”, Marlow, Peyton Jones, Singh. 2009.

• “Parallel Performance Tuning for Haskell”, Jones, Marlow, Singh, 2009

Page 111: Multicore Programming in Haskell Now!

© 2009 Galois, Inc. All rights reserved.

References

• “Harnessing the Multicores: Nested Data Parallelism in Haskell”, Peyton Jones, Leshchinkskiy, Keller, Chakravarty, 2008.

• “Haskell on a Shared-Memory Multiprocessor”, Harris, Marlow, Peyton Jones, 2005

• “Algorithm + Strategy = Parallelism”, Trinder, Hammond, Loidl, Peyton Jones, 1998.

• “Concurrent Haskell”, Peyton Jones, Gordon, Finne, 1996.

• “Tackling the Awkward Squad”, Peyton Jones, 2001.


Top Related