loop fusion in haskell

90
Loop fusion in Haskell Roman Leshchinskiy Programming Languages and Systems University of New South Wales

Upload: don-stewart

Post on 16-Nov-2014

939 views

Category:

Documents


2 download

DESCRIPTION

http://unlines.wordpress.com/2009/10/22/talk-on-loop-fusion-in-haskell/by Roman LeshchinskiyRoman gave a talk about loop fusion in Haskell at FP-Syd, the Sydney Functional Programming group. It covered stream fusion and fusion for distributed types which are two of the optimisations that make Data Parallel Haskell fast.Original PDF: http://www.cse.unsw.edu.au/~rl/talks/fp-syd-fusion.pdf

TRANSCRIPT

Page 1: Loop Fusion in Haskell

Loop fusion in Haskell

Roman Leshchinskiy

Programming Languages and SystemsUniversity of New South Wales

Page 2: Loop Fusion in Haskell

What is this about?

What I do

Data Parallel Haskell

compiles nested data-parallel programs to flat data-parallel ones

lots of arrays and collective operations involved

What other people do

array programs with lots of collective operations

Page 3: Loop Fusion in Haskell

What is this about?

What I do

Data Parallel Haskell

compiles nested data-parallel programs to flat data-parallel ones

lots of arrays and collective operations involved

What other people do

array programs with lots of collective operations

zipWith (-)(zipWith (*)

(zipWith (-) (replicate_s segd as1) xs)(zipWith (-) (replicate_s segd bs1) ys))

(zipWith (*)(zipWith (-) (replicate_s segd bs2) ys)(zipWith (-) (replicate_s segd as2) xs))

Page 4: Loop Fusion in Haskell

What is this about?

What I do

Data Parallel Haskell

compiles nested data-parallel programs to flat data-parallel ones

lots of arrays and collective operations involved

What other people do

array programs with lots of collective operations

return . foldl’ hash 5381. map toLower. filter isAlpha =<< readFile f

Page 5: Loop Fusion in Haskell

What is this about?

What I do

Data Parallel Haskell

compiles nested data-parallel programs to flat data-parallel ones

lots of arrays and collective operations involved

What other people do

array programs with lots of collective operations

What everybody wants

no temporary arrays

fused loops

C-like speed

Page 6: Loop Fusion in Haskell

Loop fusion is easy!

foo xs = map (*5) (map (+3) xs)

Page 7: Loop Fusion in Haskell

Loop fusion is easy!

foo xs = map (*5) (map (+3) xs)

RULES

"map/map" map f (map g xs) = map (f . g) xs

Page 8: Loop Fusion in Haskell

Loop fusion is easy!

foo xs = map (*5) (map (+3) xs)bar ys = filter even (filter (<42) ys)

RULES

"map/map" map f (map g xs) = map (f . g) xs

Page 9: Loop Fusion in Haskell

Loop fusion is easy!

foo xs = map (*5) (map (+3) xs)bar ys = filter even (filter (<42) ys)

RULES

"map/map" map f (map g xs) = map (f . g) xs"filter/filter" filter f (filter g xs)

= filter (λ x → f x && g x) xs

Page 10: Loop Fusion in Haskell

Loop fusion is easy!

foo xs = map (*5) (map (+3) xs)bar ys = filter even (filter (<42) ys)baz zs = map (+1) (filter even zs)

RULES

"map/map" map f (map g xs) = map (f . g) xs"filter/filter" filter f (filter g xs)

= filter (λ x → f x && g x) xs

Page 11: Loop Fusion in Haskell

Loop fusion is easy!

foo xs = map (*5) (map (+3) xs)bar ys = filter even (filter (<42) ys)baz zs = map (+1) (filter even zs)

RULES

"map/map" map f (map g xs) = map (f . g) xs"filter/filter" filter f (filter g xs)

= filter (λ x → f x && g x) xs"map/filter" map f (filter g xs) = mapFilter f g xs

Page 12: Loop Fusion in Haskell

Loop fusion is easy!

foo xs = map (*5) (map (+3) xs)bar ys = filter even (filter (<42) ys)baz zs = map (+1) (filter even zs)

RULES

"map/map" map f (map g xs) = map (f . g) xs"filter/filter" filter f (filter g xs)

= filter (λ x → f x && g x) xs"map/filter" map f (filter g xs) = mapFilter f g xs"map/mapFilter" map f (mapFilter g h xs)

= mapFilter (f . g) h xs"mapFilter/filter" mapFilter f g (filter h xs)

= mapFilter (f λ x → g x && h x) xs...

Page 13: Loop Fusion in Haskell

Loop fusion is easy!

foo xs = map (*5) (map (+3) xs)bar ys = filter even (filter (<42) ys)baz zs = map (+1) (filter even zs)

RULES

"map/map" map f (map g xs) = map (f . g) xs"filter/filter" filter f (filter g xs)

= filter (λ x → f x && g x) xs"map/filter" map f (filter g xs) = mapFilter f g xs"map/mapFilter" map f (mapFilter g h xs)

= mapFilter (f . g) h xs"mapFilter/filter" mapFilter f g (filter h xs)

= mapFilter (f λ x → g x && h x) xs... BAD

IDEA

Page 14: Loop Fusion in Haskell

The challenge

use a constant number of rewrite rules

don’t require new rules for new combinators

make adding new combinators easy

fuse everything!

don’t require specialised compiler support

handle both sequential and parallel loops

Page 15: Loop Fusion in Haskell

Sequential loops

Page 16: Loop Fusion in Haskell

Streams

data Step s a = Yield a s| Done

data Stream a = ∃s. Stream (s → Step s a) s

stepper produces next element and state from current state

similar to an iterator

actually encodes an anamorphism (unfold)

Page 17: Loop Fusion in Haskell

Streams

data Step s a = Yield a s| Done

data Stream a = ∃s. Stream (s → Step s a) s

stepper produces next element and state from current state

similar to an iterator

actually encodes an anamorphism (unfold)

stepper

state

Page 18: Loop Fusion in Haskell

Streams

data Step s a = Yield a s| Done

data Stream a = ∃s. Stream (s → Step s a) s

stepper produces next element and state from current state

similar to an iterator

actually encodes an anamorphism (unfold)

sumS :: Num a ⇒ Stream a → asumS (Stream step s) = go 0 swhere go z s = case step s of

Yield x s’ → go (z+x) s’Done → z

Page 19: Loop Fusion in Haskell

Streams

data Step s a = Yield a s| Done

data Stream a = ∃s. Stream (s → Step s a) s

stepper produces next element and state from current state

similar to an iterator

actually encodes an anamorphism (unfold)

stream :: Array a → Stream astream arr = Stream step 0where step i | i < length arr = Yield (arr ! i) (i+1)

| otherwise = Done

Page 20: Loop Fusion in Haskell

Streams

data Step s a = Yield a s| Done

data Stream a = ∃s. Stream (s → Step s a) s

stepper produces next element and state from current state

similar to an iterator

actually encodes an anamorphism (unfold)

mapS :: (a → b) → Stream a → Stream bmapS f (Stream step s) = Stream step’ swhere step’ s = case step s of

Yield x s’ → Yield (f x) s’Done → Done

Page 21: Loop Fusion in Haskell

Streams

data Step s a = Yield a s| Done

data Stream a = ∃s. Stream (s → Step s a) s

stepper produces next element and state from current state

similar to an iterator

actually encodes an anamorphism (unfold)

unstream :: Stream a → Array aunstream (Stream step s) = <allocate, fill and freeze>

Page 22: Loop Fusion in Haskell

Stream fusion in three easy steps

Step 1: implement array operations in terms of streams

sum :: Num a ⇒ Array a → asum = sumS . stream

map :: (a → b) → Array a → Array bmap f = unstream . mapS f . stream

Step 2: inline them

Step 3: eliminate stream/unstream pairs

"stream/unstream" stream (unstream s) = s

Page 23: Loop Fusion in Haskell

Stream fusion in three easy steps

Step 1: implement array operations in terms of streams

sum :: Num a ⇒ Array a → asum = sumS . stream

map :: (a → b) → Array a → Array bmap f = unstream . mapS f . stream

Step 2: inline them

Step 3: eliminate stream/unstream pairs

"stream/unstream" stream (unstream s) = s

Page 24: Loop Fusion in Haskell

Stream fusion in three easy steps

Step 1: implement array operations in terms of streams

sum :: Num a ⇒ Array a → asum = sumS . stream

map :: (a → b) → Array a → Array bmap f = unstream . mapS f . stream

Step 2: inline them

sumsq :: Num a ⇒ Array a → asumsq = sum . map (λx -> x*x)

Step 3: eliminate stream/unstream pairs

"stream/unstream" stream (unstream s) = s

Page 25: Loop Fusion in Haskell

Stream fusion in three easy steps

Step 1: implement array operations in terms of streams

sum :: Num a ⇒ Array a → asum = sumS . stream

map :: (a → b) → Array a → Array bmap f = unstream . mapS f . stream

Step 2: inline them

sumsq :: Num a ⇒ Array a → asumsq = sum . map (λx -> x*x)

= sumS . stream . unstream . mapS f . stream

Step 3: eliminate stream/unstream pairs

"stream/unstream" stream (unstream s) = s

Page 26: Loop Fusion in Haskell

Stream fusion in three easy steps

Step 1: implement array operations in terms of streams

sum :: Num a ⇒ Array a → asum = sumS . stream

map :: (a → b) → Array a → Array bmap f = unstream . mapS f . stream

Step 2: inline them

sumsq :: Num a ⇒ Array a → asumsq = sum . map (λx -> x*x)

= sumS . stream . unstream . mapS f . stream

Step 3: eliminate stream/unstream pairs

"stream/unstream" stream (unstream s) = s

Page 27: Loop Fusion in Haskell

Stream fusion in three easy steps

Step 1: implement array operations in terms of streams

sum :: Num a ⇒ Array a → asum = sumS . stream

map :: (a → b) → Array a → Array bmap f = unstream . mapS f . stream

Step 2: inline them

sumsq :: Num a ⇒ Array a → asumsq = sum . map (λx -> x*x)

= sumS . stream . unstream . mapS f . stream

Step 3: eliminate stream/unstream pairs

"stream/unstream" stream (unstream s) = s

Page 28: Loop Fusion in Haskell

Stream fusion in three easy steps

Step 1: implement array operations in terms of streams

sum :: Num a ⇒ Array a → asum = sumS . stream

map :: (a → b) → Array a → Array bmap f = unstream . mapS f . stream

Step 2: inline them

sumsq :: Num a ⇒ Array a → asumsq = sum . map (λx -> x*x)

= sumS . mapS f . stream

Step 3: eliminate stream/unstream pairs

"stream/unstream" stream (unstream s) = s

Page 29: Loop Fusion in Haskell

Stream fusion in three easy steps

Step 1: implement array operations in terms of streams

sum :: Num a ⇒ Array a → asum = sumS . stream

map :: (a → b) → Array a → Array bmap f = unstream . mapS f . stream

Step 2: inline them

sumsq :: Num a ⇒ Array a → asumsq = sum . map (λx -> x*x)

= sumS . mapS f . stream

Step 3: eliminate stream/unstream pairs

"stream/unstream" stream (unstream s) = s

LetGHC

doth

ere

st

Page 30: Loop Fusion in Haskell

Optimising stream operations

sumsq xs = sumS (mapS square ( stream xs))

optimal loop

no Stream or Step values ever created

only general-purpose optimisations

will be optimised further (unboxing etc.)

requires a great compiler (thanks GHC team!)

Page 31: Loop Fusion in Haskell

Optimising stream operations

sumsq xs = sumS (mapS square ( stream xs))

stream :: Array a → Stream astream arr = Stream step 0where step i | i < length arr = Yield (arr ! i) (i+1)

| otherwise = Done

optimal loop

no Stream or Step values ever created

only general-purpose optimisations

will be optimised further (unboxing etc.)

requires a great compiler (thanks GHC team!)

inline

Page 32: Loop Fusion in Haskell

Optimising stream operations

sumsq xs = sumS ( mapS square (Stream step1 0))wherestep1 i = case i < length xs of

True → Yield (xs ! i) (i+1)False → Done

optimal loop

no Stream or Step values ever created

only general-purpose optimisations

will be optimised further (unboxing etc.)

requires a great compiler (thanks GHC team!)

Page 33: Loop Fusion in Haskell

Optimising stream operations

sumsq xs = sumS ( mapS square (Stream step1 0))wherestep1 i = case i < length xs of

True → Yield (xs ! i) (i+1)False → Done

mapS :: (a → b) → Stream a → bmapS f (Stream step s) = Stream step’ swhere step’ s = case step s of

Yield x s’ → Yield (f x) s’Done → Done

optimal loop

no Stream or Step values ever created

only general-purpose optimisations

will be optimised further (unboxing etc.)

requires a great compiler (thanks GHC team!)

inline

Page 34: Loop Fusion in Haskell

Optimising stream operations

sumsq xs = sumS (Stream step2 0)wherestep1 i = case i < length xs of

True → Yield (xs ! i) (i+1)False → Done

step2 i = case step1 i ofYield x i’ → Yield (square x) i’Done → Done

optimal loop

no Stream or Step values ever created

only general-purpose optimisations

will be optimised further (unboxing etc.)

requires a great compiler (thanks GHC team!)

Page 35: Loop Fusion in Haskell

Optimising stream operations

sumsq xs = sumS (Stream step2 0)wherestep1 i = case i < length xs of

True → Yield (xs ! i) (i+1)False → Done

step2 i = case step1 i ofYield x i’ → Yield (square x) i’Done → Done

sumS :: Num a ⇒ Stream a → asumS (Stream step s) = go 0 swhere go z s = case step s of

Yield x s’ → go (z+x) s’Done → z

optimal loop

no Stream or Step values ever created

only general-purpose optimisations

will be optimised further (unboxing etc.)

requires a great compiler (thanks GHC team!)

inline

Page 36: Loop Fusion in Haskell

Optimising stream operations

sumsq xs = go 0 0wherestep1 i = case i < length xs of

True → Yield (xs ! i) (i+1)False → Done

step2 i = case step1 i ofYield x i’ → Yield (square x) i’Done → Done

go z i = case step2 i of

Yield x i’ → go (z+x) i’Done → z

optimal loop

no Stream or Step values ever created

only general-purpose optimisations

will be optimised further (unboxing etc.)

requires a great compiler (thanks GHC team!)

Page 37: Loop Fusion in Haskell

Optimising stream operations

sumsq xs = go 0 0wherestep1 i = case i < length xs of

True → Yield (xs ! i) (i+1)False → Done

step2 i = case step1 i ofYield x i’ → Yield (square x) i’Done → Done

go z i = case step2 i of

Yield x i’ → go (z+x) i’Done → z

optimal loop

no Stream or Step values ever created

only general-purpose optimisations

will be optimised further (unboxing etc.)

requires a great compiler (thanks GHC team!)

inline

Page 38: Loop Fusion in Haskell

Optimising stream operations

sumsq xs = go 0 0wherestep1 i = case i < length xs of

True → Yield (xs ! i) (i+1)False → Done

go z i = case (case step1 i ofYield x i’ → Yield (square x) i’Done → Done) of

Yield x i’ → go (z+x) i’Done → z

optimal loop

no Stream or Step values ever created

only general-purpose optimisations

will be optimised further (unboxing etc.)

requires a great compiler (thanks GHC team!)

Page 39: Loop Fusion in Haskell

Optimising stream operations

sumsq xs = go 0 0wherestep1 i = case i < length xs of

True → Yield (xs ! i) (i+1)False → Done

go z i = case (case step1 i of

Yield x i’ → Yield (square x) i’

Done → Done) ofYield x i’ → go (z+x) i’Done → z

optimal loop

no Stream or Step values ever created

only general-purpose optimisations

will be optimised further (unboxing etc.)

requires a great compiler (thanks GHC team!)

case of case

Page 40: Loop Fusion in Haskell

Optimising stream operations

sumsq xs = go 0 0wherestep1 i = case i < length xs of

True → Yield (xs ! i) (i+1)False → Done

go z i = case step1 i of

Yield x i’ → go (z + square x) i’Done → z

optimal loop

no Stream or Step values ever created

only general-purpose optimisations

will be optimised further (unboxing etc.)

requires a great compiler (thanks GHC team!)

Page 41: Loop Fusion in Haskell

Optimising stream operations

sumsq xs = go 0 0wherestep1 i = case i < length xs of

True → Yield (xs ! i) (i+1)False → Done

go z i = case step1 i of

Yield x i’ → go (z + square x) i’Done → z

optimal loop

no Stream or Step values ever created

only general-purpose optimisations

will be optimised further (unboxing etc.)

requires a great compiler (thanks GHC team!)

inline

Page 42: Loop Fusion in Haskell

Optimising stream operations

sumsq xs = go 0 0wherego z i = case (case i < length xs of

True → Yield (xs ! i) (i+1)False → Done) of

Yield x i’ → go (z + square x) i’Done → z

optimal loop

no Stream or Step values ever created

only general-purpose optimisations

will be optimised further (unboxing etc.)

requires a great compiler (thanks GHC team!)

Page 43: Loop Fusion in Haskell

Optimising stream operations

sumsq xs = go 0 0wherego z i = case (case i < length xs of

True → Yield (xs ! i) (i+1)False → Done) of

Yield x i’ → go (z + square x) i’Done → z

optimal loop

no Stream or Step values ever created

only general-purpose optimisations

will be optimised further (unboxing etc.)

requires a great compiler (thanks GHC team!)

case of case

Page 44: Loop Fusion in Haskell

Optimising stream operations

sumsq xs = go 0 0wherego z i = case i < length xs of

True → go (z + square (xs ! i)) (i+1)False → z

optimal loop

no Stream or Step values ever created

only general-purpose optimisations

will be optimised further (unboxing etc.)

requires a great compiler (thanks GHC team!)

Page 45: Loop Fusion in Haskell

Optimising stream operations

sumsq xs = go 0 0wherego z i = case i < length xs of

True → go (z + square (xs ! i)) (i+1)False → z

optimal loop

no Stream or Step values ever created

only general-purpose optimisations

will be optimised further (unboxing etc.)

requires a great compiler (thanks GHC team!)

Page 46: Loop Fusion in Haskell

Why does it work?

sumsq xs = go 0 0wherestep1 i = case i < length xs of

True → Yield (xs ! i) (i+1)False → Done

step2 i = case step1 i ofYield x i’ → Yield (square x) i’Done → Done

go z i = case step2 i of

Yield x i’ → go (z+x) i’Done → z

Page 47: Loop Fusion in Haskell

Why does it work?

sumsq xs = go 0 0wherestep1 i = case i < length xs of

True → Yield (xs ! i) (i+1)False → Done

step2 i = case step1 i ofYield x i’ → Yield (square x) i’Done → Done

go z i = case step2 i of

Yield x i’ → go (z+x) i’Done → z

non-recursive

Page 48: Loop Fusion in Haskell

Why does it work?

sumsq xs = go 0 0wherestep1 i = case i < length xs of

True → Yield (xs ! i) (i+1)False → Done

step2 i = case step1 i ofYield x i’ → Yield (square x) i’Done → Done

go z i = case step2 i of

Yield x i’ → go (z+x) i’Done → z

non-recursive

non-recursive

Page 49: Loop Fusion in Haskell

Why does it work?

sumsq xs = go 0 0wherestep1 i = case i < length xs of

True → Yield (xs ! i) (i+1)False → Done

step2 i = case step1 i ofYield x i’ → Yield (square x) i’Done → Done

go z i = case step2 i of

Yield x i’ → go (z+x) i’Done → z

non-recursive

non-recursive

recursive

Page 50: Loop Fusion in Haskell

A slight problem

filterS :: (a → Bool) → Stream a → Stream afilterS f (Stream step s) = Stream step’ swherestep’ s = case step s of

Yield x s’| f x → Yield x s’| otherwise → step s’

Done → Done

Page 51: Loop Fusion in Haskell

A slight problem

filterS :: (a → Bool) → Stream a → Stream afilterS f (Stream step s) = Stream step’ swherestep’ s = case step s of

Yield x s’| f x → Yield x s’| otherwise → step s’

Done → Done

recursive

Page 52: Loop Fusion in Haskell

Extending streams

Idea: allow a loop iteration not to produce an element

data Step s a = Yield a s| Skip s| Done

Page 53: Loop Fusion in Haskell

Extending streams

Idea: allow a loop iteration not to produce an element

data Step s a = Yield a s| Skip s| Done

Page 54: Loop Fusion in Haskell

Extending streams

Idea: allow a loop iteration not to produce an element

data Step s a = Yield a s| Skip s| Done

filterS :: (a → Bool) → Stream a → Stream afilterS f (Stream step s) = Stream step’ swherestep’ s = case step s of

Yield x s’| f x → Yield x s’| otherwise → Skip s’

Skip s’ → Skip s’Done → Done

Page 55: Loop Fusion in Haskell

Extending streams

Idea: allow a loop iteration not to produce an element

data Step s a = Yield a s| Skip s| Done

filterS :: (a → Bool) → Stream a → Stream afilterS f (Stream step s) = Stream step’ swherestep’ s = case step s of

Yield x s’| f x → Yield x s’| otherwise → Skip s’

Skip s’ → Skip s’Done → Done

non-recursive

Page 56: Loop Fusion in Haskell

Stream fusion - summary

encode loops by streams

implement array operations in terms of streams

eliminate stream/unstream pairs (temporaries)

stream producers are non-recursive

standard optimisations remove overhead (loop fusion)

Standard optimisations: inlining, case-of-case, worker/wrappertransformation, SpecConstr, LiberateCase, specialisation ...

Page 57: Loop Fusion in Haskell

Stream fusion - summary

encode loops by streams

implement array operations in terms of streams

eliminate stream/unstream pairs (temporaries)

stream producers are non-recursive

standard optimisations remove overhead (loop fusion)

Standard optimisations: inlining, case-of-case, worker/wrappertransformation, SpecConstr, LiberateCase, specialisation ...

Page 58: Loop Fusion in Haskell

Parallel loops

Page 59: Loop Fusion in Haskell

DPH on multicores

Evaluation strategy after vectorisation

operations are data parallel and flat

executed by a gang of worker threads

essentially fork-join parallelism

Page 60: Loop Fusion in Haskell

DPH on multicores

Evaluation strategy after vectorisation

operations are data parallel and flat

executed by a gang of worker threads

essentially fork-join parallelism

mapP :: (a → b) → Array a → Array bmapP f xs = <split xs across workers>

<map f over each chunk><collect local results>

Page 61: Loop Fusion in Haskell

DPH on multicores

Evaluation strategy after vectorisation

operations are data parallel and flat

executed by a gang of worker threads

essentially fork-join parallelism

mapP :: (a → b) → Array a → Array bmapP f xs = <split xs across workers>

<map f over each chunk><collect local results>

f is sequential

Page 62: Loop Fusion in Haskell

DPH on multicores

Evaluation strategy after vectorisation

operations are data parallel and flat

executed by a gang of worker threads

essentially fork-join parallelism

mapP :: (a → b) → Array a → Array bmapP f xs = <split xs across workers>

<map f over each chunk><collect local results>

sumP :: Num a ⇒ Array a → asumP xs = <split xs across workers>

<sum each chunk><reduce local sums>

Page 63: Loop Fusion in Haskell

DPH on multicores

Evaluation strategy after vectorisation

operations are data parallel and flat

executed by a gang of worker threads

essentially fork-join parallelism

sumsqP = sumP . mapP square

Page 64: Loop Fusion in Haskell

DPH on multicores

Evaluation strategy after vectorisation

operations are data parallel and flat

executed by a gang of worker threads

essentially fork-join parallelism

sumsqP xs = <split xs across workers><map square over each chunk><collect local results><split results across workers><sum each chunk><reduce local sums>

Page 65: Loop Fusion in Haskell

DPH on multicores

Evaluation strategy after vectorisation

operations are data parallel and flat

executed by a gang of worker threads

essentially fork-join parallelism

sumsqP xs = <split xs across workers><map square over each chunk><collect local results><split results across workers><sum each chunk><reduce local sums>

Page 66: Loop Fusion in Haskell

DPH on multicores

Evaluation strategy after vectorisation

operations are data parallel and flat

executed by a gang of worker threads

essentially fork-join parallelism

sumsqP xs = <split xs across workers><map square over each chunk><collect local results><split results across workers><sum each chunk><reduce local sums>

Page 67: Loop Fusion in Haskell

Distributed types

Idea: let’s make the evaluation strategy explicit! (Keller 1999)

data Dist a a is distributed across threadsDist (Array a) each thread has a local array (chunk)Dist Double each thread has a local Double

splitD distribute an array across threadsjoinD collect thread-local chunks

mapD execute a sequential operation in each threadsumD compute sum of local values

Page 68: Loop Fusion in Haskell

Distributed types

Idea: let’s make the evaluation strategy explicit! (Keller 1999)

data Dist a a is distributed across threadsDist (Array a) each thread has a local array (chunk)Dist Double each thread has a local Double

splitD distribute an array across threadsjoinD collect thread-local chunks

mapD execute a sequential operation in each threadsumD compute sum of local values

Page 69: Loop Fusion in Haskell

Distributed types

Idea: let’s make the evaluation strategy explicit! (Keller 1999)

data Dist a a is distributed across threadsDist (Array a) each thread has a local array (chunk)Dist Double each thread has a local Double

splitD distribute an array across threadsjoinD collect thread-local chunks

mapD execute a sequential operation in each threadsumD compute sum of local values

splitD :: Array a → Dist (Array a)joinD :: Dist (Array a) → Array a

Page 70: Loop Fusion in Haskell

Distributed types

Idea: let’s make the evaluation strategy explicit! (Keller 1999)

data Dist a a is distributed across threadsDist (Array a) each thread has a local array (chunk)Dist Double each thread has a local Double

splitD distribute an array across threadsjoinD collect thread-local chunks

mapD execute a sequential operation in each threadsumD compute sum of local values

splitD :: Array a → Dist (Array a)joinD :: Dist (Array a) → Array amapD :: (a → b) → Dist a → Dist bsumD :: Num a ⇒ Dist a → a

Page 71: Loop Fusion in Haskell

Programming with distributed types

mapP f xs = <split xs across workers><map f over each chunk><collect local results>

Page 72: Loop Fusion in Haskell

Programming with distributed types

mapP f = joinD -- collect. mapD (map f) -- map f over chunks. splitD -- split

Page 73: Loop Fusion in Haskell

Programming with distributed types

mapP f = joinD -- collect. mapD (map f) -- map f over chunks. splitD -- split

sumP xs = <split xs across workers><sum each chunk><reduce local sums>

Page 74: Loop Fusion in Haskell

Programming with distributed types

mapP f = joinD -- collect. mapD (map f) -- map f over chunks. splitD -- split

sumP = sumD -- reduce. mapD sum -- sum each chunk. splitD -- split

Page 75: Loop Fusion in Haskell

Fusing distributed types

sumsqP = sumP . mapP square

Page 76: Loop Fusion in Haskell

Fusing distributed types

sumsqP = sumD -- reduce. mapD sum -- sum each chunk. splitD -- split. joinD -- collect. mapD (map square) -- map square over chunks. splitD -- split

Page 77: Loop Fusion in Haskell

Fusing distributed types

sumsqP = sumD -- reduce. mapD sum -- sum each chunk. splitD -- split. joinD -- collect. mapD (map square) -- map square over chunks. splitD -- split

RULES

splitD (joinD xs) = xs

Page 78: Loop Fusion in Haskell

Fusing distributed types

sumsqP = sumD -- reduce. mapD sum -- sum each chunk. mapD (map square) -- map square over chunks. splitD -- split

RULES

splitD (joinD xs) = xs

Page 79: Loop Fusion in Haskell

Fusing distributed types

sumsqP = sumD -- reduce. mapD sum -- sum each chunk. mapD (map square) -- map square over chunks. splitD -- split

RULES

splitD (joinD xs) = xsmapD f (mapD g xs) = mapD (f . g) xs

Page 80: Loop Fusion in Haskell

Fusing distributed types

sumsqP = sumD -- reduce

. mapD (sum . map square) -- work

. splitD -- split

RULES

splitD (joinD xs) = xsmapD f (mapD g xs) = mapD (f . g) xs

Page 81: Loop Fusion in Haskell

Fusing distributed types

sumsqP = sumD -- reduce

. mapD (sum . map square) -- work

. splitD -- split

RULES

splitD (joinD xs) = xsmapD f (mapD g xs) = mapD (f . g) xs

stream fusion

Page 82: Loop Fusion in Haskell

Distributed types on multicores

data Dist a a is distributed across threads

splitD distribute xs across threadsjoinD collect thread-local chunksmapD execute a sequential operation in each thread

splitD/joinD eliminate communicationmapD/mapD eliminate synchronisation

Page 83: Loop Fusion in Haskell

Distributed types on clusters

data Dist a a is distributed across nodes

splitD scatterjoinD gathermapD execute operation on each node

splitD/joinD eliminate communicationmapD/mapD eliminate synchronisation

Page 84: Loop Fusion in Haskell

Distributed types on GPUs

data Dist a a is in GPU memory

splitD CPU −→ GPU transferjoinD GPU −→ CPU transfermapD execute kernel on the GPU

splitD/joinD eliminate memory transfers (communication)mapD/mapD fuse kernels (synchronisation)

Page 85: Loop Fusion in Haskell

Distribured types – summary

encode parallel loops as split/work/join

eliminate unnecessary split/join pairs

fuse sequential work (stream fusion)

very general mechanism for fusing parallel computations

applicable to a wide range of architectures

again, no specialised compiler support

Page 86: Loop Fusion in Haskell

Obligatory benchmark1

10

100

1000

1 2 4 8

sumsq, Haskell sumsq, C dotp, Haskelldotp, C smvm, Haskell smvm, C

1

10

100

1000

10000

1 2 4 8 16 32 64

Runtime @ greyarea

sumsq, Haskell sumsq, C dotp, Haskelldotp, C smvm, Haskell smvm, C

Page 87: Loop Fusion in Haskell

Parting thoughts

it’s nice, it’s easy to use, it works

high-level functional programs compiled to highly efficient code

even parallel ones!

rewrite rules + great optimiser = win

DPH doesn’t require any special-purpose optimisations

try this in an imperative language...

Stream fusion: dph, bytestring, vector, uvector

Distributed types: dph

Page 88: Loop Fusion in Haskell

Parting thoughts

it’s nice, it’s easy to use, it works

high-level functional programs compiled to highly efficient code

even parallel ones!

rewrite rules + great optimiser = win

DPH doesn’t require any special-purpose optimisations

try this in an imperative language...

Stream fusion: dph, bytestring, vector, uvector

Distributed types: dph

don’t

Page 89: Loop Fusion in Haskell

Parting thoughts

it’s nice, it’s easy to use, it works

high-level functional programs compiled to highly efficient code

even parallel ones!

rewrite rules + great optimiser = win

DPH doesn’t require any special-purpose optimisations

try this in an imperative language...

Stream fusion: dph, bytestring, vector, uvector

Distributed types: dph

don’t

Page 90: Loop Fusion in Haskell