data parallelism companion slides for the art of multiprocessor programming by maurice herlihy &...

124
Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

Upload: silvia-moore

Post on 19-Jan-2016

229 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Data Parallelism

Companion slides forThe Art of Multiprocessor Programming

by Maurice Herlihy & Nir Shavit

Page 2: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Ever Wonder …

Art of Multiprocessor Programming 2

When did the term multicore become popular?

A multi-core processor is a single computing component with two or more independent actual central processing units, which are the units that read and execute program instructions.)

wikipedia

Page 3: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Let’s Ask Google Ngram!

Art of Multiprocessor Programming 3

usage of multicore in books by publication year

Page 4: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Let’s Ask Google Ngram!

Art of Multiprocessor Programming 4

This part since 2000 is obvious …

Page 5: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Let’s Ask Google Ngram!

Art of Multiprocessor Programming 5

???

Page 6: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Let’s Ask Google Ngram!

Art of Multiprocessor Programming 6

multicore cablemulticore fiber

but we digress …

Page 7: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

WordCount

Art of Multiprocessor Programming 7

alpha 8

easy to do sequentially …what about in parallel?

bravo 3charlie 9

zulu 1

Page 8: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

MapReduce

Art of Multiprocessor Programming 8

chapter 1 chapter 2 chapter k

split text among mapping threads

Page 9: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Map Phase

Art of Multiprocessor Programming 9

chapter 1 chapter 2 chapter k

must count words!

must count words!

must count words!

a mapping thread per chapter

Page 10: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Map Phase

Art of Multiprocessor Programming 10

chapter 1

alpha 9juliet 2,alpha 1tango 4

…each mapper thread produces a stream …of key-value pairs …

key: wordvalue: local count

Page 11: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Art of Multiprocessor Programming 11

Mapper Class

abstract class Mapper<IN, K, V> extends RecursiveTask<Map<K, V>> { IN input; public void setInput(IN anInput) { input = anInput; }}

Page 12: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Art of Multiprocessor Programming 12

Mapper Class

abstract class Mapper<IN, K, V> extends RecursiveTask<Map<K, V>> { IN input; public void setInput(IN anInput) { input = anInput; }}

input: document fragment

Page 13: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Art of Multiprocessor Programming 13

Mapper Class

abstract class Mapper<IN, K, V> extends RecursiveTask<Map<K, V>> { IN input; public void setInput(IN anInput) { input = anInput; }}

key: individual word

Page 14: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Art of Multiprocessor Programming 14

Mapper Class

abstract class Mapper<IN, K, V> extends RecursiveTask<Map<K, V>> { IN input; public void setInput(IN anInput) { input = anInput; }}

value: local count

Page 15: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Art of Multiprocessor Programming 15

Mapper Class

abstract class Mapper<IN, K, V> extends RecursiveTask<Map<K, V>> { IN input; public void setInput(IN anInput) { input = anInput; }}

a task that runs in parallel with other tasks

Page 16: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Art of Multiprocessor Programming 16

Mapper Class

abstract class Mapper<IN, K, V> extends RecursiveTask<Map<K, V>> { IN input; public void setInput(IN anInput) { input = anInput; }}

produces a map: word count

Page 17: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Art of Multiprocessor Programming 17

Mapper Class

abstract class Mapper<IN, K, V> extends RecursiveTask<Map<K, V>> { IN input; public void setInput(IN anInput) { input = anInput; }}

initialize input: which document fragment?

Page 18: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Art of Multiprocessor Programming 18

WordCount Mapperclass WordCountMapper extends mapreduce.Mapper< List<String>, String, Long > { … }

Page 19: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Art of Multiprocessor Programming 19

WordCount Mapperclass WordCountMapper extends mapreduce.Mapper< List<String>, String, Long > { … }document fragment is list of words

Page 20: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Art of Multiprocessor Programming 20

WordCount Mapperclass WordCountMapper extends mapreduce.Mapper< List<String>, String, Long > { … }document fragment is list of words

map each word …

Page 21: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Art of Multiprocessor Programming 21

WordCount Mapperclass WordCountMapper extends mapreduce.Mapper< List<String>, String, Long > { … }document fragment is list of words

map each word …to its count in the fragment

Page 22: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Art of Multiprocessor Programming 22

WordCount MapperMap<String,Long> compute() { Map<String,Long> map = new HashMap<>(); for (String word : input) { map.merge(word, 1L, (x, y) -> x + y); } return map; } }

Page 23: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Art of Multiprocessor Programming 23

WordCount MapperMap<String,Long> compute() { Map<String,Long> map = new HashMap<>(); for (String word : input) { map.merge(word, 1L, (x, y) -> x + y); } return map; } }

the compute() method constructs the local word count

Page 24: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Art of Multiprocessor Programming 24

WordCount MapperMap<String,Long> compute() { Map<String,Long> map = new HashMap<>(); for (String word : input) { map.merge(word, 1L, (x, y) -> x + y); } return map; } }

create a map to hold the output

Page 25: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Art of Multiprocessor Programming 25

WordCount MapperMap<String,Long> compute() { Map<String,Long> map = new HashMap<>(); for (String word : input) { map.merge(word, 1L, (x, y) -> x + y); } return map; } }

examine each word in the document fragment

Page 26: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Art of Multiprocessor Programming 26

WordCount MapperMap<String,Long> compute() { Map<String,Long> map = new HashMap<>(); for (String word : input) { map.merge(word, 1L, (x, y) -> x + y); } return map; } } increment that word’s

count in the map

Page 27: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Art of Multiprocessor Programming 27

WordCount MapperMap<String,Long> compute() { Map<String,Long> map = new HashMap<>(); for (String word : input) { map.merge(word, 1L, (x, y) -> x + y); } return map; } }

when the local count is complete, return the map

Page 28: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Reduce Phase

Art of Multiprocessor Programming 28

alpha 4bravo 2

…zulu 1

a reducer thread merges mapper outputs

alpha 2juliet 1tango 1

alpha 1foxtrot 1papa 1tango 1

alpha 1oscar 1,bravo

2…

Page 29: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

alpha 3bravo 2

…zulu 1

Reduce Phase

Art of Multiprocessor Programming 29

the reducer task produces a stream …of key-value pairs …

key: word value: word count

Page 30: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Art of Multiprocessor Programming 30

Reducer Classabstract class Reducer<K, V, OUT> extends RecursiveTask<OUT> { K key; List<V> valueList; public void setInput( K aKey, List<V> aList) { key = aKey; valueList = aList; }}

Page 31: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Art of Multiprocessor Programming 31

Reducer Classabstract class Reducer<K, V, OUT> extends RecursiveTask<OUT> { K key; List<V> valueList; public void setInput( K aKey, List<V> aList) { key = aKey; valueList = aList; }}

each reducer is givena single key (word)

Page 32: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Art of Multiprocessor Programming 32

Reducer Classabstract class Reducer<K, V, OUT> extends RecursiveTask<OUT> { K key; List<V> valueList; public void setInput( K aKey, List<V> aList) { key = aKey; valueList = aList; }}

and a list of associated values (word count per fragment)

Page 33: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Art of Multiprocessor Programming 33

Reducer Classabstract class Reducer<K, V, OUT> extends RecursiveTask<OUT> { K key; List<V> valueList; public void setInput( K aKey, List<V> aList) { key = aKey; valueList = aList; }}

It produces a single summary value(the total count for that word)

Page 34: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

WordCount

Art of Multiprocessor Programming 34

0.037

normalizing document wordcountgives a fingerprint vector

0.002

0.045

0.000

Page 35: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Document Fingerprint

Art of Multiprocessor Programming 35

a fingerprint is a point in a high-dimensional space

Page 36: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Clustering

Art of Multiprocessor Programming 36

romance novels

tango lyrics

Usenix Procedings

Page 37: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

k-means

Art of Multiprocessor Programming 37

Find k clusters from raw data

Page 38: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

k-means

Art of Multiprocessor Programming 38

Find k clusters from raw data

each vector closer to those in same cluster …

than in different clusters.

Page 39: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

MapReduce

Art of Multiprocessor Programming 39

thread 1 thread 2 thread k

split points among mapping threads

Page 40: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

k-means

Art of Multiprocessor Programming 40

Reducer picks k “centers” at random

Page 41: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Reduce Phase

Art of Multiprocessor Programming 41

0 c0

1 c1

2 c2

thread 1 thread 2 thread k

reducer sends key-value pair to mappers

key: cluster number

value: center point

Page 42: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Mappers

Art of Multiprocessor Programming 42

Each mapper uses centers to assign each vector to a cluster

0 c0

1 c1

2 c2

Page 43: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Mappers

Art of Multiprocessor Programming 43

Each mapper uses centers to assign each vector to a cluster

0 c0

1 c1

2 c2

Page 44: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Mappers

Art of Multiprocessor Programming 44

p0 2p1 1p2 1p3 0

mapper sends key-value stream to reducerkey: point

value: cluster ID

Page 45: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

C0 = {…}C1 = {…}C2 = {…}

Back at the Reducer

Art of Multiprocessor Programming 45

The reducer merges the streams ….and assembles clusters …

Page 46: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Back at the Reducer

Art of Multiprocessor Programming 46

The reducer computes new centersbased on new clusters …

0 c0’1 c1’2 c2’

Page 47: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Once is Not Enough

Art of Multiprocessor Programming 47

thread 1 thread 2 thread k

reducer sends new centers to mappers

process ends when centers become stable

0 c0’1 c1’2 c2’

Page 48: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

To Recaptulate

Art of Multiprocessor Programming 48

We saw two problems …

wordcount & k-means …

with similar parallel solutions

Map part is parallel …

Reduce part is sequential.

Page 49: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Art of Multiprocessor Programming 49

abstraction

Page 50: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Map Function

Art of Multiprocessor Programming 50

(k1, v1) list(k2, v2)

doc, contents word, count

cluster ID, center point, cluster ID

Page 51: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Reduce Function

Art of Multiprocessor Programming 51

(k2, list(v2)) list(v2)

word, list of counts count

cluster ID, list of points new cluster center

Page 52: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Example

Art of Multiprocessor Programming 52

Distributed Grep

Map:

line of document

Reduce:

copy line to display

Page 53: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Example

Art of Multiprocessor Programming 53

URL Access Frequency

Map:

(URL, local count)

Reduce:

(URL, total count)

Page 54: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Example

Art of Multiprocessor Programming 54

Reverse web link graph

Map:

(target link, source page)

Reduce:

(target link, list of source pages)

Page 55: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Other Examples

Art of Multiprocessor Programming 55

histogram

matrix multiplication

PageRank

Betweenness centrality

Page 56: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Art of Multiprocessor Programming 56

Distributed MapReduce

Google, Hadoop, etc…

Communication by message

Fault-tolerance important

Page 57: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Art of Multiprocessor Programming 57

Multicore MapReduce

Phoenix, Phoenix++, Metis …

Communication by shared memory objects

Fault-tolerance unimportant

Page 58: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Costs

Art of Multiprocessor Programming 58

key-value layout

memory allocation

mechanism overhead

cache pressure

static vs dynamic

Page 59: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Part Two

Data Streams59

Page 60: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Streams

Art of Multiprocessor Programming 60

sequence of transformations

transformation

transformation

data

source

data

data

consumer

no relation toI/O streams

sometimes in parallel

Page 61: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Streams

Art of Multiprocessor Programming 61

transformation

transformation

data

source

data

data

consumer

transformations given bymathematical functions

Page 62: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Streams

Art of Multiprocessor Programming 62

transformation

transformation

data

source

data

data

consumer

transformation createsnew stream

no modifications or side-effects

correctness easier?

Page 63: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Functional Programming

Art of Multiprocessor Programming 63

functions map old state to new state

old state never changed

no complex side-effects

elegant, easier proofs of correctness

Page 64: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Oh, Really?

Art of Multiprocessor Programming 64

“Functional languages are unnatural to use; but so are knives and forks, diplomatic protocols, double-entry bookkeeping, and a host of other things modern civilization has found useful.”

Jim Morris, 1982

Page 65: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Haiku

Art of Multiprocessor Programming 65

esthetically pleasing

only works on those who understand Haiku

Page 66: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Karate

Art of Multiprocessor Programming 66

esthetically pleasing

works even on those who do not understand Karate

Page 67: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Jim Morris’s Question

Art of Multiprocessor Programming 67

Is functional programming more like Haiku or Karate?

1981: Haiku Today: Karate

Page 68: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Laziness

Art of Multiprocessor Programming 68

x x+1

1,2,3,…

x 2x

sum

No computation until absolutely necessary

add 1 to each element: no computation

double each element: no computation

2(xi+1) is terminal

Page 69: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Laziness

Art of Multiprocessor Programming 69

x x+1

1,2,3,…

x 2x

collect in Listmove to container is terminal

Page 70: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Laziness

Art of Multiprocessor Programming 70

x x+1

1,2,3,…

x 2x

collect in ListLaziness permits optimizations

x 2(x+1)

Page 71: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Laziness

Art of Multiprocessor Programming 71

1 1 2 3 5 8 12 20 32 …

Laziness permits infinite streams …

Stream<Integer> fib = new FibStream();

Page 72: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Art of Multiprocessor Programming 72

Unbounded Random Stream

Stream<Double> randomDoubleStream() { return Stream.generate( () -> random.nextDouble() );}

Page 73: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Art of Multiprocessor Programming 73

Unbounded Random Stream

Stream<Double> randomDoubleStream() { return Stream.generate( () -> random.nextDouble() );}Unbounded stream of double-precision random numbers

Page 74: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Art of Multiprocessor Programming 74

Random Stream

Stream<Double> randomDoubleStream() { return Stream.generate( () -> random.nextDouble() );}

Stream that generates new elements on the fly

Page 75: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Art of Multiprocessor Programming 75

Random Stream

Stream<Double> randomDoubleStream() { return Stream.generate( () -> random.nextDouble() );}

function to call when generating new element

example of Java lambda expression (anon method)

Page 76: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Art of Multiprocessor Programming 76

WordCountList<String> readFile(String fileName) { … return reader .lines() .map(String::toLowerCase) .flatMap(s -> pattern.splitAsStream(s)) .collect(Collectors.toList());}

Page 77: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Art of Multiprocessor Programming 77

WordCountList<String> readFile(String fileName) { … return reader .lines() .map(String::toLowerCase) .flatMap(s -> pattern.splitAsStream(s)) .collect(Collectors.toList());}

puts each word from the document into a List

Page 78: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Art of Multiprocessor Programming 78

WordCountList<String> readFile(String fileName) { … return reader .lines() .map(String::toLowerCase) .flatMap(s -> pattern.splitAsStream(s)) .collect(Collectors.toList());}

open the file, create a FileReader

Page 79: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Art of Multiprocessor Programming 79

WordCountList<String> readFile(String fileName) { … return reader .lines() .map(String::toLowerCase) .flatMap(s -> pattern.splitAsStream(s)) .collect(Collectors.toList());}

how a stream program looks

each line creates a new stream

Page 80: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Art of Multiprocessor Programming 80

WordCountList<String> readFile(String fileName) { … return reader .lines() .map(String::toLowerCase) .flatMap(s -> pattern.splitAsStream(s)) .collect(Collectors.toList());}

turn the FileReader into a stream of lines, each line a string

Page 81: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Art of Multiprocessor Programming 81

WordCountList<String> readFile(String fileName) { … return reader .lines() .map(String::toLowerCase) .flatMap(s -> pattern.splitAsStream(s)) .collect(Collectors.toList());}

map creates a new stream by applying a function to each stream element

here, converts to lower case

Page 82: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Art of Multiprocessor Programming 82

WordCountList<String> readFile(String fileName) { … return reader .lines() .map(String::toLowerCase) .flatMap(s -> pattern.splitAsStream(s)) .collect(Collectors.toList());}

flatMap replaces one stream element with multiple stream elements

here, splits line into words

Page 83: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Art of Multiprocessor Programming 83

WordCountList<String> readFile(String fileName) { … return reader .lines() .map(String::toLowerCase) .flatMap(s -> pattern.splitAsStream(s)) .collect(Collectors.toList());}

collect puts stream elements in a container

here, in a List

terminal operation

Page 84: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Art of Multiprocessor Programming 84

WordCountList<String> readFile(String fileName) { … return reader .lines() .map(String::toLowerCase) .flatMap(s -> pattern.splitAsStream(s)) .collect(Collectors.toList());}

No loops

No conditionals

No mutable objects

Page 85: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Art of Multiprocessor Programming 85

WordCountMap<String,Long> map = text .stream() .collect( Collectors.groupingBy( Function.identity(), Collectors.counting()));

now let’s count the words

Page 86: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Art of Multiprocessor Programming 86

WordCountMap<String,Long> map = text .stream() .collect( Collectors.groupingBy( Function.identity(), Collectors.counting()));

start with list of words

Page 87: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Art of Multiprocessor Programming 87

WordCountMap<String,Long> map = text .stream() .collect( Collectors.groupingBy( Function.identity(), Collectors.counting()));

turn List into a stream

Page 88: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Art of Multiprocessor Programming 88

WordCountMap<String,Long> map = text .stream() .collect( Collectors.groupingBy( Function.identity(), Collectors.counting()));

put stream into a container (Map)word count

Page 89: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Art of Multiprocessor Programming 89

WordCountMap<String,Long> map = text .stream() .collect( Collectors.groupingBy( Function.identity(), Collectors.counting()));

each element’s key is that element

Page 90: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Art of Multiprocessor Programming 90

WordCountMap<String,Long> map = text .stream() .collect( Collectors.groupingBy( Function.identity(), Collectors.counting()));

each element’s value is the number of times it appears

Page 91: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Art of Multiprocessor Programming 91

WordCountMap<String,Long> map = text .stream() .collect( Collectors.groupingBy( Function.identity(), Collectors.counting()));

No loops

No conditionals

No mutable objects

Page 92: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

k-Means

Art of Multiprocessor Programming 92

class Point { Point(double x, double y) {…} Point plus(Point other) {…} Point scale(double x) {…} static Point barycenter( List<Point> cluster ) {…}}

Page 93: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

k-Means

Art of Multiprocessor Programming 93

class Point { Point(double x, double y) {…} Point plus(Point other) {…} Point scale(double x) {…} static Point barycenter( List<Point> cluster ) {…}}

stream-based!

Page 94: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

BaryCenter

Art of Multiprocessor Programming 94

Page 95: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Stream Barycenter

Art of Multiprocessor Programming 95

Point barycenter(List<Point> cluster){ double numPoints = cluster.size(); Optional<Point> sum = cluster .stream() .reduce(Point::plus); return sum.get() .scale(1 / numPoints); }

Page 96: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Stream Barycenter

Art of Multiprocessor Programming 96

Point barycenter(List<Point> cluster){ double numPoints = cluster.size(); Optional<Point> sum = cluster .stream() .reduce(Point::plus); return sum.get() .scale(1 / numPoints); }

cluster size

Page 97: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Stream Barycenter

Art of Multiprocessor Programming 97

Point barycenter(List<Point> cluster){ double numPoints = cluster.size(); Optional<Point> sum = cluster .stream() .reduce(Point::plus); return sum.get() .scale(1 / numPoints); }

turn List into Stream

Page 98: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Reduce

Art of Multiprocessor Programming 98

stream.reduce(+):

() ;

(a) a

(a,b) a+b

(a,b,c) (a+b)+c

etc. …terminal operation

Page 99: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

k-Means

Art of Multiprocessor Programming 99

Point barycenter(List<Point> cluster){ double numPoints = cluster.size(); Optional<Point> sum = cluster .stream() .reduce(Point::plus); return sum.get() .scale(1 / numPoints); } sum points in cluster

Optional because sum might be empty!

Page 100: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

k-Means

Art of Multiprocessor Programming 100

Point barycenter(List<Point> cluster){ double numPoints = cluster.size(); Optional<Point> sum = cluster .stream() .reduce(Point::plus); return sum.get() .scale(1 / numPoints); }

extract sum and divide by # points

Page 101: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

k-Means

Art of Multiprocessor Programming 101

List<Point> points = readFile("cluster.dat");centers = randomDistinctCenters(points);double convergence = 1.0; while (convergence > EPSILON) { … }

Page 102: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

k-Means

Art of Multiprocessor Programming 102

List<Point> points = readFile("cluster.dat");centers = randomDistinctCenters(points);double convergence = 1.0; while (convergence > EPSILON) { … }

read points from file

Page 103: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

k-Means

Art of Multiprocessor Programming 103

List<Point> points = readFile("cluster.dat");centers = randomDistinctCenters(points);double convergence = 1.0; while (convergence > EPSILON) { … }

pick random centers

Page 104: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

k-Means

Art of Multiprocessor Programming 104

List<Point> points = readFile("cluster.dat");centers = randomDistinctCenters(points);double convergence = 1.0; while (convergence > EPSILON) { … }

keep going until centers are stable

Page 105: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Compute New Clusters

Art of Multiprocessor Programming 105

while (convergence > EPSILON) { Map<Integer,List<Point>> clusters = points .stream() .collect( Collectors.groupingBy( p -> closestCenter(centers, p) ) );}

turn list of points into a Stream

Page 106: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Compute New Clusters

Art of Multiprocessor Programming 106

while (convergence > EPSILON) { Map<Integer,List<Point>> clusters = points .stream() .collect( Collectors.groupingBy( p -> closestCenter(centers, p) ) );}

put each point in a mapkey is closest centervalue is list of points with that center

Page 107: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

k-Means Con’t

Art of Multiprocessor Programming 107

while (convergence > EPSILON) { ... Map<Integer, Point> newCenters = clusters .entrySet() .stream() .collect( Collectors.toMap( e -> e.getKey(), e -> Point.barycenter(e.getValue()) ) ); convergence = distance(centers, newCenters); centers = newCenters;}

compute new centers map: ID center

Page 108: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

k-Means Con’t

Art of Multiprocessor Programming 108

while (convergence > EPSILON) { ... Map<Integer, Point> newCenters = clusters .entrySet() .stream() .collect( Collectors.toMap( e -> e.getKey(), e -> Point.barycenter(e.getValue()) ) ); convergence = distance(centers, newCenters); centers = newCenters;}

turn map into a Stream of (ID, point) pairs

Page 109: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

k-Means Con’t

Art of Multiprocessor Programming 109

while (convergence > EPSILON) { ... Map<Integer, Point> newCenters = clusters .entrySet() .stream() .collect( Collectors.toMap( e -> e.getKey(), e -> Point.barycenter(e.getValue()) ) ); convergence = distance(centers, newCenters); centers = newCenters;}

turn stream into a mapID barycenter

Page 110: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

k-Means Con’t

Art of Multiprocessor Programming 110

while (convergence > EPSILON) { ... Map<Integer, Point> newCenters = clusters .entrySet() .stream() .collect( Collectors.toMap( e -> e.getKey(), e -> Point.barycenter(e.getValue()) ) ); convergence = distance(centers, newCenters); centers = newCenters;}

new Key is still the cluster ID

Page 111: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

k-Means Con’t

Art of Multiprocessor Programming 111

while (convergence > EPSILON) { ... Map<Integer, Point> newCenters = clusters .entrySet() .stream() .collect( Collectors.toMap( e -> e.getKey(), e -> Point.barycenter(e.getValue()) ) ); convergence = distance(centers, newCenters); centers = newCenters;}

new value is the barycenter computed earlier

Page 112: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

k-Means Con’t

Art of Multiprocessor Programming 112

while (convergence > EPSILON) { ... Map<Integer, Point> newCenters = clusters .entrySet() .stream() .collect( Collectors.toMap( e -> e.getKey(), e -> Point.barycenter(e.getValue()) ) ); convergence = distance(centers, newCenters); centers = newCenters;}

If centers have moved, start again with the new centers

Page 113: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Functional k-Means

Art of Multiprocessor Programming 113

many fewer lines of code

easier to read (really!)

easier to reason

easier to optimize

Page 114: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Art of Multiprocessor Programming 114

Parallelism?Arrays.asList("Arlington", "Berkeley", "Clarendon", "Dartmouth", "Exeter") .stream() .forEach(s -> printf("%s\n", s));

So far, Streams are sequential

Page 115: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Art of Multiprocessor Programming 115

Parallelism?Arrays.asList("Arlington", "Berkeley", "Clarendon", "Dartmouth", "Exeter") .stream() .forEach(s -> printf("%s\n", s));

make List of strings and turn them into a stream

Page 116: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Art of Multiprocessor Programming 116

Parallelism?Arrays.asList("Arlington", "Berkeley", "Clarendon", "Dartmouth", "Exeter") .stream() .forEach(s -> printf("%s\n", s));

forEach applies a method to each element (not functional)

Page 117: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Art of Multiprocessor Programming 117

Parallelism?Arrays.asList("Arlington", "Berkeley", "Clarendon", "Dartmouth", "Exeter") .stream() .forEach(s -> printf("%s\n", s));

ArlingtonBerkeleyClarendonDartmouthExeter

prints

Page 118: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Art of Multiprocessor Programming 118

Parallelism?Arrays.asList("Arlington", "Berkeley", "Clarendon", "Dartmouth", "Exeter") .parallelStream() .forEach(s -> printf("%s\n", s));

turn List into a parallel stream

Page 119: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Art of Multiprocessor Programming 119

Parallelism?Arrays.asList("Arlington", "Berkeley", "Clarendon", "Dartmouth", "Exeter") .parallelStream() .forEach(s -> printf("%s\n", s));

ArlingtonBerkeleyClarendonDartmouthExeter

prints

Dartmouth ArlingtonClarendonBerkeleyExeter

Dartmouth BerkeleyExeterArlingtonClarendon

Page 120: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Art of Multiprocessor Programming 120

Parallelism?Arrays.asList("Arlington", "Berkeley", "Clarendon", "Dartmouth", "Exeter") .stream() .parallel() .forEach(s -> printf("%s\n", s));

can turn stream into a parallel stream

Page 121: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Art of Multiprocessor Programming 121

Pitfallslist.stream().forEach( s -> list.add(0));

Page 122: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Art of Multiprocessor Programming 122

Pitfallslist.stream().forEach( s -> list.add(0));

lambda (function) must not modify source!

Page 123: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Art of Multiprocessor Programming 123

Pitfallssource.parallelStream() .forEach( s -> target.add(s));

exception if target not thread-safeorder added is non-deterministic

Page 124: Data Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint

Conclusions

Art of Multiprocessor Programming 124

Streams support

functional programming

data parallelism

compiler optimizations