shooting the rapids

89
Shooting the Rapids: Getting the Best from Java 8 Streams Kirk Pepperdine @kcpeppe Maurice Naftalin @mauricenaftalin Devoxx Belgium, Nov. 2015

Upload: maurice-naftalin

Post on 12-Apr-2017

645 views

Category:

Software


0 download

TRANSCRIPT

Page 1: Shooting the Rapids

Shooting the Rapids: Getting the Best from Java 8

Streams

Kirk Pepperdine @kcpeppeMaurice Naftalin @mauricenaftalin

Devoxx Belgium, Nov. 2015

Page 2: Shooting the Rapids

• Specialises in performance tuning• speaks frequently about performance• author of performance tuning workshop

• Co-founder• performance diagnostic tooling

• Java Champion (since 2006)

About Kirk

Page 3: Shooting the Rapids

• Specialises in performance tuning• speaks frequently about performance• author of performance tuning workshop

• Co-founder• performance diagnostic tooling

• Java Champion (since 2006)

About Kirk

Page 4: Shooting the Rapids

About Maurice

Page 5: Shooting the Rapids

About Maurice

Page 6: Shooting the Rapids

About Maurice

Co-author Author

Page 7: Shooting the Rapids

About Maurice

Co-author Author

JavaChampion

JavaOneRock Star

Page 8: Shooting the Rapids

Subjects Covered in this Talk

• Background – lambdas and streams• Performance of our example• Effect of parallelizing• Splitting input data efficiently• When to go parallel• Parallel streams in the real world

Page 9: Shooting the Rapids

Benchmark Alert

Page 10: Shooting the Rapids

Predicate<Matcher> matches = new Predicate<Matcher>() { @Override public boolean test(Matcher matcher) { return matcher.find(); }};

What is a Lambda?

matchermatcher.find()

matchermatcher.find()

Page 11: Shooting the Rapids

Predicate<Matcher> matches = new Predicate<Matcher>() { @Override public boolean test(Matcher matcher) { return matcher.find(); }};

Predicate<Matcher> matches =

What is a Lambda?

matchermatcher.find()

matchermatcher.find()

Page 12: Shooting the Rapids

Predicate<Matcher> matches = new Predicate<Matcher>() { @Override public boolean test(Matcher matcher) { return matcher.find(); }};

Predicate<Matcher> matches =

What is a Lambda?

matcher

Predicate<Matcher> matches =

matcher.find()matcher

matcher.find()

Page 13: Shooting the Rapids

Predicate<Matcher> matches = new Predicate<Matcher>() { @Override public boolean test(Matcher matcher) { return matcher.find(); }};

Predicate<Matcher> matches =

What is a Lambda?

matcherPredicate<Matcher> matches =

matcher.find()matcher

matcher.find()

Page 14: Shooting the Rapids

Predicate<Matcher> matches = new Predicate<Matcher>() { @Override public boolean test(Matcher matcher) { return matcher.find(); }};

Predicate<Matcher> matches =

What is a Lambda?

matcherPredicate<Matcher> matches =

matcher.find()

->

matchermatcher.find()

Page 15: Shooting the Rapids

Predicate<Matcher> matches = new Predicate<Matcher>() { @Override public boolean test(Matcher matcher) { return matcher.find(); }};

Predicate<Matcher> matches =

What is a Lambda?

matcherPredicate<Matcher> matches = matcher.find()->

matchermatcher.find()

Page 16: Shooting the Rapids

Predicate<Matcher> matches = new Predicate<Matcher>() { @Override public boolean test(Matcher matcher) { return matcher.find(); }};

Predicate<Matcher> matches =

What is a Lambda?

matcherPredicate<Matcher> matches =

A lambda is a function from arguments to result

matcher.find()->

matchermatcher.find()

Page 17: Shooting the Rapids

Example: Processing GC Logfile

⋮2.869: Application time: 1.0001540 seconds5.342: Application time: 0.0801231 seconds8.382: Application time: 1.1013574 seconds⋮

Page 18: Shooting the Rapids

Example: Processing GC Logfile

⋮2.869: Application time: 1.0001540 seconds5.342: Application time: 0.0801231 seconds8.382: Application time: 1.1013574 seconds⋮

DoubleSummaryStatistics {count=3, sum=2.181635, min=0.080123, average=0.727212, max=1.101357}

Page 19: Shooting the Rapids

Old School Code

DoubleSummaryStatistic summary = new DoubleSummaryStatistic();Pattern stoppedTimePattern = Pattern.compile("Application time: (\\d+\\.\\d+)");while ( ( logRecord = logFileReader.readLine()) != null) {

Matcher matcher = stoppedTimePattern.matcher(logRecord); if ( matcher.find()) {

double value = Double.parseDouble( matcher.group(1)); summary.add( value);

} }

Page 20: Shooting the Rapids

Old School Code

DoubleSummaryStatistic summary = new DoubleSummaryStatistic();Pattern stoppedTimePattern = Pattern.compile("Application time: (\\d+\\.\\d+)");while ( ( logRecord = logFileReader.readLine()) != null) {

Matcher matcher = stoppedTimePattern.matcher(logRecord); if ( matcher.find()) {

double value = Double.parseDouble( matcher.group(1)); summary.add( value);

} }

Let’s look at the features in this code

Page 21: Shooting the Rapids

Data Source

DoubleSummaryStatistic summary = new DoubleSummaryStatistic();Pattern stoppedTimePattern = Pattern.compile("Application time: (\\d+\\.\\d+)");while ( ( logRecord = logFileReader.readLine()) != null) {

Matcher matcher = stoppedTimePattern.matcher(logRecord); if ( matcher.find()) {

double value = Double.parseDouble( matcher.group(1)); summary.add( value);

} }

Page 22: Shooting the Rapids

Map to Matcher

DoubleSummaryStatistic summary = new DoubleSummaryStatistic();Pattern stoppedTimePattern = Pattern.compile("Application time: (\\d+\\.\\d+)");while ( ( logRecord = logFileReader.readLine()) != null) {

Matcher matcher = stoppedTimePattern.matcher(logRecord); if ( matcher.find()) {

double value = Double.parseDouble( matcher.group(1)); summary.add( value);

} }

Page 23: Shooting the Rapids

Filter

DoubleSummaryStatistic summary = new DoubleSummaryStatistic();Pattern stoppedTimePattern = Pattern.compile("Application time: (\\d+\\.\\d+)");while ( ( logRecord = logFileReader.readLine()) != null) {

Matcher matcher = stoppedTimePattern.matcher(logRecord); if ( matcher.find()) {

double value = Double.parseDouble( matcher.group(1)); summary.add( value);

} }

Page 24: Shooting the Rapids

Map to Double

DoubleSummaryStatistic summary = new DoubleSummaryStatistic();Pattern stoppedTimePattern = Pattern.compile("Application time: (\\d+\\.\\d+)");while ( ( logRecord = logFileReader.readLine()) != null) {

Matcher matcher = stoppedTimePattern.matcher(logRecord); if ( matcher.find()) {

double value = Double.parseDouble( matcher.group(1)); summary.add( value);

} }

Page 25: Shooting the Rapids

Collect Results (Reduce)

DoubleSummaryStatistic summary = new DoubleSummaryStatistic();Pattern stoppedTimePattern = Pattern.compile("Application time: (\\d+\\.\\d+)");while ( ( logRecord = logFileReader.readLine()) != null) {

Matcher matcher = stoppedTimePattern.matcher(logRecord); if ( matcher.find()) {

double value = Double.parseDouble( matcher.group(1)); summary.add( value);

} }

Page 26: Shooting the Rapids

Java 8 Streams

• A sequence of values, “in motion”• source and intermediate operations set the stream up lazily• a terminal operation “pulls” values eagerly down the stream

collection.stream() .intermediateOp

⋮ .intermediateOp .terminalOp

Page 27: Shooting the Rapids

Stream Sources

• New method Collection.stream() • Many other sources:

• Arrays.stream(Object[]) • Streams.of(Object...) • Stream.iterate(Object,UnaryOperator) • Files.lines() • BufferedReader.lines() • Random.ints() • JarFile.stream()

• …

Page 28: Shooting the Rapids

Imperative to Stream

DoubleSummaryStatistics statistics = Files.lines(new File(“gc.log”).toPath())

.map(stoppedTimePattern::matcher)

.filter(Matcher::find)

.map(matcher -> matcher.group(1))

.mapToDouble(Double::parseDouble)

.summaryStatistics();

Page 29: Shooting the Rapids

Stream Source

DoubleSummaryStatistics statistics = Files.lines(new File(“gc.log”).toPath())

.map(stoppedTimePattern::matcher)

.filter(Matcher::find)

.map(matcher -> matcher.group(1))

.mapToDouble(Double::parseDouble)

.summaryStatistics();

Page 30: Shooting the Rapids

Intermediate Operations

DoubleSummaryStatistics statistics = Files.lines(new File(“gc.log”).toPath())

.map(stoppedTimePattern::matcher)

.filter(Matcher::find)

.map(matcher -> matcher.group(1))

.mapToDouble(Double::parseDouble)

.summaryStatistics();

Page 31: Shooting the Rapids

Method References

DoubleSummaryStatistics statistics = Files.lines(new File(“gc.log”).toPath())

.map(stoppedTimePattern::matcher)

.filter(Matcher::find)

.map(matcher -> matcher.group(1))

.mapToDouble(Double::parseDouble)

.summaryStatistics();

Page 32: Shooting the Rapids

Terminal Operation

DoubleSummaryStatistics statistics = Files.lines(new File(“gc.log”).toPath())

.map(stoppedTimePattern::matcher)

.filter(Matcher::find)

.map(matcher -> matcher.group(1))

.mapToDouble(Double::parseDouble)

.summaryStatistics();

Page 33: Shooting the Rapids

Visualising Sequential Streams

x2x0 x1 x3x0 x1 x2 x3

Source Map Filter Reduction

Intermediate Operations

Terminal Operation

“Values in Motion”

Page 34: Shooting the Rapids

Visualising Sequential Streams

x2x0 x1 x3x1 x2 x3 ✔

Source Map Filter Reduction

Intermediate Operations

Terminal Operation

“Values in Motion”

Page 35: Shooting the Rapids

Visualising Sequential Streams

x2x0 x1 x3 x1x2 x3 ❌✔

Source Map Filter Reduction

Intermediate Operations

Terminal Operation

“Values in Motion”

Page 36: Shooting the Rapids

Visualising Sequential Streams

x2x0 x1 x3 x1x2x3 ❌✔

Source Map Filter Reduction

Intermediate Operations

Terminal Operation

“Values in Motion”

Page 37: Shooting the Rapids

Old School: 13.3 secsSequential: 13.8 secs

- Should be the same workload- Stream code is cleaner, easier to read

How Does It Perform?

24M line file, MacBook Pro, Haswell i7, 4 cores, hyperthreaded, Java 9.0

Page 38: Shooting the Rapids

Can We Do Better?

• We might be able to if the workload is parallelizable• split stream into many segments• process each segment• combine results

• Requirements exactly match Fork/Join workflow

Page 39: Shooting the Rapids

x2

Visualizing Parallel Streams

x0x1

x3

x0

x1x2x3

Page 40: Shooting the Rapids

x2

Visualizing Parallel Streams

x0

x1

x3

x0

x1

x2x3

Page 41: Shooting the Rapids

x2

Visualizing Parallel Streams

x1

x3

x0

x1

x3

Page 42: Shooting the Rapids

x2

Visualizing Parallel Streams

x1 y3

x0

x1

x3

Page 43: Shooting the Rapids

Splitting Stream Sources

• Stream source is a Spliterator• can both iterate over data and – where possible – split it

Page 44: Shooting the Rapids

Splitting the Data

Page 45: Shooting the Rapids

Splitting the Data

Page 46: Shooting the Rapids

Splitting the Data

Page 47: Shooting the Rapids

Splitting the Data

Page 48: Shooting the Rapids

Splitting the Data

Page 49: Shooting the Rapids

Splitting the Data

Page 50: Shooting the Rapids

Splitting the Data

Page 51: Shooting the Rapids

Splitting the Data

Page 52: Shooting the Rapids

Splitting the Data

Page 53: Shooting the Rapids

Parallel Streams

DoubleSummaryStatistics statistics = Files.lines(new File(“gc.log”).toPath())

.parallel() .map(stoppedTimePattern::matcher) .filter(Matcher::find) .map(matcher -> matcher.group(1)).mapToDouble(Double::parseDouble).summaryStatistics();

Page 54: Shooting the Rapids

About Fork/Join

• Introduced in Java 7• draws from a common pool of ForkJoinWorkerThread

• default pool size == HW cores – 1• assumes workload will be CPU bound

• On its own, not an easy coding idiom• parallel streams provide an abstraction layer

• Spliterator defines how to split stream• framework code submits sub-tasks to the common Fork/Join pool

Page 55: Shooting the Rapids

Old School: 13.3 secsSequential: 13.8 secsParallel: 9.5 secs

- 1.45x faster- but not 8x faster (????)

How Does That Perform?

24M lines, 2.8GHz 8-core i7, 16GB, OS X, Java 9.0

Page 56: Shooting the Rapids

In Fact!!!!

• Different benchmarks yield a mixed bag of results• some were better• some were the same• some were worse!

Page 57: Shooting the Rapids

Open Questions

• Under what conditions are things better• or worse

• When should we parallelize• and when is serial better

Page 58: Shooting the Rapids

Open Questions

• Under what conditions are things better• or worse

• When should we parallelize• and when is serial better

Answer depends upon where the bottleneck is

Page 59: Shooting the Rapids

Where is Our Bottleneck?

• I/O operations• not a surprise, we’re reading from a file

• Java 9 uses FileChannelLineSpliterator• 2x better than Java 8’s implementation

76.0% 0 + 5941 sun.nio.ch.FileDispatcherImpl.pread0

Page 60: Shooting the Rapids

Poorly Splitting Sources

• Some sources split worse than others• LinkedList vs ArrayList

• Streaming I/O is problematic• more threads == more pressure on contended resource

• thrashing and other ill effects

• Workload size doesn’t cover the overheads

Page 61: Shooting the Rapids

Streaming I/O Bottleneck

x2x0 x1 x3x0 x1 x2 x3

Page 62: Shooting the Rapids

Streaming I/O Bottleneck

❌x2x1x0 x1 x3

Page 63: Shooting the Rapids

5.342: … nds

LineSpliterator

2.869: Applicati … seconds \n 8.382: … nds 9.337:App … nds\n \n \n

spliterator coverage

Page 64: Shooting the Rapids

5.342: … nds

LineSpliterator

2.869: Applicati … seconds \n 8.382: … nds 9.337:App … nds\n \n \n

spliterator coverage

MappedByteBuffer

Page 65: Shooting the Rapids

5.342: … nds

LineSpliterator

2.869: Applicati … seconds \n 8.382: … nds 9.337:App … nds\n \n \n

spliterator coverage

MappedByteBuffer mid

Page 66: Shooting the Rapids

5.342: … nds

LineSpliterator

2.869: Applicati … seconds \n 8.382: … nds 9.337:App … nds\n \n \n

spliterator coverage

MappedByteBuffer mid

Page 67: Shooting the Rapids

5.342: … nds

LineSpliterator

2.869: Applicati … seconds \n 8.382: … nds 9.337:App … nds\n \n \n

spliterator coveragenew spliterator coverage

MappedByteBuffer mid

Page 68: Shooting the Rapids

5.342: … nds

LineSpliterator

2.869: Applicati … seconds \n 8.382: … nds 9.337:App … nds\n \n \n

spliterator coveragenew spliterator coverage

MappedByteBuffer mid

Included in JDK9 as FileChannelLinesSpliterator

Page 69: Shooting the Rapids

In-memory Comparison

• Read GC log into an ArrayList prior to processing

Page 70: Shooting the Rapids

Old School: 9.4 secsSequential: 9.9 secsParallel: 2.7 secs

- 4.25x faster- better but still not 8x faster

In-memory Comparison

24M lines, 2.8GHz 8 core i7, 16GB, OS X, JDK 9.0

Page 71: Shooting the Rapids

Justifying the Overhead

CPNQ performance model:

C - number of submittersP - number of CPUsN - number of elementsQ - cost of the operation

cost of intermediate operations is N * Qoverhead of setting up F/J framework is ~100µs

Page 72: Shooting the Rapids

Amortizing Setup Costs

• N*Q needs to be large• Q can often only be estimated• N may only be known at run time

• Rule of thumb, N > 10,000

• P is the number of processors• P == number for cores for CPU bound• P < number of cores otherwise

Page 73: Shooting the Rapids

Other Gotchas

• Frequent hand-offs place pressure on thread schedulers• effect is magnified when a hypervisor is involved• estimated 80,000 cycles to handoff data between threads

• you can do a lot of processing in 80,000 cycles

• Too many threads places pressure on thread schedulers• responsible for other ill effects (TTSP)• too few threads may leave hardware under-utilized

Page 74: Shooting the Rapids

Simulated Server Environment

ExecutorService threadPool = Executors.newFixedThreadPool(10);

threadPool.execute(() -> { try { long timer = System.currentTimeMillis(); value = Files.lines( new File(“gc.log").toPath()).parallel() .map(applicationStoppedTimePattern::matcher) .filter(Matcher::find) .map( matcher -> matcher.group(2)) .mapToDouble(Double::parseDouble) .summaryStatistics().getSum(); } catch (Exception ex) {}});

Page 75: Shooting the Rapids

Work Flow and Results• First task to arrive will consume all ForkJoinWorkerThread

• downstream tasks wait for a ForkJoinWorkerThread• downstream tasks start intermixing with initial task

• Initial task collects dead time as it competes for threads• all other tasks collect dead time as they either

• compete or wait for a ForkJoinWorkerThread

Page 76: Shooting the Rapids

Work Flow and Results• First task to arrive will consume all ForkJoinWorkerThread

• downstream tasks wait for a ForkJoinWorkerThread• downstream tasks start intermixing with initial task

• Initial task collects dead time as it competes for threads• all other tasks collect dead time as they either

• compete or wait for a ForkJoinWorkerThread

System is stressed beyond capacity

Page 77: Shooting the Rapids

Intermediate Operation Bottleneck

68.6% 1384 + 0 java.util.regex.Pattern$Curly.match26.6% 521 + 15 java.util.stream.ReferencePipeline$3$1.accept

Page 78: Shooting the Rapids

Intermediate Operation Bottleneck

• Bottleneck is in pattern matching• but, streaming infrastructure isn’t far behind!

68.6% 1384 + 0 java.util.regex.Pattern$Curly.match26.6% 521 + 15 java.util.stream.ReferencePipeline$3$1.accept

Page 79: Shooting the Rapids

Tragedy of the Commons

Garrett Hardin, ecologist (1968):Imagine the grazing of animals on a common ground. Each flock owner gains if they add to their own flock. But every animal added to the total degrades the commons a small amount.

Page 80: Shooting the Rapids

Tragedy of the Commons

Page 81: Shooting the Rapids

Tragedy of the Commons

You have a finite amount of hardware– it might be in your best interest to grab it all– but if everyone behaves the same way…

Page 82: Shooting the Rapids

Simulated Server Environment

Page 83: Shooting the Rapids

Simulated Server Environment• Submit 10 tasks to Fork-Join (via Executor thread-pool)

• first result comes out in 32 seconds

• compared to 9.5 seconds for individually submitted task• high system time reflects task is I/O bounded

Page 84: Shooting the Rapids

In-Memory Variation

Page 85: Shooting the Rapids

In-Memory Variation• Preload log file

Page 86: Shooting the Rapids

In-Memory Variation• Preload log file• Submit 10 tasks to Fork-Join (via Executor thread-pool)

• first result comes out in 23 seconds• compared to 4.5 seconds for individually submitted task

• task is CPU bound

Page 87: Shooting the Rapids

Conclusions

Sequential stream performance comparable to imperative codeGoing parallel is worthwhile IF- task is suitable

- expensive enough to amortize setup costs

- no inter-task communication needed

- data source is suitable

- environment is suitable

Need to monitor JDK to understanding bottlenecks- Fork/Join pool is not well instrumented

Page 88: Shooting the Rapids

Resourceshttp://gee.cs.oswego.edu/dl/html/StreamParallelGuidance.html

Page 89: Shooting the Rapids

Resourceshttp://gee.cs.oswego.edu/dl/html/StreamParallelGuidance.html