shooting the rapids

Shooting the Rapids: Getting the Best from Java 8

Streams

Kirk Pepperdine @kcpeppeMaurice Naftalin @mauricenaftalin

Devoxx Belgium, Nov. 2015

• Specialises in performance tuning• speaks frequently about performance• author of performance tuning workshop

• Co-founder• performance diagnostic tooling

• Java Champion (since 2006)

About Kirk

About Maurice

About Maurice

Co-author Author

About Maurice

Co-author Author

JavaChampion

JavaOneRock Star

Subjects Covered in this Talk

• Background – lambdas and streams• Performance of our example• Effect of parallelizing• Splitting input data efficiently• When to go parallel• Parallel streams in the real world

Benchmark Alert

Predicate<Matcher> matches = new Predicate<Matcher>() { @Override public boolean test(Matcher matcher) { return matcher.find(); }};

What is a Lambda?

matchermatcher.find()



Predicate<Matcher> matches =

What is a Lambda?





What is a Lambda?

matcher


matcher.find()matcher

matcher.find()



What is a Lambda?

matcherPredicate<Matcher> matches =

matcher.find()matcher

matcher.find()



What is a Lambda?


matcher.find()

->




What is a Lambda?

matcherPredicate<Matcher> matches = matcher.find()->




What is a Lambda?


A lambda is a function from arguments to result

matcher.find()->


Example: Processing GC Logfile

⋮2.869: Application time: 1.0001540 seconds5.342: Application time: 0.0801231 seconds8.382: Application time: 1.1013574 seconds⋮

Example: Processing GC Logfile

⋮2.869: Application time: 1.0001540 seconds5.342: Application time: 0.0801231 seconds8.382: Application time: 1.1013574 seconds⋮

DoubleSummaryStatistics {count=3, sum=2.181635, min=0.080123, average=0.727212, max=1.101357}

Old School Code

DoubleSummaryStatistic summary = new DoubleSummaryStatistic();Pattern stoppedTimePattern = Pattern.compile("Application time: (\\d+\\.\\d+)");while ( ( logRecord = logFileReader.readLine()) != null) {

Matcher matcher = stoppedTimePattern.matcher(logRecord); if ( matcher.find()) {

double value = Double.parseDouble( matcher.group(1)); summary.add( value);

} }

Old School Code




} }

Let’s look at the features in this code

Data Source




} }

Map to Matcher




} }

Filter




} }

Map to Double




} }

Collect Results (Reduce)




} }

Java 8 Streams

• A sequence of values, “in motion”• source and intermediate operations set the stream up lazily• a terminal operation “pulls” values eagerly down the stream

collection.stream() .intermediateOp

⋮ .intermediateOp .terminalOp

Stream Sources

• New method Collection.stream() • Many other sources:

• Arrays.stream(Object[]) • Streams.of(Object...) • Stream.iterate(Object,UnaryOperator) • Files.lines() • BufferedReader.lines() • Random.ints() • JarFile.stream()

• …

Imperative to Stream

DoubleSummaryStatistics statistics = Files.lines(new File(“gc.log”).toPath())

.map(stoppedTimePattern::matcher)

.filter(Matcher::find)

.map(matcher -> matcher.group(1))

.mapToDouble(Double::parseDouble)

.summaryStatistics();

Stream Source







Intermediate Operations







Method References







Terminal Operation







Visualising Sequential Streams

x2x0 x1 x3x0 x1 x2 x3

Source Map Filter Reduction


Terminal Operation

“Values in Motion”


x2x0 x1 x3x1 x2 x3 ✔



Terminal Operation



x2x0 x1 x3 x1x2 x3 ❌✔



Terminal Operation



x2x0 x1 x3 x1x2x3 ❌✔



Terminal Operation


Old School: 13.3 secsSequential: 13.8 secs

- Should be the same workload- Stream code is cleaner, easier to read

How Does It Perform?

24M line file, MacBook Pro, Haswell i7, 4 cores, hyperthreaded, Java 9.0

Can We Do Better?

• We might be able to if the workload is parallelizable• split stream into many segments• process each segment• combine results

• Requirements exactly match Fork/Join workflow

x2

Visualizing Parallel Streams

x0x1

x3

x0

x1x2x3

x2


x0

x1

x3

x0

x1

x2x3

x2


x1

x3

x0

x1

x3

✔

❌

x2


x1 y3

x0

x1

x3

✔

❌

Splitting Stream Sources

• Stream source is a Spliterator• can both iterate over data and – where possible – split it

Splitting the Data

Parallel Streams


.parallel() .map(stoppedTimePattern::matcher) .filter(Matcher::find) .map(matcher -> matcher.group(1)).mapToDouble(Double::parseDouble).summaryStatistics();

About Fork/Join

• Introduced in Java 7• draws from a common pool of ForkJoinWorkerThread

• default pool size == HW cores – 1• assumes workload will be CPU bound

• On its own, not an easy coding idiom• parallel streams provide an abstraction layer

• Spliterator defines how to split stream• framework code submits sub-tasks to the common Fork/Join pool

Old School: 13.3 secsSequential: 13.8 secsParallel: 9.5 secs

- 1.45x faster- but not 8x faster (????)

How Does That Perform?

24M lines, 2.8GHz 8-core i7, 16GB, OS X, Java 9.0

In Fact!!!!

• Different benchmarks yield a mixed bag of results• some were better• some were the same• some were worse!

Open Questions

• Under what conditions are things better• or worse

• When should we parallelize• and when is serial better

Open Questions

• Under what conditions are things better• or worse

• When should we parallelize• and when is serial better

Answer depends upon where the bottleneck is

Where is Our Bottleneck?

• I/O operations• not a surprise, we’re reading from a file

• Java 9 uses FileChannelLineSpliterator• 2x better than Java 8’s implementation

76.0% 0 + 5941 sun.nio.ch.FileDispatcherImpl.pread0

Poorly Splitting Sources

• Some sources split worse than others• LinkedList vs ArrayList

• Streaming I/O is problematic• more threads == more pressure on contended resource

• thrashing and other ill effects

• Workload size doesn’t cover the overheads

Streaming I/O Bottleneck

x2x0 x1 x3x0 x1 x2 x3

Streaming I/O Bottleneck

✔

❌x2x1x0 x1 x3

5.342: … nds

LineSpliterator

2.869: Applicati … seconds \n 8.382: … nds 9.337:App … nds\n \n \n

spliterator coverage

5.342: … nds

LineSpliterator



MappedByteBuffer

5.342: … nds

LineSpliterator



MappedByteBuffer mid

5.342: … nds

LineSpliterator


spliterator coveragenew spliterator coverage


5.342: … nds

LineSpliterator


spliterator coveragenew spliterator coverage


Included in JDK9 as FileChannelLinesSpliterator

In-memory Comparison

• Read GC log into an ArrayList prior to processing

Old School: 9.4 secsSequential: 9.9 secsParallel: 2.7 secs

- 4.25x faster- better but still not 8x faster

In-memory Comparison

24M lines, 2.8GHz 8 core i7, 16GB, OS X, JDK 9.0

Justifying the Overhead

CPNQ performance model:

C - number of submittersP - number of CPUsN - number of elementsQ - cost of the operation

cost of intermediate operations is N * Qoverhead of setting up F/J framework is ~100µs

Amortizing Setup Costs

• N*Q needs to be large• Q can often only be estimated• N may only be known at run time

• Rule of thumb, N > 10,000

• P is the number of processors• P == number for cores for CPU bound• P < number of cores otherwise

Other Gotchas

• Frequent hand-offs place pressure on thread schedulers• effect is magnified when a hypervisor is involved• estimated 80,000 cycles to handoff data between threads

• you can do a lot of processing in 80,000 cycles

• Too many threads places pressure on thread schedulers• responsible for other ill effects (TTSP)• too few threads may leave hardware under-utilized

Simulated Server Environment

ExecutorService threadPool = Executors.newFixedThreadPool(10);

threadPool.execute(() -> { try { long timer = System.currentTimeMillis(); value = Files.lines( new File(“gc.log").toPath()).parallel() .map(applicationStoppedTimePattern::matcher) .filter(Matcher::find) .map( matcher -> matcher.group(2)) .mapToDouble(Double::parseDouble) .summaryStatistics().getSum(); } catch (Exception ex) {}});

Work Flow and Results• First task to arrive will consume all ForkJoinWorkerThread

• downstream tasks wait for a ForkJoinWorkerThread• downstream tasks start intermixing with initial task

• Initial task collects dead time as it competes for threads• all other tasks collect dead time as they either

• compete or wait for a ForkJoinWorkerThread

Work Flow and Results• First task to arrive will consume all ForkJoinWorkerThread

• downstream tasks wait for a ForkJoinWorkerThread• downstream tasks start intermixing with initial task

• Initial task collects dead time as it competes for threads• all other tasks collect dead time as they either

• compete or wait for a ForkJoinWorkerThread

System is stressed beyond capacity

Intermediate Operation Bottleneck

68.6% 1384 + 0 java.util.regex.Pattern$Curly.match26.6% 521 + 15 java.util.stream.ReferencePipeline$3$1.accept

Intermediate Operation Bottleneck

• Bottleneck is in pattern matching• but, streaming infrastructure isn’t far behind!

68.6% 1384 + 0 java.util.regex.Pattern$Curly.match26.6% 521 + 15 java.util.stream.ReferencePipeline$3$1.accept

Tragedy of the Commons

Garrett Hardin, ecologist (1968):Imagine the grazing of animals on a common ground. Each flock owner gains if they add to their own flock. But every animal added to the total degrades the commons a small amount.


You have a finite amount of hardware– it might be in your best interest to grab it all– but if everyone behaves the same way…

Simulated Server Environment

Simulated Server Environment• Submit 10 tasks to Fork-Join (via Executor thread-pool)

• first result comes out in 32 seconds

• compared to 9.5 seconds for individually submitted task• high system time reflects task is I/O bounded

In-Memory Variation

In-Memory Variation• Preload log file

In-Memory Variation• Preload log file• Submit 10 tasks to Fork-Join (via Executor thread-pool)

• first result comes out in 23 seconds• compared to 4.5 seconds for individually submitted task

• task is CPU bound

Conclusions

Sequential stream performance comparable to imperative codeGoing parallel is worthwhile IF- task is suitable

- expensive enough to amortize setup costs

- no inter-task communication needed

- data source is suitable

- environment is suitable

Need to monitor JDK to understanding bottlenecks- Fork/Join pool is not well instrumented

Resourceshttp://gee.cs.oswego.edu/dl/html/StreamParallelGuidance.html

http://gee.cs.oswego.edu/dl/html/StreamParallelGuidance.html

shooting the rapids

Software