shooting the rapids
TRANSCRIPT
Shooting the Rapids: Getting the Best from Java 8
Streams
Kirk Pepperdine @kcpeppeMaurice Naftalin @mauricenaftalin
Devoxx Belgium, Nov. 2015
• Specialises in performance tuning• speaks frequently about performance• author of performance tuning workshop
• Co-founder• performance diagnostic tooling
• Java Champion (since 2006)
About Kirk
• Specialises in performance tuning• speaks frequently about performance• author of performance tuning workshop
• Co-founder• performance diagnostic tooling
• Java Champion (since 2006)
About Kirk
About Maurice
About Maurice
About Maurice
Co-author Author
About Maurice
Co-author Author
JavaChampion
JavaOneRock Star
Subjects Covered in this Talk
• Background – lambdas and streams• Performance of our example• Effect of parallelizing• Splitting input data efficiently• When to go parallel• Parallel streams in the real world
Benchmark Alert
Predicate<Matcher> matches = new Predicate<Matcher>() { @Override public boolean test(Matcher matcher) { return matcher.find(); }};
What is a Lambda?
matchermatcher.find()
matchermatcher.find()
Predicate<Matcher> matches = new Predicate<Matcher>() { @Override public boolean test(Matcher matcher) { return matcher.find(); }};
Predicate<Matcher> matches =
What is a Lambda?
matchermatcher.find()
matchermatcher.find()
Predicate<Matcher> matches = new Predicate<Matcher>() { @Override public boolean test(Matcher matcher) { return matcher.find(); }};
Predicate<Matcher> matches =
What is a Lambda?
matcher
Predicate<Matcher> matches =
matcher.find()matcher
matcher.find()
Predicate<Matcher> matches = new Predicate<Matcher>() { @Override public boolean test(Matcher matcher) { return matcher.find(); }};
Predicate<Matcher> matches =
What is a Lambda?
matcherPredicate<Matcher> matches =
matcher.find()matcher
matcher.find()
Predicate<Matcher> matches = new Predicate<Matcher>() { @Override public boolean test(Matcher matcher) { return matcher.find(); }};
Predicate<Matcher> matches =
What is a Lambda?
matcherPredicate<Matcher> matches =
matcher.find()
->
matchermatcher.find()
Predicate<Matcher> matches = new Predicate<Matcher>() { @Override public boolean test(Matcher matcher) { return matcher.find(); }};
Predicate<Matcher> matches =
What is a Lambda?
matcherPredicate<Matcher> matches = matcher.find()->
matchermatcher.find()
Predicate<Matcher> matches = new Predicate<Matcher>() { @Override public boolean test(Matcher matcher) { return matcher.find(); }};
Predicate<Matcher> matches =
What is a Lambda?
matcherPredicate<Matcher> matches =
A lambda is a function from arguments to result
matcher.find()->
matchermatcher.find()
Example: Processing GC Logfile
⋮2.869: Application time: 1.0001540 seconds5.342: Application time: 0.0801231 seconds8.382: Application time: 1.1013574 seconds⋮
Example: Processing GC Logfile
⋮2.869: Application time: 1.0001540 seconds5.342: Application time: 0.0801231 seconds8.382: Application time: 1.1013574 seconds⋮
DoubleSummaryStatistics {count=3, sum=2.181635, min=0.080123, average=0.727212, max=1.101357}
Old School Code
DoubleSummaryStatistic summary = new DoubleSummaryStatistic();Pattern stoppedTimePattern = Pattern.compile("Application time: (\\d+\\.\\d+)");while ( ( logRecord = logFileReader.readLine()) != null) {
Matcher matcher = stoppedTimePattern.matcher(logRecord); if ( matcher.find()) {
double value = Double.parseDouble( matcher.group(1)); summary.add( value);
} }
Old School Code
DoubleSummaryStatistic summary = new DoubleSummaryStatistic();Pattern stoppedTimePattern = Pattern.compile("Application time: (\\d+\\.\\d+)");while ( ( logRecord = logFileReader.readLine()) != null) {
Matcher matcher = stoppedTimePattern.matcher(logRecord); if ( matcher.find()) {
double value = Double.parseDouble( matcher.group(1)); summary.add( value);
} }
Let’s look at the features in this code
Data Source
DoubleSummaryStatistic summary = new DoubleSummaryStatistic();Pattern stoppedTimePattern = Pattern.compile("Application time: (\\d+\\.\\d+)");while ( ( logRecord = logFileReader.readLine()) != null) {
Matcher matcher = stoppedTimePattern.matcher(logRecord); if ( matcher.find()) {
double value = Double.parseDouble( matcher.group(1)); summary.add( value);
} }
Map to Matcher
DoubleSummaryStatistic summary = new DoubleSummaryStatistic();Pattern stoppedTimePattern = Pattern.compile("Application time: (\\d+\\.\\d+)");while ( ( logRecord = logFileReader.readLine()) != null) {
Matcher matcher = stoppedTimePattern.matcher(logRecord); if ( matcher.find()) {
double value = Double.parseDouble( matcher.group(1)); summary.add( value);
} }
Filter
DoubleSummaryStatistic summary = new DoubleSummaryStatistic();Pattern stoppedTimePattern = Pattern.compile("Application time: (\\d+\\.\\d+)");while ( ( logRecord = logFileReader.readLine()) != null) {
Matcher matcher = stoppedTimePattern.matcher(logRecord); if ( matcher.find()) {
double value = Double.parseDouble( matcher.group(1)); summary.add( value);
} }
Map to Double
DoubleSummaryStatistic summary = new DoubleSummaryStatistic();Pattern stoppedTimePattern = Pattern.compile("Application time: (\\d+\\.\\d+)");while ( ( logRecord = logFileReader.readLine()) != null) {
Matcher matcher = stoppedTimePattern.matcher(logRecord); if ( matcher.find()) {
double value = Double.parseDouble( matcher.group(1)); summary.add( value);
} }
Collect Results (Reduce)
DoubleSummaryStatistic summary = new DoubleSummaryStatistic();Pattern stoppedTimePattern = Pattern.compile("Application time: (\\d+\\.\\d+)");while ( ( logRecord = logFileReader.readLine()) != null) {
Matcher matcher = stoppedTimePattern.matcher(logRecord); if ( matcher.find()) {
double value = Double.parseDouble( matcher.group(1)); summary.add( value);
} }
Java 8 Streams
• A sequence of values, “in motion”• source and intermediate operations set the stream up lazily• a terminal operation “pulls” values eagerly down the stream
collection.stream() .intermediateOp
⋮ .intermediateOp .terminalOp
Stream Sources
• New method Collection.stream() • Many other sources:
• Arrays.stream(Object[]) • Streams.of(Object...) • Stream.iterate(Object,UnaryOperator) • Files.lines() • BufferedReader.lines() • Random.ints() • JarFile.stream()
• …
Imperative to Stream
DoubleSummaryStatistics statistics = Files.lines(new File(“gc.log”).toPath())
.map(stoppedTimePattern::matcher)
.filter(Matcher::find)
.map(matcher -> matcher.group(1))
.mapToDouble(Double::parseDouble)
.summaryStatistics();
Stream Source
DoubleSummaryStatistics statistics = Files.lines(new File(“gc.log”).toPath())
.map(stoppedTimePattern::matcher)
.filter(Matcher::find)
.map(matcher -> matcher.group(1))
.mapToDouble(Double::parseDouble)
.summaryStatistics();
Intermediate Operations
DoubleSummaryStatistics statistics = Files.lines(new File(“gc.log”).toPath())
.map(stoppedTimePattern::matcher)
.filter(Matcher::find)
.map(matcher -> matcher.group(1))
.mapToDouble(Double::parseDouble)
.summaryStatistics();
Method References
DoubleSummaryStatistics statistics = Files.lines(new File(“gc.log”).toPath())
.map(stoppedTimePattern::matcher)
.filter(Matcher::find)
.map(matcher -> matcher.group(1))
.mapToDouble(Double::parseDouble)
.summaryStatistics();
Terminal Operation
DoubleSummaryStatistics statistics = Files.lines(new File(“gc.log”).toPath())
.map(stoppedTimePattern::matcher)
.filter(Matcher::find)
.map(matcher -> matcher.group(1))
.mapToDouble(Double::parseDouble)
.summaryStatistics();
Visualising Sequential Streams
x2x0 x1 x3x0 x1 x2 x3
Source Map Filter Reduction
Intermediate Operations
Terminal Operation
“Values in Motion”
Visualising Sequential Streams
x2x0 x1 x3x1 x2 x3 ✔
Source Map Filter Reduction
Intermediate Operations
Terminal Operation
“Values in Motion”
Visualising Sequential Streams
x2x0 x1 x3 x1x2 x3 ❌✔
Source Map Filter Reduction
Intermediate Operations
Terminal Operation
“Values in Motion”
Visualising Sequential Streams
x2x0 x1 x3 x1x2x3 ❌✔
Source Map Filter Reduction
Intermediate Operations
Terminal Operation
“Values in Motion”
Old School: 13.3 secsSequential: 13.8 secs
- Should be the same workload- Stream code is cleaner, easier to read
How Does It Perform?
24M line file, MacBook Pro, Haswell i7, 4 cores, hyperthreaded, Java 9.0
Can We Do Better?
• We might be able to if the workload is parallelizable• split stream into many segments• process each segment• combine results
• Requirements exactly match Fork/Join workflow
x2
Visualizing Parallel Streams
x0x1
x3
x0
x1x2x3
x2
Visualizing Parallel Streams
x0
x1
x3
x0
x1
x2x3
x2
Visualizing Parallel Streams
x1
x3
x0
x1
x3
✔
❌
x2
Visualizing Parallel Streams
x1 y3
x0
x1
x3
✔
❌
Splitting Stream Sources
• Stream source is a Spliterator• can both iterate over data and – where possible – split it
Splitting the Data
Splitting the Data
Splitting the Data
Splitting the Data
Splitting the Data
Splitting the Data
Splitting the Data
Splitting the Data
Splitting the Data
Parallel Streams
DoubleSummaryStatistics statistics = Files.lines(new File(“gc.log”).toPath())
.parallel() .map(stoppedTimePattern::matcher) .filter(Matcher::find) .map(matcher -> matcher.group(1)).mapToDouble(Double::parseDouble).summaryStatistics();
About Fork/Join
• Introduced in Java 7• draws from a common pool of ForkJoinWorkerThread
• default pool size == HW cores – 1• assumes workload will be CPU bound
• On its own, not an easy coding idiom• parallel streams provide an abstraction layer
• Spliterator defines how to split stream• framework code submits sub-tasks to the common Fork/Join pool
Old School: 13.3 secsSequential: 13.8 secsParallel: 9.5 secs
- 1.45x faster- but not 8x faster (????)
How Does That Perform?
24M lines, 2.8GHz 8-core i7, 16GB, OS X, Java 9.0
In Fact!!!!
• Different benchmarks yield a mixed bag of results• some were better• some were the same• some were worse!
Open Questions
• Under what conditions are things better• or worse
• When should we parallelize• and when is serial better
Open Questions
• Under what conditions are things better• or worse
• When should we parallelize• and when is serial better
Answer depends upon where the bottleneck is
Where is Our Bottleneck?
• I/O operations• not a surprise, we’re reading from a file
• Java 9 uses FileChannelLineSpliterator• 2x better than Java 8’s implementation
76.0% 0 + 5941 sun.nio.ch.FileDispatcherImpl.pread0
Poorly Splitting Sources
• Some sources split worse than others• LinkedList vs ArrayList
• Streaming I/O is problematic• more threads == more pressure on contended resource
• thrashing and other ill effects
• Workload size doesn’t cover the overheads
Streaming I/O Bottleneck
x2x0 x1 x3x0 x1 x2 x3
Streaming I/O Bottleneck
✔
❌x2x1x0 x1 x3
5.342: … nds
LineSpliterator
2.869: Applicati … seconds \n 8.382: … nds 9.337:App … nds\n \n \n
spliterator coverage
5.342: … nds
LineSpliterator
2.869: Applicati … seconds \n 8.382: … nds 9.337:App … nds\n \n \n
spliterator coverage
MappedByteBuffer
5.342: … nds
LineSpliterator
2.869: Applicati … seconds \n 8.382: … nds 9.337:App … nds\n \n \n
spliterator coverage
MappedByteBuffer mid
5.342: … nds
LineSpliterator
2.869: Applicati … seconds \n 8.382: … nds 9.337:App … nds\n \n \n
spliterator coverage
MappedByteBuffer mid
5.342: … nds
LineSpliterator
2.869: Applicati … seconds \n 8.382: … nds 9.337:App … nds\n \n \n
spliterator coveragenew spliterator coverage
MappedByteBuffer mid
5.342: … nds
LineSpliterator
2.869: Applicati … seconds \n 8.382: … nds 9.337:App … nds\n \n \n
spliterator coveragenew spliterator coverage
MappedByteBuffer mid
Included in JDK9 as FileChannelLinesSpliterator
In-memory Comparison
• Read GC log into an ArrayList prior to processing
Old School: 9.4 secsSequential: 9.9 secsParallel: 2.7 secs
- 4.25x faster- better but still not 8x faster
In-memory Comparison
24M lines, 2.8GHz 8 core i7, 16GB, OS X, JDK 9.0
Justifying the Overhead
CPNQ performance model:
C - number of submittersP - number of CPUsN - number of elementsQ - cost of the operation
cost of intermediate operations is N * Qoverhead of setting up F/J framework is ~100µs
Amortizing Setup Costs
• N*Q needs to be large• Q can often only be estimated• N may only be known at run time
• Rule of thumb, N > 10,000
• P is the number of processors• P == number for cores for CPU bound• P < number of cores otherwise
Other Gotchas
• Frequent hand-offs place pressure on thread schedulers• effect is magnified when a hypervisor is involved• estimated 80,000 cycles to handoff data between threads
• you can do a lot of processing in 80,000 cycles
• Too many threads places pressure on thread schedulers• responsible for other ill effects (TTSP)• too few threads may leave hardware under-utilized
Simulated Server Environment
ExecutorService threadPool = Executors.newFixedThreadPool(10);
threadPool.execute(() -> { try { long timer = System.currentTimeMillis(); value = Files.lines( new File(“gc.log").toPath()).parallel() .map(applicationStoppedTimePattern::matcher) .filter(Matcher::find) .map( matcher -> matcher.group(2)) .mapToDouble(Double::parseDouble) .summaryStatistics().getSum(); } catch (Exception ex) {}});
Work Flow and Results• First task to arrive will consume all ForkJoinWorkerThread
• downstream tasks wait for a ForkJoinWorkerThread• downstream tasks start intermixing with initial task
• Initial task collects dead time as it competes for threads• all other tasks collect dead time as they either
• compete or wait for a ForkJoinWorkerThread
Work Flow and Results• First task to arrive will consume all ForkJoinWorkerThread
• downstream tasks wait for a ForkJoinWorkerThread• downstream tasks start intermixing with initial task
• Initial task collects dead time as it competes for threads• all other tasks collect dead time as they either
• compete or wait for a ForkJoinWorkerThread
System is stressed beyond capacity
Intermediate Operation Bottleneck
68.6% 1384 + 0 java.util.regex.Pattern$Curly.match26.6% 521 + 15 java.util.stream.ReferencePipeline$3$1.accept
Intermediate Operation Bottleneck
• Bottleneck is in pattern matching• but, streaming infrastructure isn’t far behind!
68.6% 1384 + 0 java.util.regex.Pattern$Curly.match26.6% 521 + 15 java.util.stream.ReferencePipeline$3$1.accept
Tragedy of the Commons
Garrett Hardin, ecologist (1968):Imagine the grazing of animals on a common ground. Each flock owner gains if they add to their own flock. But every animal added to the total degrades the commons a small amount.
Tragedy of the Commons
Tragedy of the Commons
You have a finite amount of hardware– it might be in your best interest to grab it all– but if everyone behaves the same way…
Simulated Server Environment
Simulated Server Environment• Submit 10 tasks to Fork-Join (via Executor thread-pool)
• first result comes out in 32 seconds
• compared to 9.5 seconds for individually submitted task• high system time reflects task is I/O bounded
In-Memory Variation
In-Memory Variation• Preload log file
In-Memory Variation• Preload log file• Submit 10 tasks to Fork-Join (via Executor thread-pool)
• first result comes out in 23 seconds• compared to 4.5 seconds for individually submitted task
• task is CPU bound
Conclusions
Sequential stream performance comparable to imperative codeGoing parallel is worthwhile IF- task is suitable
- expensive enough to amortize setup costs
- no inter-task communication needed
- data source is suitable
- environment is suitable
Need to monitor JDK to understanding bottlenecks- Fork/Join pool is not well instrumented
Resourceshttp://gee.cs.oswego.edu/dl/html/StreamParallelGuidance.html
Resourceshttp://gee.cs.oswego.edu/dl/html/StreamParallelGuidance.html