apache big data europe 2016

© 2015 IBM Corporation

A Javatm Implementer's Guide toBetter Apache Sparktm Performance

Tim EllisonIBM Runtimes Team, Hursley, UK

tellison

@tpellison

© 2016 IBM Corporation2

Apache Spark is a fast, general

purpose cluster computing platform


Apache Spark is a fast, general

purpose cluster computing platform


SQL Streaming Machine Learning Graph

Core

Data Frames Machine Learning Pipelines

Low-level language optimizations

Run the existing code quicker!


SQL Streaming Machine Learning Graph

Core

Data Frames Machine Learning Pipelines

Higher-level platform optimizations

Write quicker code!


Apache Spark Platform Optimizations

A computing platform wants to offer high-level APIs to understand “intent” of application and optimize it's implementation

– Apache Spark platform optimizes use of Scala/Java platform

– Scala/Java platform optimizes use of machine hardware capabilities


IBM Java SE Platform Optimizations

IBMRuntime

TechnologyCentre

Quality EngineeringSecurity fixes

Additional TestingCustomer Feedback

Production Requirements

PerformanceReliabilityScalability

Monitoring and Management

AWT Swing Java2d

XML Crypto Corba

Core libraries

“J9” virtual machineand JIT compiler

Deep integration with hardware and software innovation

Heuristics optimised for workload scenarios


Low-level Profiling of Apache Spark Workloads

Tools like HealthCenter and tprof show us how the JVM is performing and the effect on the CPU.


A few things we can do with the JVM to enhance the performance of Apache Spark!

1) JIT compiler enhancements, and writing JIT-friendly code

2) Faster IO – networking and storage

3) Offloading tasks to co-processors


Adaptive Compilation : trading-off effort and benefit

1) The VM searches the JAR, loads and verifies bytecodes, converts to internal representation (IR), runs bytecode order directly.

2) After many invocations (or via sampling) code gets compiled at ‘cold’ or ‘warm’ level. Modest, local optimizations performed.

3) A low overhead sampling thread is used to identify frequently used methods within the application.

4) Methods may get recompiled at ‘hot’ or ‘scorching’ levels for more intensive optimizations.

5) Transition to ‘scorching’ goes through a temporary profiling step, to produce application specific, globally optimized code.

cold

hot

scorching

profiling

interpreter

warm

Java's bytecodes are compiled as required and based on runtime profiling- dynamic compilation to the target machine capabilities and application demands

The JIT takes a holistic view of the application, looking for global optimizations based on actual usage patterns, and speculative assumptions.

© 2016 IBM Corporation11 11

;

Language → IR Generators

Compile Time

Connectors

Local and Global Optimization Techniques

Block Hoisting

Inliner

Value Propagation

Block Hoisting

CSE

Loop Versioner

PRE

Escape Analysis

Redundant Monitor Elimination

Cold block Outlining

Loop Unroller

Copy Propagation

Optimal Store Placement

Simplifier

Dead Tree Removal

Switch Analyzer

Async Check Removal

cold warm hot FSDsmall profiling customscorching AOT

Profiler

CPU-specific Code Generators

Optimizer• Multiple optimization strategies for different code quality compile-time tradeoffs

• Spend compile time where it makes biggest difference

•Easy to adopt the latest optimizations.

•Embodies many years of Java experience.

JIT Optimization Techniques


JIT Optimizations are Limited to Preserving Semantic Equivalence

Compilers thrive on proving properties of code– Uncertainty makes code slower

Problem: Java has some inherent uncertainty– virtual calls– many small methods– dynamic class loading

Problem: Scala has even more inherent uncertainty– field accesses become method calls– many layers of indirection to facilitate multiple inheritance– frequent use of interface calls

JIT can speculate to overcome some of these


Compiler Techniques to Cope with Uncertainty

Challenge: dynamic class loading means application shape changes

Approach: JIT speculates about invariance of code, and copes with failure

Challenge: lots of small methods

Approach: JIT aggregates small methods into larger ones to avoid call overhead

Challenge: lots of virtual method calls

Approach: JIT rewites virtual calls to direct calls through flow analysis

Challenge: objects are heap-allocated, and must be GC'd

Approach: JIT converts variables to stack allocation where possible

13


Programmers Can Help by Reducing Uncertainty

Locals are faster than globals– Can prove closed set of storage readers / modifers.– Fields and statics slow; parameters and locals fast.

Constants are faster than variables– Can copy constants inline or across memory caches.– Java’s final and Scala’s val are your friends.

private is faster than public– Private methods can't be dynamically redefined.– protected and “package private” just as slow as public.

Small methods (≤100 bytecodes) are good– More opportunities to in-line them.

Simple is faster than complex– Easier for the JIT to reason about the effects.

Limit extension points and architectural complexity when practical– Makes call sites concrete.

14


Scala is a language with many, many features.– Some features are for convenience others are for performance– The JVM is not optimized for these patterns and idioms.

Understand how the features you are using perform

Idiomatic:

Scala & Java library implementationsare standard imperative

Idiomatic is convenient, but slow!Example source & performance from https://jazzy.id.au/2012/10/16/benchmarking_scala_against_java.html

Running Scala code on the Java VM


Example: Exploiting JIT optimizations to reduce overhead of logging checks

Tip: Check for the non-null value of a static field ref to instance of a logging class singleton– e.g.

– Uses the JIT's speculative optimization to avoid the explicit test for logging being enabled; instead it ...

1) Generates an internal JIT runtime assumption (e.g. InfoLogger.class is undefined),2) NOPs the test for trace enablement3) Uses a class initialization hook for the InfoLogger.class (already necessary for instantiating the class)

4) The JIT will regenerate the test code if the class event is fired

Spark's logging calls are gated on the checks of a static boolean value

trait Logging

Spark


Style: Judicious use of polymorphism Spark / Scala has a number of highly polymorphic interface call sites and high fan-in (several calling

contexts invoking the same callee method) in map, reduce, filter, flatMap, ...– e.g. ExternalSorter.insertAll is very hot (drains an iterator using hasNext/next calls)

Pattern #1:– InterruptibleIterator → Scala's mapIterator → Scala's filterIterator → …

Pattern #2:– InterruptibleIterator → Scala's filterIterator → Scala's mapIterator → …

The JIT can only choose one pattern to in-line!– Makes JIT devirtualization and speculation more risky; using profiling information from a different

context could lead to incorrect devirtualization.

– More conservative speculation, or good phase change detection and recovery are needed in the JIT compiler to avoid getting it wrong.

Lambdas and functions as arguments, by definition, introduce different code flow targets– Passing in widely implemented interfaces produce many different bytecode sequences– When we in-line we have to put runtime checks ahead of in-lined method bodies to make sure we are

going to run the right method!– Often specialized classes are used only in a very limited number of places, but the majority of the code

does not use these classes and pays a heavy penalty– e.g. Scala's attempt to specialize Tuple2 Int argument does more harm than good!

Tip: Use polymorphism sparingly, use the same order / patterns for nested & wrappered code, and keep call sites homogeneous.


Example: Use of Scala Functions as Values

Scala has first class functions – they can be treated as values

val square = (x: Int) => { x * x }

Implemented by scalac using generic interfaces!

val square = new Function2[Int, Int] { def apply(x: Int): Int = x * x }

Calls to function values site is highly uncertain– Program contains many different classes that implement Function2– Hard to tell which Function2 implementer to inline

def filter(values: List[A], checkOp: A => Boolean): List[A] = { var result: List[A] = Nil for (v <- values) { if (checkOp(v)) { result = result :: v } } return result }


JIT compiler enhancements, and writing JIT-friendly code

Understand the implementation of Scala language features – use them judiciously.

Try to reduce uncertainty for the compiler in your coding style.

Stick to common coding patterns - the JIT is tuned for them.– As new workloads emerge the latest JITs will change too

Focus on performance hotspots inyour application

– Too much emphasis on performancecan compromise maintainability

– Too much emphasis on maintainabilitycan compromise performance!

#WOCinTech Chat


Effect of adjusting JIT heuristics for Apache Spark

IBM JDK8 SR3 (tuned)

IBM JDK8 SR3 (out of the box)

PageRank 160% 148%

Sleep 187% 113%

Sort 103% 147%

WordCount 130% 146%

Bayes 100% 91%

Terasort 160% 131%1/Geometric mean of HiBench time on zLinux 32 cores, 25G heap

Improvements in successive IBM Java 8 releases Performance compared with OpenJDK 8

HiBench huge, Spark 2.0.1, Linux Power8 12 core * 8-way SMT

1.35x


Faster IO – networking and storage


Performance of the Apache Spark Runtime Core

Executing tasks

– How quickly can a worker sort, compute, transform, … the data in this partition?

– Where is the best place to send work to have it completed quickest?

“Narrow” RDD dependencies e.g. map()pipeline-able

“Wide” RDD dependencies e.g. reduce()shuffles

RDD1partition1

RDD1partition 2

RDD1partition 1

RDD1partition 3

RDD1partition n...

RDD1partition1

RDD2partition 2

RDD2partition 1

RDD2partition 3

RDD2partition n...

RDD1partition1

RDD3partition 2

RDD3partition 1

RDD3partition 3

RDD3partition n...

RDD1partition1

RDD1partition 2

RDD1partition 1

RDD1partition 3

RDD1partition n...

RDD1partition1

RDD2partition 2

RDD2partition 1

Moving data blocks

– How quickly can a worker get the data needed for a task?

– How quickly can a worker persist the results if required?


Remote Direct Memory Access (RDMA) Networking

Spark VM

Buffer

OffHeap

Buffer

Spark VM

Buffer

OffHeap

Buffer

Ether/IB SwitchRDMA NIC/HCA RDMA NIC/HCA

OS OSDMA DMA(Z-Copy) (Z-Copy)

(B-Copy)(B-Copy)

Acronyms:Z-Copy – Zero Copy

B-Copy – Buffer CopyIB – InfiniBandEther - Ethernet

NIC – Network Interface CardHCA – Host Control Adapter

● Low-latency, high-throughput networking● Direct 'application to application' memory pointer exchange between remote hosts● Off-load network processing to RDMA NIC/HCA – OS/Kernel Bypass (zero-copy)● Introduces new IO characteristics that can influence the Apache Spark transfer plan

Spark node #1 Spark node #2


TCP/IP

RDMA

RDMA exhibits improved throughput and reduced latency.

Our JVM makes RDMA available transparently via java.net.Socket APIs (JSoR),or explicitly via com.ibm jVerbs calls.

Characteristics of RDMA Networking


Modifications for an RDMA-enabled Apache Spark

25

New dynamic transfer plan that adapts to the load and responsiveness of the remote hosts.

New “RDMA” shuffle IO mode with lower latency and higher throughput.

JVM-agnostic

IBM JVM only

JVM-agnostic

IBM JVM only

IBM JVM only

Block manipulation (i.e., RDD partitions)

High-level API

JVM-agnostic working prototype with RDMA


TCP-H benchmark 100Gb30% improvement in database operations

Shuffle-intensive benchmarks show 30% - 40% better performance with RDMA

HiBench PageRank 3Gb40% faster, lower CPU usage

32 cores(1 master, 4 nodes x 8 cores/node,

32GB Mem/node)


Time to complete TeraSort benchmark

32 cores (1 master, 4 nodes x 8 cores/node, 32GB Mem/node), IBM Java 8

0 100 200 300 400 500 600

Spark HiBench TeraSort [30GB]

Execution Time (sec)

556s

159s

TCP/IP

JSoR


Faster storage IO with POWER CAPI/Flash

POWER8 architecture offers a 40Tb Flash drive attached via Coherent Accelerator Processor Interface (CAPI)

– Provides simple coherent block IO APIs– No file system overhead

Power Service Layer (PSL)– Performs Address Translations– Maintains Cache– Simple, but powerful interface to the Accelerator unit

Coherent Accelerator Processor Proxy (CAPP)– Maintains directory of cache lines held by Accelerator– Snoops PowerBus on behalf of Accelerator


Faster disk IO with CAPI/Flash-enabled Spark

When under memory pressure, Spark spills RDDs to disk.– Happens in ExternalAppendOnlyMap and ExternalSorter

We have modified Spark to spill to the high-bandwidth, coherently-attached Flash device instead.– Replacement for DiskBlockManager– New FlashBlockManager handles spill to/from flash

Making this pluggable requires some further abstraction in Spark:– Spill code assumes using disks, and depends on DiskBlockManger– We are spilling without using a file system layer

Dramatically improves performance of executors under memory pressure.

Allows to reach similar performance with much less memory (denser cloud deployments).

IBM Flash System 840Power8 + CAPI


e.g. using CAPI Flash for RDD caching allows for 4X memory reduction while maintaining equal performance


Offloading tasks to graphics co-processors


JIT optimized GPU acceleration

Comes with caveats

– Recognize a limited set of operations within the lambda expressions,• notably no object references maintained on GPU

– Default grid dimensions and operating parameters for the GPU workload

– Redundant/pessimistic data transfer between host and device • Not using GPU shared memory

– Limited heuristics about when to invoke the GPU and when togenerate CPU instructions

As the JIT compiles a stream expression we can identify candidates for GPU off-loading– Arrays copied to and from the device implicitly– Java operations mapped to GPU kernel operations– Preserves the standard Java syntax and semantics bytecodes

intermediaterepresentation

optimizer

CPU GPU

code generatorcode

generator

PTX ISACPU native


GPU-enabled Standard Java APIs

Some Arrays.sort() methods will offload work to GPUs today– e.g. sorting large arrays of ints


GPU optimization of Loops and Lambda expressions

Speed-up factor when run on a GPU enabled host

0.00

0.01

0.10

1.00

10.00

100.00

1000.00

auto-SIMD parallel forEach on CPUparallel forEach on GPU

matrix size

The JIT can recognize parallel stream code, and automatically compile down to the GPU.


DataFrames and DataSets Codegen Optimizations

User's SQL operators

Arbitrary user code

Uses data schema to generate code– Builds logical and physical execution plans.– Loop through rows in the data set to:

● Access serialized storage, convert data to VM objects, call function | execute SQL plan

– Code is generated and compiled at runtime


Storage Layout Optimization

UnsafeRow representation– Data laid out in a heap byte array– Accessed via sun.misc.Unsafe method calls

Conversion to Columunar Format– Allows for compression and better vector processing


/* 049 */ while (inputadapter_input.hasNext()) {/* 050 */ InternalRow inputadapter_row = (InternalRow) inputadapter_input.next();/* 051 */ boolean inputadapter_isNull = inputadapter_row.isNullAt(0);/* 052 */ int inputadapter_value = inputadapter_isNull ? -1 : (inputadapter_row.getInt(0));/* 053 *//* 054 */ if (!(!(inputadapter_isNull))) continue;/* 055 *//* 056 */ boolean filter_isNull2 = false;/* 057 *//* 058 */ boolean filter_value2 = false;/* 059 */ filter_value2 = inputadapter_value > 8;/* 060 */ if (!filter_value2) continue;/* 061 */ boolean inputadapter_isNull1 = inputadapter_row.isNullAt(1);/* 062 */ double inputadapter_value1 = inputadapter_isNull1 ? -1.0 : (inputadapter_row.getDouble(1));/* 063 *//* 064 */ if (!(!(inputadapter_isNull1))) continue;/* 065 *//* 066 */ boolean filter_isNull7 = false;/* 067 *//* 068 */ boolean filter_value7 = false;/* 069 */ filter_value7 = org.apache.spark.util.Utils.nanSafeCompareDoubles(inputadapter_value1, 24.0D) > 0;/* 070 */ if (!filter_value7) continue;/* 071 *//* 072 */ filter_numOutputRows.add(1);/* 073 *//* 074 */ // do aggregate/* 075 */ // common sub-expressions/* 076 *//* 077 */ // evaluate aggregate function/* 078 */ boolean agg_isNull1 = false;/* 079 *//* 080 */ long agg_value1 = -1L;/* 081 */ agg_value1 = agg_bufValue + 1L;/* 082 */ // update aggregation buffer/* 083 */ agg_bufIsNull = false;/* 084 */ agg_bufValue = agg_value1;/* 085 */ if (shouldStop()) return;/* 086 */ }

Code generation is shared between columnar and row based storage. ● this introduces huge overhead.● the complex loop is not a counted loop, contains

numerous branches (due to nullability checks) and calls etc

Generated Code


● ColumnarBatch format instead of CachedBatch for in-memory cache.● Common compression scheme (e.g. lz4) for all of the data types in a ColumnVector

Optimizing Spark's Generated Code

/* 073 */ while (inmemorytablescan_batch != null) {/* 074 */ int numRows = inmemorytablescan_batch.numRows();/* 075 */ while (inmemorytablescan_batchIdx < numRows) {/* 076 */ int inmemorytablescan_rowIdx = inmemorytablescan_batchIdx++;/* 077 */ boolean inmemorytablescan_isNull = inmemorytablescan_colInstance0.isNullAt(inmemorytablescan_rowIdx);/* 078 */ int inmemorytablescan_value = inmemorytablescan_isNull ? -1 : (inmemorytablescan_colInstance0.getInt(inmemorytablescan_rowIdx));/* 079 *//* 080 */ // filtering out values and counting rows similar to the old code/* 081 */ .../* 112 */ }/* 113 */ inmemorytablescan_batch = null;/* 114 */ inmemorytablescan_nextBatch();/* 115 */ }

● Simple and specialized runtime code for building a concrete instance of the in-memory cache

● Read data directly from ColumnVector in the whole-stage codegen.

3.4x performance improvement


Summary

We are focused on Core runtime performance to get a multiplier up the Apache Spark stack.– More efficient code, more efficient memory usage/spilling, more efficient serialization &

networking, etc.

There are hardware and software technologies we can bring to the party.– We can tune the stack from hardware to high level structures for running Apache Spark.

Apache Spark and Scala developers can help themselves by their style of coding.

All the changes are being made in the Java runtime orbeing pushed out to the Apache Spark community.

There is lots more stuff I don't have time to talk about, like GC optimizations, object layout, monitoring VM/Apache Spark events, hardware compression, security, etc. etc.

– mailto:

http://ibm.biz/sparkkit


Questions?