spark tips & tricks

1 © Cloudera, Inc. All rights reserved. Tips + Tricks Best Practices + Recommendations for Apache Spark Jason Hubbard | Systems Engineer @ Cloudera

Upload: jason-hubbard

Post on 05-Apr-2017




6 download


Page 1: Spark Tips & Tricks

1© Cloudera, Inc. All rights reserved.

Tips + TricksBest Practices + Recommendations for Apache Spark

Jason Hubbard | Systems Engineer @ Cloudera

Page 2: Spark Tips & Tricks

2© Cloudera, Inc. All rights reserved.

Quick Spark OverviewPromise!

Page 3: Spark Tips & Tricks

3© Cloudera, Inc. All rights reserved.

Overview: Spark ComponentsSpark is a fast, general purpose cluster computing platform.Spark takes advantage of parallelism, by distributing processing across a cluster of nodes, in order to provide fast processing of data.

Each Spark application gets its own executor processes which stay up for the duration of the application. The executor runs tasks in multiple threads.

The driver program coordinates tasks and handles resource requests to the cluster manager. The driver distributes application code to each executor. Normally, each task takes 1 core.

Example: When Spark is run interactively via pyspark or spark-shell executors are assigned to the shell (driver) until the shell is exited. Each time the user invokes an action on the data the driver invokes that action on each executor as a task.

Page 4: Spark Tips & Tricks

4© Cloudera, Inc. All rights reserved.

Spark on YarnYARN ( Yet Another Resource Negotiator) is a resource manager which can be used by Spark as the cluster manager (and is recommend for use with CDH).

Resource Manager Node Manager



Node Manager








Spark Driver• The Spark Driver submits initial

request to Resource Manager • Spark Application Master is launched • Spark Application Master coordinates

with Resource Manager and Node Managers to launch containers and Executors

• Spark Driver Coordinates execution of the application with the executors

Page 5: Spark Tips & Tricks

5© Cloudera, Inc. All rights reserved.

What About pyspark?•Spark operates on JVM

•Objects serialized for performance•Python =/= Java•Spark uses Py4J to move data from JVM to python

•Extra serialization cost (Pickle)•Note: DataFrames create a query plan out of pyspark and execute in JVM

•But only if there are no UDFs•UDFs kick back to double serialization cost

•Apache Arrow aims to solve some of these issues

Page 6: Spark Tips & Tricks

6© Cloudera, Inc. All rights reserved.

Resource ManagementHow do I figure out # of executors, cores, memory?

Page 7: Spark Tips & Tricks

7© Cloudera, Inc. All rights reserved.

What are we doing here?•num-executors, executor-cores, executor-memory

•obviously pretty important•How do we configure these?

•Tip: use executor-cores to drive the rest. •i.e. X=total core, Y=executor-cores, then X/Y = num executors•i.e. Z=total mem, Z/(X/Y) = executor-memory

•Good rule of thumb – try ~5 executor-cores•Notes!

•Too few cores doesn’t take advantage of multiple tasks running in a executor (ex: sharing broadcast variables)•Too many tasks can create bad HDFS I/O (problems with concurrent threads)•You can’t give all your resources to Spark

•At the very least, YARN AM needs a container, OS needs memory/core, HDFS needs memory/core, YARN needs memory/core, offheap memory, other services..

Page 8: Spark Tips & Tricks

8© Cloudera, Inc. All rights reserved.

Quick Aside on Memory Usage

• yarn.nodemanager.resource.memory-mb – what yarn is working with• spark.executor.memory – memory per executor process (heap memory)• Spark.yarn.executor.memoryOverhead – offheap memory

• Default is max (384 mb, 0.1 * spark.executor.memory)• Memory is pretty key – controls how much data you can process, group, join,

cache, shuffle, etc,

Page 9: Spark Tips & Tricks

9© Cloudera, Inc. All rights reserved.

Unified Memory Management Spark

• Storage\Execution 1.6• Evicts Storage, not execution• Specify minimum unevictable amount (not

reservation)• Tasks 1.0

• Static vs dynamic (slots determined dynamically)

• Fair and starvation free, static simpler, dynamic better for stragglers

• Off by default in CDH (performance regressions)

• Toggle with spark.memory.useLegacyMode

Page 10: Spark Tips & Tricks

10© Cloudera, Inc. All rights reserved.

Worked Example

16 Core

16 Core

16 Core

16 Core

64 Total Cores in Cluster512 GB RAM

C 1 Core for OS4 GB RAM

111C 12 Cores105 RAM for Executors

Core/RAM Allocation (per Host) GO AND TEST!Also, keep executor mem < 64 GB [GC delays]

1 Executor

4 Cores48 Cores 12 Executors with

4 Cores, 35 GB RAM Each


Allocate Resources Try differ executor/core ratios

1 Executor

5 Cores48 Cores 9 Executors with 5

Cores, 46 GB RAM Each


(Leaves cores un-utilized)

Determine the optimal resource allocation for the Spark job

128 GB

128 GB

128 GB

128 GB

Worker Nodes

C 1 Core for CM agent1 GB RAM

C 1 Core for NM1 GB RAM

C 1 Core for DN1 GB RAM

12 x 4 = 48 total cores105 x 4 = 420 GB RAM

1 Executor

6 Cores48 Cores 8 Executors with 6

Cores, 52 GB RAM Each

xC 12 GB RAM for overhead

Page 11: Spark Tips & Tricks

11© Cloudera, Inc. All rights reserved.

This could be easier: Dynamic Resource Allocation Yarn will handle the sizing and distribution of the requested executors. CDH 5.5+

This configuration will allow the dynamic allocation of between 1 and 20 executors. Spark will initially attempt to run the job with 5 executors, this helps to speed up jobs which you know will require a certain number of executors ahead of time.

Configuration settings should be placed in the spark configuration file. However they can also be submitted with the job.

spark-submit --class com.cloudera.example.YarnExample \ --master yarn-cluster \ --conf "spark.dynamicAllocation.enabled=true" \ --conf "spark.dynamicAllocation.minExecutors=1" \ --conf "spark.dynamicAllocation.maxExecutors=20" \ --conf "spark.dynamicAllocation.initialExecutors=5" \ lib/yarn-example.jar \ 10

Dynamic Allocation of Resources in Yarn only handles allocation of Executors.The number of cores per executor is handled via Spark.conf they are not dynamically sized.

It is still important to understand the sizing limitations of your cluster in order to properly set the Min & Max executor settings as well as the executor- cores setting.Spark also lets you control the timeout of executors when they are not being used.

Page 12: Spark Tips & Tricks

12© Cloudera, Inc. All rights reserved.

Warning: Shuffle Block Size•Spark imposes a 2 GB limit on shuffle block size•Anything larger will cause application errors•How do we fix this?•Create more partitions•Spark core: rdd.repartition, rdd.coalesce. •Spark SQL: spark.sql.shuffle.partitions•Default is 200!•Note: If partitions > 2000, Spark uses HighlyCompressedMapStatus•Rule of Thumb: 128 MB/partition•Avoid Data Skew

Page 13: Spark Tips & Tricks

13© Cloudera, Inc. All rights reserved.

Data Skew

•Avoid This! •Skew slows down jobs/queries•May even causes errors/break Spark (partition > 2GB, etc)

Good Not Good

Page 14: Spark Tips & Tricks

14© Cloudera, Inc. All rights reserved.

How to Avoid Skew• If things are running slowly, always inspect the data/partition

sizes• If you notice skew, try adding a salt to keys• With salt, do two stage operation, one on salted keys, then one on unsalted

results• There are more transformations, but we spread the work around better so job

should be faster• If there is a small number of of skewed keys, you can try isolated salting• Note: Less of a problem in SparkSQL• More efficient columnar cached representation• Able to push some operations down to the data store • Optimizer is able to look inside our operations (partition pruning, predicate

pushdowns, etc)

Page 15: Spark Tips & Tricks

15© Cloudera, Inc. All rights reserved.

Finding Skew

Page 16: Spark Tips & Tricks

16© Cloudera, Inc. All rights reserved.


•ReduceByKey over GroupByKey•GroupByKey is unbounded and dependent on data

•Tree(Reduce\Aggregate) over Reduce\Aggregate•Push more work to workers and return less data to driver

•Less Spark•mapPartitions - reuse resources (JDBC)•Group or Reduce by then process in memory

•Submit multiple Jobs concurrently•Java concurrency

Page 17: Spark Tips & Tricks

17© Cloudera, Inc. All rights reserved.


•In general, Spark stores data in memory as deserialized java objects and on disk/through network as serialized binary•Serialize stuff before you send it to executors, or don’t send stuff that don’t need to be serialized

•This is bad:val myObj=<something>myRDD.filter(x => x == myObj.value)•This is good:val myObj=<something>val myVal=myObj.valuemyRDD.filter(x => x == myValue)

•Kyro is better than Java serialization. Register custom classes•Cut fat off of data objects. Only use what you need.

Page 18: Spark Tips & Tricks

18© Cloudera, Inc. All rights reserved.


•This seems trivial but getting in good habits can save a lot of time•Name your jobs

•sc.setJobGroup•Name your RDDs

•rdd.setName•Which is easier to understand?

Page 19: Spark Tips & Tricks

19© Cloudera, Inc. All rights reserved.

Check Your Dependencies•Incorrectly built jars and version mismatch are common issues•Common errors:

•Could not find or load main class org.apache.spark.deploy.yarn.ApplicationMaster•java.lang.ClassNotFoundException: org.apache.hadoop.mapreduce.MRJobConfig•java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FSDataInputStream•ImportError: No module named qt_returns_summary_wrapper•NoSuchMethodException

•Double check and verify your artifacts•Use versions of components provided by your distro – versions will work together

Page 20: Spark Tips & Tricks

20© Cloudera, Inc. All rights reserved.

Spark StreamingIt’s all about the microbatches

Page 21: Spark Tips & Tricks

21© Cloudera, Inc. All rights reserved.

Spark Streaming•Incoming data represented as DStreams (Discretized Streams)

•Data commonly read from streaming data channels like Kafka or Flume

•A spark-streaming application is a DAG of Transformations and Actions on DStreams (and RDDs)

Page 22: Spark Tips & Tricks

22© Cloudera, Inc. All rights reserved.

Discretized Stream•Incoming data stream is broken down into micro-batches

•Micro-batch size is user defined, usually 0.3 to 1 second •Micro-batches are disjoint

•Each micro-batch is an RDD •Effectively, a DStream is a sequence of RDDs, one per micro-batch•Spark Streaming known for high throughput

Page 23: Spark Tips & Tricks

23© Cloudera, Inc. All rights reserved.

Windowed DStreams• Defined by specifying a window size and a step size

• Both are multiples of micro-batch size• Operations invoked on each window’s data

Page 24: Spark Tips & Tricks

24© Cloudera, Inc. All rights reserved.

Fault Tolerance• Handle Driver Failures

• Process Control System• Submit with deploy-mode=cluster, runs in Application Master

• Driver failure causes Application Master Failure• Set higher (default 2), try 4

• Reset Failure Counts• (default none), try 1h• spark.yarn.max.executor.failures (default max(2 * num executors, 3)), try {8 *

num_executors} • spark.yarn.executor.failuresValidityInterval (default none), try 1h• spark.task.maxFailures (default 4), try 8

Page 25: Spark Tips & Tricks

25© Cloudera, Inc. All rights reserved.

Graceful Shutdown• Design for failure, but may need to finish batches• yarn application -kill [applicationId] (may stop in the middle)• Shutdown hook too late• spark.streaming.stopGracefullyOnShutdown doesn’t work on Yarn• Marker file or http endpoint

Page 26: Spark Tips & Tricks

26© Cloudera, Inc. All rights reserved.

Prevent Data Loss• Receiver

• Enable Checkpoint• Enable WAL• Upgrades won’t work, delete checkpoint dir!

• Direct• Checkpoint offsets (upgrades won’t work)• Save checkpoints manually in ZK, HDFS, Hbase, RDBMS, etc

Page 27: Spark Tips & Tricks

27© Cloudera, Inc. All rights reserved.

Performance• Prevent starvation, create dedicated Pool

• --queue realtime_queue• Protect against stragglers, enable speculation

• --conf spark.speculation=true• Single receiver most execute same task and node as receiver

• Increase replication or lower spark.locality.wait (default 10 ms)• Batch time, Inverse function• mapWithState instead of updateStateByKey (size of batch instead of state)

Page 28: Spark Tips & Tricks

28© Cloudera, Inc. All rights reserved.

Parallelism/Partitions• Repartition (may cause shuffle)• Batch Interval & Block Interval determine # tasks (may cause shuffle)

• Lower block interval to increase tasks (min 50 ms)• Batch Interval / Block Interval should = # executors

• Multiple receivers w/ union (avoids shuffle)• Kafka direct, increase kafka partitions• For receivers, don’t forger receiver consumes a long running task

Page 29: Spark Tips & Tricks

29© Cloudera, Inc. All rights reserved.

Security• Submit principal and keytab for secured cluster via submit

• --principal user/hostname@domain --keytab keytabfile• Disable HDFS Cache with HA Namenode (HDFS-9276, SPARK-11182)

• --conf spark.hadoop.fs.hdfs.impl.disable.cache=true

Page 30: Spark Tips & Tricks

30© Cloudera, Inc. All rights reserved.

Backpressure• Find optimal records for processing time• May hide lag, monitor• Smooth startup

• Spark 2 • spark.streaming.backpressure.initialRate

• Spark 1 • Receiver: spark.streaming.backpressure.initialRate• Direct: spark.streaming.kafka.maxRatePerPartition

Page 31: Spark Tips & Tricks

31© Cloudera, Inc. All rights reserved.

Logging• Enable YARN rolling aggregator

• yarn.nodemanager.log-aggregation.roll-monitoring-interval-seconds

• Configure Spark rolling strategy- or -• Custom log4j appender

• --conf• --conf spark.executor.extraJavaOptions=-• --files /path/to/

Page 32: Spark Tips & Tricks

32© Cloudera, Inc. All rights reserved.


Page 33: Spark Tips & Tricks

33© Cloudera, Inc. All rights reserved.

S3• Treats S3 as a filesystem, S3A is preferred over S3N and S3• Eventually consistent listing

• Consider writing to HDFS first then copy to S3• DirectParquetOutputCommitter removed from Spark 2• If writing directly to S3, use version 2 commit algorithm and turn off speculation

• spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2• spark.speculation=false

Page 34: Spark Tips & Tricks

34© Cloudera, Inc. All rights reserved.

Parquet on S3• When reading Parquet enable random I/O

• fs.s3a.experimental.input.fadvise=random (Filesystem Level when created)• spark.hadoop.parquet.enable.summary-metadata=false• spark.sql.parquet.mergeSchema=false• spark.sql.parquet.filterPushdown=true• spark.sql.hive.metastorePartitionPruning=true

Page 35: Spark Tips & Tricks

35© Cloudera, Inc. All rights reserved.

S3 Performance• fs.s3a.block.size• Tune fs.s3a.multipart.threshold fs.s3a.multipart.size • spark.hadoop.fs.s3a.readahead.range 67108864• Expiremental (buffer in memory)• fs.s3a.threads.max active uploads queued

Page 36: Spark Tips & Tricks

36© Cloudera, Inc. All rights reserved.

YARN• yarn.scheduler.fair.locality.threshold.node = -1• yarn.scheduler.fair.locality.threshold.rack = -1• spark.locality.wait.rack=0

Page 37: Spark Tips & Tricks

37© Cloudera, Inc. All rights reserved.

[email protected] You