spark tips & tricks

of 37 /37
1 © Cloudera, Inc. All rights reserved. Tips + Tricks Best Practices + Recommendations for Apache Spark Jason Hubbard | Systems Engineer @ Cloudera

Author: jason-hubbard

Post on 05-Apr-2017




6 download

Embed Size (px)


Tips + Tricks

Tips + TricksBest Practices + Recommendations for Apache Spark

Jason Hubbard | Systems Engineer @ Cloudera

# Cloudera, Inc. All rights reserved.


Quick Spark OverviewPromise!

# Cloudera, Inc. All rights reserved.

Overview: Spark ComponentsSpark is a fast, general purpose cluster computing platform.Spark takes advantage of parallelism, by distributing processing across a cluster of nodes, in order to provide fast processing of data.

Each Spark application gets its own executor processes which stay up for the duration of the application. The executor runs tasks in multiple threads.

The driver program coordinates tasks and handles resource requests to the cluster manager. The driver distributes application code to each executor. Normally, each task takes 1 core. Example: When Spark is run interactively via pyspark or spark-shell executors are assigned to the shell (driver) until the shell is exited. Each time the user invokes an action on the data the driver invokes that action on each executor as a task.

# Cloudera, Inc. All rights reserved.

Spark on YarnYARN ( Yet Another Resource Negotiator) is a resource manager which can be used by Spark as the cluster manager (and is recommend for use with CDH).Resource ManagerNode ManagerContainerExecutor

Node ManagerContainerExecutorApplication Mstr.Spark DriverThe Spark Driver submits initial request to Resource Manager Spark Application Master is launched Spark Application Master coordinates with Resource Manager and Node Managers to launch containers and ExecutorsSpark Driver Coordinates execution of the application with the executors

# Cloudera, Inc. All rights reserved.

What About pyspark?Spark operates on JVMObjects serialized for performancePython =/= JavaSpark uses Py4J to move data from JVM to pythonExtra serialization cost (Pickle)Note: DataFrames create a query plan out of pyspark and execute in JVMBut only if there are no UDFsUDFs kick back to double serialization costApache Arrow aims to solve some of these issues

# Cloudera, Inc. All rights reserved.Resource ManagementHow do I figure out # of executors, cores, memory?

# Cloudera, Inc. All rights reserved.What are we doing here?num-executors, executor-cores, executor-memoryobviously pretty importantHow do we configure these?Tip: use executor-cores to drive the rest. i.e. X=total core, Y=executor-cores, then X/Y = num executorsi.e. Z=total mem, Z/(X/Y) = executor-memoryGood rule of thumb try ~5 executor-coresNotes!Too few cores doesnt take advantage of multiple tasks running in a executor (ex: sharing broadcast variables)Too many tasks can create bad HDFS I/O (problems with concurrent threads)You cant give all your resources to SparkAt the very least, YARN AM needs a container, OS needs memory/core, HDFS needs memory/core, YARN needs memory/core, offheap memory, other services..

# Cloudera, Inc. All rights reserved.Quick Aside on Memory Usageyarn.nodemanager.resource.memory-mb what yarn is working withspark.executor.memory memory per executor process (heap memory)Spark.yarn.executor.memoryOverhead offheap memoryDefault is max (384 mb, 0.1 * spark.executor.memory)Memory is pretty key controls how much data you can process, group, join, cache, shuffle, etc,

# Cloudera, Inc. All rights reserved.Unified Memory Management SparkStorage\Execution 1.6Evicts Storage, not executionSpecify minimum unevictable amount (not reservation)Tasks 1.0Static vs dynamic (slots determined dynamically)Fair and starvation free, static simpler, dynamic better for stragglersOff by default in CDH (performance regressions)Toggle with spark.memory.useLegacyMode

# Cloudera, Inc. All rights reserved.Worked Example16 Core16 Core16 Core16 Core64 Total Cores in Cluster512 GB RAMC1 Core for OS4 GB RAM111C12 Cores105 RAM for ExecutorsCore/RAM Allocation (per Host)GO AND TEST!Also, keep executor mem < 64 GB [GC delays]1 Executor4 Cores48 Cores12 Executors with 4 Cores, 35 GB RAM EachxAllocate Resources Try differ executor/core ratios1 Executor5 Cores48 Cores9 Executors with 5 Cores, 46 GB RAM Eachx(Leaves cores un-utilized)Determine the optimal resource allocation for the Spark job128 GB128 GB128 GB128 GBWorker NodesC1 Core for CM agent1 GB RAMC1 Core for NM1 GB RAMC1 Core for DN1 GB RAM12 x 4 = 48 total cores105 x 4 = 420 GB RAM1 Executor6 Cores48 Cores8 Executors with 6 Cores, 52 GB RAM EachxC12 GB RAM for overhead

# Cloudera, Inc. All rights reserved.


This could be easier: Dynamic Resource Allocation Yarn will handle the sizing and distribution of the requested executors. CDH 5.5+This configuration will allow the dynamic allocation of between 1 and 20 executors. Spark will initially attempt to run the job with 5 executors, this helps to speed up jobs which you know will require a certain number of executors ahead of time.Configuration settings should be placed in the spark configuration file. However they can also be submitted with the job.

spark-submit --class com.cloudera.example.YarnExample \ --master yarn-cluster \ --conf "spark.dynamicAllocation.enabled=true" \ --conf "spark.dynamicAllocation.minExecutors=1" \ --conf "spark.dynamicAllocation.maxExecutors=20" \ --conf "spark.dynamicAllocation.initialExecutors=5" \ lib/yarn-example.jar \ 10

Dynamic Allocation of Resources in Yarn only handles allocation of Executors.The number of cores per executor is handled via Spark.conf they are not dynamically sized.It is still important to understand the sizing limitations of your cluster in order to properly set the Min & Max executor settings as well as the executor- cores setting.Spark also lets you control the timeout of executors when they are not being used.

# Cloudera, Inc. All rights reserved.

Warning: Shuffle Block SizeSpark imposes a 2 GB limit on shuffle block sizeAnything larger will cause application errorsHow do we fix this?Create more partitionsSpark core: rdd.repartition, rdd.coalesce. Spark SQL: spark.sql.shuffle.partitionsDefault is 200!Note: If partitions > 2000, Spark uses HighlyCompressedMapStatusRule of Thumb: 128 MB/partitionAvoid Data Skew

# Cloudera, Inc. All rights reserved.Data SkewAvoid This! Skew slows down jobs/queriesMay even causes errors/break Spark (partition > 2GB, etc)

GoodNot Good

# Cloudera, Inc. All rights reserved.How to Avoid Skew

If things are running slowly, always inspect the data/partition sizesIf you notice skew, try adding a salt to keysWith salt, do two stage operation, one on salted keys, then one on unsalted resultsThere are more transformations, but we spread the work around better so job should be fasterIf there is a small number of of skewed keys, you can try isolated saltingNote: Less of a problem in SparkSQLMore efficient columnar cached representationAble to push some operations down to the data store Optimizer is able to look inside our operations (partition pruning, predicate pushdowns, etc)

# Cloudera, Inc. All rights reserved.Finding Skew

# Cloudera, Inc. All rights reserved.DAGnabitReduceByKey over GroupByKeyGroupByKey is unbounded and dependent on dataTree(Reduce\Aggregate) over Reduce\AggregatePush more work to workers and return less data to driverLess SparkmapPartitions - reuse resources (JDBC)Group or Reduce by then process in memorySubmit multiple Jobs concurrentlyJava concurrency

# Cloudera, Inc. All rights reserved.SerializationIn general, Spark stores data in memory as deserialized java objects and on disk/through network as serialized binarySerialize stuff before you send it to executors, or dont send stuff that dont need to be serializedThis is bad:val myObj=myRDD.filter(x => x == myObj.value)This is good:val myObj=val myVal=myObj.valuemyRDD.filter(x => x == myValue)Kyro is better than Java serialization. Register custom classesCut fat off of data objects. Only use what you need.

# Cloudera, Inc. All rights reserved.LabelingThis seems trivial but getting in good habits can save a lot of timeName your jobssc.setJobGroupName your RDDsrdd.setNameWhich is easier to understand?

# Cloudera, Inc. All rights reserved.Check Your DependenciesIncorrectly built jars and version mismatch are common issuesCommon errors:Could not find or load main class org.apache.spark.deploy.yarn.ApplicationMasterjava.lang.ClassNotFoundException: org.apache.hadoop.mapreduce.MRJobConfigjava.lang.NoClassDefFoundError: org/apache/hadoop/fs/FSDataInputStreamImportError: No module named qt_returns_summary_wrapperNoSuchMethodExceptionDouble check and verify your artifactsUse versions of components provided by your distro versions will work together

# Cloudera, Inc. All rights reserved.Spark StreamingIts all about the microbatches

# Cloudera, Inc. All rights reserved.Spark StreamingIncoming data represented as DStreams (Discretized Streams)Data commonly read from streaming data channels like Kafka or Flume

A spark-streaming application is a DAG of Transformations and Actions on DStreams (and RDDs)

# Cloudera, Inc. All rights reserved.Dstream is the abstraction and each Dstream has transformation and actions like RDDs.subset of transformations and actions additional state operationsCan use same Rdd operations by iterating over rdds within dstream21

Discretized StreamIncoming data stream is broken down into micro-batchesMicro-batch size is user defined, usually 0.3 to 1 second Micro-batches are disjointEach micro-batch is an RDD Effectively, a DStream is a sequence of RDDs, one per micro-batchSpark Streaming known for high throughput

# Cloudera, Inc. All rights reserved.Can have larger batch sizes for operations like writing to hdfs22

Windowed DStreams

Defined by specifying a window size and a step sizeBoth are multiples of micro-batch sizeOperations invoked on each windows data

# Cloudera, Inc. All rights reserved.Example is window size of 3 step size of 223

Fault ToleranceHandle Driver FailuresProcess Control SystemSubmit with deploy-mode=cluster, runs in Application MasterDriver failure causes Application Master FailureSet higher (default 2), try 4Reset Failure (default none), try 1hspark.yarn.max.executor.failures (default max(2 * num executors, 3)), try {8 * num_executors} spark.yarn.executor.failuresValidityInterval (default none), try 1hspark.task.maxFailures (default 4), try 8

# Cloudera, Inc. All rights reserved.Graceful ShutdownDesign for failure, but may need to finish batchesyarn application -kill [applicationId] (may stop in the middle)Shutdown hook too latespark.streaming.stopGracefullyOnShutdown doesnt work on YarnMarker file or http endpoint

# Cloudera, Inc. All rights reserved.Prevent Data LossReceiverEnable CheckpointEnable WALUpgrades wont work, delete checkpoint dir!DirectCheckpoint offsets (upgrades wont work)Save checkpoints manually in ZK, HDFS, Hbase, RDBMS, etc

# Cloudera, Inc. All rights reserved.PerformancePrevent starvation, create dedicated Pool--queue realtime_queueProtect against stragglers, enable speculation--conf spark.speculation=trueSingle receiver most execute same task and node as receiverIncrease replication or lower spark.locality.wait (default 10 ms)Batch time, Inverse functionmapWithState instead of updateStateByKey (size of batch instead of state)

# Cloudera, Inc. All rights reserved.Parallelism/PartitionsRepartition (may cause shuffle)Batch Interval & Block Interval determine # tasks (may cause shuffle)Lower block interval to increase tasks (min 50 ms)Batch Interval / Block Interval should = # executorsMultiple receivers w/ union (avoids shuffle)Kafka direct, increase kafka partitionsFor receivers, dont forger receiver consumes a long running task

# Cloudera, Inc. All rights reserved.SecuritySubmit principal and keytab for secured cluster via submit--principal user/[email protected] --keytab keytabfileDisable HDFS Cache with HA Namenode (HDFS-9276,SPARK-11182)--conf spark.hadoop.fs.hdfs.impl.disable.cache=true

# Cloudera, Inc. All rights reserved.BackpressureFind optimal records for processing timeMay hide lag, monitorSmooth startupSpark 2 spark.streaming.backpressure.initialRateSpark 1 Receiver: spark.streaming.backpressure.initialRateDirect: spark.streaming.kafka.maxRatePerPartition

# Cloudera, Inc. All rights reserved.LoggingEnable YARN rolling aggregatoryarn.nodemanager.log-aggregation.roll-monitoring-interval-seconds

Configure Spark rolling strategy- or -Custom log4j appender--conf /path/to/

# Cloudera, Inc. All rights reserved.Cloud

# Cloudera, Inc. All rights reserved.S3Treats S3 as a filesystem, S3A is preferred over S3N and S3Eventually consistent listingConsider writing to HDFS first then copy to S3DirectParquetOutputCommitter removed from Spark 2If writing directly to S3, use version 2 commit algorithm and turn off speculationspark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2spark.speculation=false

# Cloudera, Inc. All rights reserved.Parquet on S3When reading Parquet enable random I/Ofs.s3a.experimental.input.fadvise=random (Filesystem Level when created)spark.hadoop.parquet.enable.summary-metadata=falsespark.sql.parquet.mergeSchema=falsespark.sql.parquet.filterPushdown=truespark.sql.hive.metastorePartitionPruning=true

# Cloudera, Inc. All rights reserved.S3 Performancefs.s3a.block.sizeTune fs.s3a.multipart.threshold fs.s3a.multipart.size spark.hadoop.fs.s3a.readahead.range 67108864Expiremental (buffer in memory) active uploads queued

# Cloudera, Inc. All rights reserved.YARNyarn.scheduler.fair.locality.threshold.node = -1yarn.scheduler.fair.locality.threshold.rack = -1spark.locality.wait.rack=0

# Cloudera, Inc. All rights [email protected] You

# Cloudera, Inc. All rights reserved.