spark intro by adform research

15
Spark tutorial, developing locally and deploying on EMR

Upload: vasil-remeniuk

Post on 15-Jul-2015

96 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Spark Intro by Adform Research

Spark tutorial, developing locally and deploying on EMR

Page 2: Spark Intro by Adform Research

Use cases (my biased opinion)

• Interactive and Expressive Data Analysis• If you feel limited when trying to express yourself in “group by”, “join” and

“where”

• Only if it is not possible to work with datasets locally

• Entering Danger Zone:• Spark SQL engine, like Impala/Hive

• Speed up ETLs if your data can fit in memory (speculation)

• Machine learning

• Graph analytics

• Streaming (not mature yet)

Page 3: Spark Intro by Adform Research

Possible working styles

• Develop in IDE

• Develop as you go in Spark shell

IDE Spark-shell

Easier to manipulate with objects, inheritance, package management

Easier to debug code with production scale data

Requires some hacking to get programs run on both Windows and Prod environments

Will only run on Windows if you have correct line endings in spark-shell launcher scripts or use Cygwin

Page 4: Spark Intro by Adform Research

IntelliJ IDEA

• Basic set up https://gitz.adform.com/dspr/audience-extension/tree/38b4b0588902457677f985caf6eb356e037a668c/spark-skeleton

Page 5: Spark Intro by Adform Research

Hacks

• 99% chance that on Windows you won’t be able to use function `saveAsTextFile()`

• Download exe file from http://stackoverflow.com/questions/19620642/failed-to-locate-the-winutils-binary-in-the-hadoop-binary-path

• Place it somewhere on your PC in bin folder (C:\\somewhere\\bin\\winutils.exe) and set in your code before using save functionSystem.setProperty("hadoop.home.dir", "C:\\somewhere\\")

Page 6: Spark Intro by Adform Research

When you are done with your code…

• It is time to package everything to fat jar with sbt assembly

• Add “provided” to library dependencies, since spark libs are already in the classpath if you run job on emr with spark already set-up

• Find more info in Audience Extension project Spark branch build.sbtfile.

libraryDependencies += "org.apache.spark" %% "spark-core" %

"1.2.0" % "provided"

libraryDependencies += "org.apache.spark" %% "spark-mllib" %

"1.2.0" % "provided"

Page 7: Spark Intro by Adform Research

Running on EMR

• build.sbt can be configured (S3 package) to upload fat jar to s3 when it is done with assembly, if you don’t have that just upload it manually

• Run bootstrap action s3://support.elasticmapreduce/spark/install-spark with arguments -v 1.2.0.a -x –g (some documentation in https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark)

• Also install ganglia for monitoring cluster load (run this before spark bootstrap step)

• If you don’t install ganglia ssh tunnels to spark UI won’t work.

Page 8: Spark Intro by Adform Research

Start with local mode first

Use only one instance in cluster, submit your jar with this:/home/hadoop/spark/bin/spark-submit \

--class com.adform.dspr.SimilarityJob \

--master local[16] \

--driver-memory 4G \

--conf spark.default.parallelism=112\

SimilarityJob.jar \

--remote \

--input s3://adform-dsp-warehouse/data/facts/impressions/dt=20150109/* \

--output s3://dev-adform-data-engineers/tmp/spark/2days \

--similarity-threshold 300

Page 9: Spark Intro by Adform Research

Run on multiple machines with yarn master

/home/hadoop/spark/bin/spark-submit \

--class com.adform.dspr.SimilarityJob \

--master yarn \

--deploy-mode client \ #or cluster

--num-executors 7 \

--executor-memory 116736 M \

--executor-cores 16 \

--conf spark.default.parallelism=112 \

--conf spark.task.maxFailures=4 \

SimilarityJob.jar \

--remote \

… … …

Executor parameters are optional, bootstrap script will automatically try to maximize spark configuration options. Note that scripts are not aware of tasks that you are doing, they only read emr cluster specifications.

Page 10: Spark Intro by Adform Research

Spark UI

• Need to set up ssh tunnel to use access it from your PC

• Alternative is to use command line browser lynx

• When you submit app with local master UI will be in ip:4040

• When you submit with Yarn master, go to Hadoop UI on port 9026, it will have Spark task running, click on ApplicationMaster in Tracking UI column, or get UI url from command line when you submit task

Page 11: Spark Intro by Adform Research

Spark UI

For spark 1.2.0 Executors tab is wrong, storage is always empty, only useful tabs are Jobs, Stages and Environment.

Page 12: Spark Intro by Adform Research

Some useful settings

• spark.hadoop.validateOutputSpecs useful when developing, set to false, then you can overwrite output files

• spark.default.parallelism (number of output files / number of cores), automatically configured when you run bootstrap actions with -x option

• spark.shuffle.consolidateFiles (default false)

• spark.rdd.compress (default false)

• spark.akka.timeout, spark.akka.frameSize, spark.speculation, …

• http://spark.apache.org/docs/1.2.0/configuration.html

Page 13: Spark Intro by Adform Research

Spark shell

/home/hadoop/spark/bin/spark-shell \

--master <yarn|local[*]> \

--deploy-mode client \

--num-executors 7 \

--executor-memory 4G \

--executor-cores 16 \

--driver-memory 4G

--conf spark.default.parallelism=112 \

--conf spark.task.maxFailures=4 \

Page 14: Spark Intro by Adform Research

Spark shell

• In spark shell you don’t need to instantiate spark context, it is already intantiated, but you can create another one if you like

• Type scala expressions and see what is happening

• Note the lazy evaluation, to force expression evaluation fore debugging use action functions like [expression].take(n) or [expression].count to see if your statements are OK

Page 15: Spark Intro by Adform Research

Summary

• Spark is better suited for developing in Linux

• Don’t trust Amazon bootstrap scripts, check if your application is utilizing resources with Ganglia

• Try to write scala code in a way that it is possible to run parts of it in spark-shell, otherwise it is hard to debug problems which occur only at production dataset scale.