d16 spark, cluster, awssmack 2 hot topic in bay area scala, spark apache mesos - distributed system...

CS 696 Intro to Big Data: Tools and Methods Fall Semester, 2017 Doc 16 Spark, Cluster, AWS EMR Nov 7, 2017 Copyright ©, All rights reserved. 2017 SDSU & Roger Whitney, 5500 Campanile Drive, San Diego, CA 92182-7700 USA. OpenContent (http://www.opencontent.org/opl.shtml) license defines the copyright on this document.

Upload: others

Post on 23-May-2020




0 download


Page 1: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

CS 696 Intro to Big Data: Tools and Methods Fall Semester, 2017

Doc 16 Spark, Cluster, AWS EMR Nov 7, 2017

Copyright ©, All rights reserved. 2017 SDSU & Roger Whitney, 5500 Campanile Drive, San Diego, CA 92182-7700 USA. OpenContent (http://www.opencontent.org/opl.shtml) license defines the copyright on this document.

Page 2: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications



Hot topic in Bay area

Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications on JVM Apache Cassandra - distributed database Apache Kafka -

Page 3: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Towards AWS


Need Spark program packaged in jar file

Issues Packaging in jar Running in local cluster of one machine Logging File references

Page 4: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Spark Program & Packaging in Jar


Put program in object

Packaging in jar file Package your code not Spark jars - Spark adds 200MB By hand using jar command Using sbt

Page 5: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Why Jar Size Matters




Slave Slave Slave

Page 6: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Jar File & Spark Jars


When running Spark program Spark supplies all the Spark dependencies

If your jar file does not contain Spark jars then It can not run by itself

If your jar file does contain the Spark jars then It can run by itself Can run in Spark But you are passing unneeded 200 MB to each slave

Need to include all other needed resources in you jar file

Page 7: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Sample Program


import org.apache.spark.{SparkConf, SparkContext}

object SimpleApp { def main(args: Array[String]) { val conf = new SparkConf().setAppName("Simple Application") val sc = new SparkContext(conf) val rdd = sc.parallelize(List(1,2,3,4)) rdd.saveAsTextFile("SimpleAppOutput") sc.stop() } }

Page 8: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications



name := "Simple Project"

version := "1.0"

scalaVersion := "2.11.11"

libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.2.0"

Page 9: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

File Structure


simpleApp simpleApp/build.sbt src/ src/main src/main/scala src/main/scala/SimpleApp.scala

Page 10: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Compiling the Example Using sbt


from the directory simpleApp directory

->sbt package[info] Updated file /Users/whitney/Courses/696/Fall17/SparkExamples/simpleApp/project/build.properties: set sbt.version to 1.0.2[info] Loading project definition from /Users/whitney/Courses/696/Fall17/SparkExamples/simpleApp/project[info] Updating {file:/Users/whitney/Courses/696/Fall17/SparkExamples/simpleApp/project/}simpleapp-build...[info] Done updating.[warn] Run 'evicted' to see detailed eviction warnings...[info] Compiling 1 Scala source to /Users/whitney/Courses/696/Fall17/SparkExamples/simpleApp/target/scala-2.11/classes ...[info] Done compiling.[info] Packaging /Users/whitney/Courses/696/Fall17/SparkExamples/simpleApp/target/scala-2.11/simple-project_2.11-1.0.jar ...[info] Done packaging.[success] Total time: 14 s, completed Nov 4, 2017 4:24:36 PM

Page 11: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Note size of Jar file


Page 12: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Running in Temp Spark Runtime


->spark-submit target/scala-2.11/simple-project_2.11-1.0.jar Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 17/11/04 16:30:13 INFO SparkContext: Running Spark version 2.2.0 .... 17/11/04 16:30:15 INFO SparkContext: Successfully stopped SparkContext 17/11/04 16:30:15 INFO ShutdownHookManager: Shutdown hook called 17/11/04 16:30:15 INFO ShutdownHookManager: Deleting directory /private/var/folders/br/q_fcsjqc8xj9qn0059bctj3h0000gr/T/spark-8930a3ab-b041-4ed4-8203-fc8369b9c374

I put the SPARK_HOME/bin & SPARK_HOME/sbin on my path Set SPARK_HOME

setenv SPARK_HOME /Java/spark-2.2.0-bin-hadoop2.7

run SPARK_HOME/bin/spark-submit from simpleApp

Page 13: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Starting a Spark Cluster of One


Command SPARK_HOME/sbin/start-master.sh

->start-master.sh starting org.apache.spark.deploy.master.Master, logging to /Java/spark-2.2.0-bin-hadoop2.7/logs/spark-whitney-org.apache.spark.deploy.master.Master-1-air-6.local.out

Page 14: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Master Web Page



Page 15: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Starting slave on local machine


->start-slave.sh spark://air-6.local:7077 starting org.apache.spark.deploy.worker.Worker, logging to /Java/spark-2.2.0-bin-hadoop2.7/logs/spark-whitney-org.apache.spark.deploy.worker.Worker-1-air-6.local.out

Command SPARK_HOME/sbin/start-slave.sh

Page 16: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Master Web Page


Page 17: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Submitting Job to Spark on Cluster


->spark-submit --master spark://air-6.local:7077 target/scala-2.11/simple-project_2.11-1.0.jar

run SPARK_HOME/bin/spark-submit from simpleApp

Page 18: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Master Web Page


Page 19: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Application Page


Page 20: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Starting/Stopping Master/Slave


Commands in SPARK_HOME/sbin


->start-slave.sh spark://air-6.local:7077





Page 21: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications



./bin/spark-submit \ --class <main-class> \ --master <master-url> \ --deploy-mode <deploy-mode> \ --conf <key>=<value> \ ... # other options <application-jar> \ [application-arguments]

Page 22: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Spark Properties


name master logging memory etc


name - displayed in Spark Master Web page

Page 23: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications



Master URL Meaning

local Run Spark locally with one worker thread.

local[K] Run Spark locally with K worker threads

local[K,F] Run Spark locally with K worker threads and F maxFailures

local[*]Run Spark locally with as many worker threads as logical cores on your machine.

local[*,F]Run Spark locally with as many worker threads as logical cores on your machine and F maxFailures.

spark://HOST:PORT Connect to the given Spark standalone cluster master.


Connect to the given Spark standalone cluster with standby masters with Zookeeper.

mesos://HOST:PORT Connect to the given Mesos cluster.

yarn Connect to a YARN cluster in client or cluster mode

Page 24: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications



->spark-submit target/scala-2.11/simple-project_2.11-1.0.jar

->spark-submit --master spark://air-6.local:7077 \ target/scala-2.11/simple-project_2.11-1.0.jar

->spark-submit --master "local[*]" target/scala-2.11/simple-project_2.11-1.0.jar

Start spark master-slave using default value

Start spark master-slave using all cores

Submit job to existing master

Page 25: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Setting Properties


In program

submit command

config file

In precedence order

Page 26: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Setting master in Code


import org.apache.spark.{SparkConf, SparkContext}

object SimpleApp { def main(args: Array[String]) { val conf = new SparkConf().setAppName("Simple Application").setMaster("local[*]") val sc = new SparkContext(conf) val rdd = sc.parallelize(List(1,2,3,4)) rdd.saveAsTextFile("SimpleAppOutput") sc.stop() } }

Don't set master in code It overrides value in command line and config file So will not be able change master settings without recompiling

Page 27: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications



import org.apache.spark.{SparkConf, SparkContext}

object SimpleApp { def main(args: Array[String]) { val conf = new SparkConf().setAppName("Simple Application") val sc = new SparkContext(conf) val rdd = sc.parallelize(List(1,2,3,4)) rdd.saveAsTextFile("SimpleAppOutput") sc.stop() } }

Spark will not override existing files If you run this a second time without removing files you get an exception

Page 28: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Using Intellij


Page 29: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Using Intellij


Edit build.sbt file to add libraryDependencies

name := "Your Project"

version := "0.1"

scalaVersion := "2.11.11"

libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.2.0"

Page 30: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications





update - dependencies


package - generate jar file


run - Not useful with Spark


Page 31: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Issue - Debugging


Debugger not available for program running on cluster

Print statements Don't count on seeing them from slaves

Logging Spark uses log4j 1.2

Page 32: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

1/2 of Default Output


->spark-submit --master spark://air-6.local:7077 simpleappintell_2.11-0.1.jarlog4j:WARN No appenders could be found for logger (root).log4j:WARN Please initialize the log4j system properly.log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.cat in the hatUsing Spark's default log4j profile: org/apache/spark/log4j-defaults.properties17/11/04 22:16:37 INFO SparkContext: Running Spark version 2.2.017/11/04 22:16:38 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable17/11/04 22:16:38 INFO SparkContext: Submitted application: Simple Application17/11/04 22:16:38 INFO SecurityManager: Changing view acls to: whitney17/11/04 22:16:38 INFO SecurityManager: Changing modify acls to: whitney17/11/04 22:16:38 INFO SecurityManager: Changing view acls groups to: 17/11/04 22:16:38 INFO SecurityManager: Changing modify acls groups to: 17/11/04 22:16:38 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(whitney); groups with view permissions: Set(); users with modify permissions: Set(whitney); groups with modify permissions: Set()17/11/04 22:16:38 INFO Utils: Successfully started service 'sparkDriver' on port 52153.17/11/04 22:16:38 INFO SparkEnv: Registering MapOutputTracker17/11/04 22:16:38 INFO SparkEnv: Registering BlockManagerMaster17/11/04 22:16:38 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information17/11/04 22:16:38 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up17/11/04 22:16:38 INFO DiskBlockManager: Created local directory at /private/var/folders/br/q_fcsjqc8xj9qn0059bctj3h0000gr/T/blockmgr-f07bc14c-79a1-4402-aa1f-8df995460e4717/11/04 22:16:38 INFO MemoryStore: MemoryStore started with capacity 366.3 MB17/11/04 22:16:38 INFO SparkEnv: Registering OutputCommitCoordinator17/11/04 22:16:38 INFO Utils: Successfully started service 'SparkUI' on port 4040.17/11/04 22:16:39 INFO SparkUI: Bound SparkUI to, and started at 22:16:39 INFO SparkContext: Added JAR file:/Users/whitney/Courses/696/Fall17/SparkExamples/simpleAppIntell/target/scala-2.11/simpleappintell_2.11-0.1.jar at spark:// with timestamp 150985899902017/11/04 22:16:39 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://air-6.local:7077...17/11/04 22:16:39 INFO TransportClientFactory: Successfully created connection to air-6.local/ after 23 ms (0 ms spent in bootstraps)17/11/04 22:16:39 INFO StandaloneSchedulerBackend: Connected to Spark cluster with app ID app-20171104221639-0004

Page 33: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications



OFF (most specific, no logging) FATAL (most specific, little data) ERROR WARN INFO DEBUG TRACE (least specific, a lot of data) ALL (least specific, all data)

Log Levels Can specify level Per package Per class

Can determine log Format Location of output

Page 34: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Setting Level in Code


import org.apache.spark.{SparkConf, SparkContext} import org.apache.log4j.{Level, LogManager, Logger}

object SimpleApp { def main(args: Array[String]) {

Logger.getLogger("org").setLevel(Level.ERROR) val log = LogManager.getRootLogger log.info("Start") println("cat in the hat") val conf = new SparkConf().setAppName("Simple Application") val sc = new SparkContext(conf) val rdd = sc.parallelize(List(1,2,3,4)) rdd.saveAsTextFile("SimpleAppOutput2") log.info("End") sc.stop() } }

Page 35: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications



->spark-submit --master spark://air-6.local:7077 simpleappintell_2.11-0.1.jar log4j:WARN No appenders could be found for logger (root). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info. cat in the hat Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 17/11/05 12:04:37 INFO root: End

Page 36: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Again - Do you want to set log level in Code


Can set level in config file $SPARK_HOME/conf/log4j.properties.temple

By default Spark will look for $SPARK_HOME/conf/log4j.properties But does is not part of program

Page 37: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Quiet Log config


# Set everything to be logged to the console log4j.rootCategory=INFO, console log4j.appender.console=org.apache.log4j.ConsoleAppender log4j.appender.console.target=System.err log4j.appender.console.layout=org.apache.log4j.PatternLayout log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n

# Set the default spark-shell log level to WARN. When running the spark-shell, the # log level for this class is used to overwrite the root logger's log level, so that # the user can have different defaults for the shell and regular Spark apps. log4j.logger.org.apache.spark.repl.Main=WARN

# Settings to quiet third party logs that are too verbose log4j.logger.org=WARN log4j.logger.org.apache.parquet=ERROR log4j.logger.parquet=ERROR

Page 38: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Master Logging vs Slave Logging


import org.apache.spark.{SparkConf, SparkContext} import org.apache.log4j.{Level, LogManager, PropertyConfigurator, Logger}

object SimpleApp { def main(args: Array[String]) { val log = LogManager.getRootLogger log.info("Start") val conf = new SparkConf().setAppName("Simple Application") val sc = new SparkContext(conf) val rdd = sc.parallelize(1 to 10) val stringRdd = rdd.map { value => log.info(value) value.toString } log.info("End") sc.stop() } }



Error on Running Log on serializable

Page 39: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Serializable Logger


import org.apache.spark.{SparkConf, SparkContext} import org.apache.log4j.{LogManager, Logger}

object DistributedLogger extends Serializable { @transient lazy val log = Logger.getLogger(getClass.getName) }

Page 40: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications



object SimpleApp { def main(args: Array[String]) { val log = LogManager.getRootLogger log.info("Start") val conf = new SparkConf().setAppName("Simple Application") val sc = new SparkContext(conf) val rdd = sc.parallelize(1 to 10) val result = rdd.map { i => DistributedLogger.log.warn("i = " + i) i + 10 } result.saveAsTextFile("SimpleAppOutput") log.info("End") sc.stop() } }

Page 41: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications



->spark-submit target/scala-2.11/simpleappintell_2.11-0.1.jar 17/11/06 16:59:40 INFO root: Start 17/11/06 16:59:41 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable [Stage 0:> (0 + 0) / 8]17/11/06 16:59:44 WARN DistributedLogger$: i = 7 17/11/06 16:59:44 WARN DistributedLogger$: i = 8 17/11/06 16:59:44 WARN DistributedLogger$: i = 9 17/11/06 16:59:44 WARN DistributedLogger$: i = 6 17/11/06 16:59:44 WARN DistributedLogger$: i = 3 17/11/06 16:59:44 WARN DistributedLogger$: i = 4 17/11/06 16:59:44 WARN DistributedLogger$: i = 1 17/11/06 16:59:44 WARN DistributedLogger$: i = 5 17/11/06 16:59:44 WARN DistributedLogger$: i = 2 17/11/06 16:59:44 WARN DistributedLogger$: i = 10 17/11/06 16:59:44 INFO root: End

Page 42: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Logging DataFrames


To log client operations needs to use udf

Page 43: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Amazon Elastic Map-Reduce (EMR)


Hadoop, Hive, Spark, etc on Cluster

Predefined set of languages/tools available

Can create cluster of machines

https://aws.amazon.com Create new account Get 12 months free access

Page 44: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

AWS Free Tier


12 months free

EC2 - compute instances 740 hours per month Billed in hour increments Billed per instance

S3 - storage 5 GB 20,000 Get requests

RDS - MySQL, PostgresSQL, SQL Sever 20 GB 750 hours

EC2 Container - Docker images 500 MB

I and students were charged last year

Page 45: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

AWS Educate



SDSU is an institutional member

Students get $100 credit

Page 46: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

EC2 Pricing


Price Per Hour

On Demand Spot

m1.medium $0.0047

m1.large $0.0?

ml.xlarge $0.352

m3.xlarge $0.0551

m4.large $0.1 $0.0299

c1.medium $0.0132

c1.xlarge $0.057

Page 47: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Basic Outline


Develop & test Spark locally

Upload program jar file & data to S3

Configure & launch cluster AWS Management Console AWS CLI SDKs

Monitor cluster

Make sure you terminate cluster when done

Page 48: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Simple Storage System - S3


Files are stored in buckets

Bucket names are global

Supports s3 - files divided in to block s3n

Accessing files S3 console Third party REST Java, C#, etc

Page 49: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Amazon S3


Page 50: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

S3 Creating a Bucket


Page 51: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

S3 Costs


AWS Free Usage Tier

New AWS customers receive each month for one year 5 GB of Amazon S3 storage in the Standard Storage class, 20,000 Get Requests, 2,000 Put Requests, and 15 GB of data transfer out

Standard StorageStandard - Infrequent

Access StorageGlacier Storage

First 50 TB / month $0.023 per GB $0.0125 per GB $0.004 per GB

Next 450 TB / month $0.022 per GB $0.0125 per GB $0.004 per GB

Over 500 TB / month $0.021 per GB $0.0125 per GB $0.004 per GB

Page 52: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

S3 Objects


Objects contain Object data Metadata

Size 1 byte to 5 gigabytes per object

Object data Just bytes No meaning associated with bytes

Metadata Name-value pairs to describe the object Some http headers used


Page 53: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

S3 Buckets


Namespace for objects

No limitation on number of object per bucket

Only 100 buckets per account

Each bucket has a name Up to 255 bytes long Cannot be same as existing bucket name by any S3 user

Page 54: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Bucket Names


Bucket names must Contain lowercase letters, numbers, periods (.), underscores (_), and dashes (-) Start with a number or letter Be between 3 and 255 characters long Not be in an IP address style (e.g., "")

To conform with DNS requirements, Amazon recommends Bucket names should not contain underscores (_) Bucket names should be between 3 and 63 characters long Bucket names should not end with a dash Bucket names cannot contain dashes next to periods (e.g.,

"my-.bucket.com" and "my.-bucket" are invalid

Page 55: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications



Unique identifier for an object within a bucket

Object Url



Bucket = doc Key = 2006-03-01/AmazonS3.wsdl

Page 56: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Access Control Lists (ACL)


Each Bucket has an ACL Determines who has read/write access

Each Object can have an ACL Determines who has read/write access

ACL consists of a list of grants

Grant contains One grantee One permission

Page 57: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

S3 Data Consistency Model


Updates to a single object at a key in a bucket are atomic

But a read after a write may return the old value Changes may take time to progate

No object locking If two writes to same object occur at the same time The one with later timestamp wins

Page 58: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

CAP Theorem


CAP theorem says in a distributed system you can not have all three Consistency Availability tolerance to network Partitions

Page 59: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications



A = 2 A = 2

Machine 1 Machine 2

A = 2 A = 3Not Consistent

Page 60: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications



A = 2 A = 2

Machine 1 Machine 2

A = 2 A = 2Partitioned

Machine 1 cannot talk to machine 2

But how does machine 1 tell the difference between no connection and a very slow connection or busy machine 2?

Page 61: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications



Latency Time between making a request and getting a response

Distributed systems always have latency

In practice detect a partition by latency

When no response in a given time frame assume we are partitioned

Page 62: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications



A = 2 A = 2

Machine 1 Machine 2


A = 2 A = 2ClientClient can not access value of A

What does not available mean? No connection Slow connection What is the difference?

Some say high available - meaning low latency

In practice available and latency are related

Page 63: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Consistency over Latency


A = 2 A = 2Set A to 3

A = 2 A = 2Set A to 3 Lock A

A = 2 A = 2Set A to 3 Set A to 3

A = 3 A = 3Set A to 3 Unlock A

Machine 1 Machine 2

Write requests queued until unlocked

Increased latency System still available

A = 3 A = 3

Page 64: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Latency over Consistency


A = 2 A = 2Set A to 3

Machine 1 Machine 2

Write requests accepted

Low latency System inconsistent A = 3 A = 2

Set A to 3

A = 3 A = 2

A = 3 A = 3

Page 65: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Latency over Consistency - Write Conflicts


A = 2 A = 2Set A to 3

Machine 1 Machine 2

A = 3 A = 1Set A to 3

Subtract 1 from A

A = ? A = ?Need policy to make system consistent

A = 3 A = 2Subtract 1 from A

Page 66: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications



A = 2 A = 2

Machine 1 Machine 2

A = ? A = ?Need policy to make system consistent

A = 2 A = 2

Set A to 3A = 3 A = 1

Subtract 1 from A

Page 67: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

CAP Theorem


Not a theorem

Too simplistic What is availability What is a partition of the network


Intent of CAP was to focus designers attention on the tradeoffs in distributed systems

How to handle partitions in the network Consistency Latency Availability

Page 68: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

CAP & S3


S3 favors latency over consistency

Page 69: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Running Program on AWS EMR


Make sure program runs locally

Create jar file containing code Make sure that jar file contains manifest

Create s3 bucket(s) for jar file logs input output

Upload jar & data files to s3

Page 70: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Test Program - SimpleApp


import org.apache.spark.{SparkConf, SparkContext} import org.apache.log4j.LogManager

object SimpleApp { def main(args: Array[String]) { val log = LogManager.getRootLogger log.info("Start") if (args.length < 1) { log.error("Missing argument") return } val outputFile = args(0) val conf = new SparkConf().setAppName("Simple Application") val sc = new SparkContext(conf) val rdd = sc.parallelize(1 to 10) rdd.saveAsTextFile(outputFile) log.info("End") sc.stop() } }

Page 71: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Packaging SimpleApp using SBT


->sbt package [info] Loading settings from idea.sbt ... [info] Loading global plugins from /Users/whitney/.sbt/1.0/plugins [info] Loading settings from plugins.sbt ... [info] Loading project definition from /Users/whitney/Courses/696/Fall17/SparkExamples/simpleAppIntell/project [info] Loading settings from build.sbt ... [info] Set current project to SimpleAppIntell (in build file:/Users/whitney/Courses/696/Fall17/SparkExamples/simpleAppIntell/) [success] Total time: 2 s, completed Nov 6, 2017 4:05:00 PM

In project directory

Page 72: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Packaging SimpleApp using SBT


In project directory

->sbt [info] Loading settings from idea.sbt ... [info] Loading global plugins from /Users/whitney/.sbt/1.0/plugins [info] Loading settings from plugins.sbt ... [info] Loading project definition from /Users/whitney/Courses/696/Fall17/SparkExamples/simpleAppIntell/project [info] Loading settings from build.sbt ... [info] Set current project to SimpleAppIntell (in build file:/Users/whitney/Courses/696/Fall17/SparkExamples/simpleAppIntell/) [info] sbt server started at sbt:SimpleAppIntell> package [success] Total time: 2 s, completed Nov 6, 2017 4:06:33 PM sbt:SimpleAppIntell>

I use SBT shell as it is faster when needing to repeat operations

Page 73: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Result of SBT package


Note: I renamed the jar file simpleapp.jar

Page 74: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Contents of simple app.jar


Manifest-Version: 1.0 Implementation-Title: SimpleAppIntell Implementation-Version: 0.1 Specification-Vendor: default Specification-Title: SimpleAppIntell Implementation-Vendor-Id: default Specification-Version: 0.1 Implementation-Vendor: default Main-Class: SimpleApp


When running SimpleApp locally Don't need to use --class Spark finds main class from manifest

When running on AWS Need to use --class

Page 75: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Running Program on AWS EMR


Make sure program runs locally

Create jar file containing code Make sure that jar file contains manifest

Create s3 bucket(s) for jar file logs input output

Upload jar & data files to s3

Page 76: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

My S3 Buckets


Page 77: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Spark on AWS - EMR Console


Page 78: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications


You can either use Spark option on Quick Options or use Advanced Options

Page 79: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Advanced Options


Page 80: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Spark Application Setup


You have to give --class ClassName in Spark-submit options

Page 81: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications


Using the custom jar option Useful when cloning steps

Page 82: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications



Page 83: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Warning on AWS


It can take 5-10 minutes to start cluster

Logs do not show your logging statements

When you configure Steps incorrectly they fail Error messages are not very helpful

Page 84: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

SSH to your Master Node


Create Amazon EC2 Key pair



Open EC2 Dashboard - Select Key Pairs

Page 85: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

SSH to your Master Node


In Create Cluster - Quick Options

Page 86: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

SSH to your Master Node


Click for Instructions

Page 87: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Command-line Tools


Open-source command-line tool for launching Apache Spark clusters



aws cli

Amazon's command line tool
