d16 spark, cluster, awssmack 2 hot topic in bay area scala, spark apache mesos - distributed system...

87
CS 696 Intro to Big Data: Tools and Methods Fall Semester, 2017 Doc 16 Spark, Cluster, AWS EMR Nov 7, 2017 Copyright ©, All rights reserved. 2017 SDSU & Roger Whitney, 5500 Campanile Drive, San Diego, CA 92182-7700 USA. OpenContent (http://www.opencontent.org/opl.shtml) license defines the copyright on this document.

Upload: others

Post on 23-May-2020

12 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

CS 696 Intro to Big Data: Tools and Methods Fall Semester, 2017

Doc 16 Spark, Cluster, AWS EMR Nov 7, 2017

Copyright ©, All rights reserved. 2017 SDSU & Roger Whitney, 5500 Campanile Drive, San Diego, CA 92182-7700 USA. OpenContent (http://www.opencontent.org/opl.shtml) license defines the copyright on this document.

Page 2: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

SMACK

2

Hot topic in Bay area

Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications on JVM Apache Cassandra - distributed database Apache Kafka -

Page 3: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Towards AWS

3

Need Spark program packaged in jar file

Issues Packaging in jar Running in local cluster of one machine Logging File references

Page 4: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Spark Program & Packaging in Jar

4

Put program in object

Packaging in jar file Package your code not Spark jars - Spark adds 200MB By hand using jar command Using sbt

Page 5: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Why Jar Size Matters

5

Jar

Master

Slave Slave Slave

Page 6: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Jar File & Spark Jars

6

When running Spark program Spark supplies all the Spark dependencies

If your jar file does not contain Spark jars then It can not run by itself

If your jar file does contain the Spark jars then It can run by itself Can run in Spark But you are passing unneeded 200 MB to each slave

Need to include all other needed resources in you jar file

Page 7: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Sample Program

7

import org.apache.spark.{SparkConf, SparkContext}

object SimpleApp { def main(args: Array[String]) { val conf = new SparkConf().setAppName("Simple Application") val sc = new SparkContext(conf) val rdd = sc.parallelize(List(1,2,3,4)) rdd.saveAsTextFile("SimpleAppOutput") sc.stop() } }

Page 8: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

build.sbt

8

name := "Simple Project"

version := "1.0"

scalaVersion := "2.11.11"

libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.2.0"

Page 9: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

File Structure

9

simpleApp simpleApp/build.sbt src/ src/main src/main/scala src/main/scala/SimpleApp.scala

Page 10: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Compiling the Example Using sbt

10

from the directory simpleApp directory

->sbt package[info] Updated file /Users/whitney/Courses/696/Fall17/SparkExamples/simpleApp/project/build.properties: set sbt.version to 1.0.2[info] Loading project definition from /Users/whitney/Courses/696/Fall17/SparkExamples/simpleApp/project[info] Updating {file:/Users/whitney/Courses/696/Fall17/SparkExamples/simpleApp/project/}simpleapp-build...[info] Done updating.[warn] Run 'evicted' to see detailed eviction warnings...[info] Compiling 1 Scala source to /Users/whitney/Courses/696/Fall17/SparkExamples/simpleApp/target/scala-2.11/classes ...[info] Done compiling.[info] Packaging /Users/whitney/Courses/696/Fall17/SparkExamples/simpleApp/target/scala-2.11/simple-project_2.11-1.0.jar ...[info] Done packaging.[success] Total time: 14 s, completed Nov 4, 2017 4:24:36 PM

Page 11: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Note size of Jar file

11

Page 12: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Running in Temp Spark Runtime

12

->spark-submit target/scala-2.11/simple-project_2.11-1.0.jar Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 17/11/04 16:30:13 INFO SparkContext: Running Spark version 2.2.0 .... 17/11/04 16:30:15 INFO SparkContext: Successfully stopped SparkContext 17/11/04 16:30:15 INFO ShutdownHookManager: Shutdown hook called 17/11/04 16:30:15 INFO ShutdownHookManager: Deleting directory /private/var/folders/br/q_fcsjqc8xj9qn0059bctj3h0000gr/T/spark-8930a3ab-b041-4ed4-8203-fc8369b9c374

I put the SPARK_HOME/bin & SPARK_HOME/sbin on my path Set SPARK_HOME

setenv SPARK_HOME /Java/spark-2.2.0-bin-hadoop2.7

run SPARK_HOME/bin/spark-submit from simpleApp

Page 13: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Starting a Spark Cluster of One

13

Command SPARK_HOME/sbin/start-master.sh

->start-master.sh starting org.apache.spark.deploy.master.Master, logging to /Java/spark-2.2.0-bin-hadoop2.7/logs/spark-whitney-org.apache.spark.deploy.master.Master-1-air-6.local.out

Page 14: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Master Web Page

14

localhost:8080 127.0.0.1:8080 0.0.0.0:8080

Page 15: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Starting slave on local machine

15

->start-slave.sh spark://air-6.local:7077 starting org.apache.spark.deploy.worker.Worker, logging to /Java/spark-2.2.0-bin-hadoop2.7/logs/spark-whitney-org.apache.spark.deploy.worker.Worker-1-air-6.local.out

Command SPARK_HOME/sbin/start-slave.sh

Page 16: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Master Web Page

16

Page 17: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Submitting Job to Spark on Cluster

17

->spark-submit --master spark://air-6.local:7077 target/scala-2.11/simple-project_2.11-1.0.jar

run SPARK_HOME/bin/spark-submit from simpleApp

Page 18: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Master Web Page

18

Page 19: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Application Page

19

Page 20: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Starting/Stopping Master/Slave

20

Commands in SPARK_HOME/sbin

->start-master.sh

->start-slave.sh spark://air-6.local:7077

->stop-master.sh

->stop-slave.sh

->start-all.sh

->stop-all.sh

Page 21: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

spark-submit

21

./bin/spark-submit \ --class <main-class> \ --master <master-url> \ --deploy-mode <deploy-mode> \ --conf <key>=<value> \ ... # other options <application-jar> \ [application-arguments]

Page 22: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Spark Properties

22

name master logging memory etc

https://spark.apache.org/docs/latest/configuration.html

name - displayed in Spark Master Web page

Page 23: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

master

23

Master URL Meaning

local Run Spark locally with one worker thread.

local[K] Run Spark locally with K worker threads

local[K,F] Run Spark locally with K worker threads and F maxFailures

local[*]Run Spark locally with as many worker threads as logical cores on your machine.

local[*,F]Run Spark locally with as many worker threads as logical cores on your machine and F maxFailures.

spark://HOST:PORT Connect to the given Spark standalone cluster master.

spark://HOST1:PORT1,HOST2:PORT2

Connect to the given Spark standalone cluster with standby masters with Zookeeper.

mesos://HOST:PORT Connect to the given Mesos cluster.

yarn Connect to a YARN cluster in client or cluster mode

Page 24: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Examples

24

->spark-submit target/scala-2.11/simple-project_2.11-1.0.jar

->spark-submit --master spark://air-6.local:7077 \ target/scala-2.11/simple-project_2.11-1.0.jar

->spark-submit --master "local[*]" target/scala-2.11/simple-project_2.11-1.0.jar

Start spark master-slave using default value

Start spark master-slave using all cores

Submit job to existing master

Page 25: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Setting Properties

25

In program

submit command

config file

In precedence order

Page 26: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Setting master in Code

26

import org.apache.spark.{SparkConf, SparkContext}

object SimpleApp { def main(args: Array[String]) { val conf = new SparkConf().setAppName("Simple Application").setMaster("local[*]") val sc = new SparkContext(conf) val rdd = sc.parallelize(List(1,2,3,4)) rdd.saveAsTextFile("SimpleAppOutput") sc.stop() } }

Don't set master in code It overrides value in command line and config file So will not be able change master settings without recompiling

Page 27: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Warning

27

import org.apache.spark.{SparkConf, SparkContext}

object SimpleApp { def main(args: Array[String]) { val conf = new SparkConf().setAppName("Simple Application") val sc = new SparkContext(conf) val rdd = sc.parallelize(List(1,2,3,4)) rdd.saveAsTextFile("SimpleAppOutput") sc.stop() } }

Spark will not override existing files If you run this a second time without removing files you get an exception

Page 28: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Using Intellij

28

Page 29: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Using Intellij

29

Edit build.sbt file to add libraryDependencies

name := "Your Project"

version := "0.1"

scalaVersion := "2.11.11"

libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.2.0"

Page 30: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

SBT

30

http://www.scala-sbt.org

clean

update - dependencies

compile

package - generate jar file

test

run - Not useful with Spark

Commands

Page 31: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Issue - Debugging

31

Debugger not available for program running on cluster

Print statements Don't count on seeing them from slaves

Logging Spark uses log4j 1.2

Page 32: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

1/2 of Default Output

32

->spark-submit --master spark://air-6.local:7077 simpleappintell_2.11-0.1.jarlog4j:WARN No appenders could be found for logger (root).log4j:WARN Please initialize the log4j system properly.log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.cat in the hatUsing Spark's default log4j profile: org/apache/spark/log4j-defaults.properties17/11/04 22:16:37 INFO SparkContext: Running Spark version 2.2.017/11/04 22:16:38 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable17/11/04 22:16:38 INFO SparkContext: Submitted application: Simple Application17/11/04 22:16:38 INFO SecurityManager: Changing view acls to: whitney17/11/04 22:16:38 INFO SecurityManager: Changing modify acls to: whitney17/11/04 22:16:38 INFO SecurityManager: Changing view acls groups to: 17/11/04 22:16:38 INFO SecurityManager: Changing modify acls groups to: 17/11/04 22:16:38 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(whitney); groups with view permissions: Set(); users with modify permissions: Set(whitney); groups with modify permissions: Set()17/11/04 22:16:38 INFO Utils: Successfully started service 'sparkDriver' on port 52153.17/11/04 22:16:38 INFO SparkEnv: Registering MapOutputTracker17/11/04 22:16:38 INFO SparkEnv: Registering BlockManagerMaster17/11/04 22:16:38 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information17/11/04 22:16:38 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up17/11/04 22:16:38 INFO DiskBlockManager: Created local directory at /private/var/folders/br/q_fcsjqc8xj9qn0059bctj3h0000gr/T/blockmgr-f07bc14c-79a1-4402-aa1f-8df995460e4717/11/04 22:16:38 INFO MemoryStore: MemoryStore started with capacity 366.3 MB17/11/04 22:16:38 INFO SparkEnv: Registering OutputCommitCoordinator17/11/04 22:16:38 INFO Utils: Successfully started service 'SparkUI' on port 4040.17/11/04 22:16:39 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://192.168.0.102:404017/11/04 22:16:39 INFO SparkContext: Added JAR file:/Users/whitney/Courses/696/Fall17/SparkExamples/simpleAppIntell/target/scala-2.11/simpleappintell_2.11-0.1.jar at spark://192.168.0.102:52153/jars/simpleappintell_2.11-0.1.jar with timestamp 150985899902017/11/04 22:16:39 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://air-6.local:7077...17/11/04 22:16:39 INFO TransportClientFactory: Successfully created connection to air-6.local/192.168.0.102:7077 after 23 ms (0 ms spent in bootstraps)17/11/04 22:16:39 INFO StandaloneSchedulerBackend: Connected to Spark cluster with app ID app-20171104221639-0004

Page 33: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Log4j

33

OFF (most specific, no logging) FATAL (most specific, little data) ERROR WARN INFO DEBUG TRACE (least specific, a lot of data) ALL (least specific, all data)

Log Levels Can specify level Per package Per class

Can determine log Format Location of output

Page 34: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Setting Level in Code

34

import org.apache.spark.{SparkConf, SparkContext} import org.apache.log4j.{Level, LogManager, Logger}

object SimpleApp { def main(args: Array[String]) {

Logger.getLogger("org").setLevel(Level.ERROR) val log = LogManager.getRootLogger log.info("Start") println("cat in the hat") val conf = new SparkConf().setAppName("Simple Application") val sc = new SparkContext(conf) val rdd = sc.parallelize(List(1,2,3,4)) rdd.saveAsTextFile("SimpleAppOutput2") log.info("End") sc.stop() } }

Page 35: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Output

35

->spark-submit --master spark://air-6.local:7077 simpleappintell_2.11-0.1.jar log4j:WARN No appenders could be found for logger (root). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info. cat in the hat Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 17/11/05 12:04:37 INFO root: End

Page 36: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Again - Do you want to set log level in Code

36

Can set level in config file $SPARK_HOME/conf/log4j.properties.temple

By default Spark will look for $SPARK_HOME/conf/log4j.properties But does is not part of program

Page 37: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Quiet Log config

37

# Set everything to be logged to the console log4j.rootCategory=INFO, console log4j.appender.console=org.apache.log4j.ConsoleAppender log4j.appender.console.target=System.err log4j.appender.console.layout=org.apache.log4j.PatternLayout log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n

# Set the default spark-shell log level to WARN. When running the spark-shell, the # log level for this class is used to overwrite the root logger's log level, so that # the user can have different defaults for the shell and regular Spark apps. log4j.logger.org.apache.spark.repl.Main=WARN

# Settings to quiet third party logs that are too verbose log4j.logger.org=WARN log4j.logger.org.apache.parquet=ERROR log4j.logger.parquet=ERROR

Page 38: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Master Logging vs Slave Logging

38

import org.apache.spark.{SparkConf, SparkContext} import org.apache.log4j.{Level, LogManager, PropertyConfigurator, Logger}

object SimpleApp { def main(args: Array[String]) { val log = LogManager.getRootLogger log.info("Start") val conf = new SparkConf().setAppName("Simple Application") val sc = new SparkContext(conf) val rdd = sc.parallelize(1 to 10) val stringRdd = rdd.map { value => log.info(value) value.toString } log.info("End") sc.stop() } }

Master

Slave

Error on Running Log on serializable

Page 39: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Serializable Logger

39

import org.apache.spark.{SparkConf, SparkContext} import org.apache.log4j.{LogManager, Logger}

object DistributedLogger extends Serializable { @transient lazy val log = Logger.getLogger(getClass.getName) }

Page 40: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Main

40

object SimpleApp { def main(args: Array[String]) { val log = LogManager.getRootLogger log.info("Start") val conf = new SparkConf().setAppName("Simple Application") val sc = new SparkContext(conf) val rdd = sc.parallelize(1 to 10) val result = rdd.map { i => DistributedLogger.log.warn("i = " + i) i + 10 } result.saveAsTextFile("SimpleAppOutput") log.info("End") sc.stop() } }

Page 41: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Running

41

->spark-submit target/scala-2.11/simpleappintell_2.11-0.1.jar 17/11/06 16:59:40 INFO root: Start 17/11/06 16:59:41 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable [Stage 0:> (0 + 0) / 8]17/11/06 16:59:44 WARN DistributedLogger$: i = 7 17/11/06 16:59:44 WARN DistributedLogger$: i = 8 17/11/06 16:59:44 WARN DistributedLogger$: i = 9 17/11/06 16:59:44 WARN DistributedLogger$: i = 6 17/11/06 16:59:44 WARN DistributedLogger$: i = 3 17/11/06 16:59:44 WARN DistributedLogger$: i = 4 17/11/06 16:59:44 WARN DistributedLogger$: i = 1 17/11/06 16:59:44 WARN DistributedLogger$: i = 5 17/11/06 16:59:44 WARN DistributedLogger$: i = 2 17/11/06 16:59:44 WARN DistributedLogger$: i = 10 17/11/06 16:59:44 INFO root: End

Page 42: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Logging DataFrames

42

To log client operations needs to use udf

Page 43: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Amazon Elastic Map-Reduce (EMR)

43

Hadoop, Hive, Spark, etc on Cluster

Predefined set of languages/tools available

Can create cluster of machines

https://aws.amazon.com Create new account Get 12 months free access

Page 44: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

AWS Free Tier

44

12 months free

EC2 - compute instances 740 hours per month Billed in hour increments Billed per instance

S3 - storage 5 GB 20,000 Get requests

RDS - MySQL, PostgresSQL, SQL Sever 20 GB 750 hours

EC2 Container - Docker images 500 MB

I and students were charged last year

Page 45: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

AWS Educate

45

https://aws.amazon.com/education/awseducate/

SDSU is an institutional member

Students get $100 credit

Page 46: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

EC2 Pricing

46

Price Per Hour

On Demand Spot

m1.medium $0.0047

m1.large $0.0?

ml.xlarge $0.352

m3.xlarge $0.0551

m4.large $0.1 $0.0299

c1.medium $0.0132

c1.xlarge $0.057

Page 47: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Basic Outline

47

Develop & test Spark locally

Upload program jar file & data to S3

Configure & launch cluster AWS Management Console AWS CLI SDKs

Monitor cluster

Make sure you terminate cluster when done

Page 48: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Simple Storage System - S3

48

Files are stored in buckets

Bucket names are global

Supports s3 - files divided in to block s3n

Accessing files S3 console Third party REST Java, C#, etc

Page 49: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Amazon S3

49

Page 50: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

S3 Creating a Bucket

50

Page 51: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

S3 Costs

51

AWS Free Usage Tier

New AWS customers receive each month for one year 5 GB of Amazon S3 storage in the Standard Storage class, 20,000 Get Requests, 2,000 Put Requests, and 15 GB of data transfer out

Standard StorageStandard - Infrequent

Access StorageGlacier Storage

First 50 TB / month $0.023 per GB $0.0125 per GB $0.004 per GB

Next 450 TB / month $0.022 per GB $0.0125 per GB $0.004 per GB

Over 500 TB / month $0.021 per GB $0.0125 per GB $0.004 per GB

Page 52: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

S3 Objects

52

Objects contain Object data Metadata

Size 1 byte to 5 gigabytes per object

Object data Just bytes No meaning associated with bytes

Metadata Name-value pairs to describe the object Some http headers used

Content-Type

Page 53: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

S3 Buckets

53

Namespace for objects

No limitation on number of object per bucket

Only 100 buckets per account

Each bucket has a name Up to 255 bytes long Cannot be same as existing bucket name by any S3 user

Page 54: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Bucket Names

54

Bucket names must Contain lowercase letters, numbers, periods (.), underscores (_), and dashes (-) Start with a number or letter Be between 3 and 255 characters long Not be in an IP address style (e.g., "192.168.5.4")

To conform with DNS requirements, Amazon recommends Bucket names should not contain underscores (_) Bucket names should be between 3 and 63 characters long Bucket names should not end with a dash Bucket names cannot contain dashes next to periods (e.g.,

"my-.bucket.com" and "my.-bucket" are invalid

Page 55: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Key

55

Unique identifier for an object within a bucket

Object Url

http://buckerName.s3.amazonaws.com/Key

http://doc.s3.amazonaws.com/2006-03-01/AmazonS3.wsdl

Bucket = doc Key = 2006-03-01/AmazonS3.wsdl

Page 56: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Access Control Lists (ACL)

56

Each Bucket has an ACL Determines who has read/write access

Each Object can have an ACL Determines who has read/write access

ACL consists of a list of grants

Grant contains One grantee One permission

Page 57: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

S3 Data Consistency Model

57

Updates to a single object at a key in a bucket are atomic

But a read after a write may return the old value Changes may take time to progate

No object locking If two writes to same object occur at the same time The one with later timestamp wins

Page 58: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

CAP Theorem

58

CAP theorem says in a distributed system you can not have all three Consistency Availability tolerance to network Partitions

Page 59: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Consistency

59

A = 2 A = 2

Machine 1 Machine 2

A = 2 A = 3Not Consistent

Page 60: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Partition

60

A = 2 A = 2

Machine 1 Machine 2

A = 2 A = 2Partitioned

Machine 1 cannot talk to machine 2

But how does machine 1 tell the difference between no connection and a very slow connection or busy machine 2?

Page 61: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Latency

61

Latency Time between making a request and getting a response

Distributed systems always have latency

In practice detect a partition by latency

When no response in a given time frame assume we are partitioned

Page 62: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Available

62

A = 2 A = 2

Machine 1 Machine 2

Client

A = 2 A = 2ClientClient can not access value of A

What does not available mean? No connection Slow connection What is the difference?

Some say high available - meaning low latency

In practice available and latency are related

Page 63: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Consistency over Latency

63

A = 2 A = 2Set A to 3

A = 2 A = 2Set A to 3 Lock A

A = 2 A = 2Set A to 3 Set A to 3

A = 3 A = 3Set A to 3 Unlock A

Machine 1 Machine 2

Write requests queued until unlocked

Increased latency System still available

A = 3 A = 3

Page 64: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Latency over Consistency

64

A = 2 A = 2Set A to 3

Machine 1 Machine 2

Write requests accepted

Low latency System inconsistent A = 3 A = 2

Set A to 3

A = 3 A = 2

A = 3 A = 3

Page 65: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Latency over Consistency - Write Conflicts

65

A = 2 A = 2Set A to 3

Machine 1 Machine 2

A = 3 A = 1Set A to 3

Subtract 1 from A

A = ? A = ?Need policy to make system consistent

A = 3 A = 2Subtract 1 from A

Page 66: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Partition

66

A = 2 A = 2

Machine 1 Machine 2

A = ? A = ?Need policy to make system consistent

A = 2 A = 2

Set A to 3A = 3 A = 1

Subtract 1 from A

Page 67: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

CAP Theorem

67

Not a theorem

Too simplistic What is availability What is a partition of the network

Misleading

Intent of CAP was to focus designers attention on the tradeoffs in distributed systems

How to handle partitions in the network Consistency Latency Availability

Page 68: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

CAP & S3

68

S3 favors latency over consistency

Page 69: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Running Program on AWS EMR

69

Make sure program runs locally

Create jar file containing code Make sure that jar file contains manifest

Create s3 bucket(s) for jar file logs input output

Upload jar & data files to s3

Page 70: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Test Program - SimpleApp

70

import org.apache.spark.{SparkConf, SparkContext} import org.apache.log4j.LogManager

object SimpleApp { def main(args: Array[String]) { val log = LogManager.getRootLogger log.info("Start") if (args.length < 1) { log.error("Missing argument") return } val outputFile = args(0) val conf = new SparkConf().setAppName("Simple Application") val sc = new SparkContext(conf) val rdd = sc.parallelize(1 to 10) rdd.saveAsTextFile(outputFile) log.info("End") sc.stop() } }

Page 71: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Packaging SimpleApp using SBT

71

->sbt package [info] Loading settings from idea.sbt ... [info] Loading global plugins from /Users/whitney/.sbt/1.0/plugins [info] Loading settings from plugins.sbt ... [info] Loading project definition from /Users/whitney/Courses/696/Fall17/SparkExamples/simpleAppIntell/project [info] Loading settings from build.sbt ... [info] Set current project to SimpleAppIntell (in build file:/Users/whitney/Courses/696/Fall17/SparkExamples/simpleAppIntell/) [success] Total time: 2 s, completed Nov 6, 2017 4:05:00 PM

In project directory

Page 72: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Packaging SimpleApp using SBT

72

In project directory

->sbt [info] Loading settings from idea.sbt ... [info] Loading global plugins from /Users/whitney/.sbt/1.0/plugins [info] Loading settings from plugins.sbt ... [info] Loading project definition from /Users/whitney/Courses/696/Fall17/SparkExamples/simpleAppIntell/project [info] Loading settings from build.sbt ... [info] Set current project to SimpleAppIntell (in build file:/Users/whitney/Courses/696/Fall17/SparkExamples/simpleAppIntell/) [info] sbt server started at 127.0.0.1:4172 sbt:SimpleAppIntell> package [success] Total time: 2 s, completed Nov 6, 2017 4:06:33 PM sbt:SimpleAppIntell>

I use SBT shell as it is faster when needing to repeat operations

Page 73: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Result of SBT package

73

Note: I renamed the jar file simpleapp.jar

Page 74: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Contents of simple app.jar

74

Manifest-Version: 1.0 Implementation-Title: SimpleAppIntell Implementation-Version: 0.1 Specification-Vendor: default Specification-Title: SimpleAppIntell Implementation-Vendor-Id: default Specification-Version: 0.1 Implementation-Vendor: default Main-Class: SimpleApp

MANIFEST.MF Note

When running SimpleApp locally Don't need to use --class Spark finds main class from manifest

When running on AWS Need to use --class

Page 75: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Running Program on AWS EMR

75

Make sure program runs locally

Create jar file containing code Make sure that jar file contains manifest

Create s3 bucket(s) for jar file logs input output

Upload jar & data files to s3

Page 76: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

My S3 Buckets

76

Page 77: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Spark on AWS - EMR Console

77

Page 78: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

78

You can either use Spark option on Quick Options or use Advanced Options

Page 79: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Advanced Options

79

Page 80: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Spark Application Setup

80

You have to give --class ClassName in Spark-submit options

Page 81: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

81

Using the custom jar option Useful when cloning steps

Page 82: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Output

82

Page 83: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Warning on AWS

83

It can take 5-10 minutes to start cluster

Logs do not show your logging statements

When you configure Steps incorrectly they fail Error messages are not very helpful

Page 84: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

SSH to your Master Node

84

Create Amazon EC2 Key pair

http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-key-pairs.html#having-ec2-create-your-key-pair

Instructions

Open EC2 Dashboard - Select Key Pairs

Page 85: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

SSH to your Master Node

85

In Create Cluster - Quick Options

Page 86: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

SSH to your Master Node

86

Click for Instructions

Page 87: D16 Spark, Cluster, AWSSMACK 2 Hot topic in Bay area Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications

Command-line Tools

87

Open-source command-line tool for launching Apache Spark clusters

https://github.com/nchammas/flintrock

Flintrock

aws cli

Amazon's command line tool

https://aws.amazon.com/cli/