toying with spark

S PA R K - N E W K I D O N T H E B L O C K

A B O U T M E …

• I designed Bamboo (HP’s Big Data Analytics Platform)

• I write software (mostly with Scala but leaning towards Haskell recently …)

• I like translating seq to parallel algorithms mostly using CUDA / OpenCL; embedded assembly is an EVIL thing.

• I wrote 2 books

• OpenCL Parallel Programming Development Cookbook

• Developing an Akka Edge

W H AT ’ S C O V E R E D T O D AY ?

• What’s Apache Spark

• What’s a RDD ? How can i understand it ?

• What’s Spark SQL

• What’s Spark Streaming

• References

W H AT ’ S A PA C H E S PA R K

• As a beginner’s guide, you can refer to Tsai Li Ming’s talk.

• API model abstracts

• how to extract data from 3rd party s/w (via JDBC, Cassandra, HBase)

• how to extract-compute data (via GraphX, MLLib, SparkSQL)

• how to store data (data connectors to “local”, “hdfs”, “s3”

http://www.slideshare.net/tsailiming/spark-meetup1-intro-to-spark

R E S I L I E N T D I S T R I B U T E D D ATA S E T S

• Apache Spark works on data broken into chunks

• These chunks are called RDDs

• RDDs are chained into a lineage graph => a graph that identifies relationships.

• RDDs can be queried, grouped, transformed in a coarse grained manner to a fine grained manner.

• A RDD has a lifecycle:

• reification

• lazy-compute/lazy re-compute

• destruction

• RDD’s lifecycle is managed by the system unless …

• A program commands the RDD to persist() or unpersist() which affects the lazy computation.

R E S I L I E N T D I S T R I B U T E D D ATA S E T S

“ A G G R E G AT E ” I N S PA R K

> val data = sc.parallelize( (1 to 4) toList,2) > data.aggregate(0) > .. (math.max(_, _), > .. ( _ + _ )) > ….. > result = 6

def aggregate(zerovalue: U) (fbinary: (U, T) => U, fagg: (U, U) => U): U

H O W “ A G G R E G AT E ” W O R K S I N S PA R K

e1

RDD

fagg

fbinary

e2 e3 e4

zerovalue

res1

fbinary

res2

fagg final result

caveat: partition-sensitive algorithm should work correctly regardless of partitions

“ C O G R O U P ” I N S PA R K

> val x = sc.parallelize(List(1, 2, 1, 3), 1) > val y = x.map((_, "y")) > val z = x.map((_, "z")) > y.cogroup(z).collect res72: Array[(Int, (Iterable[String], Iterable[String]))] = Array((1,(Array(y, y),Array(z, z))), (3,(Array(y),Array(z))), (2,(Array(y),Array(z))))

def cogroup[W1, W2, W3] (other1: RDD[(K, W1)], other2: RDD[(K, W2)], other3: RDD[(K, W3)], numPartitions: Int): RDD[(K, (Iterable[V], Iterable[W1], Iterable[W2], Iterable[W3]))]

H O W “ C O G R O U P ” W O R K S I N S PA R K

RDDx(k1,va) (k2,vb) (k1,vc) (k3,vd) (k1,ve)

(k1,vf) (k2,vg) (k1,vh) RDDy

RDDx.cogroup(RDDy) =?

H O W “ C O G R O U P ” W O R K S I N S PA R K

ArraycombinedArray[(k1,[va,vc,ve,vf,vh]),

(k2,[vb,vg]),

(k3,[vd])]

RDDx.cogroup(RDDy) = *see below*

“ C O G R O U P ” I N S PA R K

• CoGroup works in both RDD and Spark Streams

• the ability to combine multiple RDDs allows higher abstractions to be constructed

• A Stream in Spark is just a list of (Time,RDD[U])

W H AT ’ S S PA R K S Q L• Spark SQL is new, largely replaced Shark

• Large scale queries (inline queries) to be embedded into a Spark program

• Spark SQL supports Apache Hive, JSON, Parquet, RDD.

• Spark SQL’s optimizer is clever!

• Supports UDFs from Hive or Write your own !

S PA R K S Q L

J S O N

S PA R K S Q L

PA R Q U E TH I V E

data sources

R D D

S PA R K S Q L ( A N E X A M P L E )

// import spark sql import org.apache.spark.sql.hive.HiveContext // create a spark sql hivecontext val sc = new SparkContext(…) val hiveCtx = new HiveContext(sc)



val input = hiveCtx.jsonFile(inputFile) input.registerTempTable(“tweets”)




val topTweets = hiveCtx.sql(“SELECT text, retweetCount FROM tweets ORDER BY retweetCount LIMIT 10”)

S PA R K S Q L ( A N E X A M P L E )// import spark sql import org.apache.spark.sql.hive.HiveContext // create a spark sql hivecontext val sc = new SparkContext(…) val hiveCtx = new HiveContext(sc)


val topTweets = hiveCtx.sql(“SELECT text, retweetCount FROM tweets ORDER BY retweetCount LIMIT 10”)

val topTweetContent = topTweets.map(row ⇒ row.getString(0))

W H AT ’ S S PA R K S T R E A M I N G

• Core component is a DStream

• DStream is an abstract RDD whose basic components is a (key,value) pairs where key = Time, value = RDD.

• Forward and backward queries are supported

• Fault-Tolerance by check-pointing RDDs.

• What you can do with RDDs, you can do with DStreams.

S PA R K S T R E A M I N G ( Q U I C K E X A M P L E )

import org.apache.spark.streaming.StreamingContext import org.apache.spark.streaming.StreamingContext._ import org.apache.spark.streaming.dstream.DStream import org.apache.spark.streaming.Duration

// Create a StreamingContext with a 1-second batch size from a SparkConf val ssc = new StreamingContext(conf, Seconds(1))

S PA R K S T R E A M I N G ( Q U I C K E X A M P L E )import org.apache.spark.streaming.StreamingContext import org.apache.spark.streaming.StreamingContext._ import org.apache.spark.streaming.dstream.DStream import org.apache.spark.streaming.Duration


// Create a DStream using data received after connecting to // port 7777 on the local machine val lines = ssc.socketTextStream("localhost", 7777)

// Filter our DStream for lines with "error"val errorLines = lines.filter(_.contains("error"))

// Print out the lines with errorserrorLines.print()

S PA R K S T R E A M I N G ( Q U I C K E X A M P L E )import org.apache.spark.streaming.StreamingContext import org.apache.spark.streaming.StreamingContext._ import org.apache.spark.streaming.dstream.DStream import org.apache.spark.streaming.Duration


// Create a DStream using data received after connecting to // port 7777 on the local machine val lines = ssc.socketTextStream("localhost", 7777)

// Filter our DStream for lines with "error"val errorLines = lines.filter(_.contains("error"))

// Print out the lines with errorserrorLines.print()

// Start our streaming context and wait for it to "finish" ssc.start()

// Wait for the job to finish ssc.awaitTermination()

A D S T R E A M L O O K S L I K E …

t1 to t2 t2 to t3 t3 to t4

timestart

DStream

A D S T R E A M C A N H AV E T R A N S F O R M AT I O N S O N T H E M !

t1 to t2

timestart

DStream(s)

t1 to t2

data-1

data-2

f

transformation on the fly!

S PA R K S T R E A M T R A N S F O R M AT I O N

t1 to t2t2 to t3

timestart

DStream(s)

t1 to t2t2 to t3

data-1

data-2

f fdata output in

batches

S PA R K S T R E A M T R A N S F O R M AT I O N

t3 to t4

timestart

DStream(s)

t3 to t4

data-1

data-2

f

t1 to t2t2 to t3

t1 to t2t2 to t3

f fff

S TAT E F U L S PA R K S T R E A M T R A N S F O R M AT I O N

t3 to t4

timestart

DStream(s)

t3 to t4

data-1

data-2

f

t1 to t2t2 to t3

t1 to t2t2 to t3

f fff

H O W D O E S S PA R K S T R E A M I N G H A N D L E FA U LT S ?

• As before, check-point is the key to fault-tolerance (especially in stateful-dstream transformations)

• Programs can recover from check-points => no need to restart all over again.

• You can use “monit” to restart Spark jobs or pass the Spark flag “- - supervise” to the job config a.k.a driver fault tolerance

• All incoming data to workers replicated

• In-house RDDs follow the lineage graph to recover

• The above is known as worker fault tolerance.

• Receivers fault tolerance is largely dependent on whether data sources can re-send lost data

• Streams guarantee exactly-once semantics; caveat: multiple writes can occur to the HDFS (app specific logic needs to handle)

H O W D O E S S PA R K S T R E A M I N G H A N D L E FA U LT S ?

R E F E R E N C E S

• Books:

• “Learning Spark: Lightning Fast Big Data ANlaytics”

• “Advanced Analytics with Spark: Patterns for Learning from Data At Scale”

• “Fast Data Processing with Spark”

• “Machine Learning with Spark”

• Berkeley Data Bootcamp

• Introduction to Big Data with Apache Spark

• Kien Dang’s introduction to Spark and R using Naive Bayes (click here)

• Spark Streaming with Scala and Akka (click here)

http://ampcamp.berkeley.edu/5/exercises/index.html

https://www.edx.org/course/introduction-big-data-apache-spark-uc-berkeleyx-cs100-1x#.VLM68qb89WU

https://github.com/kiendang/sparkr-naivebayes-example

http://typesafe.com/activator/template/spark-streaming-scala-akka

T H E E N D

Q U E S T I O N S ?

T W I T T E R : @ R AY M O N D TAY B L G I T H U B : @ R AY G I T

toying with spark

Data & Analytics