survey of spark for data pre-processing and analytics

32
Yannick Pouliot Consulting © 2015 all rights reserved Yannick Pouliot, PhD [email protected] 8/12/2015 Spark: New Vistas of Computing Power

Upload: yannick-pouliot

Post on 21-Feb-2017

477 views

Category:

Health & Medicine


1 download

TRANSCRIPT

Page 1: Survey of Spark for Data Pre-Processing and Analytics

Yannick Pouliot Consulting© 2015 all rights reserved

Yannick Pouliot, [email protected]

8/12/2015

Spark: New Vistas of Computing

Power

Page 2: Survey of Spark for Data Pre-Processing and Analytics

Yannick Pouliot Consulting© 2015 all rights reservedYannick Pouliot Consulting© 2015 all rights reserved

Spark: An Open Source Cluster-Based Distributed Computing

Page 3: Survey of Spark for Data Pre-Processing and Analytics

Yannick Pouliot Consulting© 2015 all rights reservedYannick Pouliot Consulting© 2015 all rights reserved

Distributed Computing A La SparkHere, “distributed computing” means…• Multi-node cluster: master, slaves• Paradigm is:

o Minimize networking by allocating chunks of data to slaveso Slaves receive code to run on their subset of the data

• Lots of redundancy• No communication between nodes

o only with the master

• Operating on commodity hardware

Page 4: Survey of Spark for Data Pre-Processing and Analytics

Yannick Pouliot Consulting© 2015 all rights reservedYannick Pouliot Consulting© 2015 all rights reserved

Distributed Computing Is The Rage Everywhere … Except In Academia

That Should Disturb You

Page 5: Survey of Spark for Data Pre-Processing and Analytics

Yannick Pouliot Consulting© 2015 all rights reservedYannick Pouliot Consulting© 2015 all rights reserved

Some Ideal Applicationshere, using Spark’s MLlib library

The highly distributed nature of Spark means it is ideal for …• Generating lots of…

o Trees in a random foresto Permutations for computing distributionso Samplings (boostrapping, sub-sampling)

• Cleaning up textual data, e.g., o NLP on EHR recordso Mapping variant spellings of drug names to UMLS CUIs

• Normalizing datasets• Computing basic statistics• Hypothesis testing

Page 6: Survey of Spark for Data Pre-Processing and Analytics

Yannick Pouliot Consulting© 2015 all rights reservedYannick Pouliot Consulting© 2015 all rights reserved

Spark = Speed• Focus of Spark is to make data analytics fast

• fast to run code• fast to write code

• To run programs faster, Spark provides primitives for in-memory cluster computingo job can load data into memory and queried repeatedly o much quicker than with disk-based systems like Hadoop MapReduce  

• To make programming faster, Spark integrates into the Scala programming languageo Enables manipulation of distributed datasets as if they were local collectionso Can use Spark interactively to query big data from the Scala interpreter

Page 7: Survey of Spark for Data Pre-Processing and Analytics

Yannick Pouliot Consulting© 2015 all rights reservedYannick Pouliot Consulting© 2015 all rights reserved

HDFS

Marketing-Level Architecture Stack

Page 8: Survey of Spark for Data Pre-Processing and Analytics

Yannick Pouliot Consulting© 2015 all rights reservedYannick Pouliot Consulting© 2015 all rights reserved

Architecture: Closer To Reality

Page 9: Survey of Spark for Data Pre-Processing and Analytics

Yannick Pouliot Consulting© 2015 all rights reservedYannick Pouliot Consulting© 2015 all rights reserved

Dynamics

Page 10: Survey of Spark for Data Pre-Processing and Analytics

Yannick Pouliot Consulting© 2015 all rights reservedYannick Pouliot Consulting© 2015 all rights reserved

Data Sources Integration

Page 11: Survey of Spark for Data Pre-Processing and Analytics

Yannick Pouliot Consulting© 2015 all rights reservedYannick Pouliot Consulting© 2015 all rights reserved

Ecosystem

Page 12: Survey of Spark for Data Pre-Processing and Analytics

Yannick Pouliot Consulting© 2015 all rights reserved

Spark’s Machine Learning Library: MLlib

Page 13: Survey of Spark for Data Pre-Processing and Analytics

Yannick Pouliot Consulting© 2015 all rights reservedYannick Pouliot Consulting© 2015 all rights reserved

MLlib

Current Functionality

Page 14: Survey of Spark for Data Pre-Processing and Analytics

Yannick Pouliot Consulting© 2015 all rights reservedYannick Pouliot Consulting© 2015 all rights reserved

Spark Programming ModelSpark follows REPL model: read–eval–print loop

o Similar to R and python shello Ideal for exploratory data analysis

Writing a Spark program typically consists of:1. Reading some input data into local memory2. Invoking transformation or actions that operate on a subset of data in local

memory3. Running those transformations/actions in a distributed fashion on the

network (memory or disk)4. Deciding what actions to undertake next

Best of all, it can all be done within the shell, just like R (or Python)

Page 15: Survey of Spark for Data Pre-Processing and Analytics

Yannick Pouliot Consulting© 2015 all rights reservedYannick Pouliot Consulting© 2015 all rights reserved

RDD: The Secret Sauce

Page 16: Survey of Spark for Data Pre-Processing and Analytics

Yannick Pouliot Consulting© 2015 all rights reservedYannick Pouliot Consulting© 2015 all rights reserved

RDD = Distributed Data Frame• RDD = “Resilient Data Dataset”

o Similar to an R data frame, but laid out across a cluster of machines as a collection of partitions• Partition = subset of the data

o Spark Master node remembers all transformations applied to an RDD• if a partition is lost (e.g., a slave machine goes down), it can easily be reconstructed on

some other machine in the cluster (“lineage”)o “Resilient” = RDD can always be reconstructed because of lineage tracking

• RDDs are Spark’s fundamental abstraction for representing a collection of objects that can be distributed across multiple machines in a clustero val rdd = sc.parallelize(Array(1, 2, 2, 4), 4)

• 4 = number of “partitions”• Partitions are fundamental unit of parallelism

o val rdd2 = sc.textFile("linkage")

Page 17: Survey of Spark for Data Pre-Processing and Analytics

Yannick Pouliot Consulting© 2015 all rights reservedYannick Pouliot Consulting© 2015 all rights reserved

Partitions = Data Slices• Partitions are “slices” a dataset is cut into • Spark will run one task for each partition • Typically 2-4 partitions for each CPU in cluster

o Normally, Spark tries to set the number of partitions automatically based on your clustero Can also be set manually as a second parameter to parallelize (e.g. sc.parallelize(data, 4))

Page 18: Survey of Spark for Data Pre-Processing and Analytics

Yannick Pouliot Consulting© 2015 all rights reservedYannick Pouliot Consulting© 2015 all rights reserved

A Brief Word About Scala• One of the two languages Spark is built on (the other being Java)• Compiles to Java byte code• REPL based, so good for exploratory data analysis• Pure OO language• Much more streamlined than Java

o Way fewer lines of code

• Lots of type inferring • Seamless calls to Java

o E.g., you might be using a Java method inside a Scale object…

• Broad’s GATK is written in Scala

Page 19: Survey of Spark for Data Pre-Processing and Analytics

Yannick Pouliot Consulting© 2015 all rights reservedYannick Pouliot Consulting© 2015 all rights reserved

Example: Word Count In Spark Using Scala

scala> val hamlet = sc.textFile(“/Users/akuntamukkala/temp/gutenburg.txt”)

hamlet: org.apache.spark.rdd.RDD[String] =` MappedRDD[1] at textFile at <console>:12

scala> val topWordCount = hamlet.flatMap(str=>str.split(“ “)). filter(!_.isEmpty).map(word=>(word,1)).reduceByKey(_+_).map{case (word, count) => (count, word)}.sortByKey(false)

topWordCount: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[10] at sortByKey at <console>:14

scala> topWordCount.take(5).foreach(x=>println(x))(1044,the)(730,and)(679,of)(648,to)(511,I)

RDD lineage

Page 20: Survey of Spark for Data Pre-Processing and Analytics

Yannick Pouliot Consulting© 2015 all rights reservedYannick Pouliot Consulting© 2015 all rights reserved

More On Word Count

Page 21: Survey of Spark for Data Pre-Processing and Analytics

Yannick Pouliot Consulting© 2015 all rights reserved

A Quick Tour of Common Spark

Functions

Page 22: Survey of Spark for Data Pre-Processing and Analytics

Yannick Pouliot Consulting© 2015 all rights reservedYannick Pouliot Consulting© 2015 all rights reserved

Common Transformations

TRANSFORMATION & PURPOSE EXAMPLE & RESULT

filter(func) Purpose: new RDD by selecting those data elements on which func returns true

scala> val rdd = sc.parallelize(List(“ABC”,”BCD”,”DEF”)) scala> val filtered = rdd.filter(_.contains(“C”)) scala> filtered.collect() Result:Array[String] = Array(ABC, BCD)

map(func) Purpose: return new RDD by applying func on each data element

scala> val rdd=sc.parallelize(List(1,2,3,4,5)) scala> val times2 = rdd.map(_*2) scala> times2.collect() Result:Array[Int] = Array(2, 4, 6, 8, 10)

flatMap(func) Purpose: Similar to map but func returns a Seq instead of a value. For example, mapping a sentence into a Seq of words

scala> val rdd=sc.parallelize(List(“Spark is awesome”,”It is fun”)) scala> val fm=rdd.flatMap(str=>str.split(“ “)) scala> fm.collect() Result:Array[String] = Array(Spark, is, awesome, It, is, fun)

reduceByKey(func,[numTasks]) Purpose: To aggregate values of a key using a function. “numTasks” is an optional parameter to specify number of reduce tasks

scala> val word1=fm.map(word=>(word,1)) scala> val wrdCnt=word1.reduceByKey(_+_) scala> wrdCnt.collect() Result:Array[(String, Int)] = Array((is,2), (It,1), (awesome,1), (Spark,1), (fun,1))

groupByKey([numTasks]) Purpose: To convert (K,V) to (K,Iterable<V>)

scala> val cntWrd = wrdCnt.map{case (word, count) => (count, word)} scala> cntWrd.groupByKey().collect() Result:Array[(Int, Iterable[String])] = Array((1,ArrayBuffer(It, awesome, Spark, fun)), (2,ArrayBuffer(is)))

distinct([numTasks]) Purpose: Eliminate duplicates from RDD

scala> fm.distinct().collect() Result:Array[String] = Array(is, It, awesome, Spark, fun)

Page 23: Survey of Spark for Data Pre-Processing and Analytics

Yannick Pouliot Consulting© 2015 all rights reservedYannick Pouliot Consulting© 2015 all rights reserved

ACTION & PURPOSE EXAMPLE & RESULT

count() Purpose: get the number of data elements in the RDD

scala> val rdd = sc.parallelize(list(‘A’,’B’,’c’)) scala> rdd.count() Result:long = 3

collect() Purpose: get all the data elements in an RDD as an array

scala> val rdd = sc.parallelize(list(‘A’,’B’,’c’)) scala> rdd.collect() Result:Array[char] = Array(A, B, c)

reduce(func) Purpose: Aggregate the data elements in an RDD using this function which takes two arguments and returns one

scala> val rdd = sc.parallelize(list(1,2,3,4)) scala> rdd.reduce(_+_) Result:Int = 10

take (n) Purpose: : fetch first n data elements in an RDD. computed by driver program.

Scala> val rdd = sc.parallelize(list(1,2,3,4)) scala> rdd.take(2) Result:Array[Int] = Array(1, 2)

foreach(func) Purpose: execute function for each data element in RDD. usually used to update an accumulator(discussed later) or interacting with external systems.

Scala> val rdd = sc.parallelize(list(1,2,3,4)) scala> rdd.foreach(x=>println(“%s*10=%s”. format(x,x*10)))Result:1*10=10 4*10=40 3*10=30 2*10=20

first() Purpose: retrieves the first data element in RDD. Similar to take(1)

scala> val rdd = sc.parallelize(list(1,2,3,4)) scala> rdd.first() Result:Int = 1

saveAsTextFile(path) Purpose: Writes the content of RDD to a text file or a set of text files to local file system/ HDFS

scala> val hamlet = sc.textFile(“/users/akuntamukkala/ temp/gutenburg.txt”) scala> hamlet.filter(_.contains(“Shakespeare”)). saveAsTextFile(“/users/akuntamukkala/temp/ filtered”)Result:akuntamukkala@localhost~/temp/filtered$ ls _SUCCESS part-00000 part-00001

Common Actions

Page 24: Survey of Spark for Data Pre-Processing and Analytics

Yannick Pouliot Consulting© 2015 all rights reservedYannick Pouliot Consulting© 2015 all rights reserved

TRANSFORMATION AND PURPOSE EXAMPLE AND RESULT

union()Purpose: new RDD containing all elements from source RDD and argument.

Scala> val rdd1=sc.parallelize(List(‘A’,’B’))scala> val rdd2=sc.parallelize(List(‘B’,’C’))scala> rdd1.union(rdd2).collect()Result:Array[Char] = Array(A, B, B, C)

intersection()Purpose: new RDD containing only common elements from source RDD and argument.

Scala> rdd1.intersection(rdd2).collect()Result:Array[Char] = Array(B)

cartesian()Purpose: new RDD cross product of all elements from source RDD and argument

Scala> rdd1.cartesian(rdd2).collect()Result:Array[(Char, Char)] = Array((A,B), (A,C), (B,B), (B,C))

subtract()Purpose: new RDD created by removing data elements in source RDD in common with argument

scala> rdd1.subtract(rdd2).collect() Result:Array[Char] = Array(A)

join(RDD,[numTasks])Purpose: When invoked on (K,V) and (K,W), this operation creates a new RDD of (K, (V,W))

scala> val personFruit = sc.parallelize(Seq((“Andy”, “Apple”), (“Bob”, “Banana”), (“Charlie”, “Cherry”), (“Andy”,”Apricot”)))scala> val personSE = sc.parallelize(Seq((“Andy”, “Google”), (“Bob”, “Bing”), (“Charlie”, “Yahoo”), (“Bob”,”AltaVista”)))scala> personFruit.join(personSE).collect()Result:Array[(String, (String, String))] = Array((Andy,(Apple,Google)), (Andy,(Apricot,Google)), (Charlie,(Cherry,Yahoo)), (Bob,(Banana,Bing)), (Bob,(Banana,AltaVista)))

cogroup(RDD,[numTasks])Purpose: To convert (K,V) to (K,Iterable<V>)

scala> personFruit.cogroup(personSe).collect()Result:Array[(String, (Iterable[String], Iterable[String]))] = Array((Andy,(ArrayBuffer(Apple, Apricot),ArrayBuffer(google))), (Charlie,(ArrayBuffer(Cherry),ArrayBuffer(Yahoo))), (Bob,(ArrayBuffer(Banana),ArrayBuffer(Bing, AltaVista))))

Common Set Operations

Page 25: Survey of Spark for Data Pre-Processing and Analytics

Yannick Pouliot Consulting© 2015 all rights reservedYannick Pouliot Consulting© 2015 all rights reserved

Spark comes with R binding!

Page 26: Survey of Spark for Data Pre-Processing and Analytics

Yannick Pouliot Consulting© 2015 all rights reservedYannick Pouliot Consulting© 2015 all rights reserved

Spark and R: A Marriage Made In Heaven

Page 27: Survey of Spark for Data Pre-Processing and Analytics

Yannick Pouliot Consulting© 2015 all rights reservedYannick Pouliot Consulting© 2015 all rights reserved

SparkR: A Package For Computing with Spark

Page 28: Survey of Spark for Data Pre-Processing and Analytics

Yannick Pouliot Consulting© 2015 all rights reservedYannick Pouliot Consulting© 2015 all rights reserved

Side Bar: Running Spark R on Amazon AWS1. Launch Spark cluster at Amazon:

./spark-ec2 --key-pair=spark-df --identity-file=/Users/code/Downloads/spark-df.pem --region=eu-west-1-s 1--instance-type c3.2xlarge launch mysparkr 2. Launch SparkR locally:

chmod u+w /root/spark/ ./spark/bin/sparkR

AWS is probably the best way for academics to access Spark, because of complexity deploying the infrastructure

Page 29: Survey of Spark for Data Pre-Processing and Analytics

Yannick Pouliot Consulting© 2015 all rights reserved

And Not Just R: Python

and Spark

Page 30: Survey of Spark for Data Pre-Processing and Analytics

Yannick Pouliot Consulting© 2015 all rights reservedYannick Pouliot Consulting© 2015 all rights reserved

Python Code Example

Page 31: Survey of Spark for Data Pre-Processing and Analytics

Yannick Pouliot Consulting© 2015 all rights reservedYannick Pouliot Consulting© 2015 all rights reserved

Page 32: Survey of Spark for Data Pre-Processing and Analytics

Yannick Pouliot Consulting© 2015 all rights reserved

Questions?