an introduction to apache spark big data madison: 29 july 2014
TRANSCRIPT
![Page 1: An Introduction to Apache Spark Big Data Madison: 29 July 2014](https://reader034.vdocuments.mx/reader034/viewer/2022051716/58a2cdd21a28abaa338b6302/html5/thumbnails/1.jpg)
William Benton @willb Red Hat, Inc.
An Introduction to Apache Spark Big Data Madison: 29 July 2014
![Page 2: An Introduction to Apache Spark Big Data Madison: 29 July 2014](https://reader034.vdocuments.mx/reader034/viewer/2022051716/58a2cdd21a28abaa338b6302/html5/thumbnails/2.jpg)
About me
• At Red Hat for almost 6 years, working on distributed computing
• Currently contributing to Spark, active Fedora packager and sponsor
• Before Red Hat: concurrency and program analysis research
![Page 3: An Introduction to Apache Spark Big Data Madison: 29 July 2014](https://reader034.vdocuments.mx/reader034/viewer/2022051716/58a2cdd21a28abaa338b6302/html5/thumbnails/3.jpg)
Forecast
• Background
• Resilient Distributed Datasets
• Spark Libraries
• Community Overview
![Page 4: An Introduction to Apache Spark Big Data Madison: 29 July 2014](https://reader034.vdocuments.mx/reader034/viewer/2022051716/58a2cdd21a28abaa338b6302/html5/thumbnails/4.jpg)
Background
![Page 5: An Introduction to Apache Spark Big Data Madison: 29 July 2014](https://reader034.vdocuments.mx/reader034/viewer/2022051716/58a2cdd21a28abaa338b6302/html5/thumbnails/5.jpg)
Processing data in 2005
• MapReduce paper applied very old ideas to distributed data processing
• Hadoop provided open-source MR and distributed FS implementations
• MR allows scale-out on commodity clusters for some real problems
![Page 6: An Introduction to Apache Spark Big Data Madison: 29 July 2014](https://reader034.vdocuments.mx/reader034/viewer/2022051716/58a2cdd21a28abaa338b6302/html5/thumbnails/6.jpg)
Processing data in 2005
• MapReduce paper applied very old ideas to distributed data processing
• Hadoop provided open-source MR and distributed FS implementations
• MR allows scale-out on commodity clusters for some real problems
Dean & Ghemawat, 2004
![Page 7: An Introduction to Apache Spark Big Data Madison: 29 July 2014](https://reader034.vdocuments.mx/reader034/viewer/2022051716/58a2cdd21a28abaa338b6302/html5/thumbnails/7.jpg)
Processing data in 2005
• MapReduce paper applied very old ideas to distributed data processing
• Hadoop provided open-source MR and distributed FS implementations
• MR allows scale-out on commodity clusters for some real problems
Dean & Ghemawat, 2004 McCarthy, 1960
![Page 8: An Introduction to Apache Spark Big Data Madison: 29 July 2014](https://reader034.vdocuments.mx/reader034/viewer/2022051716/58a2cdd21a28abaa338b6302/html5/thumbnails/8.jpg)
Hadoop and MR shortcomings
• MR works well for batch jobs; suffers for iterative or interactive ones
• Special-purpose extensions for machine learning, query, etc.
• MR model is not a natural fit for many programmers or programs
![Page 9: An Introduction to Apache Spark Big Data Madison: 29 July 2014](https://reader034.vdocuments.mx/reader034/viewer/2022051716/58a2cdd21a28abaa338b6302/html5/thumbnails/9.jpg)
Hadoop and MR shortcomings
• MR works well for batch jobs; suffers for iterative or interactive ones
• Special-purpose extensions for machine learning, query, etc.
• MR model is not a natural fit for many programmers or programs
MAHOUT
![Page 10: An Introduction to Apache Spark Big Data Madison: 29 July 2014](https://reader034.vdocuments.mx/reader034/viewer/2022051716/58a2cdd21a28abaa338b6302/html5/thumbnails/10.jpg)
Hadoop and MR shortcomings
• MR works well for batch jobs; suffers for iterative or interactive ones
• Special-purpose extensions for machine learning, query, etc.
• MR model is not a natural fit for many programmers or programs
MAHOUT HIVE, PIG
![Page 11: An Introduction to Apache Spark Big Data Madison: 29 July 2014](https://reader034.vdocuments.mx/reader034/viewer/2022051716/58a2cdd21a28abaa338b6302/html5/thumbnails/11.jpg)
Hadoop and MR shortcomings
• MR works well for batch jobs; suffers for iterative or interactive ones
• Special-purpose extensions for machine learning, query, etc.
• MR model is not a natural fit for many programmers or programs
MAHOUT HIVE, PIG
![Page 12: An Introduction to Apache Spark Big Data Madison: 29 July 2014](https://reader034.vdocuments.mx/reader034/viewer/2022051716/58a2cdd21a28abaa338b6302/html5/thumbnails/12.jpg)
Hadoop and MR shortcomings
• MR works well for batch jobs; suffers for iterative or interactive ones
• Special-purpose extensions for machine learning, query, etc.
• MR model is not a natural fit for many programmers or programs
MAHOUT HIVE, PIG
DATA SCIENTIST TIME: $460/8 hours EC2 C3 Large INSTANCE: $0.84/8 hours
![Page 13: An Introduction to Apache Spark Big Data Madison: 29 July 2014](https://reader034.vdocuments.mx/reader034/viewer/2022051716/58a2cdd21a28abaa338b6302/html5/thumbnails/13.jpg)
Apache Spark
• Introduced in 2009; donated to Apache in 2013; 1.0 release in 2014
• Based on a fundamental abstraction rather than an execution model
• Supports in-memory computing and a wide range of problems
![Page 14: An Introduction to Apache Spark Big Data Madison: 29 July 2014](https://reader034.vdocuments.mx/reader034/viewer/2022051716/58a2cdd21a28abaa338b6302/html5/thumbnails/14.jpg)
Spark is general
Spark core
![Page 15: An Introduction to Apache Spark Big Data Madison: 29 July 2014](https://reader034.vdocuments.mx/reader034/viewer/2022051716/58a2cdd21a28abaa338b6302/html5/thumbnails/15.jpg)
Spark is general
Spark core
Graph
![Page 16: An Introduction to Apache Spark Big Data Madison: 29 July 2014](https://reader034.vdocuments.mx/reader034/viewer/2022051716/58a2cdd21a28abaa338b6302/html5/thumbnails/16.jpg)
Spark is general
Spark core
Graph SQL
![Page 17: An Introduction to Apache Spark Big Data Madison: 29 July 2014](https://reader034.vdocuments.mx/reader034/viewer/2022051716/58a2cdd21a28abaa338b6302/html5/thumbnails/17.jpg)
Spark is general
Spark core
Graph SQL ML
![Page 18: An Introduction to Apache Spark Big Data Madison: 29 July 2014](https://reader034.vdocuments.mx/reader034/viewer/2022051716/58a2cdd21a28abaa338b6302/html5/thumbnails/18.jpg)
Spark is general
Spark core
Graph SQL ML Streaming
![Page 19: An Introduction to Apache Spark Big Data Madison: 29 July 2014](https://reader034.vdocuments.mx/reader034/viewer/2022051716/58a2cdd21a28abaa338b6302/html5/thumbnails/19.jpg)
Spark is general
Spark core
Graph SQL ML Streaming
ad hoc
![Page 20: An Introduction to Apache Spark Big Data Madison: 29 July 2014](https://reader034.vdocuments.mx/reader034/viewer/2022051716/58a2cdd21a28abaa338b6302/html5/thumbnails/20.jpg)
Spark is general
Spark core
Graph SQL ML Streaming
ad hoc Mesos
![Page 21: An Introduction to Apache Spark Big Data Madison: 29 July 2014](https://reader034.vdocuments.mx/reader034/viewer/2022051716/58a2cdd21a28abaa338b6302/html5/thumbnails/21.jpg)
Spark is general
Spark core
Graph SQL ML Streaming
ad hoc Mesos YARN
![Page 22: An Introduction to Apache Spark Big Data Madison: 29 July 2014](https://reader034.vdocuments.mx/reader034/viewer/2022051716/58a2cdd21a28abaa338b6302/html5/thumbnails/22.jpg)
Spark is general
Spark core
Graph SQL ML Streaming
ad hoc Mesos YARN
APIs for SCALA, Java, Python, and R (3rd-party bindings for Clojure et al.)
![Page 23: An Introduction to Apache Spark Big Data Madison: 29 July 2014](https://reader034.vdocuments.mx/reader034/viewer/2022051716/58a2cdd21a28abaa338b6302/html5/thumbnails/23.jpg)
RDDs
![Page 24: An Introduction to Apache Spark Big Data Madison: 29 July 2014](https://reader034.vdocuments.mx/reader034/viewer/2022051716/58a2cdd21a28abaa338b6302/html5/thumbnails/24.jpg)
Resilient distributed datasets
![Page 25: An Introduction to Apache Spark Big Data Madison: 29 July 2014](https://reader034.vdocuments.mx/reader034/viewer/2022051716/58a2cdd21a28abaa338b6302/html5/thumbnails/25.jpg)
Resilient distributed datasets
![Page 26: An Introduction to Apache Spark Big Data Madison: 29 July 2014](https://reader034.vdocuments.mx/reader034/viewer/2022051716/58a2cdd21a28abaa338b6302/html5/thumbnails/26.jpg)
Resilient distributed datasets
![Page 27: An Introduction to Apache Spark Big Data Madison: 29 July 2014](https://reader034.vdocuments.mx/reader034/viewer/2022051716/58a2cdd21a28abaa338b6302/html5/thumbnails/27.jpg)
Resilient distributed datasets
![Page 28: An Introduction to Apache Spark Big Data Madison: 29 July 2014](https://reader034.vdocuments.mx/reader034/viewer/2022051716/58a2cdd21a28abaa338b6302/html5/thumbnails/28.jpg)
Resilient distributed datasets
Partitioned across machines by range…
![Page 29: An Introduction to Apache Spark Big Data Madison: 29 July 2014](https://reader034.vdocuments.mx/reader034/viewer/2022051716/58a2cdd21a28abaa338b6302/html5/thumbnails/29.jpg)
Resilient distributed datasets
Partitioned across machines by range…
![Page 30: An Introduction to Apache Spark Big Data Madison: 29 July 2014](https://reader034.vdocuments.mx/reader034/viewer/2022051716/58a2cdd21a28abaa338b6302/html5/thumbnails/30.jpg)
Resilient distributed datasets
Partitioned across machines by range…
…or BY HASH
![Page 31: An Introduction to Apache Spark Big Data Madison: 29 July 2014](https://reader034.vdocuments.mx/reader034/viewer/2022051716/58a2cdd21a28abaa338b6302/html5/thumbnails/31.jpg)
Resilient distributed datasets
![Page 32: An Introduction to Apache Spark Big Data Madison: 29 July 2014](https://reader034.vdocuments.mx/reader034/viewer/2022051716/58a2cdd21a28abaa338b6302/html5/thumbnails/32.jpg)
Resilient distributed datasets
Failures mean Partitions can disappear…
?
![Page 33: An Introduction to Apache Spark Big Data Madison: 29 July 2014](https://reader034.vdocuments.mx/reader034/viewer/2022051716/58a2cdd21a28abaa338b6302/html5/thumbnails/33.jpg)
Resilient distributed datasets
Failures mean Partitions can disappear…
…but they can be reconstructed!
![Page 34: An Introduction to Apache Spark Big Data Madison: 29 July 2014](https://reader034.vdocuments.mx/reader034/viewer/2022051716/58a2cdd21a28abaa338b6302/html5/thumbnails/34.jpg)
RDDs are partitioned, immutable, lazy collections
![Page 35: An Introduction to Apache Spark Big Data Madison: 29 July 2014](https://reader034.vdocuments.mx/reader034/viewer/2022051716/58a2cdd21a28abaa338b6302/html5/thumbnails/35.jpg)
RDDs are partitioned, immutable, lazy collections
TRANSFORMATIONS create new RDDs that encode a dependency DAG !
ACTIONS result in executing cluster jobs & return values to the driver
![Page 36: An Introduction to Apache Spark Big Data Madison: 29 July 2014](https://reader034.vdocuments.mx/reader034/viewer/2022051716/58a2cdd21a28abaa338b6302/html5/thumbnails/36.jpg)
RDDs (more formally)
• A set of partitions
• Lineage information
• A function to compute partitions from parent partitions
• A partitioning strategy
• Preferred locations for partitions
![Page 37: An Introduction to Apache Spark Big Data Madison: 29 July 2014](https://reader034.vdocuments.mx/reader034/viewer/2022051716/58a2cdd21a28abaa338b6302/html5/thumbnails/37.jpg)
RDDs (more formally)
• A set of partitions
• Lineage information
• A function to compute partitions from parent partitions
• A partitioning strategy
• Preferred locations for partitions(THESE ARE OPTIONAL)
![Page 38: An Introduction to Apache Spark Big Data Madison: 29 July 2014](https://reader034.vdocuments.mx/reader034/viewer/2022051716/58a2cdd21a28abaa338b6302/html5/thumbnails/38.jpg)
Creating RDDs
• From a collection: parallelize()
• …a local or remote file: textFile()
• …or HDFS: hadoopFile(); sequenceFile(); objectFile()
• (These all act lazily)
![Page 39: An Introduction to Apache Spark Big Data Madison: 29 July 2014](https://reader034.vdocuments.mx/reader034/viewer/2022051716/58a2cdd21a28abaa338b6302/html5/thumbnails/39.jpg)
RDD[T] transformations
•map(f: T=>U): RDD[U]
•flatMap(f: T=>Seq[U]): RDD[U]
•filter(f: T=>Boolean): RDD[T]
•distinct(): RDD[T]
•keyBy(f: T=>K): RDD[(K, T)]
![Page 40: An Introduction to Apache Spark Big Data Madison: 29 July 2014](https://reader034.vdocuments.mx/reader034/viewer/2022051716/58a2cdd21a28abaa338b6302/html5/thumbnails/40.jpg)
RDD[(K,V)] transformations
•sortByKey(): RDD[(K,V)]
•groupByKey(): RDD[(K,Seq[V])]
•reduceByKey(f: (V,V)=>V): RDD[(K,V)]
•join(other: RDD[(K,W)]): RDD[(K,(V,W))]
![Page 41: An Introduction to Apache Spark Big Data Madison: 29 July 2014](https://reader034.vdocuments.mx/reader034/viewer/2022051716/58a2cdd21a28abaa338b6302/html5/thumbnails/41.jpg)
RDD[(K,V)] transformations
•sortByKey(): RDD[(K,V)]
•groupByKey(): RDD[(K,Seq[V])]
•reduceByKey(f: (V,V)=>V): RDD[(K,V)]
•join(other: RDD[(K,W)]): RDD[(K,(V,W))]
…and many Others including cartesian product, cogroup, set operations, &C.
![Page 42: An Introduction to Apache Spark Big Data Madison: 29 July 2014](https://reader034.vdocuments.mx/reader034/viewer/2022051716/58a2cdd21a28abaa338b6302/html5/thumbnails/42.jpg)
Other RDD transformations
• Explicitly repartition and shuffle or coalesce to fewer partitions
• Provide hints to cache an intermediate RDD in memory or persist it to memory and/or disk
(Remember: all transformations are lazy)
![Page 43: An Introduction to Apache Spark Big Data Madison: 29 July 2014](https://reader034.vdocuments.mx/reader034/viewer/2022051716/58a2cdd21a28abaa338b6302/html5/thumbnails/43.jpg)
RDD[T] actions
•collect(): Array[T]
•count(): Long
•reduce(f: (T,T)=>T): T
•saveAsTextFile(path)
•saveAsSequenceFile(path)
![Page 44: An Introduction to Apache Spark Big Data Madison: 29 July 2014](https://reader034.vdocuments.mx/reader034/viewer/2022051716/58a2cdd21a28abaa338b6302/html5/thumbnails/44.jpg)
RDD[T] actions
•collect(): Array[T]
•count(): Long
•reduce(f: (T,T)=>T): T
•saveAsTextFile(path)
•saveAsSequenceFile(path)…and many Others including foreach, take, sampling, &C.
![Page 45: An Introduction to Apache Spark Big Data Madison: 29 July 2014](https://reader034.vdocuments.mx/reader034/viewer/2022051716/58a2cdd21a28abaa338b6302/html5/thumbnails/45.jpg)
RDD[T] actions
•collect(): Array[T]
•count(): Long
•reduce(f: (T,T)=>T): T
•saveAsTextFile(path)
•saveAsSequenceFile(path)…and many Others including foreach, take, sampling, &C.
(Remember: ALL actions are eager)
![Page 46: An Introduction to Apache Spark Big Data Madison: 29 July 2014](https://reader034.vdocuments.mx/reader034/viewer/2022051716/58a2cdd21a28abaa338b6302/html5/thumbnails/46.jpg)
Example: word count in Sparkval file = spark.textFile("hdfs://...") !
val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) !
counts.saveAsTextFile("hdfs://...")
![Page 47: An Introduction to Apache Spark Big Data Madison: 29 July 2014](https://reader034.vdocuments.mx/reader034/viewer/2022051716/58a2cdd21a28abaa338b6302/html5/thumbnails/47.jpg)
Libraries
![Page 48: An Introduction to Apache Spark Big Data Madison: 29 July 2014](https://reader034.vdocuments.mx/reader034/viewer/2022051716/58a2cdd21a28abaa338b6302/html5/thumbnails/48.jpg)
Spark libraries
Spark core
Graph SQL ML Streaming
ad hoc Mesos YARN
![Page 49: An Introduction to Apache Spark Big Data Madison: 29 July 2014](https://reader034.vdocuments.mx/reader034/viewer/2022051716/58a2cdd21a28abaa338b6302/html5/thumbnails/49.jpg)
Spark libraries
Spark core
Graph SQL ML Streaming
ad hoc Mesos YARNFIXED-point computation
![Page 50: An Introduction to Apache Spark Big Data Madison: 29 July 2014](https://reader034.vdocuments.mx/reader034/viewer/2022051716/58a2cdd21a28abaa338b6302/html5/thumbnails/50.jpg)
Spark libraries
Spark core
Graph SQL ML Streaming
ad hoc Mesos YARNFIXED-point computation
query execution
![Page 51: An Introduction to Apache Spark Big Data Madison: 29 July 2014](https://reader034.vdocuments.mx/reader034/viewer/2022051716/58a2cdd21a28abaa338b6302/html5/thumbnails/51.jpg)
Spark libraries
Spark core
Graph SQL ML Streaming
ad hoc Mesos YARNFIXED-point computation
query executionmodel parameter Optimization
![Page 52: An Introduction to Apache Spark Big Data Madison: 29 July 2014](https://reader034.vdocuments.mx/reader034/viewer/2022051716/58a2cdd21a28abaa338b6302/html5/thumbnails/52.jpg)
Spark libraries
Spark core
Graph SQL ML Streaming
ad hoc Mesos YARNFIXED-point computation
query executionmodel parameter Optimization
![Page 53: An Introduction to Apache Spark Big Data Madison: 29 July 2014](https://reader034.vdocuments.mx/reader034/viewer/2022051716/58a2cdd21a28abaa338b6302/html5/thumbnails/53.jpg)
Spark MLlib• Implementations of classic learning
algorithms on data in RDDs
• Regression, classification, clustering, recommendation, filtering, etc.
• High performance due to caching, in-memory execution
![Page 54: An Introduction to Apache Spark Big Data Madison: 29 July 2014](https://reader034.vdocuments.mx/reader034/viewer/2022051716/58a2cdd21a28abaa338b6302/html5/thumbnails/54.jpg)
Spark SQL features
• SchemaRDD type
• Catalyst: a relational algebra analysis/optimization framework
• SQL and HiveQL implementations; Hive warehouse support
• LINQ-style embedded query DSL
![Page 55: An Introduction to Apache Spark Big Data Madison: 29 July 2014](https://reader034.vdocuments.mx/reader034/viewer/2022051716/58a2cdd21a28abaa338b6302/html5/thumbnails/55.jpg)
Spark SQL examplecase class Trackpoint(lat: Double, lon: Double, ts: Long) {} !
// assume points is an RDD of Trackpoints points.registerAsTable("points") !
val results = sql("""select * from points where ts > max(ts) - 600""")
![Page 56: An Introduction to Apache Spark Big Data Madison: 29 July 2014](https://reader034.vdocuments.mx/reader034/viewer/2022051716/58a2cdd21a28abaa338b6302/html5/thumbnails/56.jpg)
Spark Streaming
• Goal: use the same abstraction for streaming as for batch or interactive
• Discretized stream abstraction: streams as sequences of RDDs
Streaming engine Spark
input stream
windowed data (RDDs)
processed data (RDDs)
![Page 57: An Introduction to Apache Spark Big Data Madison: 29 July 2014](https://reader034.vdocuments.mx/reader034/viewer/2022051716/58a2cdd21a28abaa338b6302/html5/thumbnails/57.jpg)
Community & Applications
![Page 58: An Introduction to Apache Spark Big Data Madison: 29 July 2014](https://reader034.vdocuments.mx/reader034/viewer/2022051716/58a2cdd21a28abaa338b6302/html5/thumbnails/58.jpg)
Developer community
• https://github.com/apache/spark
• First open-source release in 2010
• 121k lines of code
• 300+ contributors all-time; 80+ in the last month
• Active and friendly mailing list
![Page 59: An Introduction to Apache Spark Big Data Madison: 29 July 2014](https://reader034.vdocuments.mx/reader034/viewer/2022051716/58a2cdd21a28abaa338b6302/html5/thumbnails/59.jpg)
User community
• Spark Summit: established in 2013; growing and expanding
• Lots of cool applications (BI, medical, geospatial, security, fun)
• Many learning resources
![Page 60: An Introduction to Apache Spark Big Data Madison: 29 July 2014](https://reader034.vdocuments.mx/reader034/viewer/2022051716/58a2cdd21a28abaa338b6302/html5/thumbnails/60.jpg)
User community
• Spark Summit: established in 2013; growing and expanding
• Lots of cool applications (BI, medical, geospatial, security, fun)
• Many learning resources
more than 2x as big in ‘14
![Page 61: An Introduction to Apache Spark Big Data Madison: 29 July 2014](https://reader034.vdocuments.mx/reader034/viewer/2022051716/58a2cdd21a28abaa338b6302/html5/thumbnails/61.jpg)
User community
• Spark Summit: established in 2013; growing and expanding
• Lots of cool applications (BI, medical, geospatial, security, fun)
• Many learning resources
more than 2x as big in ‘14
Spark SUMMIT EAST: NYC 2015
![Page 62: An Introduction to Apache Spark Big Data Madison: 29 July 2014](https://reader034.vdocuments.mx/reader034/viewer/2022051716/58a2cdd21a28abaa338b6302/html5/thumbnails/62.jpg)
http://spark-summit.org/2014/agenda