unified big data processing with apache spark · unified big data processing with apache spark ......
TRANSCRIPT
![Page 1: Unified Big Data Processing with Apache Spark · Unified Big Data Processing with Apache Spark ... MapReduce was great for batch processing, ... Spark can leverage most of the latest](https://reader031.vdocuments.mx/reader031/viewer/2022021716/5b29df1d7f8b9a0b1e8b52e1/html5/thumbnails/1.jpg)
![Page 2: Unified Big Data Processing with Apache Spark · Unified Big Data Processing with Apache Spark ... MapReduce was great for batch processing, ... Spark can leverage most of the latest](https://reader031.vdocuments.mx/reader031/viewer/2022021716/5b29df1d7f8b9a0b1e8b52e1/html5/thumbnails/2.jpg)
Unified Big Data Processing with Apache Spark
Matei Zaharia @matei_zaharia
![Page 3: Unified Big Data Processing with Apache Spark · Unified Big Data Processing with Apache Spark ... MapReduce was great for batch processing, ... Spark can leverage most of the latest](https://reader031.vdocuments.mx/reader031/viewer/2022021716/5b29df1d7f8b9a0b1e8b52e1/html5/thumbnails/3.jpg)
What is Apache Spark?
Fast & general engine for big data processing Generalizes MapReduce model to support more types of processing Most active open source project in big data
![Page 4: Unified Big Data Processing with Apache Spark · Unified Big Data Processing with Apache Spark ... MapReduce was great for batch processing, ... Spark can leverage most of the latest](https://reader031.vdocuments.mx/reader031/viewer/2022021716/5b29df1d7f8b9a0b1e8b52e1/html5/thumbnails/4.jpg)
About Databricks
Founded by the creators of Spark in 2013
Continues to drive open source Spark development, and offers a cloud service (Databricks Cloud)
Partners to support Spark with Cloudera, MapR, Hortonworks, Datastax
![Page 5: Unified Big Data Processing with Apache Spark · Unified Big Data Processing with Apache Spark ... MapReduce was great for batch processing, ... Spark can leverage most of the latest](https://reader031.vdocuments.mx/reader031/viewer/2022021716/5b29df1d7f8b9a0b1e8b52e1/html5/thumbnails/5.jpg)
Spark Community
Map
Red
uce
YAR
N
HD
FS
Stor
m
Spar
k
0
500
1000
1500
2000
Map
Red
uce
YAR
N
HD
FS
Stor
m
Spar
k
0
50000
100000
150000
200000
250000
300000
350000
Commits Lines of Code Changed Activity in past 6 months
![Page 6: Unified Big Data Processing with Apache Spark · Unified Big Data Processing with Apache Spark ... MapReduce was great for batch processing, ... Spark can leverage most of the latest](https://reader031.vdocuments.mx/reader031/viewer/2022021716/5b29df1d7f8b9a0b1e8b52e1/html5/thumbnails/6.jpg)
Community Growth
0
25
50
75
100
2010 2011 2012 2013 2014
Contributors per Month to Spark
2-3x more activity than Hadoop, Storm, MongoDB, NumPy, D3, Julia, …
![Page 7: Unified Big Data Processing with Apache Spark · Unified Big Data Processing with Apache Spark ... MapReduce was great for batch processing, ... Spark can leverage most of the latest](https://reader031.vdocuments.mx/reader031/viewer/2022021716/5b29df1d7f8b9a0b1e8b52e1/html5/thumbnails/7.jpg)
Overview
Why a unified engine? Spark execution model Why was Spark so general? What’s next
![Page 8: Unified Big Data Processing with Apache Spark · Unified Big Data Processing with Apache Spark ... MapReduce was great for batch processing, ... Spark can leverage most of the latest](https://reader031.vdocuments.mx/reader031/viewer/2022021716/5b29df1d7f8b9a0b1e8b52e1/html5/thumbnails/8.jpg)
History: Cluster Programming Models
2004
![Page 9: Unified Big Data Processing with Apache Spark · Unified Big Data Processing with Apache Spark ... MapReduce was great for batch processing, ... Spark can leverage most of the latest](https://reader031.vdocuments.mx/reader031/viewer/2022021716/5b29df1d7f8b9a0b1e8b52e1/html5/thumbnails/9.jpg)
MapReduce
A general engine for batch processing
![Page 10: Unified Big Data Processing with Apache Spark · Unified Big Data Processing with Apache Spark ... MapReduce was great for batch processing, ... Spark can leverage most of the latest](https://reader031.vdocuments.mx/reader031/viewer/2022021716/5b29df1d7f8b9a0b1e8b52e1/html5/thumbnails/10.jpg)
Beyond MapReduce
MapReduce was great for batch processing, but users quickly needed to do more:
> More complex, multi-pass algorithms > More interactive ad-hoc queries > More real-time stream processing
Result: many specialized systems for these workloads
![Page 11: Unified Big Data Processing with Apache Spark · Unified Big Data Processing with Apache Spark ... MapReduce was great for batch processing, ... Spark can leverage most of the latest](https://reader031.vdocuments.mx/reader031/viewer/2022021716/5b29df1d7f8b9a0b1e8b52e1/html5/thumbnails/11.jpg)
MapReduce
Pregel
Giraph Presto
Storm
Dremel Drill
Impala S4 . . .
Specialized systems for new workloads
General batch processing
Big Data Systems Today
![Page 12: Unified Big Data Processing with Apache Spark · Unified Big Data Processing with Apache Spark ... MapReduce was great for batch processing, ... Spark can leverage most of the latest](https://reader031.vdocuments.mx/reader031/viewer/2022021716/5b29df1d7f8b9a0b1e8b52e1/html5/thumbnails/12.jpg)
Problems with Specialized Systems
More systems to manage, tune, deploy Can’t combine processing types in one application
> Even though many pipelines need to do this! > E.g. load data with SQL, then run machine learning
In many pipelines, data exchange between engines is the dominant cost!
![Page 13: Unified Big Data Processing with Apache Spark · Unified Big Data Processing with Apache Spark ... MapReduce was great for batch processing, ... Spark can leverage most of the latest](https://reader031.vdocuments.mx/reader031/viewer/2022021716/5b29df1d7f8b9a0b1e8b52e1/html5/thumbnails/13.jpg)
MapReduce
Pregel
Giraph Presto
Storm
Dremel Drill
Impala S4
Specialized systems for new workloads
General batch processing
Unified engine
Big Data Systems Today
? . . .
![Page 14: Unified Big Data Processing with Apache Spark · Unified Big Data Processing with Apache Spark ... MapReduce was great for batch processing, ... Spark can leverage most of the latest](https://reader031.vdocuments.mx/reader031/viewer/2022021716/5b29df1d7f8b9a0b1e8b52e1/html5/thumbnails/14.jpg)
Overview
Why a unified engine? Spark execution model Why was Spark so general? What’s next
![Page 15: Unified Big Data Processing with Apache Spark · Unified Big Data Processing with Apache Spark ... MapReduce was great for batch processing, ... Spark can leverage most of the latest](https://reader031.vdocuments.mx/reader031/viewer/2022021716/5b29df1d7f8b9a0b1e8b52e1/html5/thumbnails/15.jpg)
Background
Recall 3 workloads were issues for MapReduce: > More complex, multi-pass algorithms > More interactive ad-hoc queries > More real-time stream processing
While these look different, all 3 need one thing that MapReduce lacks: efficient data sharing
![Page 16: Unified Big Data Processing with Apache Spark · Unified Big Data Processing with Apache Spark ... MapReduce was great for batch processing, ... Spark can leverage most of the latest](https://reader031.vdocuments.mx/reader031/viewer/2022021716/5b29df1d7f8b9a0b1e8b52e1/html5/thumbnails/16.jpg)
Data Sharing in MapReduce
iter. 1 iter. 2 . . . Input
HDFS read
HDFS write
HDFS read
HDFS write
Input
query 1
query 2
query 3
result 1
result 2
result 3
. . .
HDFS read
Slow due to data replication and disk I/O
![Page 17: Unified Big Data Processing with Apache Spark · Unified Big Data Processing with Apache Spark ... MapReduce was great for batch processing, ... Spark can leverage most of the latest](https://reader031.vdocuments.mx/reader031/viewer/2022021716/5b29df1d7f8b9a0b1e8b52e1/html5/thumbnails/17.jpg)
iter. 1 iter. 2 . . . Input
What We’d Like
Distributed memory
Input
query 1
query 2
query 3
. . .
one-time processing
10-100× faster than network and disk
![Page 18: Unified Big Data Processing with Apache Spark · Unified Big Data Processing with Apache Spark ... MapReduce was great for batch processing, ... Spark can leverage most of the latest](https://reader031.vdocuments.mx/reader031/viewer/2022021716/5b29df1d7f8b9a0b1e8b52e1/html5/thumbnails/18.jpg)
Spark Model
Resilient Distributed Datasets (RDDs) > Collections of objects that can be stored in memory or
disk across a cluster > Built via parallel transformations (map, filter, …) > Fault-tolerant without replication
![Page 19: Unified Big Data Processing with Apache Spark · Unified Big Data Processing with Apache Spark ... MapReduce was great for batch processing, ... Spark can leverage most of the latest](https://reader031.vdocuments.mx/reader031/viewer/2022021716/5b29df1d7f8b9a0b1e8b52e1/html5/thumbnails/19.jpg)
Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(‘\t’)[2])
messages.cache()
Block 1
Block 2
Block 3
Worker
Worker
Worker
Driver
messages.filter(lambda s: “foo” in s).count()
messages.filter(lambda s: “bar” in s).count()
. . .
tasks
results Cache 1
Cache 2
Cache 3
Base RDD Transformed RDD
Action
Full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data)
![Page 20: Unified Big Data Processing with Apache Spark · Unified Big Data Processing with Apache Spark ... MapReduce was great for batch processing, ... Spark can leverage most of the latest](https://reader031.vdocuments.mx/reader031/viewer/2022021716/5b29df1d7f8b9a0b1e8b52e1/html5/thumbnails/20.jpg)
file.map(lambda rec: (rec.type, 1)) .reduceByKey(lambda x, y: x + y) .filter(lambda (type, count): count > 10)
Fault Tolerance
filter reduce map
Inpu
t file
RDDs track lineage info to rebuild lost data
![Page 21: Unified Big Data Processing with Apache Spark · Unified Big Data Processing with Apache Spark ... MapReduce was great for batch processing, ... Spark can leverage most of the latest](https://reader031.vdocuments.mx/reader031/viewer/2022021716/5b29df1d7f8b9a0b1e8b52e1/html5/thumbnails/21.jpg)
filter reduce map
Inpu
t file
Fault Tolerance
file.map(lambda rec: (rec.type, 1)) .reduceByKey(lambda x, y: x + y) .filter(lambda (type, count): count > 10)
RDDs track lineage info to rebuild lost data
![Page 22: Unified Big Data Processing with Apache Spark · Unified Big Data Processing with Apache Spark ... MapReduce was great for batch processing, ... Spark can leverage most of the latest](https://reader031.vdocuments.mx/reader031/viewer/2022021716/5b29df1d7f8b9a0b1e8b52e1/html5/thumbnails/22.jpg)
0 500
1000 1500 2000 2500 3000 3500 4000
1 5 10 20 30
Run
ning
Tim
e (s
)
Number of Iterations
Hadoop Spark
110 s / iteration
first iteration 80 s later iterations 1 s
Example: Logistic Regression
![Page 23: Unified Big Data Processing with Apache Spark · Unified Big Data Processing with Apache Spark ... MapReduce was great for batch processing, ... Spark can leverage most of the latest](https://reader031.vdocuments.mx/reader031/viewer/2022021716/5b29df1d7f8b9a0b1e8b52e1/html5/thumbnails/23.jpg)
// Scala:
val lines = sc.textFile(...) lines.filter(s => s.contains(“ERROR”)).count()
// Java:
JavaRDD<String> lines = sc.textFile(...); lines.filter(s -> s.contains(“ERROR”)).count();
Spark in Scala and Java
![Page 24: Unified Big Data Processing with Apache Spark · Unified Big Data Processing with Apache Spark ... MapReduce was great for batch processing, ... Spark can leverage most of the latest](https://reader031.vdocuments.mx/reader031/viewer/2022021716/5b29df1d7f8b9a0b1e8b52e1/html5/thumbnails/24.jpg)
How General Is It?
![Page 25: Unified Big Data Processing with Apache Spark · Unified Big Data Processing with Apache Spark ... MapReduce was great for batch processing, ... Spark can leverage most of the latest](https://reader031.vdocuments.mx/reader031/viewer/2022021716/5b29df1d7f8b9a0b1e8b52e1/html5/thumbnails/25.jpg)
Spark Core
Spark Streaming
real-time
Spark SQL relational
MLlib machine learning
GraphX graph
Libraries Built on Spark
![Page 26: Unified Big Data Processing with Apache Spark · Unified Big Data Processing with Apache Spark ... MapReduce was great for batch processing, ... Spark can leverage most of the latest](https://reader031.vdocuments.mx/reader031/viewer/2022021716/5b29df1d7f8b9a0b1e8b52e1/html5/thumbnails/26.jpg)
Represents tables as RDDs Tables = Schema + Data
Spark SQL
![Page 27: Unified Big Data Processing with Apache Spark · Unified Big Data Processing with Apache Spark ... MapReduce was great for batch processing, ... Spark can leverage most of the latest](https://reader031.vdocuments.mx/reader031/viewer/2022021716/5b29df1d7f8b9a0b1e8b52e1/html5/thumbnails/27.jpg)
Represents tables as RDDs Tables = Schema + Data = SchemaRDD
Spark SQL
c = HiveContext(sc)
rows = c.sql(“select text, year from hivetable”)
rows.filter(lambda r: r.year > 2013).collect()
From Hive:
{“text”: “hi”, “user”: { “name”: “matei”, “id”: 123 }}
c.jsonFile(“tweets.json”).registerTempTable(“tweets”)
c.sql(“select text, user.name from tweets”)
From JSON: tweets.json
![Page 28: Unified Big Data Processing with Apache Spark · Unified Big Data Processing with Apache Spark ... MapReduce was great for batch processing, ... Spark can leverage most of the latest](https://reader031.vdocuments.mx/reader031/viewer/2022021716/5b29df1d7f8b9a0b1e8b52e1/html5/thumbnails/28.jpg)
Time Input
Spark Streaming
![Page 29: Unified Big Data Processing with Apache Spark · Unified Big Data Processing with Apache Spark ... MapReduce was great for batch processing, ... Spark can leverage most of the latest](https://reader031.vdocuments.mx/reader031/viewer/2022021716/5b29df1d7f8b9a0b1e8b52e1/html5/thumbnails/29.jpg)
RDD RDD RDD RDD RDD RDD
Time
Represents streams as a series of RDDs over time
val spammers = sc.sequenceFile(“hdfs://spammers.seq”) sc.twitterStream(...) .filter(t => t.text.contains(“QCon”)) .transform(tweets => tweets.map(t => (t.user, t)).join(spammers)) .print()
Spark Streaming
![Page 30: Unified Big Data Processing with Apache Spark · Unified Big Data Processing with Apache Spark ... MapReduce was great for batch processing, ... Spark can leverage most of the latest](https://reader031.vdocuments.mx/reader031/viewer/2022021716/5b29df1d7f8b9a0b1e8b52e1/html5/thumbnails/30.jpg)
Vectors, Matrices
MLlib
![Page 31: Unified Big Data Processing with Apache Spark · Unified Big Data Processing with Apache Spark ... MapReduce was great for batch processing, ... Spark can leverage most of the latest](https://reader031.vdocuments.mx/reader031/viewer/2022021716/5b29df1d7f8b9a0b1e8b52e1/html5/thumbnails/31.jpg)
Vectors, Matrices = RDD[Vector] Iterative computation
MLlib
points = sc.textFile(“data.txt”).map(parsePoint) model = KMeans.train(points, 10) model.predict(newPoint)
![Page 32: Unified Big Data Processing with Apache Spark · Unified Big Data Processing with Apache Spark ... MapReduce was great for batch processing, ... Spark can leverage most of the latest](https://reader031.vdocuments.mx/reader031/viewer/2022021716/5b29df1d7f8b9a0b1e8b52e1/html5/thumbnails/32.jpg)
Represents graphs as RDDs of edges and vertices
GraphX
![Page 33: Unified Big Data Processing with Apache Spark · Unified Big Data Processing with Apache Spark ... MapReduce was great for batch processing, ... Spark can leverage most of the latest](https://reader031.vdocuments.mx/reader031/viewer/2022021716/5b29df1d7f8b9a0b1e8b52e1/html5/thumbnails/33.jpg)
Represents graphs as RDDs of edges and vertices
GraphX
![Page 34: Unified Big Data Processing with Apache Spark · Unified Big Data Processing with Apache Spark ... MapReduce was great for batch processing, ... Spark can leverage most of the latest](https://reader031.vdocuments.mx/reader031/viewer/2022021716/5b29df1d7f8b9a0b1e8b52e1/html5/thumbnails/34.jpg)
Represents graphs as RDDs of edges and vertices
GraphX
![Page 35: Unified Big Data Processing with Apache Spark · Unified Big Data Processing with Apache Spark ... MapReduce was great for batch processing, ... Spark can leverage most of the latest](https://reader031.vdocuments.mx/reader031/viewer/2022021716/5b29df1d7f8b9a0b1e8b52e1/html5/thumbnails/35.jpg)
// Load data using SQL val points = ctx.sql( “select latitude, longitude from historic_tweets”)
// Train a machine learning model val model = KMeans.train(points, 10)
// Apply it to a stream sc.twitterStream(...) .map(t => (model.closestCenter(t.location), 1)) .reduceByWindow(“5s”, _ + _)
Combining Processing Types
![Page 36: Unified Big Data Processing with Apache Spark · Unified Big Data Processing with Apache Spark ... MapReduce was great for batch processing, ... Spark can leverage most of the latest](https://reader031.vdocuments.mx/reader031/viewer/2022021716/5b29df1d7f8b9a0b1e8b52e1/html5/thumbnails/36.jpg)
Composing Workloads
Separate systems:
. . . HDFS read
HDFS write ET
L HDFS read
HDFS write tr
ain HDFS
read HDFS write qu
ery
HDFS write
HDFS read ET
L
trai
n
quer
y
Spark:
![Page 37: Unified Big Data Processing with Apache Spark · Unified Big Data Processing with Apache Spark ... MapReduce was great for batch processing, ... Spark can leverage most of the latest](https://reader031.vdocuments.mx/reader031/viewer/2022021716/5b29df1d7f8b9a0b1e8b52e1/html5/thumbnails/37.jpg)
Hiv
eIm
pala
(dis
k)
Impa
la (m
em)
Spar
k (d
isk)
Sp
ark
(mem
)
0
10
20
30
40
50
Res
pons
e Ti
me
(sec
)
SQL
Mah
out
Gra
phLa
b Sp
ark
0
10
20
30
40
50
60
Res
pons
e Ti
me
(min
)
ML
Performance vs Specialized Systems
Stor
m
Spar
k
0
5
10
15
20
25
30
35
Thro
ughp
ut (M
B/s
/nod
e)
Streaming
![Page 38: Unified Big Data Processing with Apache Spark · Unified Big Data Processing with Apache Spark ... MapReduce was great for batch processing, ... Spark can leverage most of the latest](https://reader031.vdocuments.mx/reader031/viewer/2022021716/5b29df1d7f8b9a0b1e8b52e1/html5/thumbnails/38.jpg)
On-Disk Performance: Petabyte Sort
Spark beat last year’s Sort Benchmark winner, Hadoop, by 3× using 10× fewer machines
2013 Record (Hadoop) Spark 100 TB Spark 1 PB
Data Size 102.5 TB 100 TB 1000 TB
Time 72 min 23 min 234 min
Nodes 2100 206 190
Cores 50400 6592 6080
Rate/Node 0.67 GB/min 20.7 GB/min 22.5 GB/min
tinyurl.com/spark-sort
![Page 39: Unified Big Data Processing with Apache Spark · Unified Big Data Processing with Apache Spark ... MapReduce was great for batch processing, ... Spark can leverage most of the latest](https://reader031.vdocuments.mx/reader031/viewer/2022021716/5b29df1d7f8b9a0b1e8b52e1/html5/thumbnails/39.jpg)
Overview Why a unified engine? Spark execution model Why was Spark so general? What’s next
![Page 40: Unified Big Data Processing with Apache Spark · Unified Big Data Processing with Apache Spark ... MapReduce was great for batch processing, ... Spark can leverage most of the latest](https://reader031.vdocuments.mx/reader031/viewer/2022021716/5b29df1d7f8b9a0b1e8b52e1/html5/thumbnails/40.jpg)
Why was Spark so General? In a world of growing data complexity, understanding this can help us design new tools / pipelines Two perspectives:
> Expressiveness perspective > Systems perspective
![Page 41: Unified Big Data Processing with Apache Spark · Unified Big Data Processing with Apache Spark ... MapReduce was great for batch processing, ... Spark can leverage most of the latest](https://reader031.vdocuments.mx/reader031/viewer/2022021716/5b29df1d7f8b9a0b1e8b52e1/html5/thumbnails/41.jpg)
1. Expressiveness Perspective Spark ≈ MapReduce + fast data sharing
![Page 42: Unified Big Data Processing with Apache Spark · Unified Big Data Processing with Apache Spark ... MapReduce was great for batch processing, ... Spark can leverage most of the latest](https://reader031.vdocuments.mx/reader031/viewer/2022021716/5b29df1d7f8b9a0b1e8b52e1/html5/thumbnails/42.jpg)
1. Expressiveness Perspective
How to share data!quickly across steps?
Local computation All-to-all communication One MR step
…How low is this latency?
Spark: RDDs
Spark: ~100 ms
MapReduce can emulate any distributed system!
![Page 43: Unified Big Data Processing with Apache Spark · Unified Big Data Processing with Apache Spark ... MapReduce was great for batch processing, ... Spark can leverage most of the latest](https://reader031.vdocuments.mx/reader031/viewer/2022021716/5b29df1d7f8b9a0b1e8b52e1/html5/thumbnails/43.jpg)
2. Systems Perspective
Main bottlenecks in clusters are network and I/O
Any system that lets apps control these resources can match speed of specialized ones
In Spark: > Users control data partitioning & caching > We implement the data structures and algorithms of
specialized systems within Spark records
![Page 44: Unified Big Data Processing with Apache Spark · Unified Big Data Processing with Apache Spark ... MapReduce was great for batch processing, ... Spark can leverage most of the latest](https://reader031.vdocuments.mx/reader031/viewer/2022021716/5b29df1d7f8b9a0b1e8b52e1/html5/thumbnails/44.jpg)
Examples Spark SQL
> A SchemaRDD holds records for each chunk of data (multiple rows), with columnar compression
GraphX > GraphX represents graphs as an RDD of HashMaps so
that it can join quickly against each partition
![Page 45: Unified Big Data Processing with Apache Spark · Unified Big Data Processing with Apache Spark ... MapReduce was great for batch processing, ... Spark can leverage most of the latest](https://reader031.vdocuments.mx/reader031/viewer/2022021716/5b29df1d7f8b9a0b1e8b52e1/html5/thumbnails/45.jpg)
Result Spark can leverage most of the latest innovations in databases, graph processing, machine learning, … Users get a single API that composes very efficiently
More info: tinyurl.com/matei-thesis
![Page 46: Unified Big Data Processing with Apache Spark · Unified Big Data Processing with Apache Spark ... MapReduce was great for batch processing, ... Spark can leverage most of the latest](https://reader031.vdocuments.mx/reader031/viewer/2022021716/5b29df1d7f8b9a0b1e8b52e1/html5/thumbnails/46.jpg)
Overview Why a unified engine? Spark execution model Why was Spark so general? What’s next
![Page 47: Unified Big Data Processing with Apache Spark · Unified Big Data Processing with Apache Spark ... MapReduce was great for batch processing, ... Spark can leverage most of the latest](https://reader031.vdocuments.mx/reader031/viewer/2022021716/5b29df1d7f8b9a0b1e8b52e1/html5/thumbnails/47.jpg)
What’s Next for Spark While Spark has been around since 2009, many pieces are just beginning 300 contributors, 2 whole libraries new this year Big features in the works
![Page 48: Unified Big Data Processing with Apache Spark · Unified Big Data Processing with Apache Spark ... MapReduce was great for batch processing, ... Spark can leverage most of the latest](https://reader031.vdocuments.mx/reader031/viewer/2022021716/5b29df1d7f8b9a0b1e8b52e1/html5/thumbnails/48.jpg)
Spark 1.2 (Coming in Dec) New machine learning pipelines API
> Featurization & parameter search, similar to SciKit-Learn
Python API for Spark Streaming
Spark SQL pluggable data sources > Hive, JSON, Parquet, Cassandra, ORC, …
Scala 2.11 support
![Page 49: Unified Big Data Processing with Apache Spark · Unified Big Data Processing with Apache Spark ... MapReduce was great for batch processing, ... Spark can leverage most of the latest](https://reader031.vdocuments.mx/reader031/viewer/2022021716/5b29df1d7f8b9a0b1e8b52e1/html5/thumbnails/49.jpg)
Beyond Hadoop
Batch Interactive Streaming
Hadoop Cassandra Mesos
…
… Public Clouds
Your application
here
Unified API across workloads, storage systems and environments
![Page 50: Unified Big Data Processing with Apache Spark · Unified Big Data Processing with Apache Spark ... MapReduce was great for batch processing, ... Spark can leverage most of the latest](https://reader031.vdocuments.mx/reader031/viewer/2022021716/5b29df1d7f8b9a0b1e8b52e1/html5/thumbnails/50.jpg)
Learn More Downloads and tutorials: spark.apache.org
Training: databricks.com/training (free videos)
Databricks Cloud: databricks.com/cloud
![Page 51: Unified Big Data Processing with Apache Spark · Unified Big Data Processing with Apache Spark ... MapReduce was great for batch processing, ... Spark can leverage most of the latest](https://reader031.vdocuments.mx/reader031/viewer/2022021716/5b29df1d7f8b9a0b1e8b52e1/html5/thumbnails/51.jpg)
www.spark-summit.org
![Page 52: Unified Big Data Processing with Apache Spark · Unified Big Data Processing with Apache Spark ... MapReduce was great for batch processing, ... Spark can leverage most of the latest](https://reader031.vdocuments.mx/reader031/viewer/2022021716/5b29df1d7f8b9a0b1e8b52e1/html5/thumbnails/52.jpg)