simplifying big data analytics with apache spark

Simplifying Big Data Analysis with Apache Spark Matei Zaharia April 27, 2015

What is Apache Spark?

Fast and general cluster computing engine interoperable with Apache Hadoop

Improves efficiency through: –  In-memory data sharing –  General computation graphs

Improves usability through: –  Rich APIs in Java, Scala, Python –  Interactive shell

Up to 100× faster

2-5× less code

Spark Core

Spark Streaming

real-time

Spark SQL structured

data

MLlib machine learning

GraphX graph

A General Engine

…

A Large Community

Map

Redu

ce

YARN

HD

FS St

orm

Sp

ark

0 500

1000 1500 2000 2500 3000 3500 4000 4500

Map

Redu

ce

YARN

HD

FS Stor

m

Spar

k

0

100000

200000

300000

400000

500000

600000

700000

800000

Commits in past year Lines of code changed in past year

About Databricks

Founded by creators of Spark and remains largest contributor

Offers a hosted service, Databricks Cloud –  Spark on EC2 with notebooks, dashboards, scheduled jobs

This Talk

Introduction to Spark Built-in libraries New APIs in 2015

–  DataFrames –  Data sources – ML Pipelines

Why a New Programming Model?

MapReduce simplified big data analysis

But users quickly wanted more: – More complex, multi-pass analytics (e.g. ML, graph) – More interactive ad-hoc queries – More real-time stream processing

All 3 need faster data sharing in parallel apps

Data Sharing in MapReduce

iter. 1 iter. 2 . . . Input

HDFS read

HDFS write

HDFS read

HDFS write

Input

query 1

query 2

query 3

result 1

result 2

result 3

. . .

HDFS read

Slow due to data replication and disk I/O

iter. 1 iter. 2 . . . Input

What We’d Like

Distributed memory

Input

query 1

query 2

query 3

. . .

one-time processing

10-100× faster than network and disk

Spark Model

Write programs in terms of transformations on datasets Resilient Distributed Datasets (RDDs)

–  Collections of objects that can be stored in memory or disk across a cluster

–  Built via parallel transformations (map, filter, …) –  Automatically rebuilt on failure

Example: Log Mining

Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(‘\t’)[2])

messages.cache()

Block 1

Block 2

Block 3

Worker

Worker

Worker

Driver

messages.filter(lambda s: “foo” in s).count()

messages.filter(lambda s: “bar” in s).count()

. . .

tasks

results Cache 1

Cache 2

Cache 3

Base RDD Transformed RDD

Action

Full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data)

Fault Tolerance

RDDs track the transformations used to build them (their lineage) to recompute lost data

messages = textFile(...).filter(lambda s: “ERROR” in s) .map(lambda s: s.split(“\t”)[2])

HadoopRDD path = hdfs://…

FilteredRDD func = lambda s: …

MappedRDD func = lambda s: …

0

1000

2000

3000

4000

1 5 10 20 30

Runn

ing

Tim

e (s

)

Number of Iterations

Hadoop Spark

110 s / iteration

first iteration 80 s later iterations 1 s

Example: Logistic Regression

14

On-Disk Performance Time to sort 100TB

Source: Daytona GraySort benchmark, sortbenchmark.org

2100 machines 2013 Record: Hadoop

72 minutes

2014 Record: Spark

207 machines

23 minutes

Supported Operators

map

filter

groupBy

union

join

leftOuterJoin

rightOuterJoin

reduce

count

fold

reduceByKey

groupByKey

cogroup

flatMap

take

first

partitionBy

pipe

distinct

save

...

// Scala:

val lines = sc.textFile(...) lines.filter(s => s.contains(“ERROR”)).count()

// Java:

JavaRDD<String> lines = sc.textFile(...); lines.filter(s -> s.contains(“ERROR”)).count();

Spark in Scala and Java

User Community

Over 500 production users Clusters up to 8000 nodes, processing 1 PB/day Single jobs over 1 PB

17

This Talk



Spark Core

Spark Streaming

real-time

Spark SQL structured data

MLlib machine learning

GraphX graph

Built-in Libraries

Key Idea

Instead of having separate execution engines for each task, all libraries work directly on RDDs Caching + DAG model is enough to run them efficiently Combining libraries into one program is much faster

20

Represents tables as RDDs Tables = Schema + Data

Spark SQL

c = HiveContext(sc)

rows = c.sql(“select text, year from hivetable”)

rows.filter(lambda r: r.year > 2013).collect()

From Hive:

{“text”: “hi”, “user”: { “name”: “matei”, “id”: 123 }}

c.jsonFile(“tweets.json”).registerTempTable(“tweets”)

c.sql(“select text, user.name from tweets”)

From JSON: tweets.json

Time

Input

Spark Streaming

RDD RDD RDD RDD RDD RDD

Represents streams as a series of RDDs over time

sc.twitterStream(...) .map(lambda t: (t.username, 1)) .reduceByWindow(“30s”, lambda a, b: a + b) .print()

Spark Streaming Time

Vectors, Matrices = RDD[Vector] Iterative computation

MLlib

points = sc.textFile(“data.txt”).map(parsePoint) model = KMeans.train(points, 10) model.predict(newPoint)

Represents graphs as RDDs of vertices and edges

GraphX

Hive

Im

pala

(disk

) Im

pala

(mem

) Sp

ark

(disk

) Sp

ark

(mem

)

0

10

20

30

40

50

Resp

onse

Tim

e (s

ec)

SQL

Mah

out

Grap

hLab

Sp

ark

0

10

20

30

40

50

60

Resp

onse

Tim

e (m

in)

ML

Performance vs. Specialized Engines

Stor

m

Spar

k

0

5

10

15

20

25

30

35

Thro

ughp

ut (M

B/s/

node

)

Streaming

Combining Processing Types

// Load data using SQL points = ctx.sql(“select latitude, longitude from hive_tweets”)

// Train a machine learning model model = KMeans.train(points, 10)

// Apply it to a stream sc.twitterStream(...) .map(lambda t: (model.predict(t.location), 1)) .reduceByWindow(“5s”, lambda a, b: a + b)

Combining Processing Types

Separate engines:

. . . HDFS read

HDFS write pr

epar

e

HDFS read

HDFS write tra

in

HDFS read

HDFS write ap

ply

HDFS write

HDFS read prep

are

train

ap

ply

Spark:

HDFS

Interactive analysis

This Talk



31

Main Directions in 2015

Data Science Making it easier for wider class of users

Platform Interfaces Scaling the ecosystem

From MapReduce to Spark public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); }

}

val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")

Beyond MapReduce Experts

33

Early adopters

Data Scientists Statisticians R users PyData …

users

understands MapReduce &

functional APIs

Data Frames De facto data processing abstraction for data science

(R and Python)

Google Trends for “dataframe”

From RDDs to DataFrames

35

36

Spark DataFrames

Collections of structured data similar to R, pandas

Automatically optimized via Spark SQL

–  Columnar storage –  Code-gen. execution

df = jsonFile(“tweets.json”)

df[df[“user”] == “matei”]

.groupBy(“date”)

.sum(“retweets”)

0

5

10

Python Scala DataFrame Ru

nnin

g Ti

me

Optimization via Spark SQL

37

DataFrame expressions are relational queries, letting Spark inspect them Automatically perform expression optimization, join algorithm selection, columnar storage, compilation to Java bytecode

38

Machine Learning Pipelines

tokenizer = Tokenizer()

tf = HashingTF(numFeatures=1000)

lr = LogisticRegression()

pipe = Pipeline([tokenizer, tf, lr])

model = pipe.fit(df)

tokenizer TF LR

model DataFrame

High-level API similar to SciKit-Learn

Operates on DataFrames Grid search to tune params across a whole pipeline

39

Spark R Interface

Exposes DataFrames and ML pipelines in R Parallelize calls to R code

df = jsonFile(“tweets.json”)

summarize(

group_by(

df[df$user == “matei”,],

“date”),

sum(“retweets”))

Target: Spark 1.4 (June)

40

Main Directions in 2015

Data Science Making it easier for wider class of users

Platform Interfaces Scaling the ecosystem

41

Data Sources API

Allows plugging smart data sources into Spark

Returns DataFrames usable in Spark apps or SQL

Pushes logic into sources

Spark

{JSON}

42

Allows plugging smart data sources into Spark

Returns DataFrames usable in Spark apps or SQL

Pushes logic into sources

SELECT * FROM mysql_users u JOIN

hive_logs h

WHERE u.lang = “en”

Spark

{JSON}

SELECT * FROM users WHERE lang=“en”

Data Sources API

43

Current Data Sources

Built-in: Hive, JSON, Parquet, JDBC

Community: CSV, Avro, ElasticSearch, Redshift, Cloudant, Mongo, Cassandra, SequoiaDB

List at spark-packages.org

44

Goal: unified engine across data sources, workloads and environments

To Learn More

Downloads & docs: spark.apache.org Try Spark in Databricks Cloud: databricks.com Spark Summit: spark-summit.org

45

simplifying big data analytics with apache spark

Software

filterlambda s

s iteration

maplambda s

s example

code spark core spark

creators of spark

s later iterations

disk spark model