transforming big data with spark and shark - aws re:invent 2012 bdt 305

Transforming Big Data with Spark and Shark

Michael Franklin and Matei Zaharia – UC Berkeley

UC BERKELEY

Sources Driving Big Data

It’s All Happening On-line

Every:ClickAd impressionBilling eventFast Forward, pause,…Friend RequestTransactionNetwork messageFault…

User Generated (Web, Social & Mobile)

…..

Internet of Things / M2M Scientific Computing

Big Data: The Challenges

3

Terabytes Petabytes+Volume

Structured Unstructured Variety

Batch Real-TimeVelocity

Our view: More data should mean better answers

• Must deal with vertical and horizontal growth

• Must balance Cost, Time, and Answer Quality

AMP Expedition

4

Resources for Making Sense at Scale

Algorithms: Machine Learning and

Analytics

Machines:

Cloud Computing

People:

CrowdSourcing &

Human Computation

5

Massive and Diverse

Data

UC BERKELEY

The AMPLab Big Bets• New “Big Data” stacks are limited by traditional intellectual

borders• Need Machine Learning/Systems/Database Co-Design• Requires Cohabitation and Real Collaboration

• Opportunity to rethink fundamental design points:• Low Latency• Variable Consistency• Cloud-based Elastic Resources• Desire for New Solutions in the Marketplace

• Consider role of people throughout the entire analytics lifecycle6

AMPLab FactsAn integration of Faculty Interests (*Directors):

+ ~50 amazing students, post-docs, staff & visitors

7

Alex Bayen (Mobile Sensing) Anthony Joseph (Sec./ Privacy)

Ken Goldberg (Crowdsourcing) Randy Katz (Systems)

*Michael Franklin (Databases) Dave Patterson (Systems)

Armando Fox (Systems) *Ion Stoica (Systems)

*Mike Jordan (Machine Learning) Scott Shenker (Networking)

Organized for Collaboration:

AMP Facts (continued)

• Launched February 2011; 6 Year Duration• Strong industry and government support

• NSF Expedition and Darpa XData• BDAS stack components released

as BSD/Apache Open Source (e.g. Spark, Shark, Mesos)

8

App: Carat - Detection of Smartphone Energy Bugs

9

> 450,000downloads

App: Cancer Tumor Genomics• Vision: Personalized Therapy

“…10 years from now, each cancer patient is going to want to get a genomic analysis of their cancer and will expect customized therapy based on that information.” Director, The Cancer Genome Atlas (TCGA), Time Magazine, 6/13/11

10

• UCSF cancer researchers + UCSC cancer genetic database + AMP Lab + Intel Cluster@TCGA: 5 PB = 20 cancers x 1000 genomes

• Sequencing costs (150X) Big Data

David Patterson, “Computer Scientists May Have What It Takes to Help Cure Cancer,” New York Times, 12/5/2011

$0.1

$1.0

$10.0

$100.0

$1,000.0

$10,000.0

$100,000.0

$K per genome

2001 - 2014

• See Dave Patterson’s Talk: Thursday 3-4, BDT205

MLBase (Declarative Machine Learning)

BlinkDB (approx QP)

BDAS: The Berkeley Data Analytics System

11

HDFS

Shark (SQL) + Streaming

AMPLab (released)3rd party AMPLab (in progress)

Streaming

Hadoop MRMPI

Graphlabetc. Spark

Shared RDDs (distributed memory) Mesos (cluster resource manager)

BDAS: Where we’re going• Time, Cost, Quality tradeoffs using sampling and the “Bag

of Little Bootstraps”• Refactoring Distributed Memory layer for sharing• Low-latency (real-time) processing via discretized streams• Graph Processing and Asynchronous computation • Declarative Machine Learning libraries that utilize these

interfaces for scalability• A “logical plan” level to serve as the narrow waist for these

and future components.• Integration of the “People” component (e.g., CrowdDB)

12

For More Informationamplab.cs.berkeley.edu

• Papers and project pages• News updates and blogs

Spark User Group and MeetupGithub and Apache Mesos

13

Deep Dive: Spark and Shark

What is Spark?

• Fast, MapReduce-like engine• In-memory storage for very fast iterative queries• General execution graphs• Up to 100x faster than Hadoop (2-10x even for on-disk data)

• Compatible with Hadoop’s storage APIs• Can access HDFS, HBase, S3, SequenceFiles, etc

Lightning-Fast Cluster Computing

What is Shark?

• Port of Apache Hive to run on Spark

• Compatible with existing Hive data, metastores, and queries (HiveQL, UDFs, etc)

• Can be more than 100x faster

Project History

• Spark started in 2009, open sourced 2010

• Shark released spring 2012

• In use at Yahoo!, Klout, Airbnb, Foursquare, Conviva, Quantifind & others

• 400+ member meetup, 20+ developers

Spark

• Language-integrated API in Scala, Java and soon Python

• Can be used interactively from Scala and Python shells

• Lets users manipulate distributed collections (“resilient distributed datasets”, or RDDs) with parallel operations

Example: Log Mining

Load error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(_.startsWith(“ERROR”))

messages = errors.map(_.split(‘\t’)(2))

cachedMsgs = messages.cache()Block 1

Block 2

Block 3

Worker

Worker

Worker

Driver

cachedMsgs.filter(_.contains(“foo”)).count

cachedMsgs.filter(_.contains(“bar”)).count

tasks

results

Cache 1

Cache 2

Cache 3

Base RDDTransformed RDD

Action

Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data)

Result: scaled to 1 TB data in 5-7 sec(vs 170 sec for on-disk data)

Fault Tolerance

RDDs track the series of transformations used to build them (their lineage) to recompute lost data

E.g:messages = textFile(...).filter(_.contains(“error”)) .map(_.split(‘\t’)(2))

HadoopRDDpath = hdfs://…

FilteredRDDfunc = _.contains(...)

MappedRDDfunc = _.split(…)

Example: Logistic Regression

Goal: find best line separating two sets of points

+

–

+ ++

+

+

++ +

– ––

–

–

–

––

+

target

–

random initial line

Example: Logistic Regression

val data = spark.textFile(...).map(readPoint).cache()

var w = Vector.random(D)

for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient}

println("Final w: " + w)

Initial parameter vector

Repeated MapReduce stepsto do gradient descent

Load data in memory once

1 10 20 300

10

20

30

40

50

60

HadoopSpark

Number of Iterations

Ru

nn

ing

Tim

e (

min

)

Logistic Regression Performance

110 s / iteration

first iteration 80 sfurther iterations 1 s

Spark in Java and Python

JavaRDD<String> lines = sc.textFile(...);

lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); }}).count();

lines = sc.textFile(...)

lines.filter(lambda x: x.contains('error')) \ .count()

Java API(out now)

PySpark(coming soon)

User Applications

• In-memory analytics on Hive data (Conviva)

• Interactive queries on data streams (Quantifind)

• Business intelligence (Yahoo!)

• Traffic estimation w/ GPS data (Mobile Millennium)

• DNA sequence analysis (SNAP)

. . .

Conviva GeoReport

• Group aggregations on many keys with same filter

• 40× gain over Hive from avoiding repeated reading, deserialization and filtering

Spark

Hive

0 2 4 6 8 10 12 14 16 18 20

0.5

20

Time (hours)

Shark: SQL on Spark

• Hive is great, but Hadoop’s execution engine makes even the smallest queries take minutes

• Can we extend Hive to run on Spark?

Hive Architecture

Meta store

HDFS

Client

Driver

SQL Parser

Query Optimizer

Physical Plan

Execution

CLI JDBC

MapReduce

Shark Architecture

Meta store

HDFS

Client

Driver

SQL Parser

Physical Plan

Execution

CLI JDBC

Spark

Cache Mgr.

Query Optimizer

Column-Oriented Storage

• Caching Hive records as Java objects is inefficient• Instead, use arrays of primitive types for columns

• Similar size to serialized form, but 5x faster to process• Columnar compression can further reduce size by 5x

1

Column Storage

2 3

john mike sally

4.1 3.5 6.4

Row Storage

1 john 4.1

2 mike 3.5

3 sally 6.4

Other Shark Optimizations

• Dynamic join algorithm selection based on the data

• Runtime selection of # of reducers

• Partition pruning using range statistics

• Controllable table partitioning across nodes

Using Shark

CREATE TABLE latest_logs TBLPROPERTIES ("memory"=true)AS SELECT * FROM logs WHERE date > now()-3600;

Then, just run HiveQL against it!

Shark Results

Selection0

10

20

30

40

50

60

70

80

90

100

1.1

Shark Shark (disk) Hive

100 m2.4xlarge nodes2.1 TB benchmark (Pavlo et al)

SELECT pageURL, pageRankFROM rankingsWHERE pageRank > X;

Shark Results: Group By


SELECT SUBSTR(sourceIP, 1, 7), SUM(adRevenue)FROM uservisitsGROUP BY SUBSTR(sourceIP, 1, 7);

Group By0

100

200

300

400

500

600

32


Shark Results: Join


SELECT sourceIP, AVG(pageRank), SUM(adRevenue) AS totalRevenueFROM rankings r, uservisits vWHERE r.pageURL = v.destURL AND v.visitDate BETWEEN Date(’2000-01-15’) AND Date(’2000-01-22’)GROUP BY v.sourceIP;

Join0

300

600

900

1200

1500

1800

10

5

Shark (copartitioned)

Shark

Shark (disk)

Hive

User Queries

Yahoo!, Conviva report 40-100x speedups over Hive

Query 10

10

20

30

40

50

60

70

0.8


Query 20

10

20

30

40

50

60

70

0.7

Query 30

10

20

30

40

50

60

70

80

90

100

1.0

100 m2.4xlargenodes, 1.7 TBConviva dataset

Getting Started

• Spark and Shark both have scripts for launching on EC2

• Work with data in HDFS, HBase, S3, and existing Hive warehouses and metastores

• Local execution mode for testing

spark-project.org amplab.cs.berkeley.edu

UC BERKELEY

http://www.spark-project.org/

http://www.amplab.cs.berkeley.edu/

We are sincerely eager to hear your feedback on this

presentation and on re:Invent.

Please fill out an evaluation form when you have a

chance.

transforming big data with spark and shark - aws re:invent 2012 bdt 305

Documents

shark shark disk hive70

shark diskhive

shark copartitionedshark

tb benchmark pavlo et

return s

filterlambda x

time min110 s iteration

cancers x