transforming big data with spark and shark - aws re:invent 2012 bdt 305
DESCRIPTION
TRANSCRIPT
Transforming Big Data with Spark and Shark
Michael Franklin and Matei Zaharia – UC Berkeley
UC BERKELEY
Sources Driving Big Data
It’s All Happening On-line
Every:ClickAd impressionBilling eventFast Forward, pause,…Friend RequestTransactionNetwork messageFault…
User Generated (Web, Social & Mobile)
…..
Internet of Things / M2M Scientific Computing
Big Data: The Challenges
3
Terabytes Petabytes+Volume
Structured Unstructured Variety
Batch Real-TimeVelocity
Our view: More data should mean better answers
• Must deal with vertical and horizontal growth
• Must balance Cost, Time, and Answer Quality
AMP Expedition
4
Resources for Making Sense at Scale
Algorithms: Machine Learning and
Analytics
Machines:
Cloud Computing
People:
CrowdSourcing &
Human Computation
5
Massive and Diverse
Data
UC BERKELEY
The AMPLab Big Bets• New “Big Data” stacks are limited by traditional intellectual
borders• Need Machine Learning/Systems/Database Co-Design• Requires Cohabitation and Real Collaboration
• Opportunity to rethink fundamental design points:• Low Latency• Variable Consistency• Cloud-based Elastic Resources• Desire for New Solutions in the Marketplace
• Consider role of people throughout the entire analytics lifecycle6
AMPLab FactsAn integration of Faculty Interests (*Directors):
+ ~50 amazing students, post-docs, staff & visitors
7
Alex Bayen (Mobile Sensing) Anthony Joseph (Sec./ Privacy)
Ken Goldberg (Crowdsourcing) Randy Katz (Systems)
*Michael Franklin (Databases) Dave Patterson (Systems)
Armando Fox (Systems) *Ion Stoica (Systems)
*Mike Jordan (Machine Learning) Scott Shenker (Networking)
Organized for Collaboration:
AMP Facts (continued)
• Launched February 2011; 6 Year Duration• Strong industry and government support
• NSF Expedition and Darpa XData• BDAS stack components released
as BSD/Apache Open Source (e.g. Spark, Shark, Mesos)
8
App: Carat - Detection of Smartphone Energy Bugs
9
> 450,000downloads
App: Cancer Tumor Genomics• Vision: Personalized Therapy
“…10 years from now, each cancer patient is going to want to get a genomic analysis of their cancer and will expect customized therapy based on that information.” Director, The Cancer Genome Atlas (TCGA), Time Magazine, 6/13/11
10
• UCSF cancer researchers + UCSC cancer genetic database + AMP Lab + Intel Cluster@TCGA: 5 PB = 20 cancers x 1000 genomes
• Sequencing costs (150X) Big Data
David Patterson, “Computer Scientists May Have What It Takes to Help Cure Cancer,” New York Times, 12/5/2011
$0.1
$1.0
$10.0
$100.0
$1,000.0
$10,000.0
$100,000.0
$K per genome
2001 - 2014
• See Dave Patterson’s Talk: Thursday 3-4, BDT205
MLBase (Declarative Machine Learning)
BlinkDB (approx QP)
BDAS: The Berkeley Data Analytics System
11
HDFS
Shark (SQL) + Streaming
AMPLab (released)3rd party AMPLab (in progress)
Streaming
Hadoop MRMPI
Graphlabetc. Spark
Shared RDDs (distributed memory) Mesos (cluster resource manager)
BDAS: Where we’re going• Time, Cost, Quality tradeoffs using sampling and the “Bag
of Little Bootstraps”• Refactoring Distributed Memory layer for sharing• Low-latency (real-time) processing via discretized streams• Graph Processing and Asynchronous computation • Declarative Machine Learning libraries that utilize these
interfaces for scalability• A “logical plan” level to serve as the narrow waist for these
and future components.• Integration of the “People” component (e.g., CrowdDB)
12
For More Informationamplab.cs.berkeley.edu
• Papers and project pages• News updates and blogs
Spark User Group and MeetupGithub and Apache Mesos
13
Deep Dive: Spark and Shark
What is Spark?
• Fast, MapReduce-like engine• In-memory storage for very fast iterative queries• General execution graphs• Up to 100x faster than Hadoop (2-10x even for on-disk data)
• Compatible with Hadoop’s storage APIs• Can access HDFS, HBase, S3, SequenceFiles, etc
Lightning-Fast Cluster Computing
What is Shark?
• Port of Apache Hive to run on Spark
• Compatible with existing Hive data, metastores, and queries (HiveQL, UDFs, etc)
• Can be more than 100x faster
Project History
• Spark started in 2009, open sourced 2010
• Shark released spring 2012
• In use at Yahoo!, Klout, Airbnb, Foursquare, Conviva, Quantifind & others
• 400+ member meetup, 20+ developers
Spark
• Language-integrated API in Scala, Java and soon Python
• Can be used interactively from Scala and Python shells
• Lets users manipulate distributed collections (“resilient distributed datasets”, or RDDs) with parallel operations
Example: Log Mining
Load error messages from a log into memory, then interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(‘\t’)(2))
cachedMsgs = messages.cache()Block 1
Block 2
Block 3
Worker
Worker
Worker
Driver
cachedMsgs.filter(_.contains(“foo”)).count
cachedMsgs.filter(_.contains(“bar”)).count
tasks
results
Cache 1
Cache 2
Cache 3
Base RDDTransformed RDD
Action
Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data)
Result: scaled to 1 TB data in 5-7 sec(vs 170 sec for on-disk data)
Fault Tolerance
RDDs track the series of transformations used to build them (their lineage) to recompute lost data
E.g:messages = textFile(...).filter(_.contains(“error”)) .map(_.split(‘\t’)(2))
HadoopRDDpath = hdfs://…
FilteredRDDfunc = _.contains(...)
MappedRDDfunc = _.split(…)
Example: Logistic Regression
Goal: find best line separating two sets of points
+
–
+ ++
+
+
++ +
– ––
–
–
–
––
+
target
–
random initial line
Example: Logistic Regression
val data = spark.textFile(...).map(readPoint).cache()
var w = Vector.random(D)
for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient}
println("Final w: " + w)
Initial parameter vector
Repeated MapReduce stepsto do gradient descent
Load data in memory once
1 10 20 300
10
20
30
40
50
60
HadoopSpark
Number of Iterations
Ru
nn
ing
Tim
e (
min
)
Logistic Regression Performance
110 s / iteration
first iteration 80 sfurther iterations 1 s
Spark in Java and Python
JavaRDD<String> lines = sc.textFile(...);
lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); }}).count();
lines = sc.textFile(...)
lines.filter(lambda x: x.contains('error')) \ .count()
Java API(out now)
PySpark(coming soon)
User Applications
• In-memory analytics on Hive data (Conviva)
• Interactive queries on data streams (Quantifind)
• Business intelligence (Yahoo!)
• Traffic estimation w/ GPS data (Mobile Millennium)
• DNA sequence analysis (SNAP)
. . .
Conviva GeoReport
• Group aggregations on many keys with same filter
• 40× gain over Hive from avoiding repeated reading, deserialization and filtering
Spark
Hive
0 2 4 6 8 10 12 14 16 18 20
0.5
20
Time (hours)
Shark: SQL on Spark
• Hive is great, but Hadoop’s execution engine makes even the smallest queries take minutes
• Can we extend Hive to run on Spark?
Hive Architecture
Meta store
HDFS
Client
Driver
SQL Parser
Query Optimizer
Physical Plan
Execution
CLI JDBC
MapReduce
Shark Architecture
Meta store
HDFS
Client
Driver
SQL Parser
Physical Plan
Execution
CLI JDBC
Spark
Cache Mgr.
Query Optimizer
Column-Oriented Storage
• Caching Hive records as Java objects is inefficient• Instead, use arrays of primitive types for columns
• Similar size to serialized form, but 5x faster to process• Columnar compression can further reduce size by 5x
1
Column Storage
2 3
john mike sally
4.1 3.5 6.4
Row Storage
1 john 4.1
2 mike 3.5
3 sally 6.4
Other Shark Optimizations
• Dynamic join algorithm selection based on the data
• Runtime selection of # of reducers
• Partition pruning using range statistics
• Controllable table partitioning across nodes
Using Shark
CREATE TABLE latest_logs TBLPROPERTIES ("memory"=true)AS SELECT * FROM logs WHERE date > now()-3600;
Then, just run HiveQL against it!
Shark Results
Selection0
10
20
30
40
50
60
70
80
90
100
1.1
Shark Shark (disk) Hive
100 m2.4xlarge nodes2.1 TB benchmark (Pavlo et al)
SELECT pageURL, pageRankFROM rankingsWHERE pageRank > X;
Shark Results: Group By
100 m2.4xlarge nodes2.1 TB benchmark (Pavlo et al)
SELECT SUBSTR(sourceIP, 1, 7), SUM(adRevenue)FROM uservisitsGROUP BY SUBSTR(sourceIP, 1, 7);
Group By0
100
200
300
400
500
600
32
Shark Shark (disk) Hive
Shark Results: Join
100 m2.4xlarge nodes2.1 TB benchmark (Pavlo et al)
SELECT sourceIP, AVG(pageRank), SUM(adRevenue) AS totalRevenueFROM rankings r, uservisits vWHERE r.pageURL = v.destURL AND v.visitDate BETWEEN Date(’2000-01-15’) AND Date(’2000-01-22’)GROUP BY v.sourceIP;
Join0
300
600
900
1200
1500
1800
10
5
Shark (copartitioned)
Shark
Shark (disk)
Hive
User Queries
Yahoo!, Conviva report 40-100x speedups over Hive
Query 10
10
20
30
40
50
60
70
0.8
Shark Shark (disk) Hive
Query 20
10
20
30
40
50
60
70
0.7
Query 30
10
20
30
40
50
60
70
80
90
100
1.0
100 m2.4xlargenodes, 1.7 TBConviva dataset
Getting Started
• Spark and Shark both have scripts for launching on EC2
• Work with data in HDFS, HBase, S3, and existing Hive warehouses and metastores
• Local execution mode for testing
spark-project.org amplab.cs.berkeley.edu
UC BERKELEY
We are sincerely eager to hear your feedback on this
presentation and on re:Invent.
Please fill out an evaluation form when you have a
chance.