cassandra day 2014: interactive analytics with cassandra and spark

46
Interactive Analytics With Spark And Cassandra Evan Chan Ooyala, Inc. April 7th, 2014 Monday, April 7, 14

Upload: evan-chan

Post on 19-Aug-2014

1.703 views

Category:

Engineering


5 download

DESCRIPTION

Take your analytics to the next level by using Apache Spark to accelerate complex interactive analytics using your Apache Cassandra data. Includes an introduction to Spark as well as how to read Cassandra tables in Spark.

TRANSCRIPT

Page 1: Cassandra Day 2014: Interactive Analytics with Cassandra and Spark

Interactive Analytics With Spark And Cassandra

Evan ChanOoyala, Inc.

April 7th, 2014

Monday, April 7, 14

Page 2: Cassandra Day 2014: Interactive Analytics with Cassandra and Spark

• Staff Engineer, Compute and Data Services, Ooyala

• Building multiple web-scale real-time systems on top of C*, Kafka, Storm, etc.

• Scala/Akka guy

• Very excited by open source, big data projects

• @evanfchan

Who is this guy?

2

Monday, April 7, 14

Page 3: Cassandra Day 2014: Interactive Analytics with Cassandra and Spark

• Cassandra at Ooyala

• What problem are we trying to solve?

• Spark and Shark

• Integrating Cassandra and Spark

• Our Spark/Cassandra Architecture

Agenda

3

Monday, April 7, 14

Page 4: Cassandra Day 2014: Interactive Analytics with Cassandra and Spark

CASSANDRA AT OOYALA

4

Monday, April 7, 14

Page 5: Cassandra Day 2014: Interactive Analytics with Cassandra and Spark

OOYALAPowering personalized video

experiences across all screens.

Monday, April 7, 14

Page 6: Cassandra Day 2014: Interactive Analytics with Cassandra and Spark

CONFIDENTIAL—DO NOT DISTRIBUTE 6CONFIDENTIAL—DO NOT DISTRIBUTE

Founded in 2007

Commercially launch in 2009

230+ employees in Silicon Valley, LA, NYC, London, Paris, Tokyo, Sydney & Guadalajara

Global footprint, 200M unique users,110+ countries, and more than 6,000 websites

Over 1 billion videos played per month and 2 billion analytic events per day

25% of U.S. online viewers watch video powered by Ooyala

COMPANY OVERVIEW

Monday, April 7, 14

Page 7: Cassandra Day 2014: Interactive Analytics with Cassandra and Spark

CONFIDENTIAL—DO NOT DISTRIBUTE 7

TRUSTED VIDEO PARTNER

STRATEGIC PARTNERS

CUSTOMERS

CONFIDENTIAL—DO NOT DISTRIBUTE

Monday, April 7, 14

Page 8: Cassandra Day 2014: Interactive Analytics with Cassandra and Spark

TITLE TEXT GOES HERE

• 12 clusters ranging in size from 3 to 107 nodes

• Total of 28TB of data managed over ~220 nodes

• Powers all of our analytics infrastructure• Traditional analytics aggregations• Recommendations and trends

• DSE/C* 1.0.x, 1.1.x, 1.2.6

We are a large Cassandra user

8

Monday, April 7, 14

Page 9: Cassandra Day 2014: Interactive Analytics with Cassandra and Spark

TITLE TEXT GOES HERE

• Started investing in Spark beginning of 2013

• 2 teams of developers doing stuff with Spark

• Actively contributing to Spark developer community

• Deploying Spark to a large (>100 node) production cluster

• Spark community very active, huge amount of interest

Becoming a big Spark user...

9

Monday, April 7, 14

Page 10: Cassandra Day 2014: Interactive Analytics with Cassandra and Spark

WHAT PROBLEM ARE WE TRYING TO SOLVE?

10

Monday, April 7, 14

Page 11: Cassandra Day 2014: Interactive Analytics with Cassandra and Spark

From mountains of raw data...

Monday, April 7, 14

Page 12: Cassandra Day 2014: Interactive Analytics with Cassandra and Spark

• Quickly

• Painlessly

• At scale?

To nuggets of truth...

Monday, April 7, 14

Page 13: Cassandra Day 2014: Interactive Analytics with Cassandra and Spark

Today: Precomputed Aggregates

• Video metrics computed along several high cardinality dimensions

• Very fast lookups, but inflexible, and hard to change

• Most computed aggregates are never read

• What if we need more dynamic queries?

• Top content for mobile users in France

• Engagement curves for users who watched recommendations

• Data mining, trends, machine learning

Monday, April 7, 14

Page 14: Cassandra Day 2014: Interactive Analytics with Cassandra and Spark

The Static - Dynamic Continuum

• Super fast lookups

• Inflexible, wasteful

• Best for 80% most common queries

• Always compute results from raw data

• Flexible but slow

100% Precomputation 100% Dynamic

Monday, April 7, 14

Page 15: Cassandra Day 2014: Interactive Analytics with Cassandra and Spark

Where We Want To Be

Partly dynamic

• Pre-aggregate most common queries

• Flexible, fast dynamic queries

• Easily generate many materialized

views

Monday, April 7, 14

Page 16: Cassandra Day 2014: Interactive Analytics with Cassandra and Spark

WHY SPARK?

16

Monday, April 7, 14

Page 17: Cassandra Day 2014: Interactive Analytics with Cassandra and Spark

Introduction To Spark• In-memory distributed computing framework• Created by UC Berkeley AMP Lab in 2010• Targeted problems that MR is bad at:– Iterative algorithms (machine learning)– Interactive data mining

• More general purpose than Hadoop MR• Top level Apache project• Active contributions from Intel, Yahoo, lots of

companies -- huge momentum

Monday, April 7, 14

Page 18: Cassandra Day 2014: Interactive Analytics with Cassandra and Spark

Spark Vs Hadoop

HDFS

Map

Reduce

Map

Reduce

Data

map()

join()

Source 2

cache()

transform

Monday, April 7, 14

Page 19: Cassandra Day 2014: Interactive Analytics with Cassandra and Spark

Throughput: Memory Is King

0 37500 75000 112500 150000

C*, cold cache

C*, warm cache

Spark RDD

6-node C*/DSE 1.1.9 cluster,Spark 0.7.0

Spark cached RDD 10-50x faster than raw Cassandra

Monday, April 7, 14

Page 20: Cassandra Day 2014: Interactive Analytics with Cassandra and Spark

Developers Love It

• “I wrote my first aggregation job in 30 minutes”

• High level “distributed collections” API

• No Hadoop cruft

• Full power of Scala, Java, Python

• Interactive REPL shell

Monday, April 7, 14

Page 21: Cassandra Day 2014: Interactive Analytics with Cassandra and Spark

Spark Vs Hadoop Word Count

file = spark.textFile("hdfs://...") file.flatMap(line => line.split(" "))    .map(word => (word, 1))    .reduceByKey(_ + _)

1 package org.myorg; 2 3 import java.io.IOException; 4 import java.util.*; 5 6 import org.apache.hadoop.fs.Path; 7 import org.apache.hadoop.conf.*; 8 import org.apache.hadoop.io.*; 9 import org.apache.hadoop.mapreduce.*; 10 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; 11 import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; 12 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; 13 import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; 14 15 public class WordCount { 16 17 public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { 18 private final static IntWritable one = new IntWritable(1); 19 private Text word = new Text(); 20 21 public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { 22 String line = value.toString(); 23 StringTokenizer tokenizer = new StringTokenizer(line); 24 while (tokenizer.hasMoreTokens()) { 25 word.set(tokenizer.nextToken()); 26 context.write(word, one); 27 } 28 } 29 } 30 31 public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { 32 33 public void reduce(Text key, Iterable<IntWritable> values, Context context) 34 throws IOException, InterruptedException { 35 int sum = 0; 36 for (IntWritable val : values) { 37 sum += val.get(); 38 } 39 context.write(key, new IntWritable(sum)); 40 } 41 } 42 43 public static void main(String[] args) throws Exception { 44 Configuration conf = new Configuration(); 45 46 Job job = new Job(conf, "wordcount"); 47 48 job.setOutputKeyClass(Text.class); 49 job.setOutputValueClass(IntWritable.class); 50 51 job.setMapperClass(Map.class); 52 job.setReducerClass(Reduce.class); 53 54 job.setInputFormatClass(TextInputFormat.class); 55 job.setOutputFormatClass(TextOutputFormat.class); 56 57 FileInputFormat.addInputPath(job, new Path(args[0])); 58 FileOutputFormat.setOutputPath(job, new Path(args[1])); 59 60 job.waitForCompletion(true); 61 } 62 63 }

Monday, April 7, 14

Page 22: Cassandra Day 2014: Interactive Analytics with Cassandra and Spark

One Platform To Rule Them All

HIVE on Spark

Spark Streaming - discretized stream

processing

• SQL, Graph, ML, Streaming all in one framework• Much higher code sharing/reuse• Easy integration between components• Fewer platforms == lower TCO• Integration with Mesos, YARN helps share resources

Monday, April 7, 14

Page 23: Cassandra Day 2014: Interactive Analytics with Cassandra and Spark

Shark - HIVE On Spark

• 100% HiveQL compatible• 10-100x faster than HIVE, answers in seconds• Reuse UDFs, SerDe’s, StorageHandlers• Can use DSE / CassandraFS for Metastore

Monday, April 7, 14

Page 24: Cassandra Day 2014: Interactive Analytics with Cassandra and Spark

INTEGRATING CASSANDRA & SPARK

24

Monday, April 7, 14

Page 25: Cassandra Day 2014: Interactive Analytics with Cassandra and Spark

Our Spark/Shark/Cassandra Stack

Node1

Cassandra

InputFormat

SerDe

Spark Worker

Shark

Node2

Cassandra

InputFormat

SerDe

Spark Worker

Shark

Node3

Cassandra

InputFormat

SerDe

Spark Worker

Shark

Spark Master Job Server

Monday, April 7, 14

Page 26: Cassandra Day 2014: Interactive Analytics with Cassandra and Spark

OPTIONS FOR READING FROM C*• Hadoop InputFormat

– ColumnFamilyInputFormat - reads all rows from 1 CF

– CqlPagingInputFormat, etc. - CQL3, 2-dary indexes– Roll your own (join multiple CFs, etc)

• Spark native RDD– sc.parallelize(rowkeys).flatMap(readColumns(_))

– JdbcRdd + Cassandra JDBC driver

Monday, April 7, 14

Page 27: Cassandra Day 2014: Interactive Analytics with Cassandra and Spark

ColumnFamilyInputFormat

video type

Record1 10 1

Record2 11 5

id Video TypeRecord1 10 1

Record2 11 5

• Must read from all rows• One CF only, not very flexible

Monday, April 7, 14

Page 28: Cassandra Day 2014: Interactive Analytics with Cassandra and Spark

Node 2Node 1

Spark RDD• RDD = Resilient Distributed Dataset• Multiple partitions living on different nodes• Each partition has records

Partition 1 Partition 2 Partition 3 Partition 4

Monday, April 7, 14

Page 29: Cassandra Day 2014: Interactive Analytics with Cassandra and Spark

Inputformat Vs RDD

InputFormat RDDSupports Hadoop, HIVE, Spark, Shark

Spark / Shark only

Have to implement multiple classes - InputFormat, RecordReader, Writeable, etc. Clunky API.

One class - simple API.

Two APIs, and often need to implement both (HIVE needs older...)

Just one API.

• You can easily use InputFormats in Spark using newAPIHadoopRDD().• Writing a custom RDD could have saved us lots of time.

Monday, April 7, 14

Page 30: Cassandra Day 2014: Interactive Analytics with Cassandra and Spark

Node 2Node 1

Skipping The InputFormat

Row 1 data Row 2 data Row 3 data Row 4 data

Rowkey1 Rowkey2 Rowkey3 Rowkey4

sc.parallelize(rowkeys).flatMap(readColumns(_))

Monday, April 7, 14

Page 31: Cassandra Day 2014: Interactive Analytics with Cassandra and Spark

OUR CASSANDRA / SPARK ARCHITECTURE

31

Monday, April 7, 14

Page 32: Cassandra Day 2014: Interactive Analytics with Cassandra and Spark

From Raw Logs To Fast Queries

ProcessC*

columnar storage

Raw Logs

Raw Logs

Raw LogsSpark

Spark

Spark

OLAP Table 1

OLAP Table 2

OLAP Table 3

Spark

Shark

Predefined queries

Ad-hoc HiveQL

Monday, April 7, 14

Page 33: Cassandra Day 2014: Interactive Analytics with Cassandra and Spark

Why Cassandra Alone Isn’t Enough• Over 30 million multi-dimensional fact table rows per day• Materializing every possible answer isn’t close to possible• Multi dimensional filtering and grouping alone leads to many billions of possible answers

• Querying fact tables in Cassandra is too slow• Reading millions or billion random rows• CQL doesn’t support grouping

Monday, April 7, 14

Page 34: Cassandra Day 2014: Interactive Analytics with Cassandra and Spark

Our Approach

• Use Cassandra to store the raw fact tables• Optimize the schema for OLAP workloads• Fast full table reads• Easily read fewer columns

• Use Spark for fast random row access and fast distributed computation

Monday, April 7, 14

Page 35: Cassandra Day 2014: Interactive Analytics with Cassandra and Spark

uuid-part-0

uuid-part-1

2013-04-05T00:00Z#id1

Section 0 Section 1 Section 2

country rows 0-9999 rows 10000-19999 rows ....

city rows 0-9999 rows 10000-19999 rows ....

Index CF

Columns CF

An OLAP Schema for Cassandra

Metadata

2013-04-05T00:00-part-0

{columns:[“country”, “city”, “plays” ]}

Metadata CF

•Optimized for: selective column loading, maximum throughput and compression

Monday, April 7, 14

Page 36: Cassandra Day 2014: Interactive Analytics with Cassandra and Spark

OLAP Workflow

DatasetAggregation Job

Query JobSpark

Executors

Cassandra

REST Job Server

Query Job

Aggregate Query

Result

Query

Result

Monday, April 7, 14

Page 37: Cassandra Day 2014: Interactive Analytics with Cassandra and Spark

Querying Data In Spark

Node 2Node 1

Partition 1

record1record2record3

Partition 2

record4record5record6

Partition 3

record7record8record9

Partition 4

record10record11record12

Convert to Spark SQL / Shark Table

Shark / Spark SQL

rdd.group / .map / .sort / .join etc

rdd.reduce / .collect / .top

Monday, April 7, 14

Page 38: Cassandra Day 2014: Interactive Analytics with Cassandra and Spark

Fault Tolerance

• Cached dataset lives in Java Heap only - what if process dies?

• Spark lineage - automatic recomputation from source, but this is expensive!

• Can also replicate cached dataset to survive single node failures

• Persist materialized views back to C*, then load into cache -- now recovery path is much faster

Monday, April 7, 14

Page 39: Cassandra Day 2014: Interactive Analytics with Cassandra and Spark

DEMO

39

Monday, April 7, 14

Page 40: Cassandra Day 2014: Interactive Analytics with Cassandra and Spark

Creating a Shark Table

Monday, April 7, 14

Page 41: Cassandra Day 2014: Interactive Analytics with Cassandra and Spark

Creating a Cached Table

Monday, April 7, 14

Page 42: Cassandra Day 2014: Interactive Analytics with Cassandra and Spark

Querying a Cached Table

Monday, April 7, 14

Page 43: Cassandra Day 2014: Interactive Analytics with Cassandra and Spark

THANK YOUAnd YES, We’re HIRING!!

ooyala.com/careers

@evanfchan

Monday, April 7, 14

Page 44: Cassandra Day 2014: Interactive Analytics with Cassandra and Spark

Industry Trends

• Fast execution frameworks

• Impala

• In-memory databases

• VoltDB, Druid

• Streaming and real-time

• Higher-level, productive data frameworks

Monday, April 7, 14

Page 45: Cassandra Day 2014: Interactive Analytics with Cassandra and Spark

PERFORMANCE #’S

Spark: C* -> OLAP aggregatescold cache, 1.4 million events

130 seconds

C* -> OLAP aggregateswarmed cache

20-30 seconds

OLAP aggregate query via Spark(56k records)

60 ms

6-node C*/DSE 1.1.9 cluster,Spark 0.7.0

Monday, April 7, 14

Page 46: Cassandra Day 2014: Interactive Analytics with Cassandra and Spark

EXAMPLE: OLAP PROCESSING

t02013-04-05T00:00Z#id1

{video: 10, type:5}

2013-04-05T00:00Z#i

{video: 20,

C* events

OLAP Aggregates

OLAP Aggregates

OLAP Aggregates

Cached Materialized Views

Spark

Spark

Spark

Union

Query 1: Plays by Provider

Query 2: Top content for mobile

Monday, April 7, 14