new analytics toolbox devnexus 2015

131
The New Analytics Toolbox Going beyond Hadoop Robbie Strickland DevNexus 2015

Upload: robbie-strickland

Post on 31-Jul-2015

298 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: New Analytics Toolbox DevNexus 2015

The New Analytics Toolbox

Going beyond Hadoop

Robbie StricklandDevNexus 2015

Page 2: New Analytics Toolbox DevNexus 2015

whoami

Robbie StricklandDirector, Software Engineeringlinkedin.com/in/robbiestrickland@[email protected]

Page 3: New Analytics Toolbox DevNexus 2015

whoami● Contributor: core Cassandra,

Java driver, Spark driver, Hive driver, other stuff

● DataStax MVP● User since 2010 (0.5 release)● Author, Cassandra High

Availability● Founder, ATL Cassandra Users

Page 4: New Analytics Toolbox DevNexus 2015

Thanks to ...

Helena EdelsonDataStax Engineering@helenaedelson

Page 5: New Analytics Toolbox DevNexus 2015

Weather @ scale

● ~10 billion API requests per day● 4 AWS regions● Over 100 million app users● Lots of data to analyze!

Page 6: New Analytics Toolbox DevNexus 2015

Agenda● Tool landscape● Spark vs. Hadoop● Spark overview● Spark + Cassandra● A typical Spark application● Spark SQL● Getting set up● Demo● Spark Streaming● Task distribution● Language comparison● Questions

Page 7: New Analytics Toolbox DevNexus 2015

Hadoop works fine, right?

Page 8: New Analytics Toolbox DevNexus 2015

Hadoop works fine, right?

● It’s 11 years old (!!)

Page 9: New Analytics Toolbox DevNexus 2015

Hadoop works fine, right?

● It’s 11 years old (!!)● It’s relatively slow and inefficient

Page 10: New Analytics Toolbox DevNexus 2015

Hadoop works fine, right?

● It’s 11 years old (!!)● It’s relatively slow and inefficient● … because everything gets written to disk

Page 11: New Analytics Toolbox DevNexus 2015

Hadoop works fine, right?

● It’s 11 years old (!!)● It’s relatively slow and inefficient● … because everything gets written to disk● Writing mapreduce code sucks

Page 12: New Analytics Toolbox DevNexus 2015

Hadoop works fine, right?

● It’s 11 years old (!!)● It’s relatively slow and inefficient● … because everything gets written to disk● Writing mapreduce code sucks● Lots of code for even simple tasks

Page 13: New Analytics Toolbox DevNexus 2015

Hadoop works fine, right?

● It’s 11 years old (!!)● It’s relatively slow and inefficient● … because everything gets written to disk● Writing mapreduce code sucks● Lots of code for even simple tasks● Boilerplate

Page 14: New Analytics Toolbox DevNexus 2015

Hadoop works fine, right?

● It’s 11 years old (!!)● It’s relatively slow and inefficient● … because everything gets written to disk● Writing mapreduce code sucks● Lots of code for even simple tasks● Boilerplate● Configuration isn’t for the faint of heart

Page 15: New Analytics Toolbox DevNexus 2015

Landscape has changed ...

Lots of new tools available:● Cloudera Impala● Apache Drill● Proprietary (Splunk, keen.io, etc.)● Spark / Spark SQL● Shark

Page 16: New Analytics Toolbox DevNexus 2015

Landscape has changed ...

Lots of new tools available:● Cloudera Impala● Apache Drill● Proprietary (Splunk, keen.io, etc.)● Spark / Spark SQL● Shark

MPP queries for Hadoop data

Page 17: New Analytics Toolbox DevNexus 2015

Landscape has changed ...

Lots of new tools available:● Cloudera Impala● Apache Drill● Proprietary (Splunk, keen.io, etc.)● Spark / Spark SQL● Shark

Varies by vendor

Page 18: New Analytics Toolbox DevNexus 2015

Landscape has changed ...

Lots of new tools available:● Cloudera Impala● Apache Drill● Proprietary (Splunk, keen.io, etc.)● Spark / Spark SQL● Shark

Generic in-memory analysis

Page 19: New Analytics Toolbox DevNexus 2015

Landscape has changed ...

Lots of new tools available:● Cloudera Impala● Apache Drill● Proprietary (Splunk, keen.io, etc.)● Spark / Spark SQL● Shark Hive queries on Spark, replaced by Spark SQL

Page 20: New Analytics Toolbox DevNexus 2015

Spark wins?

Page 21: New Analytics Toolbox DevNexus 2015

Spark wins?

● Can be a Hadoop replacement

Page 22: New Analytics Toolbox DevNexus 2015

Spark wins?

● Can be a Hadoop replacement● Works with any existing InputFormat

Page 23: New Analytics Toolbox DevNexus 2015

Spark wins?

● Can be a Hadoop replacement● Works with any existing InputFormat● Doesn’t require Hadoop

Page 24: New Analytics Toolbox DevNexus 2015

Spark wins?

● Can be a Hadoop replacement● Works with any existing InputFormat● Doesn’t require Hadoop● Supports batch & streaming analysis

Page 25: New Analytics Toolbox DevNexus 2015

Spark wins?

● Can be a Hadoop replacement● Works with any existing InputFormat● Doesn’t require Hadoop● Supports batch & streaming analysis● Functional programming model

Page 26: New Analytics Toolbox DevNexus 2015

Spark wins?

● Can be a Hadoop replacement● Works with any existing InputFormat● Doesn’t require Hadoop● Supports batch & streaming analysis● Functional programming model● Direct Cassandra integration

Page 27: New Analytics Toolbox DevNexus 2015

Spark vs. Hadoop

Page 28: New Analytics Toolbox DevNexus 2015

MapReduce:

Page 29: New Analytics Toolbox DevNexus 2015

MapReduce:

Page 30: New Analytics Toolbox DevNexus 2015

MapReduce:

Page 31: New Analytics Toolbox DevNexus 2015

MapReduce:

Page 32: New Analytics Toolbox DevNexus 2015

MapReduce:

Page 33: New Analytics Toolbox DevNexus 2015

MapReduce:

Page 34: New Analytics Toolbox DevNexus 2015

Spark (angels sing!):

Page 35: New Analytics Toolbox DevNexus 2015

What is Spark?

Page 36: New Analytics Toolbox DevNexus 2015

What is Spark?

● In-memory cluster computing

Page 37: New Analytics Toolbox DevNexus 2015

What is Spark?

● In-memory cluster computing● 10-100x faster than MapReduce

Page 38: New Analytics Toolbox DevNexus 2015

What is Spark?

● In-memory cluster computing● 10-100x faster than MapReduce● Collection API over large datasets

Page 39: New Analytics Toolbox DevNexus 2015

What is Spark?

● In-memory cluster computing● 10-100x faster than MapReduce● Collection API over large datasets● Scala, Python, Java

Page 40: New Analytics Toolbox DevNexus 2015

What is Spark?

● In-memory cluster computing● 10-100x faster than MapReduce● Collection API over large datasets● Scala, Python, Java● Stream processing

Page 41: New Analytics Toolbox DevNexus 2015

What is Spark?

● In-memory cluster computing● 10-100x faster than MapReduce● Collection API over large datasets● Scala, Python, Java● Stream processing● Supports any existing Hadoop input / output

format

Page 42: New Analytics Toolbox DevNexus 2015

What is Spark?

● Native graph processing via GraphX

Page 43: New Analytics Toolbox DevNexus 2015

What is Spark?

● Native graph processing via GraphX● Native machine learning via MLlib

Page 44: New Analytics Toolbox DevNexus 2015

What is Spark?

● Native graph processing via GraphX● Native machine learning via MLlib● SQL queries via SparkSQL

Page 45: New Analytics Toolbox DevNexus 2015

What is Spark?

● Native graph processing via GraphX● Native machine learning via MLlib● SQL queries via SparkSQL● Works out of the box on EMR

Page 46: New Analytics Toolbox DevNexus 2015

What is Spark?

● Native graph processing via GraphX● Native machine learning via MLlib● SQL queries via SparkSQL● Works out of the box on EMR● Easily join datasets from disparate sources

Page 47: New Analytics Toolbox DevNexus 2015

Spark Components

Spark Core

Spark Streaming

real-time

Spark SQL

structured queries

MLlib

machine learning

GraphX

graph processing

Page 48: New Analytics Toolbox DevNexus 2015

Deployment Options

3 cluster manager choices:● Standalone - included with Spark & easy to

set up● Mesos - generic cluster manager that can

also handle MapReduce● YARN - Hadoop 2 resource manager

Page 49: New Analytics Toolbox DevNexus 2015

Spark Word Count

val file = sc.textFile(“hdfs://…”)val counts = file.flatMap(line => line.split(“ “))

.map(word => (word, 1)) .reduceByKey(_ + _)

counts.saveAsTextFile(“hdfs://…”)

Page 50: New Analytics Toolbox DevNexus 2015

Cassandra

Page 51: New Analytics Toolbox DevNexus 2015

Why Cassandra?It’s fast:

● No locks● Tunable consistency● Sequential R/W

Page 52: New Analytics Toolbox DevNexus 2015

Why Cassandra?It scales (linearly):

● Peer-to-peer (decentralized)● DHT● Read/write to any node● Largest cluster = 75,000 nodes!

Page 53: New Analytics Toolbox DevNexus 2015

Why Cassandra?It’s fault tolerant:

● Automatic replication● Masterless (i.e. no SPOF)● Failed nodes replaced with ease● Multi data center

Page 54: New Analytics Toolbox DevNexus 2015

Why Cassandra?It’s perfect for analysis:

● Unstructured & semi-structured data● Partition aware● Multi-DC replication● Sharding is automatic● Natural time-series support

Page 55: New Analytics Toolbox DevNexus 2015

Why Cassandra?It’s easy to use:

● Familiar CQL syntax● Light administrative burden● Simple configuration

Page 56: New Analytics Toolbox DevNexus 2015

Spark on Cassandra

Page 57: New Analytics Toolbox DevNexus 2015

Spark on Cassandra

● Direct integration via DataStax driver - cassandra-driver-spark on github

Page 58: New Analytics Toolbox DevNexus 2015

Spark on Cassandra

● Direct integration via DataStax driver - cassandra-driver-spark on github

● No job config cruft

Page 59: New Analytics Toolbox DevNexus 2015

Spark on Cassandra

● Direct integration via DataStax driver - cassandra-driver-spark on github

● No job config cruft● Supports server-side filters (where clauses)

Page 60: New Analytics Toolbox DevNexus 2015

Spark on Cassandra

● Direct integration via DataStax driver - cassandra-driver-spark on github

● No job config cruft● Supports server-side filters (where clauses)● Data locality aware

Page 61: New Analytics Toolbox DevNexus 2015

Spark on Cassandra

● Direct integration via DataStax driver - cassandra-driver-spark on github

● No job config cruft● Supports server-side filters (where clauses)● Data locality aware● Uses HDFS, CassandraFS, or other

distributed FS for checkpointing

Page 62: New Analytics Toolbox DevNexus 2015

Spark on Cassandra

Spark Core

Spark Streaming

real-time

Spark SQL

structured queries

MLlib

machine learning

GraphX

graph processing

Cassandra

Page 63: New Analytics Toolbox DevNexus 2015

Cassandra with Hadoop

CassandraDataNode

TaskTracker

CassandraDataNode

TaskTracker

CassandraDataNode

TaskTracker

NameNode2ndaryNN

JobTracker

Page 64: New Analytics Toolbox DevNexus 2015

Cassandra with Spark (using HDFS)

CassandraSpark Worker

DataNode

CassandraSpark Worker

DataNode

CassandraSpark Worker

DataNode

Spark MasterNameNode2ndaryNN

Page 65: New Analytics Toolbox DevNexus 2015

Online analytics

C*Spark

C*Spark

C*Spark

C*

C*C*

Real-timereplication

Operational data center

Analyticsdata center

Page 66: New Analytics Toolbox DevNexus 2015

A Typical Spark Application

Page 67: New Analytics Toolbox DevNexus 2015

A Typical Spark Application● SparkContext + SparkConf

Page 68: New Analytics Toolbox DevNexus 2015

A Typical Spark Application● SparkContext + SparkConf

● Data Source to RDD[T]

Page 69: New Analytics Toolbox DevNexus 2015

A Typical Spark Application● SparkContext + SparkConf

● Data Source to RDD[T]

● Transformations/Actions

Page 70: New Analytics Toolbox DevNexus 2015

A Typical Spark Application● SparkContext + SparkConf

● Data Source to RDD[T]

● Transformations/Actions

● Saving/Displaying

Page 71: New Analytics Toolbox DevNexus 2015

Resilient Distributed Dataset (RDD)

Page 72: New Analytics Toolbox DevNexus 2015

Resilient Distributed Dataset (RDD)● A distributed collection of items

Page 73: New Analytics Toolbox DevNexus 2015

Resilient Distributed Dataset (RDD)● A distributed collection of items

● Transformations

○ Similar to those found in scala collections

○ Lazily processed

Page 74: New Analytics Toolbox DevNexus 2015

Resilient Distributed Dataset (RDD)● A distributed collection of items

● Transformations

○ Similar to those found in Scala collections

○ Lazily processed

● Can recalculate from any point of failure

Page 75: New Analytics Toolbox DevNexus 2015

RDD Transformations vs Actions

Transformations: Produce new RDDs

Actions: Require the materialization of the records to produce a value

Page 76: New Analytics Toolbox DevNexus 2015

RDD Transformations/Actions

Transformations: filter, map, flatMap, collect(λ):RDD[T], distinct, groupBy, subtract, union, zip, reduceByKey ...

Actions: collect:Array[T], count, fold, reduce ...

Page 77: New Analytics Toolbox DevNexus 2015

Resilient Distributed Dataset (RDD)

val numOfAdults = persons.filter(_.age > 17).count()

Transformation Action

Page 78: New Analytics Toolbox DevNexus 2015

Example

Page 79: New Analytics Toolbox DevNexus 2015

Spark SQL

● Provides SQL access to structured data○ Existing RDDs○ Hive warehouses (uses existing metastore, SerDes

and UDFs)○ JDBC/ODBC - use existing BI tools to query large

datasets

Page 80: New Analytics Toolbox DevNexus 2015

Spark SQL RDD Example

Page 81: New Analytics Toolbox DevNexus 2015

Getting set up

● Download Spark 1.2.x● Download Cassandra 2.1.x● Add the spark-cassandra-connector to

your project

"com.datastax.spark" % "spark-cassandra-connector_2.10" % "1.2.0-alpha3"

Page 82: New Analytics Toolbox DevNexus 2015

Running applications./bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ --master local[8] \ /path/to/examples.jar

Page 83: New Analytics Toolbox DevNexus 2015

Running applications./bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ --master spark://192.168.1.1:7077 \ --executor-memory 20G \ --total-executor-cores 100 \ /path/to/examples.jar

Page 84: New Analytics Toolbox DevNexus 2015

Demo

Page 85: New Analytics Toolbox DevNexus 2015

Spark Streaming

Page 86: New Analytics Toolbox DevNexus 2015

Spark Streaming

Almost any source

Almost any destination

Page 87: New Analytics Toolbox DevNexus 2015

Spark Streaming

● Creates RDDs from stream source on a defined interval

Page 88: New Analytics Toolbox DevNexus 2015

Spark Streaming

● Creates RDDs from stream source on a defined interval

● Same ops as “normal” RDDs

Page 89: New Analytics Toolbox DevNexus 2015

Spark Streaming

● Creates RDDs from stream source on a defined interval

● Same ops as “normal” RDDs● Supports a variety of sources

Page 90: New Analytics Toolbox DevNexus 2015

Spark Streaming

● Creates RDDs from stream source on a defined interval

● Same ops as “normal” RDDs● Supports a variety of sources● Exactly once message guarantee

Page 91: New Analytics Toolbox DevNexus 2015

Spark Streaming - Use Cases

Applications Sensors Web Mobile

Intrusion detection Malfunction detection Site analytics Network analytics

Fraud detection Dynamic process optimization Recommendations Location based

advertising

Log processing Supply chain planning

Sentiment analysis ...

Page 92: New Analytics Toolbox DevNexus 2015

Spark Streaming

Input stream Batches of input data

Batches of processed data

Page 93: New Analytics Toolbox DevNexus 2015

Spark Streaming

Micro batch (RDD) Micro batch (RDD) Micro batch (RDD) Micro batch (RDD)

● DStream = continuous sequence of micro batches● Each batch is an RDD● Interval is configurable

DStream

Page 94: New Analytics Toolbox DevNexus 2015

Spark Streaming

Page 95: New Analytics Toolbox DevNexus 2015

The Output

Page 96: New Analytics Toolbox DevNexus 2015

Real World Task Distribution

Page 97: New Analytics Toolbox DevNexus 2015

Real World Task Distribution

Transformations and Actions:

Similar to Scala but …

Your choice of transformations need be made with task distribution and memory in mind

Page 98: New Analytics Toolbox DevNexus 2015

Real World Task Distribution

How do we do that?

● Partition the data appropriately for number of cores (at least 2x number of cores)

Page 99: New Analytics Toolbox DevNexus 2015

Real World Task Distribution

How do we do that?

● Partition the data appropriately for number of cores (at least 2x number of cores)

● Filter early and often (Queries, Filters, Distinct...)

Page 100: New Analytics Toolbox DevNexus 2015

Real World Task Distribution

How do we do that?

● Partition the data appropriately for number of cores (at least 2x number of cores)

● Filter early and often (Queries, Filters, Distinct...)● Use pipelines

Page 101: New Analytics Toolbox DevNexus 2015

Real World Task Distribution

How do we do that?

● Partial Aggregation

Page 102: New Analytics Toolbox DevNexus 2015

Real World Task Distribution

How do we do that?

● Partial Aggregation● Create an algorithm to be as simple/efficient as it can

be to appropriately answer the question

Page 103: New Analytics Toolbox DevNexus 2015

Real World Task Distribution

Page 104: New Analytics Toolbox DevNexus 2015

Real World Task Distribution

Some Common Costly Transformations:

● sorting● groupByKey● reduceByKey● ...

Page 105: New Analytics Toolbox DevNexus 2015

Partitioning Example

User event log:● Time-series events● Tracks user interactions with system over

time● Location check-ins, page/module views,

profile changes, ad impressions/clicks, etc.

Page 106: New Analytics Toolbox DevNexus 2015

Partitioning ExampleCREATE TABLE Event ( user_id text, timestamp int, event_type text, event_data map<text, text>, PRIMARY KEY (user_id, timestamp, event_type));

Page 107: New Analytics Toolbox DevNexus 2015

Partitioning ExampleCREATE TABLE Event ( user_id text, timestamp int, event_type text, event_data map<text, text>, PRIMARY KEY (user_id, timestamp, event_type));

partition key

Page 108: New Analytics Toolbox DevNexus 2015

Partitioning ExampleCREATE TABLE Event ( user_id text, timestamp int, event_type text, event_data map<text, text>, PRIMARY KEY (user_id, timestamp, event_type));

clustering columns

Page 109: New Analytics Toolbox DevNexus 2015

Partitioning Example

Potential analysis:● Location graph● Individual usage habits● Page/module view counts

Page 110: New Analytics Toolbox DevNexus 2015

Partitioning Example

Potential analysis:● Location graph● Individual usage habits● Page/module view counts

Grouped by:user_id

Page 111: New Analytics Toolbox DevNexus 2015

Partitioning Example

Potential analysis:● Location graph● Individual usage habits● Page/module view counts

Grouped by:user_iduser_id

Page 112: New Analytics Toolbox DevNexus 2015

Partitioning Example

Potential analysis:● Location graph● Individual usage habits● Page/module view counts

Grouped by:user_iduser_idevent_data

Page 113: New Analytics Toolbox DevNexus 2015

Partitioning ExampleNode 1

rstricklandjsmithtjones

Node 2

awilsonptaylor

gwatson

Node 3

lmillermjohnson

scarter

Page 114: New Analytics Toolbox DevNexus 2015

Partitioning ExampleNode 1

rstricklandjsmithtjones

Node 2

awilsonptaylor

gwatson

Node 3

lmillermjohnson

scarter

users.reduceByKey { … }

Page 115: New Analytics Toolbox DevNexus 2015

Partitioning ExampleNode 1

rstricklandjsmithtjones

Node 2

awilsonptaylor

gwatson

Node 3

lmillermjohnson

scarter

users.reduceByKey { … }

No shuffling required!

Page 116: New Analytics Toolbox DevNexus 2015

Partitioning ExampleNode 1

rstrickland: home, local, video

jsmith: home, radartjones: radar, video

Node 2awilson: home, local,

radarptaylor: home, videogwatson: local, radar

Node 3lmiller: home, radar,

videomjohnson: radar

scarter: home, local, radar

users.filter(_.eventType == “PageView”)

Page 117: New Analytics Toolbox DevNexus 2015

Partitioning ExampleNode 1home: 2local: 1radar: 2video: 2

Node 2home: 2local: 2radar: 2video: 1

Node 3home: 2local: 1radar: 2video: 1

users.filter(_.eventType == “PageView”) .map { e => (e.event_data[“page”], 1) }

Combining is automatic :)

Page 118: New Analytics Toolbox DevNexus 2015

Partitioning ExampleNode 1

home: 2, 2, 2

Node 2local: 1, 2, 1radar: 2, 2, 2

Node 3video: 2, 1, 1

users.filter(_.eventType == “PageView”) .map { e => (e.event_data[“pageview”], 1) } .reduceByKey(_ + _) Requires a shuffle!

Page 119: New Analytics Toolbox DevNexus 2015

Java vs. Scala

Should I learn Scala? … or leverage existing Java skills?

Page 120: New Analytics Toolbox DevNexus 2015

Java vs. Scala

● Spark uses a functional paradigm

Page 121: New Analytics Toolbox DevNexus 2015

Java vs. Scala

● Spark uses a functional paradigm● Spark is written in Scala

Page 122: New Analytics Toolbox DevNexus 2015

Java vs. Scala

● Spark uses a functional paradigm● Spark is written in Scala● Scala > Java 8 > Java 7

Page 123: New Analytics Toolbox DevNexus 2015

Java vs. Scala

● Spark uses a functional paradigm● Spark is written in Scala● Scala > Java 8 > Java 7● You will want lambdas!

Page 124: New Analytics Toolbox DevNexus 2015

Language Comparison - Scalatext.flatMap { line => line.split(“ “) } .map(word => (word, 1)) .reduceByKey(_ + _)

Page 125: New Analytics Toolbox DevNexus 2015

Language Comparison - Java 8text.flatMap(line -> Arrays.asList(line.split(“ “))) .mapToPair(word -> new Tuple2<String, Integer>(word, 1)) .reduceByKey((x, y) -> x + y)

Page 126: New Analytics Toolbox DevNexus 2015

Language Comparison - Java 7JavaRDD<String> words = text.flatMap( new FlatMapFunction<String, String>() { public Iterable<String> call(String line) { return Arrays.asList(line.split(" ")); }})

Page 127: New Analytics Toolbox DevNexus 2015

Language Comparison - Java 7JavaPairRDD<String, Integer> ones = words.mapToPair( new PairFunction<String, String, Integer>() { public Tuple2<String, Integer> call(String w) { return new Tuple2<String, Integer>(w, 1); }});

Page 128: New Analytics Toolbox DevNexus 2015

Language Comparison - Java 7ones.reduceByKey( new Function2<Integer, Integer, Integer>() { public Integer call(Integer i1, Integer i2) { return i1 + i2; }});

Page 129: New Analytics Toolbox DevNexus 2015

Shameless book plug

https://www.packtpub.com/big-data-and-business-intelligence/cassandra-high-availability

Page 130: New Analytics Toolbox DevNexus 2015

We’re hiring!

Page 131: New Analytics Toolbox DevNexus 2015

Thank you!

Robbie Stricklandlinkedin.com/in/robbiestrickland@[email protected]