new analytics toolbox

78
The New Analytics Toolbox Going beyond Hadoop

Upload: robbie-strickland

Post on 26-Jun-2015

480 views

Category:

Technology


0 download

DESCRIPTION

The state of analytics has changed dramatically over the last few years. Hadoop is now commonplace, and the ecosystem has evolved to include new tools such as Spark, Shark, and Drill, that live alongside the old MapReduce-based standards. It can be difficult to keep up with the pace of change, and newcomers are left with a dizzying variety of seemingly similar choices. This is compounded by the number of possible deployment permutations, which can cause all but the most determined to simply stick with the tried and true. Learn the serious advantages to the new tools, and get an analysis of the current state--including pros and cons as well as what's needed to bootstrap and operate the various options.

TRANSCRIPT

Page 1: New Analytics Toolbox

The New Analytics Toolbox

Going beyond Hadoop

Page 2: New Analytics Toolbox

whoami

Robbie StricklandSoftware Dev Managerlinkedin.com/in/robbiestrickland@[email protected]

Page 3: New Analytics Toolbox

Hadoop works fine, right?

Page 4: New Analytics Toolbox

Hadoop works fine, right?

● It’s 11 years old (!!)

Page 5: New Analytics Toolbox

Hadoop works fine, right?

● It’s 11 years old (!!)● It’s relatively slow and inefficient

Page 6: New Analytics Toolbox

Hadoop works fine, right?

● It’s 11 years old (!!)● It’s relatively slow and inefficient● … because everything gets written to disk

Page 7: New Analytics Toolbox

Hadoop works fine, right?

● It’s 11 years old (!!)● It’s relatively slow and inefficient● … because everything gets written to disk● Writing mapreduce code sucks

Page 8: New Analytics Toolbox

Hadoop works fine, right?

● It’s 11 years old (!!)● It’s relatively slow and inefficient● … because everything gets written to disk● Writing mapreduce code sucks● Lots of code for even simple tasks

Page 9: New Analytics Toolbox

Hadoop works fine, right?

● It’s 11 years old (!!)● It’s relatively slow and inefficient● … because everything gets written to disk● Writing mapreduce code sucks● Lots of code for even simple tasks● Boilerplate

Page 10: New Analytics Toolbox

Hadoop works fine, right?

● It’s 11 years old (!!)● It’s relatively slow and inefficient● … because everything gets written to disk● Writing mapreduce code sucks● Lots of code for even simple tasks● Boilerplate● Configuration isn’t for the faint of heart

Page 11: New Analytics Toolbox

Landscape has changed ...

Lots of new tools available:● Cloudera Impala● Apache Drill● Proprietary (Splunk, keen.io, etc.)● Spark / Spark SQL● Shark

Page 12: New Analytics Toolbox

Landscape has changed ...

Lots of new tools available:● Cloudera Impala● Apache Drill● Proprietary (Splunk, keen.io, etc.)● Spark / Spark SQL● Shark

MPP queries for Hadoop data

Page 13: New Analytics Toolbox

Landscape has changed ...

Lots of new tools available:● Cloudera Impala● Apache Drill● Proprietary (Splunk, keen.io, etc.)● Spark / Spark SQL● Shark

Varies by vendor

Page 14: New Analytics Toolbox

Landscape has changed ...

Lots of new tools available:● Cloudera Impala● Apache Drill● Proprietary (Splunk, keen.io, etc.)● Spark / Spark SQL● Shark

Generic in-memory analysis

Page 15: New Analytics Toolbox

Landscape has changed ...

Lots of new tools available:● Cloudera Impala● Apache Drill● Proprietary (Splunk, keen.io, etc.)● Spark / Spark SQL● Shark Hive queries on Spark, replaced by Spark SQL

Page 16: New Analytics Toolbox

Spark wins?

Page 17: New Analytics Toolbox

Spark wins?

● Can be a Hadoop replacement

Page 18: New Analytics Toolbox

Spark wins?

● Can be a Hadoop replacement● Works with any existing InputFormat

Page 19: New Analytics Toolbox

Spark wins?

● Can be a Hadoop replacement● Works with any existing InputFormat● Doesn’t require Hadoop

Page 20: New Analytics Toolbox

Spark wins?

● Can be a Hadoop replacement● Works with any existing InputFormat● Doesn’t require Hadoop● Supports batch & streaming analysis

Page 21: New Analytics Toolbox

Spark wins?

● Can be a Hadoop replacement● Works with any existing InputFormat● Doesn’t require Hadoop● Supports batch & streaming analysis● Functional programming model

Page 22: New Analytics Toolbox

Spark wins?

● Can be a Hadoop replacement● Works with any existing InputFormat● Doesn’t require Hadoop● Supports batch & streaming analysis● Functional programming model● Direct Cassandra integration

Page 23: New Analytics Toolbox

Spark vs. Hadoop

Page 24: New Analytics Toolbox

MapReduce:

Page 25: New Analytics Toolbox

MapReduce:

Page 26: New Analytics Toolbox

MapReduce:

Page 27: New Analytics Toolbox

MapReduce:

Page 28: New Analytics Toolbox

MapReduce:

Page 29: New Analytics Toolbox

MapReduce:

Page 30: New Analytics Toolbox

Spark (angels sing!):

Page 31: New Analytics Toolbox

What is Spark?

Page 32: New Analytics Toolbox

What is Spark?

● In-memory cluster computing

Page 33: New Analytics Toolbox

What is Spark?

● In-memory cluster computing● 10-100x faster than MapReduce

Page 34: New Analytics Toolbox

What is Spark?

● In-memory cluster computing● 10-100x faster than MapReduce● Collection API over large datasets

Page 35: New Analytics Toolbox

What is Spark?

● In-memory cluster computing● 10-100x faster than MapReduce● Collection API over large datasets● Scala, Python, Java

Page 36: New Analytics Toolbox

What is Spark?

● In-memory cluster computing● 10-100x faster than MapReduce● Collection API over large datasets● Scala, Python, Java● Stream processing

Page 37: New Analytics Toolbox

What is Spark?

● In-memory cluster computing● 10-100x faster than MapReduce● Collection API over large datasets● Scala, Python, Java● Stream processing● Supports any existing Hadoop input / output

format

Page 38: New Analytics Toolbox

What is Spark?

● Native graph processing via GraphX

Page 39: New Analytics Toolbox

What is Spark?

● Native graph processing via GraphX● Native machine learning via MLlib

Page 40: New Analytics Toolbox

What is Spark?

● Native graph processing via GraphX● Native machine learning via MLlib● SQL queries via SparkSQL

Page 41: New Analytics Toolbox

What is Spark?

● Native graph processing via GraphX● Native machine learning via MLlib● SQL queries via SparkSQL● Works out of the box on EMR

Page 42: New Analytics Toolbox

What is Spark?

● Native graph processing via GraphX● Native machine learning via MLlib● SQL queries via SparkSQL● Works out of the box on EMR● Easily join datasets from disparate sources

Page 43: New Analytics Toolbox

Spark on Cassandra

Page 44: New Analytics Toolbox

Spark on Cassandra

● Direct integration via DataStax driver - cassandra-driver-spark on github

Page 45: New Analytics Toolbox

Spark on Cassandra

● Direct integration via DataStax driver - cassandra-driver-spark on github

● No job config cruft

Page 46: New Analytics Toolbox

Spark on Cassandra

● Direct integration via DataStax driver - cassandra-driver-spark on github

● No job config cruft● Supports server-side filters (where clauses)

Page 47: New Analytics Toolbox

Spark on Cassandra

● Direct integration via DataStax driver - cassandra-driver-spark on github

● No job config cruft● Supports server-side filters (where clauses)● Data locality aware

Page 48: New Analytics Toolbox

Spark on Cassandra

● Direct integration via DataStax driver - cassandra-driver-spark on github

● No job config cruft● Supports server-side filters (where clauses)● Data locality aware● Uses HDFS, CassandraFS, or other

distributed FS for checkpointing

Page 49: New Analytics Toolbox

Cassandra with Hadoop

CassandraDataNode

TaskTracker

CassandraDataNode

TaskTracker

CassandraDataNode

TaskTracker

NameNode2ndaryNN

JobTracker

Page 50: New Analytics Toolbox

Cassandra with Spark (using HDFS)

CassandraSpark Worker

DataNode

CassandraSpark Worker

DataNode

CassandraSpark Worker

DataNode

Spark MasterNameNode2ndaryNN

Page 51: New Analytics Toolbox

A Typical Spark Application

Page 52: New Analytics Toolbox

A Typical Spark Application● SparkContext + SparkConf

Page 53: New Analytics Toolbox

A Typical Spark Application● SparkContext + SparkConf

● Data Source to RDD[T]

Page 54: New Analytics Toolbox

A Typical Spark Application● SparkContext + SparkConf

● Data Source to RDD[T]

● Transformations/Actions

Page 55: New Analytics Toolbox

A Typical Spark Application● SparkContext + SparkConf

● Data Source to RDD[T]

● Transformations/Actions

● Saving/Displaying

Page 56: New Analytics Toolbox

Resilient Distributed Dataset (RDD)

Page 57: New Analytics Toolbox

Resilient Distributed Dataset (RDD)● A distributed collection of items

Page 58: New Analytics Toolbox

Resilient Distributed Dataset (RDD)● A distributed collection of items

● Transformations

○ Similar to those found in scala collections

○ Lazily processed

Page 59: New Analytics Toolbox

Resilient Distributed Dataset (RDD)● A distributed collection of items

● Transformations

○ Similar to those found in scala collections

○ Lazily processed

● Can recalculate from any point of failure

Page 60: New Analytics Toolbox

RDD Transformations vs Actions

Transformations: Produce new RDDs

Actions: Require the materialization of the records to produce a value

Page 61: New Analytics Toolbox

RDD Transformations/Actions

Transformations: filter, map, flatMap, collect(λ):RDD[T], distinct, groupBy, subtract, union, zip, reduceByKey ...

Actions: collect:Array[T], count, fold, reduce ...

Page 62: New Analytics Toolbox

Resilient Distributed Dataset (RDD)

val numOfAdults = persons.filter(_.age > 17).count()

Transformation Action

Page 63: New Analytics Toolbox

Example

Page 64: New Analytics Toolbox

Demo

Page 65: New Analytics Toolbox

Spark SQL Example

Page 66: New Analytics Toolbox

Spark Streaming

Page 67: New Analytics Toolbox

Spark Streaming

● Creates RDDs from stream source on a defined interval

Page 68: New Analytics Toolbox

Spark Streaming

● Creates RDDs from stream source on a defined interval

● Same ops as “normal” RDDs

Page 69: New Analytics Toolbox

Spark Streaming

● Creates RDDs from stream source on a defined interval

● Same ops as “normal” RDDs● Supports a variety of sources

○ Queues○ Twitter○ Flume○ etc.

Page 70: New Analytics Toolbox

Spark Streaming

Page 71: New Analytics Toolbox

The Output

Page 72: New Analytics Toolbox

Real World Task Distribution

Page 73: New Analytics Toolbox

Real World Task Distribution

Transformations and Actions:

Similar to Scala but …

Your choice of transformations need be made with task distribution and memory in mind

Page 74: New Analytics Toolbox

Real World Task Distribution

How do we do that?

● Partition the data appropriately for number of cores (at least 2 x number of cores)

● Filter early and often (Queries, Filters, Distinct...)● Use pipelines

Page 75: New Analytics Toolbox

Real World Task Distribution

How do we do that?

● Partial Aggregation● Create an algorithm to be as simple/efficient as it can be

to appropriately answer the question

Page 76: New Analytics Toolbox

Real World Task Distribution

Page 77: New Analytics Toolbox

Real World Task Distribution

Some Common Costly Transformations:

● sorting● groupByKey● reduceByKey● ...

Page 78: New Analytics Toolbox

Thank you!

Robbie Strickland@[email protected]://github.com/rstrickland/sparkdemo