east bay java user group oct 2014 spark streaming kinesis machine learning

83
Spark and Friends: Spark Streaming, Machine Learning, Graph Processing, Lambda Architecture, Approximations Chris Fregly East Bay Java User Group Oct 2014 Kinesis Streaming

Upload: chris-fregly

Post on 02-Jul-2015

1.321 views

Category:

Documents


5 download

DESCRIPTION

Spark Streaming Kinesis Machine Learning Graph Processing Lambda Architecture Approximations

TRANSCRIPT

Page 1: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

Spark and Friends:Spark Streaming,

Machine Learning, Graph Processing,

Lambda Architecture, Approximations

Chris Fregly

East Bay Java User GroupOct 2014

Kinesis

Streaming

Page 2: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

Who am I?Former Netflix’er

(netflix.github.io)

Spark Contributor

(github.com/apache/spark)

Founder

(fluxcapacitor.com)

Author

(effectivespark.com,

sparkinaction.com)

Page 3: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

Spark Streaming-Kinesis Jira

Page 4: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

Not-so-Quick Poll• Spark, Spark Streaming?

• Hadoop, Hive, Pig?

• Parquet, Avro, RCFile, ORCFile?

• EMR, DynamoDB, Redshift?

• Flume, Kafka, Kinesis, Storm?

• Lambda Architecture?

• PageRank, Collaborative Filtering, Recommendations?

• Probabilistic Data Structs, Bloom Filters, HyperLogLog?

Page 5: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

Quick Poll

Raiders or 49’ers?

Page 6: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

“Streaming”

Kinesis

Streaming

Video

Streaming

Piping

Big Data

Streaming

Page 7: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

Agenda• Spark, Spark Streaming

• Use Cases

• API and Libraries

• Machine Learning

• Graph Processing

• Execution Model

• Fault Tolerance

• Cluster Deployment

• Monitoring

• Scaling and Tuning

• Lambda Architecture

• Probabilistic Data Structures

• Approximations

Page 8: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

Timeline of Spark Evolution

Page 9: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

Spark and Berkeley AMP Lab

Berkeley Data Analytics Stack (BDAS)

Page 10: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

Spark Overview

• Based on 2007 Microsoft Dryad paper

• Written in Scala

• Supports Java, Python, SQL, and R

• Data fits in memory when possible, but not

required

• Improved efficiency over MapReduce

– 100x in-memory, 2-10x on-disk

• Compatible with Hadoop

– File formats, SerDes, and UDFs

Page 11: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

Spark Use Cases

• Ad hoc, exploratory, interactive analytics

• Real-time + Batch Analytics

– Lambda Architecture

• Real-time Machine Learning

• Real-time Graph Processing

• Approximate, Time-bound Queries

Page 12: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

Explosion of Specialized Systems

Page 13: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

Unified High-level Spark Libraries

• Spark SQL (Data Processing)

• Spark Streaming (Streaming)

• MLlib (Machine Learning)

• GraphX (Graph Processing)

• BlinkDB (Approximate Queries)

• Statistics (Correlations, Sampling, etc)

• Others

– Shark (Hive on Spark)

– Spork (Pig on Spark)

Page 14: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

Benefits of Unified Libraries

• Advancements in higher-level libraries are

pushed down into core and vice-versa

• Examples

– Spark Streaming

• GC and memory management improvements

– Spark GraphX

• IndexedRDD for random, hashed-based access within

a partition versus scanning entire partition

– Spark Core

• Sort-based Shuffle

Page 15: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

Resilient Distributed Dataset

(RDD)

Page 16: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

RDD Overview

• Core Spark abstraction

• Represents partitions

across the cluster nodes

• Enables parallel processing

on data sets

• Partitions can be in-memory

or on-disk

• Immutable, recomputable,

fault tolerant

• Contains transformation history (“lineage”) for

whole data set

Page 17: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

Types of RDDs

• Spark Out-of-the-Box

– HadoopRDD

– FilteredRDD

– MappedRDD

– PairRDD

– ShuffledRDD

– UnionRDD

– DoubleRDD

– JdbcRDD

– JsonRDD

– SchemaRDD

– VertexRDD

– EdgeRDD

• External

– CassandraRDD (DataStax)

– GeoRDD (Esri)

– EsSpark (ElasticSearch)

Page 18: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

RDD Partitions

Page 19: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

RDD Lineage Examples

Page 20: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

Demo!

• Load user data

• Load gender data

• Join user data

with gender data

• Analyze lineage

Page 21: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

Join Optimizations• When joining large dataset with small dataset (reference

data)

• Broadcast small dataset to each node/partition of large

dataset (one broadcast per node)

Page 22: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

Spark API

Page 23: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

Spark API Overview

• Richer, more expressive than MapReduce

• Native support for Java, Scala, Python,

SQL, and R (mostly)

• Unified API across all libraries

• Operations

– Transformations (lazy evaluation)

– Actions (execute transformations)

Page 24: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

Transformations

Page 25: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

Actions

Page 26: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

Spark Execution Model

Page 27: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

Job Scheduling

• Job– Contains many stages

• Contains many tasks

• FIFO– Long-running jobs may starve resources for other

jobs

• Fair– Round-robin to prevent resource starvation

• Fair Scheduler Pools– High-priority pool for important jobs

– Separate user pools

– FIFO within the pool

– Modeled after Hadoop Fair Scheduler

Page 28: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

Spark Execution Model Overview

• Parallel, distributed

• DAG-based

• Lazy evaluation

• Allows optimizations– Reduce disk I/O

– Reduce shuffle I/O

– Single pass through dataset

– Parallel execution

– Task pipelining

• Data locality and rack awareness

• Worker node fault tolerance using RDD lineage graphs per partition

Page 29: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

Execution Optimizations

Page 30: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

Task Pipelining

Page 31: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

Spark Cluster Deployment

Page 32: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

Types of Cluster Deployments

• Spark Standalone (default)

• YARN

– Allows hierarchies of resources

– Kerberos integration

– Multiple workloads from disparate execution frameworks

• Hive, Pig, Spark, MapReduce, Cascading, etc

• Mesos

– Coarse-grained

• Single, long-running Mesos tasks runs Spark mini tasks

– Fine-grained

• New Mesos task for each Spark task

• Higher overhead

• Not good for long-running Spark jobs like Spark Streaming

Page 33: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

Spark Standalone Deployment

Page 34: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

Spark YARN Deployment

Page 35: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

Master High Availability

• Multiple Master Nodes

• ZooKeeper maintains current Master

• Existing applications and workers will be

notified of new Master election

• New applications and workers need to

explicitly specify current Master

• Alternatives (Not recommended)

– Local filesystem

– NFS Mount

Page 36: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

Spark After Dark

(sparkafterdark.com)

Page 37: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

Spark After Dark ~ Tinder

Page 38: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

Mutual Match!

Page 39: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

Goal: Increase Matches (1/2)

• Top Influencers – PageRank

– Most desirable people

• People You May Know– Shortest Path

– Facebook-enabled

• Recommendations– Alternating Least

Squares (ALS)

Page 40: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

Goal: Increase Matches (2/2)

• Cluster on Interests,Height, Religion, etc

– K-Means

– Nearest Neighbor

• Textual Analysisof Profile Text

– TF/IDF

– N-gram

– LDA Topic Extraction

Page 41: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

Demo!

Page 42: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

Spark Streaming

Page 43: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

Spark Streaming Overview

• Low latency, high throughput, fault-tolerance (mostly)

• Long-running Spark application

• Supports Flume, Kafka, Twitter, Kinesis, Socket, File, etc.

• Graceful shutdown, in-flight message draining

• Uses Spark Core, DAG Execution Model, and Fault Tolerance

• Submits micro-batch jobs to the cluster

Page 44: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

Spark Streaming Overview

Page 45: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

Spark Streaming Use Cases• ETL and enrichment

of streaming data

on injest

• Operational

dashboards

• Lambda

Architecture

Page 46: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

Discretized Stream (DStream)• Core Spark Streaming abstraction

• Micro-batches of RDDs

• Operations similar to RDD – operates on underlying RDDs

Page 47: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

Spark Streaming API

Page 48: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

Spark Streaming API Overview

• Rich, expressive API similar to core

• Operations

– Transformations (lazy)

– Actions (execute transformations)

• Window and State Operations

• Requires checkpointing to snip long-

running DStream lineage

• Register DStream as a Spark SQL table for querying?! Wow.

Page 49: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

DStream Transformations

Page 50: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

DStream Actions

Page 51: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

Window and State DStream Operations

Page 52: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

DStream Window and State Example

Page 53: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

Spark Streaming Cluster

Deployment

Page 54: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

Spark Streaming Cluster Deployment

Page 55: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

Adding Receivers

Page 56: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

Adding Workers

Page 57: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

Spark Streaming

+

Kinesis

Page 58: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

Spark Streaming + Kinesis Architecture

Page 59: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

Throughput and Pricing

Page 60: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

Demo!

Kinesis

Streaming

Scala: …/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scalaJava: …/java/org/apache/spark/examples/streaming/JavaKinesisWordCountASL.java

https://github.com/apache/spark/blob/master/extras/kinesis-asl/src/main/…

Page 61: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

Spark Streaming

Fault Tolerance

Page 62: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

Characteristics of Sources

• Buffered

– Flume, Kafka, Kinesis

– Allows replay and back pressure

• Batched

– Flume, Kafka, Kinesis

– Improves throughput at expense of duplication on failure

• Checkpointed

– Kafka, Kinesis

– Allows replay from specific checkpoint

Page 63: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

Message Delivery Guarantees• Exactly once [1]

– No loss

– No redeliver

– Perfect delivery

– Incurs higher latency for transactional semantics

– Spark default per batch using DStream lineage

– Degrades to less guarantees depending on source

• At least once [1..n]– No loss

– Possible redeliver

• At most once [0,1]– Possible loss

– No redeliver

– *Best configuration if some data loss is acceptable

• Ordered– Per partition: Kafka, Kinesis

– Global across all partitions: Hard to scale

Page 64: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

Types of Checkpoints

Spark

1. Spark checkpointing of StreamingContextDStreams and metadata

2. Lineage of state and window DStreamoperations

Kinesis

3. Kinesis Client Library (KCL) checkpoints current position within shard

– Checkpoint info is stored in DynamoDB per Kinesis application keyed by shard

Page 65: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

Fault Tolerance

• Points of Failure

– Receiver

– Driver

– Worker/Processor

• Possible Solutions

– Use HDFS File Source for durability

– Data Replication

– Secondary/Backup Nodes

– Checkpoints

• Stream, Window, and State info

Page 66: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

Streaming Receiver Failure

• Use a backup receiver

• Use multiple receivers pulling from multiple shards– Use checkpoint-enabled, sharded streaming

source (ie. Kafka and Kinesis)

• Data is replicated to 2 nodes immediately upon ingestion– Will spill to disk if doesn’t fit in memory

• Possible loss of most-recent batch

• Possible at-least once delivery of batch

• Use buffered sources for replay– Kafka and Kinesis

Page 67: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

Streaming Driver Failure

• Use a backup Driver

– Use DStream metadata checkpoint info to recover

• Single point of failure – interrupts stream processing

• Streaming Driver is a long-running Spark application

– Schedules long-running stream receivers

• State and Window RDD checkpoints to HDFS to help avoid data loss (mostly)

Page 68: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

Stream Worker/Processor Failure• No problem!

• DStream RDD partitions will be recalculated from lineage

• Causes blip in processing during node failover

Page 69: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

Spark Streaming

Monitoring and Tuning

Page 70: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

Monitoring

• Monitor driver, receiver, worker nodes, and

streams

• Alert upon failure or unusually high latency

• Spark Web UI

– Streaming tab

• Ganglia, CloudWatch

• StreamingListener callback

Page 71: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

Spark Web UI

Page 72: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

Tuning• Batch interval

– High: reduce overhead of submitting new tasks for each batch

– Low: keeps latencies low

– Sweet spot: DStream job time (scheduling + processing) is

steady and less than batch interval

• Checkpoint interval

– High: reduce load on checkpoint overhead

– Low: reduce amount of data loss on failure

– Recommendation: 5-10x sliding window interval

• Use DStream.repartition() to increase parallelism of processing

DStream jobs across cluster

• Use spark.streaming.unpersist=true to let the Streaming Framework

figure out when to unpersist

• Use CMS GC for consistent processing times

Page 73: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

Lambda Architecture

Page 74: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

Lambda Architecture Overview

• Batch Layer– Immutable,

Batch read,Append-only write

– Source of truth

– ie. HDFS

• Speed Layer– Mutable,

Random read/write

– Most complex

– Recent data only

– ie. Cassandra

• Serving Layer– Immutable,

Random read, Batch write

– ie. ElephantDB

Page 75: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

Spark + AWS + Lambda

Page 76: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

Spark + Lambda + GraphX + MLlib

Page 77: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

Demo!

• Load JSON

• Convert to Parquet file

• Save Parquet file to disk

• Query Parquet file directly

Page 78: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

Approximations

Page 79: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

Approximation Overview

• Required for scaling

• Speed up analysis of large datasets

• Reduce size of working dataset

• Data is messy

• Collection of data is messy

• Exact isn’t always necessary

• “Approximate is the new Exact”

Page 80: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

Some Approximation Methods

• Approximate time-bound queries

– BlinkDB

• Bernouilli and Poisson Sampling

– RDD: sample(), takeSample()

• HyperLogLog

PairRDD: countApproxDistinctByKey()

• Count-min Sketch

• Spark Streaming and Twitter Algebird

• Bloom Filters

– Everywhere!

Page 81: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

Approximations In Action

Figure: Memory Savings with Approximation Techniques(http://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/)

Page 82: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

Spark Statistics Library

• Correlations

– Dependence between 2 random variables

– Pearson, Spearman

• Hypothesis Testing

– Measure of statistical significance

– Chi-squared test

• Stratified Sampling

– Sample separately from different sub-populations

– Bernoulli and Poisson sampling

– With and without replacement

• Random data generator

– Uniform, standard normal, and Poisson distribution

Page 83: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

Summary

• Spark, Spark Streaming Overview

• Use Cases

• API and Libraries

• Machine Learning

• Graph Processing

• Execution Model

• Fault Tolerance

• Cluster Deployment

• Monitoring

• Scaling and Tuning

• Lambda Architecture

• Probabilistic Data Structures

• Approximations

http://effectivespark.com http://sparkinaction.com

Thanks!!

Chris Fregly

@cfregly