productionalizing spark streaming

Post on 26-Jan-2015

120 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Spark Summit 2013 Talk: At Sharethrough we have deployed Spark to our production environment to support several user facing product features. While building these features we uncovered a consistent set of challenges across multiple streaming jobs. By addressing these challenges you can speed up development of future streaming jobs. In this talk we will discuss the 3 major challenges we encountered while developing production streaming jobs and how we overcame them. First we will look at how to write jobs to ensure fault tolerance since streaming jobs need to run 24/7 even under failure conditions. Second we will look at the programming abstractions we created using functional programming and existing libraries. Finally we will look at the way we test all the pieces of a job –from manipulating data through writing to external databases– to give us confidence in our code before we deploy to production

TRANSCRIPT

@rweald

Productionalizing Spark Streaming

Spark Summit 2013Ryan Weald

@rweald

Monday, December 2, 13

@rweald

What We’re Going to Cover

•What we do and Why we choose Spark

•Fault tolerance for long lived streaming jobs

•Common patterns and functional abstractions

•Testing before we “do it live”

Monday, December 2, 13

@rweald

Special focus on common patterns and

their solutions

Monday, December 2, 13

@rweald

What is Sharethrough?

Advertising for the Modern Internet

FunctionForm

Monday, December 2, 13

@rweald

What is Sharethrough?

Monday, December 2, 13

@rweald

Why Spark Streaming?

Monday, December 2, 13

@rweald

Why Spark Streaming

•Liked theoretical foundation of mini-batch

•Scala codebase + functional API

•Young project with opportunities to contribute

•Batch model for iterative ML algorithms

Monday, December 2, 13

@rweald

Great...Now productionalize it

Monday, December 2, 13

@rweald

Fault Tolerance

Monday, December 2, 13

@rweald

Keys to Fault Tolerance

1.Receiver fault tolerance

2.Monitoring job progress

Monday, December 2, 13

@rweald

Receiver Fault Tolerance

•Use Actors with supervisors

•Use self healing connection pools

Monday, December 2, 13

@rweald

Use Actors

class RabbitMQStreamReceiver (uri:String, exchangeName: String, routingKey: String) extends Actor with Receiver with Logging {

implicit val system = ActorSystem() override def preStart() = { //Your code to setup connections and actors //Include inner class to process messages }

def receive: Receive = { case _ => logInfo("unknown message") }}

Monday, December 2, 13

@rweald

Track All Outputs

•Low watermarks - Google MillWheel

•Database updated_at

•Expected output file size alerting

Monday, December 2, 13

@rweald

Common Patterns&

Functional Programming

Monday, December 2, 13

@rweald

Map -> Aggregate ->Store

Common Job Pattern

Monday, December 2, 13

@rweald

Mapping Data

inputData.map { rawRequest => val params = QueryParams.parse(rawRequest) (params.getOrElse("beaconType", "unknown"), 1L)}

Monday, December 2, 13

@rweald

Aggregation

Monday, December 2, 13

@rweald

Basic Aggregation

//beacons is DStream[String, Long]//example Seq(("click", 1L), ("click", 1L))val sum: (Long, Long) => Long = _ + _beacons.reduceByKey(sum)

Monday, December 2, 13

@rweald

What Happens when we want to sum multiple things?

Monday, December 2, 13

@rweald

Long Basic Aggregation

val inputData = Seq( ("user_1",(1L, 1L, 1L)), ("user_1",(2L, 2L, 2L)))def sum(l: (Long, Long, Long), r: (Long, Long, Long)) = { (l._1 + r._1, l._2 + r._2, l._3 + r._3)}inputData.reduceByKey(sum)

Monday, December 2, 13

@rweald

Now Sum 4 Ints instead

(ノಥ益ಥ)ノ ┻━┻

Monday, December 2, 13

@rweald

Monoids to the Rescue

Monday, December 2, 13

@rweald

WTF is a Monoid?

trait Monoid[T] { def zero: T def plus(r: T, l: T): T}

* Just need to make sure plus is associative.(1+ 5) + 2 == (2 + 1) + 5

Monday, December 2, 13

@rweald

Monoid Based Aggregation

object LongMonoid extends Monoid[(Long, Long, Long)] { def zero = (0, 0, 0) def plus(r: (Long, Long, Long), l: (Long, Long, Long)) = { (l._1 + r._1, l._2 + r._2, l._3 + r._3) }}

inputData.reduceByKey(LongMonid.plus(_, _))

Monday, December 2, 13

@rweald

Twitter Algebird

http://github.com/twitter/algebird

Monday, December 2, 13

@rweald

Algebird Based Aggregation

import com.twitter.algebird._val aggregator = implicitly[Monoid[(Long,Long, Long)]]

inputData.reduceByKey(aggregator.plus(_, _))

Monday, December 2, 13

@rweald

How many unique users per publisher?

Monday, December 2, 13

@rweald

Too big for memory based naive Map

Monday, December 2, 13

@rweald

HyperLogLog FTW

Monday, December 2, 13

@rweald

HLL Aggregation

import com.twitter.algebird._val aggregator = new HyperLogLogMonoid(12)inputData.reduceByKey(aggregator.plus(_, _))

Monday, December 2, 13

@rweald

Monoids == Reusable Aggregation

Monday, December 2, 13

@rweald

Common Job Pattern

Map -> Aggregate ->Store

Monday, December 2, 13

@rweald

Store

Monday, December 2, 13

@rweald

How do we store the results?

Monday, December 2, 13

@rweald

Storage API Requirements

•Incremental updates (preferably associative)

•Pluggable to support “big data” stores

•Allow for testing jobs

Monday, December 2, 13

@rweald

Storage API

trait MergeableStore[K, V] { def get(key: K): V def put(kv: (K,V)): V /* * Should follow same associative property * as our Monoid from earlier */ def merge(kv: (K,V)): V}

Monday, December 2, 13

@rweald

Twitter Storehaus

http://github.com/twitter/storehaus

Monday, December 2, 13

@rweald

Storing Spark Results

def saveResults(result: DStream[String, Long], store: RedisStore[String, Long]) = { result.foreach { rdd => rdd.foreach { element => val (keys, value) = element store.merge(keys, impressions) } } }

Monday, December 2, 13

@rweald

Everyone can benefit

Monday, December 2, 13

@rweald

Potential API additions?

class PairDStreamFunctions[K, V] { def aggregateByKey(aggregator: Monoid[V]) def store(store: MergeableStore[K, V]) }

Monday, December 2, 13

@rweald

Twitter Summingbird

http://github.com/twitter/summingbird

*https://github.com/twitter/summingbird/issues/387

Monday, December 2, 13

@rweald

Testing Your Jobs

Monday, December 2, 13

@rweald

Testing best Practices

•Try and avoid full integration tests

•Use in-memory stores for testing

•Keep logic outside of Spark

•Use Summingbird in memory platform???

Monday, December 2, 13

@rweald

Ryan Weald@rweald

Thank You

Monday, December 2, 13

top related