monoids, store, and dependency injection - abstractions for spark streaming jobs

42
@rweald Stores, Monoids and Dependency Injection Spark Meetup 01/16/2014 Ryan Weald @rweald

Upload: ryan-weald

Post on 26-Jan-2015

106 views

Category:

Technology


2 download

DESCRIPTION

Talk I gave at a Spark Meetup on 01/16/2014 Abstract: One of the most difficult aspects of deploying spark streaming as part of your technology stack is maintaining all the job associated with stream processing jobs. In this talk I will discuss the the tools and techniques that Sharethrough has found most useful for maintaining a large number of spark streaming jobs. We will look in detail at the way Monoids and Twitter's Algebrid library can be used to create generic aggregations. As well as the way we can create generic interfaces for writing the results of streaming jobs to multiple data stores. Finally we will look at the way dependency injection can be used to tie all the pieces together, enabling raping development of new streaming jobs.

TRANSCRIPT

Page 1: Monoids, Store, and Dependency Injection - Abstractions for Spark Streaming Jobs

@rweald

Stores, Monoids and Dependency Injection

Spark Meetup 01/16/2014 Ryan Weald

@rweald

Page 2: Monoids, Store, and Dependency Injection - Abstractions for Spark Streaming Jobs

@rweald

What We’re Going to Cover

•What we do and Why we choose Spark

•Common patterns in spark streaming jobs

•Monoids as an abstraction for aggregation

•Abstraction for saving the results of jobs

•Using dependency injection for improved

testability and developer happiness

Page 3: Monoids, Store, and Dependency Injection - Abstractions for Spark Streaming Jobs

@rweald

What is Sharethrough?

Advertising for the Modern Internet

FunctionForm

Page 4: Monoids, Store, and Dependency Injection - Abstractions for Spark Streaming Jobs

@rweald

What is Sharethrough?

Page 5: Monoids, Store, and Dependency Injection - Abstractions for Spark Streaming Jobs

@rweald

Why Spark Streaming?

Page 6: Monoids, Store, and Dependency Injection - Abstractions for Spark Streaming Jobs

@rweald

Why Spark Streaming

•Liked theoretical foundation of mini-batch

•Scala codebase + functional API

•Young project with opportunities to contribute

•Batch model for iterative ML algorithms

Page 7: Monoids, Store, and Dependency Injection - Abstractions for Spark Streaming Jobs

@rweald

Great... Now maintain dozens

of streaming jobs

Page 8: Monoids, Store, and Dependency Injection - Abstractions for Spark Streaming Jobs

@rweald

Common Patterns &

Functional Programming

Page 9: Monoids, Store, and Dependency Injection - Abstractions for Spark Streaming Jobs

@rweald

Map -> Aggregate ->Store

Common Job Pattern

Page 10: Monoids, Store, and Dependency Injection - Abstractions for Spark Streaming Jobs

@rweald

Which publisher pages has an ad unit appeared on?

Real World Example

Page 11: Monoids, Store, and Dependency Injection - Abstractions for Spark Streaming Jobs

@rweald

Mapping DatainputData.map { rawRequest => val params = QueryParams.parse(rawRequest) val pubPage = params.getOrElse( "pub_page_location", "http://example.com") val creative = params.getOrElse( "creative_key", "unknown") val uri = new java.net.URI(pubPage) val cleanPubPage = uri.getHost + "/" + uri.getPath (creative, cleanPubPage)}

Page 12: Monoids, Store, and Dependency Injection - Abstractions for Spark Streaming Jobs

@rweald

Aggregation

Page 13: Monoids, Store, and Dependency Injection - Abstractions for Spark Streaming Jobs

@rweald

Basic Aggregation

Add each pub page to a creative’s set

Page 14: Monoids, Store, and Dependency Injection - Abstractions for Spark Streaming Jobs

@rweald

Basic Aggregation

val sum: (Set[String], Set[String]) => Set[String] = _ ++ _!

creativePubPages.map { case(ckey, pubPage) (ckey, Set(pubPage))}.reduceByKey(sum)

Page 15: Monoids, Store, and Dependency Injection - Abstractions for Spark Streaming Jobs

@rweald

Way too much memory usage in

production as data size grows

Page 16: Monoids, Store, and Dependency Injection - Abstractions for Spark Streaming Jobs

@rweald

We need bloom filter to keep memory usage

fixed

Page 17: Monoids, Store, and Dependency Injection - Abstractions for Spark Streaming Jobs

@rweald

Total code re-write :(

Page 18: Monoids, Store, and Dependency Injection - Abstractions for Spark Streaming Jobs

@rweald

Monoids to the Rescue

Page 19: Monoids, Store, and Dependency Injection - Abstractions for Spark Streaming Jobs

@rweald

WTF is a Monoid?

trait Monoid[T] { def zero: T def plus(r: T, l: T): T}

* Just need to make sure plus is associative. (1+ 5) + 2 == (2 + 1) + 5

Page 20: Monoids, Store, and Dependency Injection - Abstractions for Spark Streaming Jobs

@rweald

Monoid Example

SetMonoid extends Monoid[Set[String]] { def zero = Set.empty[String] def plus(l: Set[String], r: Set[String]) = l ++ r}!SetMonoid.plus(Set("a"), Set("b"))//returns Set("a", "b")!SetMonoid.plus(Set("a"), Set("a"))//returns Set("a")

Page 21: Monoids, Store, and Dependency Injection - Abstractions for Spark Streaming Jobs

@rweald

Twitter Algebird !

http://github.com/twitter/algebird

Page 22: Monoids, Store, and Dependency Injection - Abstractions for Spark Streaming Jobs

@rweald

Algebird Based Aggregation

import com.twitter.algebird._!val bfMonoid = BloomFilter(500000, 0.01)!creativePubPages.map { case(ckey, pubPage) (ckey, bfMonoid.create(pubPage))}.reduceByKey(bfMonoid.plus(_, _))

Page 23: Monoids, Store, and Dependency Injection - Abstractions for Spark Streaming Jobs

@rweald

Add set of users who have seen creative to

same job

Page 24: Monoids, Store, and Dependency Injection - Abstractions for Spark Streaming Jobs

@rweald

Algebird Based Aggregationval aggregator = new Monoid[(BF, BF)] { def zero = (bfMonoid.zero, bfMonoid.zero) def plus(l: (BF, BF), r: (BF, BF)) = { (bfMonoid.plus(l._1, r._1), bfMonoid.plus(l._2, r._2)) }}!creativePubPages.map { case(ckey, pubPage, userId) ( ckey, bfMonoid.create(pubPage), bfMonoid.create(userID) )}.reduceByKey(aggregator.plus(_, _))

Page 25: Monoids, Store, and Dependency Injection - Abstractions for Spark Streaming Jobs

@rweald

Monoids == Reusable Aggregation

Page 26: Monoids, Store, and Dependency Injection - Abstractions for Spark Streaming Jobs

@rweald

Common Job Pattern

Map -> Aggregate ->Store

Page 27: Monoids, Store, and Dependency Injection - Abstractions for Spark Streaming Jobs

@rweald

Store

Page 28: Monoids, Store, and Dependency Injection - Abstractions for Spark Streaming Jobs

@rweald

How do we store the results?

Page 29: Monoids, Store, and Dependency Injection - Abstractions for Spark Streaming Jobs

@rweald

Storage API Requirements

•Incremental updates (preferably associative)

•Pluggable to support “big data” stores

•Allow for testing jobs

Page 30: Monoids, Store, and Dependency Injection - Abstractions for Spark Streaming Jobs

@rweald

Storage API

trait MergeableStore[K, V] { def get(key: K): V def put(kv: (K,V)): V /* * Should follow same associative property * as our Monoid from earlier */ def merge(kv: (K,V)): V}

Page 31: Monoids, Store, and Dependency Injection - Abstractions for Spark Streaming Jobs

@rweald

Twitter Storehaus !

http://github.com/twitter/storehaus

Page 32: Monoids, Store, and Dependency Injection - Abstractions for Spark Streaming Jobs

@rweald

Storing Spark Results

def saveResults(result: DStream[String, BF], store: HBaseStore[String, BF]) = { result.foreach { rdd => rdd.foreach { element => val (keys, value) = element store.merge(keys, impressions) } } }

Page 33: Monoids, Store, and Dependency Injection - Abstractions for Spark Streaming Jobs

@rweald

What if we don’t have HBase locally?

Page 34: Monoids, Store, and Dependency Injection - Abstractions for Spark Streaming Jobs

@rweald

Dependency Injection to the rescue

Page 35: Monoids, Store, and Dependency Injection - Abstractions for Spark Streaming Jobs

@rweald

Generic storage with environment specific

binding

Page 36: Monoids, Store, and Dependency Injection - Abstractions for Spark Streaming Jobs

@rweald

Generic Storage Method

def saveResults(result: DStream[String, BF], store: StorageFactory) = { val store = StorageFactory.create result.foreach { rdd => rdd.foreach { element => val (keys, value) = element store.merge(keys, impressions) } } }

Page 37: Monoids, Store, and Dependency Injection - Abstractions for Spark Streaming Jobs

@rweald

Google Guice !

https://github.com/sptz45/sse-guice

Page 38: Monoids, Store, and Dependency Injection - Abstractions for Spark Streaming Jobs

@rweald

DI the Store You Need!trait StorageFactory { def create: Store[String, BF]}!class DevModule extends ScalaModule { def configure() { bind[StorageFactory].to[InMemoryStorageFactory] }}!class ProdModule extends ScalaModule { def configure() { bind[StorageFactory].to[HBaseStorageFactory] }}

Page 39: Monoids, Store, and Dependency Injection - Abstractions for Spark Streaming Jobs

@rweald

Moving Forward

Page 40: Monoids, Store, and Dependency Injection - Abstractions for Spark Streaming Jobs

@rweald

Potential API additions?

class PairDStreamFunctions[K, V] { def aggregateByKey(aggregator: Monoid[V]) def store(store: MergeableStore[K, V]) }

Page 41: Monoids, Store, and Dependency Injection - Abstractions for Spark Streaming Jobs

@rweald

Twitter Summingbird !

http://github.com/twitter/summingbird

*https://github.com/twitter/summingbird/issues/387

Page 42: Monoids, Store, and Dependency Injection - Abstractions for Spark Streaming Jobs

@rweald

Ryan Weald @rweald

Thank You