monoids, store, and dependency injection - abstractions for spark streaming jobs

@rweald

Stores, Monoids and Dependency Injection

Spark Meetup 01/16/2014 Ryan Weald

@rweald

@rweald

What We’re Going to Cover

•What we do and Why we choose Spark

•Common patterns in spark streaming jobs

•Monoids as an abstraction for aggregation

•Abstraction for saving the results of jobs

•Using dependency injection for improved

testability and developer happiness

@rweald

What is Sharethrough?

Advertising for the Modern Internet

FunctionForm

@rweald

What is Sharethrough?

@rweald

Why Spark Streaming?

@rweald

Why Spark Streaming

•Liked theoretical foundation of mini-batch

•Scala codebase + functional API

•Young project with opportunities to contribute

•Batch model for iterative ML algorithms

@rweald

Great... Now maintain dozens

of streaming jobs

@rweald

Common Patterns &

Functional Programming

@rweald

Map -> Aggregate ->Store

Common Job Pattern

@rweald

Which publisher pages has an ad unit appeared on?

Real World Example

@rweald

Mapping DatainputData.map { rawRequest => val params = QueryParams.parse(rawRequest) val pubPage = params.getOrElse( "pub_page_location", "http://example.com") val creative = params.getOrElse( "creative_key", "unknown") val uri = new java.net.URI(pubPage) val cleanPubPage = uri.getHost + "/" + uri.getPath (creative, cleanPubPage)}

http://example.com

@rweald

Aggregation

@rweald

Basic Aggregation

Add each pub page to a creative’s set

@rweald

Basic Aggregation

val sum: (Set[String], Set[String]) => Set[String] = _ ++ _!

creativePubPages.map { case(ckey, pubPage) (ckey, Set(pubPage))}.reduceByKey(sum)

@rweald

Way too much memory usage in

production as data size grows

@rweald

We need bloom filter to keep memory usage

fixed

@rweald

Total code re-write :(

@rweald

Monoids to the Rescue

@rweald

WTF is a Monoid?

trait Monoid[T] { def zero: T def plus(r: T, l: T): T}

* Just need to make sure plus is associative. (1+ 5) + 2 == (2 + 1) + 5

@rweald

Monoid Example

SetMonoid extends Monoid[Set[String]] { def zero = Set.empty[String] def plus(l: Set[String], r: Set[String]) = l ++ r}!SetMonoid.plus(Set("a"), Set("b"))//returns Set("a", "b")!SetMonoid.plus(Set("a"), Set("a"))//returns Set("a")

@rweald

Twitter Algebird !

http://github.com/twitter/algebird


@rweald

Algebird Based Aggregation

import com.twitter.algebird._!val bfMonoid = BloomFilter(500000, 0.01)!creativePubPages.map { case(ckey, pubPage) (ckey, bfMonoid.create(pubPage))}.reduceByKey(bfMonoid.plus(_, _))

@rweald

Add set of users who have seen creative to

same job

@rweald

Algebird Based Aggregationval aggregator = new Monoid[(BF, BF)] { def zero = (bfMonoid.zero, bfMonoid.zero) def plus(l: (BF, BF), r: (BF, BF)) = { (bfMonoid.plus(l._1, r._1), bfMonoid.plus(l._2, r._2)) }}!creativePubPages.map { case(ckey, pubPage, userId) ( ckey, bfMonoid.create(pubPage), bfMonoid.create(userID) )}.reduceByKey(aggregator.plus(_, _))

@rweald

Monoids == Reusable Aggregation

@rweald

Common Job Pattern

Map -> Aggregate ->Store

@rweald

Store

@rweald

How do we store the results?

@rweald

Storage API Requirements

•Incremental updates (preferably associative)

•Pluggable to support “big data” stores

•Allow for testing jobs

@rweald

Storage API

trait MergeableStore[K, V] { def get(key: K): V def put(kv: (K,V)): V /* * Should follow same associative property * as our Monoid from earlier */ def merge(kv: (K,V)): V}

@rweald

Twitter Storehaus !

http://github.com/twitter/storehaus


@rweald

Storing Spark Results

def saveResults(result: DStream[String, BF], store: HBaseStore[String, BF]) = { result.foreach { rdd => rdd.foreach { element => val (keys, value) = element store.merge(keys, impressions) } } }

@rweald

What if we don’t have HBase locally?

@rweald

Dependency Injection to the rescue

@rweald

Generic storage with environment specific

binding

@rweald

Generic Storage Method

def saveResults(result: DStream[String, BF], store: StorageFactory) = { val store = StorageFactory.create result.foreach { rdd => rdd.foreach { element => val (keys, value) = element store.merge(keys, impressions) } } }

@rweald

Google Guice !

https://github.com/sptz45/sse-guice

https://github.com/sptz45/sse-guice

@rweald

DI the Store You Need!trait StorageFactory { def create: Store[String, BF]}!class DevModule extends ScalaModule { def configure() { bind[StorageFactory].to[InMemoryStorageFactory] }}!class ProdModule extends ScalaModule { def configure() { bind[StorageFactory].to[HBaseStorageFactory] }}

@rweald

Moving Forward

@rweald

Potential API additions?

class PairDStreamFunctions[K, V] { def aggregateByKey(aggregator: Monoid[V]) def store(store: MergeableStore[K, V]) }

@rweald

Twitter Summingbird !

http://github.com/twitter/summingbird

*https://github.com/twitter/summingbird/issues/387


https://github.com/twitter/summingbird/issues/387

@rweald

Ryan Weald @rweald

Thank You

monoids, store, and dependency injection - abstractions for spark streaming jobs

Technology