monoids, store, and dependency injection - abstractions for spark streaming jobs

@rweald

Stores, Monoids and Dependency Injection

Spark Meetup 01/16/2014 Ryan Weald

@rweald

What We’re Going to Cover

•What we do and Why we choose Spark

•Common patterns in spark streaming jobs

•Monoids as an abstraction for aggregation

•Abstraction for saving the results of jobs

•Using dependency injection for improved

testability and developer happiness

@rweald

What is Sharethrough?

Advertising for the Modern Internet

FunctionForm

@rweald

What is Sharethrough?

@rweald

Why Spark Streaming?

@rweald

Why Spark Streaming

•Liked theoretical foundation of mini-batch

•Scala codebase + functional API

•Young project with opportunities to contribute

•Batch model for iterative ML algorithms

@rweald

Great... Now maintain dozens

of streaming jobs

@rweald

Common Patterns &

Functional Programming

@rweald

Map -> Aggregate ->Store

Common Job Pattern

@rweald

Which publisher pages has an ad unit appeared on?

Real World Example

@rweald

Mapping DatainputData.map { rawRequest => val params = QueryParams.parse(rawRequest) val pubPage = params.getOrElse( "pub_page_location", "http://example.com") val creative = params.getOrElse( "creative_key", "unknown") val uri = new java.net.URI(pubPage) val cleanPubPage = uri.getHost + "/" + uri.getPath (creative, cleanPubPage)}

@rweald

Aggregation

@rweald

Basic Aggregation

Add each pub page to a creative’s set

@rweald

Basic Aggregation

val sum: (Set[String], Set[String]) => Set[String] = _ ++ _!

creativePubPages.map { case(ckey, pubPage) (ckey, Set(pubPage))}.reduceByKey(sum)

@rweald

Way too much memory usage in

production as data size grows

@rweald

We need bloom filter to keep memory usage

@rweald

Total code re-write :(

@rweald

Monoids to the Rescue

@rweald

WTF is a Monoid?

trait Monoid[T] { def zero: T def plus(r: T, l: T): T}

* Just need to make sure plus is associative. (1+ 5) + 2 == (2 + 1) + 5

@rweald

Monoid Example

SetMonoid extends Monoid[Set[String]] { def zero = Set.empty[String] def plus(l: Set[String], r: Set[String]) = l ++ r}!SetMonoid.plus(Set("a"), Set("b"))//returns Set("a", "b")!SetMonoid.plus(Set("a"), Set("a"))//returns Set("a")

@rweald

Twitter Algebird !

http://github.com/twitter/algebird

@rweald

Algebird Based Aggregation

import com.twitter.algebird._!val bfMonoid = BloomFilter(500000, 0.01)!creativePubPages.map { case(ckey, pubPage) (ckey, bfMonoid.create(pubPage))}.reduceByKey(bfMonoid.plus(_, _))

@rweald

Add set of users who have seen creative to

same job

@rweald

Algebird Based Aggregationval aggregator = new Monoid[(BF, BF)] { def zero = (bfMonoid.zero, bfMonoid.zero) def plus(l: (BF, BF), r: (BF, BF)) = { (bfMonoid.plus(l._1, r._1), bfMonoid.plus(l._2, r._2)) }}!creativePubPages.map { case(ckey, pubPage, userId) ( ckey, bfMonoid.create(pubPage), bfMonoid.create(userID) )}.reduceByKey(aggregator.plus(_, _))

@rweald

Monoids == Reusable Aggregation

@rweald

Common Job Pattern

Map -> Aggregate ->Store

@rweald

How do we store the results?

@rweald

Storage API Requirements

•Incremental updates (preferably associative)

•Pluggable to support “big data” stores

•Allow for testing jobs

@rweald

Storage API

trait MergeableStore[K, V] { def get(key: K): V def put(kv: (K,V)): V /* * Should follow same associative property * as our Monoid from earlier */ def merge(kv: (K,V)): V}

@rweald

Twitter Storehaus !

http://github.com/twitter/storehaus

@rweald

Storing Spark Results

def saveResults(result: DStream[String, BF], store: HBaseStore[String, BF]) = { result.foreach { rdd => rdd.foreach { element => val (keys, value) = element store.merge(keys, impressions) } } }

@rweald

What if we don’t have HBase locally?

@rweald

Dependency Injection to the rescue

@rweald

Generic storage with environment specific

binding

@rweald

Generic Storage Method

def saveResults(result: DStream[String, BF], store: StorageFactory) = { val store = StorageFactory.create result.foreach { rdd => rdd.foreach { element => val (keys, value) = element store.merge(keys, impressions) } } }

@rweald

Google Guice !

https://github.com/sptz45/sse-guice

@rweald

DI the Store You Need!trait StorageFactory { def create: Store[String, BF]}!class DevModule extends ScalaModule { def configure() { bind[StorageFactory].to[InMemoryStorageFactory] }}!class ProdModule extends ScalaModule { def configure() { bind[StorageFactory].to[HBaseStorageFactory] }}

@rweald

Moving Forward

@rweald

Potential API additions?

class PairDStreamFunctions[K, V] { def aggregateByKey(aggregator: Monoid[V]) def store(store: MergeableStore[K, V]) }

@rweald

Twitter Summingbird !

http://github.com/twitter/summingbird

*https://github.com/twitter/summingbird/issues/387

@rweald

Ryan Weald @rweald

Thank You

monoids, store, and dependency injection - abstractions for spark streaming jobs

Technology

linear algebraic monoids - biumargolis/linear algebraic...

monoids, categories & monotone partial...

partially commutative inverse...

some special artex spaces over bi monoids

cyrille martraire: monoids, monoids everywhere! at i...

notions of computation as monoids

all about monoids

chapter 1: from integers to monoids

monoids monoids everywhere

ur domain haz monoids dddx nyc 2014

homological classiﬁcation of monoids by projectivities...

· contents preface ix introduction xi notations and...

practical object-oriented design principles · the...

log geometry - boun.edu.trlog geometry 1 1. monoids in this...

logics for classes of boolean monoids - lee pikelogic over...

functional algebra: monoids applied

some special artex spaces over bi-monoids

abstract algebra: monoids, groups, rings - index of

iris: monoids and invariants as an orthogonal basis for...

simplicial monoids and segal categories