monoids, store, and dependency injection - abstractions for spark streaming jobs

Post on 26-Jan-2015

106 Views

Category:

Technology

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

Talk I gave at a Spark Meetup on 01/16/2014 Abstract: One of the most difficult aspects of deploying spark streaming as part of your technology stack is maintaining all the job associated with stream processing jobs. In this talk I will discuss the the tools and techniques that Sharethrough has found most useful for maintaining a large number of spark streaming jobs. We will look in detail at the way Monoids and Twitter's Algebrid library can be used to create generic aggregations. As well as the way we can create generic interfaces for writing the results of streaming jobs to multiple data stores. Finally we will look at the way dependency injection can be used to tie all the pieces together, enabling raping development of new streaming jobs.

TRANSCRIPT

@rweald

Stores, Monoids and Dependency Injection

Spark Meetup 01/16/2014 Ryan Weald

@rweald

@rweald

What We’re Going to Cover

•What we do and Why we choose Spark

•Common patterns in spark streaming jobs

•Monoids as an abstraction for aggregation

•Abstraction for saving the results of jobs

•Using dependency injection for improved

testability and developer happiness

@rweald

What is Sharethrough?

Advertising for the Modern Internet

FunctionForm

@rweald

What is Sharethrough?

@rweald

Why Spark Streaming?

@rweald

Why Spark Streaming

•Liked theoretical foundation of mini-batch

•Scala codebase + functional API

•Young project with opportunities to contribute

•Batch model for iterative ML algorithms

@rweald

Great... Now maintain dozens

of streaming jobs

@rweald

Common Patterns &

Functional Programming

@rweald

Map -> Aggregate ->Store

Common Job Pattern

@rweald

Which publisher pages has an ad unit appeared on?

Real World Example

@rweald

Mapping DatainputData.map { rawRequest => val params = QueryParams.parse(rawRequest) val pubPage = params.getOrElse( "pub_page_location", "http://example.com") val creative = params.getOrElse( "creative_key", "unknown") val uri = new java.net.URI(pubPage) val cleanPubPage = uri.getHost + "/" + uri.getPath (creative, cleanPubPage)}

@rweald

Aggregation

@rweald

Basic Aggregation

Add each pub page to a creative’s set

@rweald

Basic Aggregation

val sum: (Set[String], Set[String]) => Set[String] = _ ++ _!

creativePubPages.map { case(ckey, pubPage) (ckey, Set(pubPage))}.reduceByKey(sum)

@rweald

Way too much memory usage in

production as data size grows

@rweald

We need bloom filter to keep memory usage

fixed

@rweald

Total code re-write :(

@rweald

Monoids to the Rescue

@rweald

WTF is a Monoid?

trait Monoid[T] { def zero: T def plus(r: T, l: T): T}

* Just need to make sure plus is associative. (1+ 5) + 2 == (2 + 1) + 5

@rweald

Monoid Example

SetMonoid extends Monoid[Set[String]] { def zero = Set.empty[String] def plus(l: Set[String], r: Set[String]) = l ++ r}!SetMonoid.plus(Set("a"), Set("b"))//returns Set("a", "b")!SetMonoid.plus(Set("a"), Set("a"))//returns Set("a")

@rweald

Twitter Algebird !

http://github.com/twitter/algebird

@rweald

Algebird Based Aggregation

import com.twitter.algebird._!val bfMonoid = BloomFilter(500000, 0.01)!creativePubPages.map { case(ckey, pubPage) (ckey, bfMonoid.create(pubPage))}.reduceByKey(bfMonoid.plus(_, _))

@rweald

Add set of users who have seen creative to

same job

@rweald

Algebird Based Aggregationval aggregator = new Monoid[(BF, BF)] { def zero = (bfMonoid.zero, bfMonoid.zero) def plus(l: (BF, BF), r: (BF, BF)) = { (bfMonoid.plus(l._1, r._1), bfMonoid.plus(l._2, r._2)) }}!creativePubPages.map { case(ckey, pubPage, userId) ( ckey, bfMonoid.create(pubPage), bfMonoid.create(userID) )}.reduceByKey(aggregator.plus(_, _))

@rweald

Monoids == Reusable Aggregation

@rweald

Common Job Pattern

Map -> Aggregate ->Store

@rweald

Store

@rweald

How do we store the results?

@rweald

Storage API Requirements

•Incremental updates (preferably associative)

•Pluggable to support “big data” stores

•Allow for testing jobs

@rweald

Storage API

trait MergeableStore[K, V] { def get(key: K): V def put(kv: (K,V)): V /* * Should follow same associative property * as our Monoid from earlier */ def merge(kv: (K,V)): V}

@rweald

Twitter Storehaus !

http://github.com/twitter/storehaus

@rweald

Storing Spark Results

def saveResults(result: DStream[String, BF], store: HBaseStore[String, BF]) = { result.foreach { rdd => rdd.foreach { element => val (keys, value) = element store.merge(keys, impressions) } } }

@rweald

What if we don’t have HBase locally?

@rweald

Dependency Injection to the rescue

@rweald

Generic storage with environment specific

binding

@rweald

Generic Storage Method

def saveResults(result: DStream[String, BF], store: StorageFactory) = { val store = StorageFactory.create result.foreach { rdd => rdd.foreach { element => val (keys, value) = element store.merge(keys, impressions) } } }

@rweald

Google Guice !

https://github.com/sptz45/sse-guice

@rweald

DI the Store You Need!trait StorageFactory { def create: Store[String, BF]}!class DevModule extends ScalaModule { def configure() { bind[StorageFactory].to[InMemoryStorageFactory] }}!class ProdModule extends ScalaModule { def configure() { bind[StorageFactory].to[HBaseStorageFactory] }}

@rweald

Moving Forward

@rweald

Potential API additions?

class PairDStreamFunctions[K, V] { def aggregateByKey(aggregator: Monoid[V]) def store(store: MergeableStore[K, V]) }

@rweald

Twitter Summingbird !

http://github.com/twitter/summingbird

*https://github.com/twitter/summingbird/issues/387

@rweald

Ryan Weald @rweald

Thank You

top related