@rweald
Stores, Monoids and Dependency Injection
Spark Meetup 01/16/2014 Ryan Weald
@rweald
@rweald
What We’re Going to Cover
•What we do and Why we choose Spark
•Common patterns in spark streaming jobs
•Monoids as an abstraction for aggregation
•Abstraction for saving the results of jobs
•Using dependency injection for improved
testability and developer happiness
@rweald
What is Sharethrough?
Advertising for the Modern Internet
FunctionForm
@rweald
What is Sharethrough?
@rweald
Why Spark Streaming?
@rweald
Why Spark Streaming
•Liked theoretical foundation of mini-batch
•Scala codebase + functional API
•Young project with opportunities to contribute
•Batch model for iterative ML algorithms
@rweald
Great... Now maintain dozens
of streaming jobs
@rweald
Common Patterns &
Functional Programming
@rweald
Map -> Aggregate ->Store
Common Job Pattern
@rweald
Which publisher pages has an ad unit appeared on?
Real World Example
@rweald
Mapping DatainputData.map { rawRequest => val params = QueryParams.parse(rawRequest) val pubPage = params.getOrElse( "pub_page_location", "http://example.com") val creative = params.getOrElse( "creative_key", "unknown") val uri = new java.net.URI(pubPage) val cleanPubPage = uri.getHost + "/" + uri.getPath (creative, cleanPubPage)}
@rweald
Aggregation
@rweald
Basic Aggregation
Add each pub page to a creative’s set
@rweald
Basic Aggregation
val sum: (Set[String], Set[String]) => Set[String] = _ ++ _!
creativePubPages.map { case(ckey, pubPage) (ckey, Set(pubPage))}.reduceByKey(sum)
@rweald
Way too much memory usage in
production as data size grows
@rweald
We need bloom filter to keep memory usage
fixed
@rweald
Total code re-write :(
@rweald
Monoids to the Rescue
@rweald
WTF is a Monoid?
trait Monoid[T] { def zero: T def plus(r: T, l: T): T}
* Just need to make sure plus is associative. (1+ 5) + 2 == (2 + 1) + 5
@rweald
Monoid Example
SetMonoid extends Monoid[Set[String]] { def zero = Set.empty[String] def plus(l: Set[String], r: Set[String]) = l ++ r}!SetMonoid.plus(Set("a"), Set("b"))//returns Set("a", "b")!SetMonoid.plus(Set("a"), Set("a"))//returns Set("a")
@rweald
Algebird Based Aggregation
import com.twitter.algebird._!val bfMonoid = BloomFilter(500000, 0.01)!creativePubPages.map { case(ckey, pubPage) (ckey, bfMonoid.create(pubPage))}.reduceByKey(bfMonoid.plus(_, _))
@rweald
Add set of users who have seen creative to
same job
@rweald
Algebird Based Aggregationval aggregator = new Monoid[(BF, BF)] { def zero = (bfMonoid.zero, bfMonoid.zero) def plus(l: (BF, BF), r: (BF, BF)) = { (bfMonoid.plus(l._1, r._1), bfMonoid.plus(l._2, r._2)) }}!creativePubPages.map { case(ckey, pubPage, userId) ( ckey, bfMonoid.create(pubPage), bfMonoid.create(userID) )}.reduceByKey(aggregator.plus(_, _))
@rweald
Monoids == Reusable Aggregation
@rweald
Common Job Pattern
Map -> Aggregate ->Store
@rweald
Store
@rweald
How do we store the results?
@rweald
Storage API Requirements
•Incremental updates (preferably associative)
•Pluggable to support “big data” stores
•Allow for testing jobs
@rweald
Storage API
trait MergeableStore[K, V] { def get(key: K): V def put(kv: (K,V)): V /* * Should follow same associative property * as our Monoid from earlier */ def merge(kv: (K,V)): V}
@rweald
Storing Spark Results
def saveResults(result: DStream[String, BF], store: HBaseStore[String, BF]) = { result.foreach { rdd => rdd.foreach { element => val (keys, value) = element store.merge(keys, impressions) } } }
@rweald
What if we don’t have HBase locally?
@rweald
Dependency Injection to the rescue
@rweald
Generic storage with environment specific
binding
@rweald
Generic Storage Method
def saveResults(result: DStream[String, BF], store: StorageFactory) = { val store = StorageFactory.create result.foreach { rdd => rdd.foreach { element => val (keys, value) = element store.merge(keys, impressions) } } }
@rweald
DI the Store You Need!trait StorageFactory { def create: Store[String, BF]}!class DevModule extends ScalaModule { def configure() { bind[StorageFactory].to[InMemoryStorageFactory] }}!class ProdModule extends ScalaModule { def configure() { bind[StorageFactory].to[HBaseStorageFactory] }}
@rweald
Moving Forward
@rweald
Potential API additions?
class PairDStreamFunctions[K, V] { def aggregateByKey(aggregator: Monoid[V]) def store(store: MergeableStore[K, V]) }
@rweald
Twitter Summingbird !
http://github.com/twitter/summingbird
*https://github.com/twitter/summingbird/issues/387
@rweald
Ryan Weald @rweald
Thank You