# online learning with structured streaming, spark summit brussels 2016

Post on 07-Jan-2017

188 views

Embed Size (px)

TRANSCRIPT

Recipes for running Spark Streaming in Production

Online Learning with Structured StreamingRam Sriharsha, Vlad Feinberg @halfabrane

Spark Summit, Brussels27 October 2016

1

What is online learning?Update model parameters on each data pointIn batch setting get to see the entire dataset before updateCannot visit data points againIn batch setting, can iterate over data points as many times as we want!

2

An example: the perceptron3

xw

Update Rule: if (y != sign(w.x)), w -> w + y(w.x)Goal: Find the best line separating positiveFrom negative examples on a plane

Why learn online?I want to adapt to changing patterns quicklydata distribution can changee.g, distribution of features that affect learning might change over timeI need to learn a good model within resource + time constraints (large-scale learning)Time to a given accuracy might be faster for certain online algorithms4

Online Classification SettingPick a hypothesisFor each labeled example (, y):Predict label using hypothesisObserve the loss (y, ) (and its gradient)Learn from mistake and update hypothesisGoal: to make as few mistakes as possible in comparison to the best hypothesis in hindsight5

An example: Online SGDInitialize weights Loss function is known.For each labeled example (, y):Perform update -> (y , .)For each new example x:Predict = (.) ( is called link function)6

(y , .)

Distributed Online LearningSynchronousOn each worker:Load training data, compute gradients and update model, push model to driverOn some node:Perform model mergeAsynchronousOn each worker:Load training data, compute gradients and push to serverOn each server:Aggregate the gradients, perform update step7

Vlad Feinberg () - We are using the synchronous model, but the driver doesn't do the merge - it's some worker.ChallengesNot all algorithms admit efficient online versionsLack of infrastructure(Single machine) Vowpal Wabbit works great but hard to use from Scala, Java and other languages.(Distributed) No implementation that is fault tolerant, scalable, robustLack of framework in open source to provide extensible algorithmsAdagrad, normalized learning, L1 regularization,Online SGD, FTRL, ...8

Structured Streaming

One single API DataFrame for everythingSame API for machine learning, batch processing, graphXDataset is a typed version of DataFrame for Scala and Java

End-to-end exactly-once guaranteesThe guarantees extend into the sources/sinks, e.g. MySQL, S3

Understands external event-timeHandling late arriving dataSupport sessionization based on event-time

Structured Streaming

How does it work?at any time, the output of the application is equivalent to executing a batch job on a prefix of the data11

The ModelTrigger: every 1 sec123Time

data upto 1Input

data upto 2

data upto 3QueryInput: data from source as an append-only table

Trigger: how frequently to check input for new data

Query: operations on input usual map/filter/reduce new window, session ops

The ModelTrigger: every 1 sec123

output for data up to 1ResultQueryTime

data upto 1Input

data upto 2

output for data up to 2

data upto 3

output for data up to 3Result: final operated table updated every trigger interval

Output: what part of result to write to data sink after every trigger Complete output: Write full result table every timeOutput

complete output

The ModelTrigger: every 1 sec123

output for data up to 1ResultQueryTime

data upto 1Input

data upto 2

output for data up to 2

data upto 3

output for data up to 3

Output

deltaoutputResult: final operated table updated every trigger interval

Output: what part of result to write to data sink after every trigger Complete output: Write full result table every timeDelta output: Write only the rows that changed in result from previous batchAppend output: Write only new rows

*Not all output modes are feasible with all queries

Streaming ML on Structured Streaming

Streaming ML on Structured StreamingTrigger: every 1 sec123Time

data upto 1Input

data upto 2

data upto 3QueryInput: append only table containing labeled examples

Query: Stateful aggregation query: picks up the last trained model, performs a distributed update + merge

As before, the input can be viewed as an ever-expanding input table, which represents the incoming stream of labeled examples.

Just like most machine learning algorithms can be thought of as iterative reductions over ones entire dataset, the streamingML query is also just a reduce operation.

Streaming ML has key differences in semantics from both, first, traditional ML algorithms and, second, regular aggregations.

First, whereas a typical ML algorithm, like LogisticRegression, will perform multiple passess over the dataset as it improves the model, streaming ML will, in effect, do a single pass over this infinite table.

Streaming ML on Structured StreamingTrigger: every 1 sec123

model for data up to tResultQueryTime

labeled examples upto time tInput

Result: table of model parameters updated every trigger interval

Complete mode: table has one row,constantly being updated

Append mode (in the works): table has timestamp-keyed model, onerow per trigger Output

intermediate models would have the same state at this point of computation for the (abstract) queries #1 and #2

Second, where typical aggregations are stateless, streaming ML is stateful.

For instance, when you calculate a sum of numbers, you dont need any state to compute partial sums of separate data on your machines - you can just start adding from 0.

OTOH, if you have an online algorithm, baked into it is some kind of model state, that you learn on as you see more and more examples. When we get a new batch of labelled data, we cant start from 0 - we have to keep using the most recent model as we go along.

What this query would spit out is the resulting aggregation, the model trained on all of the tables data. In the streaming context, this means the model trained on all of the data seen so far.

Why is this hard?Need to update model, i.eUpdate(previousModel, newDataPoint) = newModelTypical aggregation is associative, commutativee.g. sum(P1: sum(sum(0, data[0]), data[1]),P2: sum(sum(0, data[2]), data[3]))General model update violates associativity + commutativity!18

Solution: Make AssumptionsResult may be partition-dependent, but we dont care as long as we get some valid result.

average-models( P1: update(update(previous model, data[0]), data[1]), P2: update(update(previous model, data[2]), data[3]))

Only partition-dependent if update and average dont commute - can still be deterministic otherwise!19

Stateful AggregatorWithin each partitionInitialize with previous state (instead of zero in regular aggregator)For each item, update statePerform reduce stepOutput final state

Very general abstraction: works for sketches, online statistics (quantiles), online clustering 20

How does it work?

DriverMap

Map

State StoreLabeled Stream SourceReduceIs there more data?yes!run queryMap

Read labeled examplesFeature transforms, gradient updatesModel averagingsave modelread last saved model

APIs

Spark Summit Brussels27 October 2016

22

ML Estimator on StreamsInteroperable with ML pipelines23

Streaming DF

m = estimator.fit()

m.writeStreamstreaming sinkInput: stream of labelled dataOutput: stream of models, updated over time.

Interoperability: You can use this just along with or inside your queries - ANY input streaming dataframeBatch integration. You get out a STREAM of models (to S3).

23

Batch InteroperabilitySeamless application on batch datasets24

Static DF for batch MLmodel = estimator.fit(batchDF)1

n

Calling fit() on a batch dataset runs a single pass of the algorithm. Great for speed on large datasets, analyzing online model behavior in a controlled environment.24

Feature CreationHandle new features as they appear (ex., IPs in fraud detection)Provide transformers, such as the HashingEncoder, that apply the hashing trick.Encode arbitrary (possibly categorical data) without knowing cardinality ahead of time by using a high-dimensional sparse mapping.25

25

API GoalsProvide modern, regret-minimization-based online algorithms.Online Logistic RegressionAdagradOnline gradient descentL2 regularizationInput streams of any kind accepted.Streaming aware feature engineering26

26

Whats next?

Spark Summit Brussels27 October 2016

27

Whats next?More bells and whistlesAdaptive normalizationL1 regularizationMore algorithmsOnline quantile estimation?More general Sketches?Online clustering?Scale testing and benchmarking

28

Vlad Feinberg () - Adagrad was complete, right?

Demo

Spark Summit Brussels27 October 2016

29