model stacking for ``online''...

Model Stacking for “Online” Data

Jeremy Coyle, Sam Lendle, Sara Moore

May 7, 2013

BackgroundModel Stacking

I Definition: An ensemble method where many algorithms,possibly from different types of models, are combined to forma final estimator.

I Motivation: A wide variety of algorithms can be used topredict an outcome, but for any new prediction problem, it’snot clear which one will best capture the true relationshipbetween features and outcome.

I Method: algorithms are trained together, then combined viaa weighted average using using hold-out data (orcross-validation).

I Implementation: Super Learner combines other algorithmsby minimizing a suitable loss function of a linear combinationof their predictions under cross-validation.

I van der Laan et al. (2003, 2006) proved that the combinedapproach will perform asymptotically as well as or better thanthe “best” algorithm in the library of algorithms.

BackgroundOnline (Machine) Learning

I In many contexts, the amount of data generated is so largethat it is not feasible to store it. Online machine learning, aparadigm in which models are fit to data coming in a“stream,” eliminates the need to store all past observations.

I For each block of incoming data, the fit of the model isupdated and then the data are discarded. Before the fit isupdated, each new block can also be used to assess theperformance of the model (a sort of cross-validation).

I Any algorithms that can be fit with stochastic gradientdescent can be easily used in an online context.

Research Question

I Can we develop an efficient online/streaming approach totraining stacking algorithms?

I Secondary goal: Speed up the method by training eachalgorithm on each block concurrently

DatasetsWikipedia

I Most recent Wikipedia dump (updated since Assignment 3’sversion)

I Predict inclusion in parent category of “Mathematics”(dichotomous)

I training set size ≈ 4, 500, 000, validation set size = 100, 000I used Dictionary of 200, 000 most common words, minus

stopwordsI Single words from article text

DatasetsStack Overflow

I Stack Overflow questions up until July 31, 2012 (fromkaggle.com)

I Predict if question ends up as “closed” (output: OpenStatus,dichotomized)

I training set size ≈ 3, 000, 000, validation set size ≈ 300, 000I used Feature Hashing

I question tagsI words in question titleI words in question body (not longer than 15 characters)I user reputation at question posting time

(ReputationAtPostCreation)I number of undeleted questions by user

(OwnerUndeletedAnswerCountAtPostTime)

MethodsOverview

I Train many estimators with stochastic or mini-batch gradientdescent

I Update weighted combination of estimators on new blockbefore training each estimator on that block

I Use a “moving window” of predictions and true outcomes toupdate weights, where the window sizes could span more thanone block

I Test on validation set

MethodsAlgorithm

Algorithm 1: Online Model Stacker

1.1 Using the first data block:1.2 Take a gradient step for each algorithm in the library of algorithms1.3 for each subsequent data block do1.4 Predict the outcome of each observation using each algorithm1.5 Calculate the risk for this data block (Mean Squared Error)1.6 Update the best weighted average of algorithms for predicting

the true outcome (via NNLS)1.7 Take a gradient step for each individual algorithm fits

1.8 end

MethodsAlgorithm Library

I Logistic regressionI Ridge regularizationI LASSO regularization

I Mean

MethodsPerformance assessment

I Overall model fit assessed using MSE using the predictionsfrom each block (before updating the models).

I Performance of the individual algorithms and overall modelassessed via MSE/AUC on validation set after fitting on onepass through the training data.

ResultsChoosing learning rate (Wikipedia data)

Risk by learning rate and SuperLearner

0 100 200 300 400Step

SuperLearner()

LogisticRegression(alpha=1.000000)

MeanLearner()

Weights by learning rate

0 100 200 300 400Step

MeanLearner()

Validation set performance of eachlearning rate and SuperLearner combination

rMSE Accuracy AUC F1

MeanLearner()

SuperLearner()

9950 0.

Performance Measure

Validation set ROC plot for eachlearning rate and Super Learner combination

0.00 0.25 0.50 0.75 1.00False Positive Rate

factor(learner)

SuperLearner()

MeanLearner()

ResultsChoosing penalty parameter (Wikipedia data)

Risk for different penalties

0 100 200 300 400Step

SuperLearner()

LassoLogisticRegression(lambda=1.00000,alpha=0.100000)

LassoLogisticRegression(lambda=1.00000e−05,alpha=0.100000)

MeanLearner()

ResultsOverall results (Wikipedia data)

Risk by algorithm

0 100 200 300 400Step

SuperLearner()

RidgeLogisticRegression(lambda=1.00000e−05,alpha=0.100000)

SVM(lambda=1.00000e−06,alpha=0.100000)

MeanLearner()

Weights by algorithm

0 100 200 300 400Step

RidgeLogisticRegression(lambda=1.00000e−05,alpha=0.100000)

SVM(lambda=1.00000e−06,alpha=0.100000)

MeanLearner()

Validation set performance

LassoLogisticRegression

LogisticRegression

MeanLearner

RidgeLogisticRegression

SuperLearner

9950 0.

Performance Measure

Validation set ROC plots

LogisticRegression

MeanLearner

SuperLearner

ResultsResults from Stack Overflow

Validation set performance

●●●● ● ●●●●●●●●●●●●●● ●●●●●●

●●●● ●

●●

●●●● ● ●●●●● ●●●●● ●●●●●●●●

●●

●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●

●●●●● ●●●●

●●●●●●●●●●●●●●●●

●●●●●

●●●

●●

●●●●● ●●●●●●●●●●●●●●●●●●●●

●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●

● ● ●●●● ● ●●●●●●●●●●●●●●●●●●

● ● ●●●

●●

●●●

● ● ●●●● ● ●●●● ● ●●●● ●●●● ●●●●●

● ● ●●●

●●● ●●●●● ●●●●●●●●●●●●●●●●●

●●● ●●

●●● ● ●●●●●●●●●●

●●●●●●●●●●●

●●● ● ●

●●●

●●

●●● ● ●●●● ● ●●●●●●●●●●●●●●●●

●●● ● ●

●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●

LogisticRegression

MeanLearner

SuperLearner

Performance Measure

ResultsResults from Stack Overflow

Validation set ROC plot

LogisticRegression

MeanLearner

SuperLearner

Conclusions

I Does as well as any individual algorithm

I Doesn’t overfit even when many algorithms are included

I Also a good way to auto-tune many learning ratesconcurrently

References

Leon Bottou. Large-scale machine learning with stochasticgradient descent. In Proceedings of COMPSTAT’2010, pages177–186. Springer, 2010.

John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradientmethods for online learning and stochastic optimization. Journalof Machine Learning Research, 12:2121–2159, 2010.

Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. Pegasos:Primal estimated sub-gradient solver for svm. In Proceedings ofthe 24th international conference on Machine learning, pages807–814. ACM, 2007.

Mark J van der Laan, Eric C Polley, and Alan E Hubbard. Superlearner. Statistical Applications in Genetics and MolecularBiology, 6(1):1–21, 2007.

model stacking for ``online''...

Documents

cs294-6 reconfigurable computing

sp08 cs294 lecture 9 -- speech signalklein/cs294-19/sp08...

onur 447 spring13 lecture2 isa

english 1102 - spring13

markstrat intro spring13

predicting student retention in massive open online courses...

cs294: communication-avoiding algorithms www....

spring13 catalogue issuu

onur 447 spring13 lecture1 intro

inferring(author(locaon(in(social(media sanjay(krishnan...

ucb cs294-88: declarative design [0.2cm] chisel...

cs294-1 behavioral data mining - people

birkhaeuser spring13 en 041212 web

social bookmarking spring13

sp07 cs294 lecture 18 -- semantic roles.ppt...

the application of gordon's empirical model - spring13

lookbook spring13

edtech502 41734174 spring13 syllabus

welcome cs294-8 design of deeply networked systems spring...

spring13 team...