model stacking for ``online''...
Post on 05-Jun-2020
2 Views
Preview:
TRANSCRIPT
Model Stacking for “Online” Data
Jeremy Coyle, Sam Lendle, Sara Moore
May 7, 2013
BackgroundModel Stacking
I Definition: An ensemble method where many algorithms,possibly from different types of models, are combined to forma final estimator.
I Motivation: A wide variety of algorithms can be used topredict an outcome, but for any new prediction problem, it’snot clear which one will best capture the true relationshipbetween features and outcome.
I Method: algorithms are trained together, then combined viaa weighted average using using hold-out data (orcross-validation).
I Implementation: Super Learner combines other algorithmsby minimizing a suitable loss function of a linear combinationof their predictions under cross-validation.
I van der Laan et al. (2003, 2006) proved that the combinedapproach will perform asymptotically as well as or better thanthe “best” algorithm in the library of algorithms.
BackgroundOnline (Machine) Learning
I In many contexts, the amount of data generated is so largethat it is not feasible to store it. Online machine learning, aparadigm in which models are fit to data coming in a“stream,” eliminates the need to store all past observations.
I For each block of incoming data, the fit of the model isupdated and then the data are discarded. Before the fit isupdated, each new block can also be used to assess theperformance of the model (a sort of cross-validation).
I Any algorithms that can be fit with stochastic gradientdescent can be easily used in an online context.
Research Question
I Can we develop an efficient online/streaming approach totraining stacking algorithms?
I Secondary goal: Speed up the method by training eachalgorithm on each block concurrently
DatasetsWikipedia
I Most recent Wikipedia dump (updated since Assignment 3’sversion)
I Predict inclusion in parent category of “Mathematics”(dichotomous)
I training set size ≈ 4, 500, 000, validation set size = 100, 000I used Dictionary of 200, 000 most common words, minus
stopwordsI Single words from article text
DatasetsStack Overflow
I Stack Overflow questions up until July 31, 2012 (fromkaggle.com)
I Predict if question ends up as “closed” (output: OpenStatus,dichotomized)
I training set size ≈ 3, 000, 000, validation set size ≈ 300, 000I used Feature Hashing
I question tagsI words in question titleI words in question body (not longer than 15 characters)I user reputation at question posting time
(ReputationAtPostCreation)I number of undeleted questions by user
(OwnerUndeletedAnswerCountAtPostTime)
MethodsOverview
I Train many estimators with stochastic or mini-batch gradientdescent
I Update weighted combination of estimators on new blockbefore training each estimator on that block
I Use a “moving window” of predictions and true outcomes toupdate weights, where the window sizes could span more thanone block
I Test on validation set
MethodsAlgorithm
Algorithm 1: Online Model Stacker
1.1 Using the first data block:1.2 Take a gradient step for each algorithm in the library of algorithms1.3 for each subsequent data block do1.4 Predict the outcome of each observation using each algorithm1.5 Calculate the risk for this data block (Mean Squared Error)1.6 Update the best weighted average of algorithms for predicting
the true outcome (via NNLS)1.7 Take a gradient step for each individual algorithm fits
1.8 end
MethodsAlgorithm Library
I Logistic regressionI Ridge regularizationI LASSO regularization
I SVM
I Mean
MethodsPerformance assessment
I Overall model fit assessed using MSE using the predictionsfrom each block (before updating the models).
I Performance of the individual algorithms and overall modelassessed via MSE/AUC on validation set after fitting on onepass through the training data.
ResultsChoosing learning rate (Wikipedia data)
Risk by learning rate and SuperLearner
0.004
0.005
0.006
0.007
0 100 200 300 400Step
MS
E
Model
SuperLearner()
LogisticRegression(alpha=1.000000)
LogisticRegression(alpha=0.100000)
LogisticRegression(alpha=0.010000)
LogisticRegression(alpha=0.001000)
LogisticRegression(alpha=0.000100)
MeanLearner()
ResultsChoosing learning rate (Wikipedia data)
Weights by learning rate
0.0
0.1
0.2
0.3
0.4
0 100 200 300 400Step
Wei
ght
Model
LogisticRegression(alpha=1.000000)
LogisticRegression(alpha=0.100000)
LogisticRegression(alpha=0.010000)
LogisticRegression(alpha=0.001000)
LogisticRegression(alpha=0.000100)
MeanLearner()
ResultsChoosing learning rate (Wikipedia data)
Validation set performance of eachlearning rate and SuperLearner combination
rMSE Accuracy AUC F1
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
MeanLearner()
LogisticRegression(alpha=0.000100)
LogisticRegression(alpha=0.001000)
LogisticRegression(alpha=0.010000)
LogisticRegression(alpha=0.100000)
LogisticRegression(alpha=1.000000)
SuperLearner()
0.06
0.07
0.08
0.99
250.
9930
0.99
350.
9940
0.99
450.
9950 0.
50.
60.
70.
80.
91.
00.
0
0.2
0.4
0.6
Performance Measure
Mod
el
ResultsChoosing learning rate (Wikipedia data)
Validation set ROC plot for eachlearning rate and Super Learner combination
0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.00False Positive Rate
True
Pos
itive
Rat
e
factor(learner)
SuperLearner()
LogisticRegression(alpha=1.000000)
LogisticRegression(alpha=0.100000)
LogisticRegression(alpha=0.010000)
LogisticRegression(alpha=0.001000)
LogisticRegression(alpha=0.000100)
MeanLearner()
ResultsChoosing penalty parameter (Wikipedia data)
Risk for different penalties
0.004
0.005
0.006
0.007
0 100 200 300 400Step
MS
E
Model
SuperLearner()
LassoLogisticRegression(lambda=1.00000,alpha=0.100000)
LassoLogisticRegression(lambda=0.100000,alpha=0.100000)
LassoLogisticRegression(lambda=0.0100000,alpha=0.100000)
LassoLogisticRegression(lambda=0.00100000,alpha=0.100000)
LassoLogisticRegression(lambda=0.000100000,alpha=0.100000)
LassoLogisticRegression(lambda=1.00000e−05,alpha=0.100000)
LassoLogisticRegression(lambda=1.00000e−06,alpha=0.100000)
LassoLogisticRegression(lambda=1.00000e−07,alpha=0.100000)
MeanLearner()
ResultsOverall results (Wikipedia data)
Risk by algorithm
0.004
0.005
0.006
0.007
0 100 200 300 400Step
MS
E
Model
SuperLearner()
LogisticRegression(alpha=0.100000)
LassoLogisticRegression(lambda=1.00000e−05,alpha=0.100000)
RidgeLogisticRegression(lambda=1.00000e−05,alpha=0.100000)
SVM(lambda=1.00000e−06,alpha=0.100000)
MeanLearner()
ResultsOverall results (Wikipedia data)
Weights by algorithm
0.0
0.2
0.4
0.6
0 100 200 300 400Step
Wei
ght
Model
LogisticRegression(alpha=0.100000)
LassoLogisticRegression(lambda=1.00000e−05,alpha=0.100000)
RidgeLogisticRegression(lambda=1.00000e−05,alpha=0.100000)
SVM(lambda=1.00000e−06,alpha=0.100000)
MeanLearner()
ResultsOverall results (Wikipedia data)
Validation set performance
rMSE Accuracy AUC F1
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
LassoLogisticRegression
LogisticRegression
MeanLearner
RidgeLogisticRegression
SuperLearner
SVM
0.07
0.08
0.99
250.
9930
0.99
350.
9940
0.99
450.
9950 0.
50.
60.
70.
80.
91.
00.
0
0.2
0.4
0.6
Performance Measure
Mod
el
ResultsOverall results (Wikipedia data)
Validation set ROC plots
0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.00False Positive Rate
True
Pos
itive
Rat
e
type
LassoLogisticRegression
LogisticRegression
MeanLearner
RidgeLogisticRegression
SuperLearner
SVM
ResultsResults from Stack Overflow
Validation set performance
rMSE Accuracy AUC F1
●●●● ● ●●●●●●●●●●●●●● ●●●●●●
●●●● ●
●●
●●
●
●
●●●● ● ●●●●● ●●●●● ●●●●●●●●
●●
●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●
●
●●●●● ●●●●
●●●●●●●●●●●●●●●●
●●●●●
●●●
●●
●
●●●●● ●●●●●●●●●●●●●●●●●●●●
●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●
●
● ● ●●●● ● ●●●●●●●●●●●●●●●●●●
● ● ●●●
●●
●●●
●
● ● ●●●● ● ●●●● ● ●●●● ●●●● ●●●●●
● ● ●●●
●●● ●●●●● ●●●●●●●●●●●●●●●●●
●●● ●●
●
●●● ● ●●●●●●●●●●
●●●●●●●●●●●
●●● ● ●
●●●
●●
●
●●● ● ●●●● ● ●●●●●●●●●●●●●●●●
●●● ● ●
●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●
●
LassoLogisticRegression
LogisticRegression
MeanLearner
RidgeLogisticRegression
SVM
SuperLearner
0.13
80.
139
0.14
00.
141
0.14
20.
9782
50.
9785
00.
9787
50.
9790
00.
9792
5
0.5
0.6
0.7
0.8
0.00
0.02
0.04
0.06
Performance Measure
Mod
el
ResultsResults from Stack Overflow
Validation set ROC plot
0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.00False Positive Rate
True
Pos
itive
Rat
e
type
LassoLogisticRegression
LogisticRegression
MeanLearner
RidgeLogisticRegression
SVM
SuperLearner
Conclusions
I Does as well as any individual algorithm
I Doesn’t overfit even when many algorithms are included
I Also a good way to auto-tune many learning ratesconcurrently
References
Leon Bottou. Large-scale machine learning with stochasticgradient descent. In Proceedings of COMPSTAT’2010, pages177–186. Springer, 2010.
John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradientmethods for online learning and stochastic optimization. Journalof Machine Learning Research, 12:2121–2159, 2010.
Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. Pegasos:Primal estimated sub-gradient solver for svm. In Proceedings ofthe 24th international conference on Machine learning, pages807–814. ACM, 2007.
Mark J van der Laan, Eric C Polley, and Alan E Hubbard. Superlearner. Statistical Applications in Genetics and MolecularBiology, 6(1):1–21, 2007.
top related