cpts 570 – machine learning school of eecs washington...

CptS 570 – Machine LearningSchool of EECS

Washington State University

CptS 570 - Machine Learning 1

No one learner is always best (No Free Lunch) Combination of learners can overcome

individual weaknesses How to choose learners that complement one

another? How to combine their outputs to maximize

accuracy? Ensemble: Weighted majority vote of several

learners


Different algorithms◦ E.g., parametric vs. non-parametric

Different parameter settings◦ E.g., random initial weights in neural network

Different input representations◦ E.g., feature selection◦ E.g., multi-modal training data (e.g., audio & video)

Different training sets◦ Bagging: Different samples of same training set◦ Boosting/Cascading: Weight more heavily examples

missed by previous learned classifier◦ Partitioning: Mixture of experts


All learners generate an output◦ Voting, stacking

One or a few learners generate output◦ Chosen by gating function◦ Mixture of experts

Learner output weighted by accuracy and complexity◦ Cascading, boosting


L learners, K outputs dji(x) is prediction of learner j for output i Regression

Classification


1 and 0where11

=≥= ∑∑==

L

jjj

L

jjiji wwdwy

k

K

kii yyC1

maxifChoose=

=

Majority voting: wj = 1/L If learner produces P(Ci|x), then use as

weights after normalization Weight wj is accuracy of learner j on

validation set Learn weights (stacked generalization)



Example:

where dji=P(Ci|x,Mj) and wj=P(Mj) Majority voting implies uniform prior Can’t include all models, so choose a few

with suspected high probability


( ) ( ) ( )jjii PxCPxCPj

MMM

,|| models all

∑=

Assuming each learner is independent and better than random

Then adding more learners will maintain bias, but reduce variance (i.e., error)


[ ] [ ] [ ]

( ) ( ) ( )jjj

jj

j

jjj

j

dL

dLL

dL

dL

y

dEdELL

dL

EyE

VarVarVarVarVar1111

11

22 =⋅=

=

=

=⋅=

=

∑∑

∑

General case

If learners positively correlated, then variance (and error) increase

If learners negatively correlated, then variance (and error) decrease◦ But bias increases

Voting is a form of smoothing that maintains low bias, but decreases variance


( ) ( )

+=

= ∑ ∑∑∑

<j j jijij

jj ddCovd

Ld

Ly ),(VarVarVar 211

22

Given training set X of size N Generate L different training sets, each of

size N, by sampling with replacement from X◦ Called “bootstrapping”

Use one learning algorithm to learn L classifiers from the different training sets

Learning algorithm must be unstable◦ I.e., small changes in training set result in different

classifiers◦ E.g., decision trees, neural networks


Similar to bagging, but L training sets chosen to increase negative correlation

Use one learning algorithm to learn L classifiers

Training set for classifier j biased toward examples missed by classifier j-1

Learning algorithm should be weak (not too accurate)

Adaptive Boosting (AdaBoost)



Each point represents1 of 27 test domains.

Dietterich “Machine Learning Research: Four Current Directions,” AI Magazine, Winter 1997.

Weights depend on the test instance

Competitive learning◦ Weight wj(x) driven

toward 1 (others to 0) for learner j best at region near x


)()(1

xx∑=

=L

jjj dwy

Combining function f( ) is learned

Train f on data not used to train base learners


Ensemble need not be fixed Can modify ensemble to improve accuracy or

reduce correlation of base learners Subset selection◦ Add/remove base learners while performance

improves Meta-learners◦ Stack learners to construct new features


Use classifier djonly if previous classifiers lacked confidence

Order classifiers by increasing complexity

Differs from boosting◦ Both errant and

uncertain examples passed to next learner


Typically, the hypothesis space H does not contain the target function f

Weighted combinations of several approximations may represent classifiers outside of H


Decision surfacesdefined by learneddecision trees.

Decision surfacedefined by vote overLearned decision trees.

$1M to team improving NetFlix’s movie recommender by 10%

Won by team “BellKor’s Pragmatic Chaos” which combined classifiers from 3 teams◦ Bellkor, Big Chaos, Pragmatic Theory

Second place “The Ensemble” combined classifiers from 23 other teams

Solutions effectively ensembles of over 800 classifiers

www.netflixprize.com


http://www.netflixprize.com/�


Toscher et al. “The BigChaos Solution to the Netflix Grand Prize”, 2009.

Combining learners can overcome weaknesses of individual learners

Base learners must do better than random and have uncorrelated errors

Ensembles typically majority vote of base classifiers

Boosting, stacking Application to recommender systems◦ Netflix Prize


cpts 570 – machine learning school of eecs washington...

Documents