cpts 570 – machine learning school of eecs washington...

24
CptS 570 – Machine Learning School of EECS Washington State University CptS 570 - Machine Learning 1

Upload: others

Post on 05-Jun-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CptS 570 – Machine Learning School of EECS Washington ...holder/courses/CptS570/fall10/slides/... · School of EECS. Washington State University. CptS 570 - Machine Learning 1

CptS 570 – Machine LearningSchool of EECS

Washington State University

CptS 570 - Machine Learning 1

Page 2: CptS 570 – Machine Learning School of EECS Washington ...holder/courses/CptS570/fall10/slides/... · School of EECS. Washington State University. CptS 570 - Machine Learning 1

No one learner is always best (No Free Lunch) Combination of learners can overcome

individual weaknesses How to choose learners that complement one

another? How to combine their outputs to maximize

accuracy? Ensemble: Weighted majority vote of several

learners

CptS 570 - Machine Learning 2

Page 3: CptS 570 – Machine Learning School of EECS Washington ...holder/courses/CptS570/fall10/slides/... · School of EECS. Washington State University. CptS 570 - Machine Learning 1

Different algorithms◦ E.g., parametric vs. non-parametric

Different parameter settings◦ E.g., random initial weights in neural network

Different input representations◦ E.g., feature selection◦ E.g., multi-modal training data (e.g., audio & video)

Different training sets◦ Bagging: Different samples of same training set◦ Boosting/Cascading: Weight more heavily examples

missed by previous learned classifier◦ Partitioning: Mixture of experts

CptS 570 - Machine Learning 3

Page 4: CptS 570 – Machine Learning School of EECS Washington ...holder/courses/CptS570/fall10/slides/... · School of EECS. Washington State University. CptS 570 - Machine Learning 1

All learners generate an output◦ Voting, stacking

One or a few learners generate output◦ Chosen by gating function◦ Mixture of experts

Learner output weighted by accuracy and complexity◦ Cascading, boosting

CptS 570 - Machine Learning 4

Page 5: CptS 570 – Machine Learning School of EECS Washington ...holder/courses/CptS570/fall10/slides/... · School of EECS. Washington State University. CptS 570 - Machine Learning 1

L learners, K outputs dji(x) is prediction of learner j for output i Regression

Classification

CptS 570 - Machine Learning 5

1 and 0where11

=≥= ∑∑==

L

jjj

L

jjiji wwdwy

k

K

kii yyC1

maxifChoose=

=

Page 6: CptS 570 – Machine Learning School of EECS Washington ...holder/courses/CptS570/fall10/slides/... · School of EECS. Washington State University. CptS 570 - Machine Learning 1

Majority voting: wj = 1/L If learner produces P(Ci|x), then use as

weights after normalization Weight wj is accuracy of learner j on

validation set Learn weights (stacked generalization)

CptS 570 - Machine Learning 6

Page 7: CptS 570 – Machine Learning School of EECS Washington ...holder/courses/CptS570/fall10/slides/... · School of EECS. Washington State University. CptS 570 - Machine Learning 1

CptS 570 - Machine Learning 7

Example:

Page 8: CptS 570 – Machine Learning School of EECS Washington ...holder/courses/CptS570/fall10/slides/... · School of EECS. Washington State University. CptS 570 - Machine Learning 1

where dji=P(Ci|x,Mj) and wj=P(Mj) Majority voting implies uniform prior Can’t include all models, so choose a few

with suspected high probability

CptS 570 - Machine Learning 8

( ) ( ) ( )jjii PxCPxCPj

MMM

,|| models all

∑=

Page 9: CptS 570 – Machine Learning School of EECS Washington ...holder/courses/CptS570/fall10/slides/... · School of EECS. Washington State University. CptS 570 - Machine Learning 1

Assuming each learner is independent and better than random

Then adding more learners will maintain bias, but reduce variance (i.e., error)

CptS 570 - Machine Learning 9

[ ] [ ] [ ]

( ) ( ) ( )jjj

jj

j

jjj

j

dL

dLL

dL

dL

y

dEdELL

dL

EyE

VarVarVarVarVar1111

11

22 =⋅=

=

=

=⋅=

=

∑∑

Page 10: CptS 570 – Machine Learning School of EECS Washington ...holder/courses/CptS570/fall10/slides/... · School of EECS. Washington State University. CptS 570 - Machine Learning 1

General case

If learners positively correlated, then variance (and error) increase

If learners negatively correlated, then variance (and error) decrease◦ But bias increases

Voting is a form of smoothing that maintains low bias, but decreases variance

CptS 570 - Machine Learning 10

( ) ( )

+=

= ∑ ∑∑∑

<j j jijij

jj ddCovd

Ld

Ly ),(VarVarVar 211

22

Page 11: CptS 570 – Machine Learning School of EECS Washington ...holder/courses/CptS570/fall10/slides/... · School of EECS. Washington State University. CptS 570 - Machine Learning 1

Given training set X of size N Generate L different training sets, each of

size N, by sampling with replacement from X◦ Called “bootstrapping”

Use one learning algorithm to learn L classifiers from the different training sets

Learning algorithm must be unstable◦ I.e., small changes in training set result in different

classifiers◦ E.g., decision trees, neural networks

CptS 570 - Machine Learning 11

Page 12: CptS 570 – Machine Learning School of EECS Washington ...holder/courses/CptS570/fall10/slides/... · School of EECS. Washington State University. CptS 570 - Machine Learning 1

Similar to bagging, but L training sets chosen to increase negative correlation

Use one learning algorithm to learn L classifiers

Training set for classifier j biased toward examples missed by classifier j-1

Learning algorithm should be weak (not too accurate)

Adaptive Boosting (AdaBoost)

CptS 570 - Machine Learning 12

Page 13: CptS 570 – Machine Learning School of EECS Washington ...holder/courses/CptS570/fall10/slides/... · School of EECS. Washington State University. CptS 570 - Machine Learning 1

CptS 570 - Machine Learning 13

Page 14: CptS 570 – Machine Learning School of EECS Washington ...holder/courses/CptS570/fall10/slides/... · School of EECS. Washington State University. CptS 570 - Machine Learning 1

CptS 570 - Machine Learning 14

Each point represents1 of 27 test domains.

Dietterich “Machine Learning Research: Four Current Directions,” AI Magazine, Winter 1997.

Page 15: CptS 570 – Machine Learning School of EECS Washington ...holder/courses/CptS570/fall10/slides/... · School of EECS. Washington State University. CptS 570 - Machine Learning 1

CptS 570 - Machine Learning 15

Page 16: CptS 570 – Machine Learning School of EECS Washington ...holder/courses/CptS570/fall10/slides/... · School of EECS. Washington State University. CptS 570 - Machine Learning 1

CptS 570 - Machine Learning 16

Page 17: CptS 570 – Machine Learning School of EECS Washington ...holder/courses/CptS570/fall10/slides/... · School of EECS. Washington State University. CptS 570 - Machine Learning 1

Weights depend on the test instance

Competitive learning◦ Weight wj(x) driven

toward 1 (others to 0) for learner j best at region near x

CptS 570 - Machine Learning 17

)()(1

xx∑=

=L

jjj dwy

Page 18: CptS 570 – Machine Learning School of EECS Washington ...holder/courses/CptS570/fall10/slides/... · School of EECS. Washington State University. CptS 570 - Machine Learning 1

Combining function f( ) is learned

Train f on data not used to train base learners

CptS 570 - Machine Learning 18

Page 19: CptS 570 – Machine Learning School of EECS Washington ...holder/courses/CptS570/fall10/slides/... · School of EECS. Washington State University. CptS 570 - Machine Learning 1

Ensemble need not be fixed Can modify ensemble to improve accuracy or

reduce correlation of base learners Subset selection◦ Add/remove base learners while performance

improves Meta-learners◦ Stack learners to construct new features

CptS 570 - Machine Learning 19

Page 20: CptS 570 – Machine Learning School of EECS Washington ...holder/courses/CptS570/fall10/slides/... · School of EECS. Washington State University. CptS 570 - Machine Learning 1

Use classifier djonly if previous classifiers lacked confidence

Order classifiers by increasing complexity

Differs from boosting◦ Both errant and

uncertain examples passed to next learner

CptS 570 - Machine Learning 20

Page 21: CptS 570 – Machine Learning School of EECS Washington ...holder/courses/CptS570/fall10/slides/... · School of EECS. Washington State University. CptS 570 - Machine Learning 1

Typically, the hypothesis space H does not contain the target function f

Weighted combinations of several approximations may represent classifiers outside of H

CptS 570 - Machine Learning 21

Decision surfacesdefined by learneddecision trees.

Decision surfacedefined by vote overLearned decision trees.

Page 22: CptS 570 – Machine Learning School of EECS Washington ...holder/courses/CptS570/fall10/slides/... · School of EECS. Washington State University. CptS 570 - Machine Learning 1

$1M to team improving NetFlix’s movie recommender by 10%

Won by team “BellKor’s Pragmatic Chaos” which combined classifiers from 3 teams◦ Bellkor, Big Chaos, Pragmatic Theory

Second place “The Ensemble” combined classifiers from 23 other teams

Solutions effectively ensembles of over 800 classifiers

www.netflixprize.com

CptS 570 - Machine Learning 22

Page 23: CptS 570 – Machine Learning School of EECS Washington ...holder/courses/CptS570/fall10/slides/... · School of EECS. Washington State University. CptS 570 - Machine Learning 1

CptS 570 - Machine Learning 23

Toscher et al. “The BigChaos Solution to the Netflix Grand Prize”, 2009.

Page 24: CptS 570 – Machine Learning School of EECS Washington ...holder/courses/CptS570/fall10/slides/... · School of EECS. Washington State University. CptS 570 - Machine Learning 1

Combining learners can overcome weaknesses of individual learners

Base learners must do better than random and have uncorrelated errors

Ensembles typically majority vote of base classifiers

Boosting, stacking Application to recommender systems◦ Netflix Prize

CptS 570 - Machine Learning 24