csse463: image recognition day 27 this week this week last night: k-means lab due. last night:...

CSSE463: Image Recognition CSSE463: Image Recognition Day 27Day 27 This weekThis week

Last night: k-means lab due.Last night: k-means lab due. TodayToday: : Classification by “boosting” Classification by “boosting”

Yoav Freund and Robert Schapire. A decision-theoretic generalization of on-Yoav Freund and Robert Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. line learning and an application to boosting. Proceedings of the 2Proceedings of the 2ndnd European European Conference on Computational Learning TheoryConference on Computational Learning Theory, March, 1995., March, 1995.

http://www.cs.princeton.edu/~schapire/publist.html

Sunday night: project plans and prelim work dueSunday night: project plans and prelim work due Questions?Questions?

Motivation for AdaboostMotivation for Adaboost

SVMs are fairly SVMs are fairly expensive in terms of expensive in terms of memory.memory. Need to store a list of Need to store a list of

support vectors, which for support vectors, which for RBF kernels, can be long.RBF kernels, can be long.

Monolith

SVMinput output

By the way, how does svmfwd compute y1?y1 is just the weighted sum of contributions of individual support vectors:

d = data dimension, e.g., 294

numSupVecs, svcoeff (alpha) and bias are learned during training. BTW: looking at which of your training examples are support vectors can be revealing! (Keep in mind for term project)

biasesvcoeffynumSupVecs

i

svxdi

i

1

)*/1(2

*1

Q1

Motivation for AdaboostMotivation for Adaboost SVMs are fairly SVMs are fairly

expensive in terms of expensive in terms of memory.memory. Need to store a list of Need to store a list of

support vectors, which for support vectors, which for RBF kernels, can be long.RBF kernels, can be long.

Can we do better?Can we do better? Yes, consider simpler Yes, consider simpler

classifiers like thresholding classifiers like thresholding or decision trees.or decision trees.

The idea of boosting is to The idea of boosting is to combine the results of a combine the results of a number of “weak” learners number of “weak” learners (classifiers) in a smart way.(classifiers) in a smart way.

Monolith

SVMinput output

L4

L3

L2

L1

input output

Team of experts

w1

w4

w2

w3

Q2

Idea of Boosting (continued)Idea of Boosting (continued)Consider each weak learner as an expert Consider each weak learner as an expert that produces “advice” (a classification that produces “advice” (a classification result, possibly with confidence). result, possibly with confidence).

We want to combine their predictions We want to combine their predictions intelligently.intelligently.

Call each weak learner L multiple times Call each weak learner L multiple times with a different distribution of the data with a different distribution of the data (perhaps a subset of the whole training (perhaps a subset of the whole training set).set).

Use a weighting scheme in which each Use a weighting scheme in which each expert’s advice is weighted by its accuracy expert’s advice is weighted by its accuracy on the training set.on the training set.

Not memorizing the training set, since each classifier is weakNot memorizing the training set, since each classifier is weak

L4

L3

L2

L1

input output

Team of experts

w1

w4

w2

w3

NotationNotation

Training set contains labeled samples:Training set contains labeled samples: {(x{(x11, C(x, C(x11)), (x)), (x22, C(x, C(x22)), … (x)), … (xNN, C(x, C(xNN))}))}

C(xC(xii) is the class label for sample x) is the class label for sample xii, either 1 or 0., either 1 or 0.

ww is a weight vector, 1 entry per example, of how is a weight vector, 1 entry per example, of how important each example is.important each example is. Can initialize to uniform (1/N) or weight guessed Can initialize to uniform (1/N) or weight guessed

“important” examples more heavily“important” examples more heavily

hh is a hypothesis (weak classifier output) in the is a hypothesis (weak classifier output) in the range [0,1]. When called with a vector of inputs, range [0,1]. When called with a vector of inputs, h is a vector of outputsh is a vector of outputs

Adaboost AlgorithmAdaboost AlgorithmInitialize weightsInitialize weightsfor t = 1:Tfor t = 1:T

1. Set1. Set

2. Call weak learner WL2. Call weak learner WLtt with each x, and get back result vector with each x, and get back result vector

3. Find error 3. Find error tt of : of :

4. Set4. Set

5. Re-weight the training data5. Re-weight the training data

Output answer:Output answer:

Nwi

11

N

i

iittit xcxhp

1

)()(

t

tt

1

correctifw

incorrectifww

tti

tit

i 1

otherwise

xhifxh

T

t t

T

t

itt

if

0

1log

2

1)(

1log1

)(11

N

i

ti

tt wwp1

/

Normalizes so p weights sum to 1

Average error on training setweighted by p

Low overall error close to 0; (big impact on next step) error almost 0.5 almost 1 (small impact “ “ “ )

Weighting the ones it got right less so next time it will focus more on the incorrect ones

th

th

Q3-5

Adaboost AlgorithmAdaboost Algorithm

Output answer:Output answer:

Combines predictions made at all T iterations.Combines predictions made at all T iterations.

Label as class 1 if weighted average of all T Label as class 1 if weighted average of all T predictions is over ½. predictions is over ½.

Weighted by error of overall classifier at that Weighted by error of overall classifier at that iteration (better classifiers weighted higher)iteration (better classifiers weighted higher)

otherwise

ihifxh

T

t t

T

tt

tif

0

1log

2

1)(

1log1

)(11

Q6

Practical issuesPractical issues

For binary problems, weak learners have For binary problems, weak learners have to have at least 50% accuracy (better than to have at least 50% accuracy (better than chance).chance).

Running time for training is O(Running time for training is O(nTnT), for ), for nn samples and T iterations.samples and T iterations.

How do we choose T?How do we choose T?Trade-off between training time and accuracyTrade-off between training time and accuracyBest determined experimentally (see optT Best determined experimentally (see optT

demo)demo)

Multiclass problemsMulticlass problems

Can be generalized to multiple classes Can be generalized to multiple classes using variants Adaboost.M1 and using variants Adaboost.M1 and Adaboost.M2Adaboost.M2

Challenge finding a weak learner with 50% Challenge finding a weak learner with 50% accuracy if there are many classesaccuracy if there are many classesRandom chance gives only 17% if 6 classes.Random chance gives only 17% if 6 classes.Version M2 handles this.Version M2 handles this.

Available softwareAvailable software

GML AdaBoost Matlab ToolboxGML AdaBoost Matlab ToolboxWeak learner: Classification treeWeak learner: Classification treeUses less memory if weak learner is Uses less memory if weak learner is

simplesimpleBetter accuracy than SVM in tests (toy and Better accuracy than SVM in tests (toy and

real problems)real problems)DemoDemoSample use in literatureSample use in literature

Q7

csse463: image recognition day 27 this week this week last night: k-means lab due. last night:...

Documents