Download - AdaBoost new.pdf

Transcript
Page 1: AdaBoost new.pdf

PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition © 2005, Robi Polikar, Rowan University, Glassboro, NJ

Advanced Topics in

Pattern Recognition

PR

Dept. of Electrical and Computer Engineering0909.555.01

Week 9AdaBoost & Learn++

Graphical Drawings: Pattern Classification, Duda, Hart and Stork, Copyright ® John Wiley and Sons, 2001.

AdaBoostAdaBoost.M1AdaBoost.M2AdaBoost.R

Learn++

Page 2: AdaBoost new.pdf

PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition © 2005, Robi Polikar, Rowan University, Glassboro, NJ

This Week in PR

AdaBoost and VariationsAdaBoost.M1AdaBoost.M2AdaBoost.R (independent research)Learn++

Bias Variance Analysis

Page 3: AdaBoost new.pdf

PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition © 2005, Robi Polikar, Rowan University, Glassboro, NJ

AdaBoost

Arguably, the most popular and successful of all ensemble generation algorithms, AdaBoost (Adaptive Boosting) is an extension of the original boosting algorithm, that extends boosting to the multi-class problems.

Y. Freund and R. Schapire, “ A decision – theoretic generalization of on-line learning and an application to boosting,” Journal of Computer and System Sciences, vol. 55, no. 1, pp. 119-129, 1997.

Solves the “the general problem of producing a very accurate prediction rule by combining rough and moderately inaccurate rules-of-thumbs.”

AdaBoost generates an ensemble of classifiers, the training data of each is drawn from a distribution that starts uniform and iteratively changes into one that provides more weight to those instances that are misclassified.Each classifier in AdaBoost focuses increasingly on the more difficult to classifiyinstances.The classifiers are then combined through weighted majority voting

Page 4: AdaBoost new.pdf

PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition © 2005, Robi Polikar, Rowan University, Glassboro, NJ

AdaBoost

Algorithm AdaBoostCreate a discreet distribution of the training data by assigning a weight to each instance. Initially, the distribution is uniform, hence all weights are the sameDraw a subset from this distribution and train a weak classifier with this datasetCompute the error, ε, of this classifier on its own training dataset. Make sure that this error is less than ½.Test the entire training data on this classifier:

If an instance x is correctly classified, reduce its weight proportional to εIf it is misclassified, increase its weight proportional to εNormalize the weights such that they constitute a distribution

Repeat until T classifiers are generatedCombine the classifiers using weighted majority voting

Page 5: AdaBoost new.pdf

PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition © 2005, Robi Polikar, Rowan University, Glassboro, NJ

AdaBoost.M1

Ck-1

EnsembleDecision

Ck

C1

Ck

CT

Tr

ain

ing

Da

taD

istr

ibu

tio

n

Tr

ain

ing

Da

ta

Voting Weightslog 1/ β

S1

ST

D1

DT

β1

βT

log βT log β1

h1

hk-1

hk

hk+1

hT

UpdateDistribution

D

Normalized Errorβ

© Robi Polikar

Page 6: AdaBoost new.pdf

PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition © 2005, Robi Polikar, Rowan University, Glassboro, NJ

AdaBoost.M1

Algorithm AdaBoost.M1 Input:

• Sequence of m examples S= [ ]),(),...,,(),,( 2211 mm yxyxyx with labels yi ∈Y = {1,...,C} drawn from a distribution D,

• Weak learning algorithm WeakLearn, • Integer T specifying number of iterations.

Initialize D1(i) = im

∀,1 .

Do for t = 1,2,...,T: 1. Call WeakLearn, providing it with the distribution Dt. 2. Get back a hypothesis ht : X Y 3. Calculate the error of ∑

=iit yxhi

tt iDh)(:

t )( : ε

If εt > ½, then set T = t -1 and abort loop. 4. Set βt = εt / (1 - εt ). 5. Update distribution Dt:

⎩⎨⎧ =

×=+ otherwiseyxh if

ZiDiD iitt

t

tt , 1

)( , )()(1

β

where ( )∑=i

tt iDZ is a normalization constant chosen so that

Dt+1 becomes a distribution function Output the final hypothesis:

∑=

∈=

yxht tYyfinal

t

xh)(:

1logmaxarg)(β

Page 7: AdaBoost new.pdf

PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition © 2005, Robi Polikar, Rowan University, Glassboro, NJ

Weighted Majority VotingDemystified!

Problem: How many hours a day should our students spend working on homework?

3 6 7 9 8

0.30.2 0.25 0.15 0.1

5.96 ~ 6

Experts

WeightAssigner

Weighted Majority Voting

Page 8: AdaBoost new.pdf

PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition © 2005, Robi Polikar, Rowan University, Glassboro, NJ

Does AdaBoost Work…?

You Betcha!The error of AdaBoost can be shown to be

where εt is the error of the tth hypothesis. Note that this product gets smaller and smaller with each added classifierBut wait…isn’t this against the Occam’s razor…?

For explanation see Freund and Schapire’s paper as well as Schapire’stutorial on boosting and margin theory. More about this later.

( )1

2 1T

Tt t

t

ε ε ε=

< −∏

Page 9: AdaBoost new.pdf

PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition © 2005, Robi Polikar, Rowan University, Glassboro, NJ

Occam’s Razor vs. AdaBoost

What to expect:• training error to decrease with number of

classifiers• generalization error to increase after a

while (overfitting)

What’s observed:• generalization error does not increase

even after many many iterations• In fact, it even decreases even after training error is zero!

• Is Occam’s razor of “simple is better”wrong? Violated…?

From R

. Schapire–

http://ww

w.cs.princeton.edu/~schapire/

(letters database)

Page 10: AdaBoost new.pdf

PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition © 2005, Robi Polikar, Rowan University, Glassboro, NJ

The Margin Theory

The margin of an instance x roughly describes the confidence of the ensemble in its decision:

Loosely speaking, the margin of an instance is simply the difference between the total (or fraction of) vote(s) it receives from correctly identifying classifiers and the maximum (or fraction of) vote(s) it receives by any incorrect class

where kth class is the true class, and μj(x) is the total support (vote) class j receives from all classifiers such that

The margin is therefore the strength of the vote, and the higher the margin, the more confidence there is the classification. Incorrect decisions have negative margins

( ) ( ) ( ){ }maxk jj km μ μ

≠= −x x x

( )1

1T

jj

μ=

=∑ x

Page 11: AdaBoost new.pdf

PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition © 2005, Robi Polikar, Rowan University, Glassboro, NJ

Margin TheoryFrom

R. Schapire

-http://ww

w.cs.princeton.edu/~schapire/

(letters database)

Page 12: AdaBoost new.pdf

PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition © 2005, Robi Polikar, Rowan University, Glassboro, NJ

Margin Theory

Large margins indicate a lower bound on generalization error.If all margins are large, the final decision boundary can be obtained using a simpler classifier (similar to polls can predict the outcomes of not-so-close races very early on)

They show that boosting tends to increase margins on the training data examples, and argue that an ensemble classifier with larger margins is a simpler classifierregardless of the number of classifiers that make up the ensemble.More specifically: Let H be a finite space of base classifiers. For any δ>0 and θ>0, with probability 1- δ over the random choice of the training data set S, any ensemble E={h1, …, hT} H combined through weighted majority satisfies⊆

1/ 2

2

log log1 1( ) ( ) logN H

P error P training margin ON

θθ δ

⎛ ⎞⎛ ⎞⎛ ⎞⎜ ⎟≤ ≤ + +⎜ ⎟⎜ ⎟⎜ ⎟⎝ ⎠⎝ ⎠⎝ ⎠

N: number of instances|H|: Cardinality of the classifier space – the weaker the classifier, the smaller the |H|P(error) is independent of the number of classifiers!

Page 13: AdaBoost new.pdf

PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition © 2005, Robi Polikar, Rowan University, Glassboro, NJ

AdaBoost.M2

AdaBoost.M1 requires that all classifiers have a weighted error no greater than ½.

This is the least that can be asked from a classifier in a two-class problem, since an error of ½ is equivalent to random guessing.The probability of error for random guessing is much higher for multi-class problems (specifically, k-1/k for a k-class problem). Therefore, achieving an error of ½ becomes increasing difficult for larger number of classes, particularly if the weak classifiers are really weak.

AdaBoost.M2 address this problem by removing the weighted error ½restriction, instead defines the pseudo-error, which itself is then required to have an error no larger than ½.

Pseudo-error recognizes that there is information given in the outputs of classifies for non-selected / non-winning classes.

• On the OCR problem, 1 and 7 may look alike, and the classifier give high plausibility outputs to these and low to all others when faced with a 1 or 7.

Page 14: AdaBoost new.pdf

PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition © 2005, Robi Polikar, Rowan University, Glassboro, NJ

AdaBoost.M2

Algorithm AdaBoost.M2 Input:

• Sequence of m examples S= [ ]),(),...,,(),,( 2211 mm yxyxyx with labels yi ∈Y = {1,...,C} drawn from a distribution D,

• Weak learning algorithm WeakLearn, • Integer T specifying number of iterations.

Let ( ) { }{ }, : 1, 2, , , iB i y i m y y= ∈ ≠L

Initialize ( ) ( )1 , 1 ,D i y B for i y B= ∈ . Do for t = 1,2,...,T:

1. Call WeakLearn, providing it with the distribution Dt. 2. Get back a hypothesis [ ]: 0,1th X Y× → 3. Calculate the pseudo-error of ht ( ) ( ) ( )( )

( )t

,

1 2 ( , ) 1 , ,t t i i t ii y B

D i y h x y h x yε∈

= − +∑

4. Set βt = εt / (1 - εt ). 5. Update distribution Dt:

( ) ( ) ( )( )1 2 1 , ,1

( , )( , ) t i i t ih x y h x ytt t

t

D i yD i yZ

β + −+ = ⋅ where ( )∑=

itt iDZ is a

normalization constant chosen so that Dt+1 becomes a distribution function Output the final hypothesis:

( )1

1( ) arg max log ,T

final ty Y t t

h x h x yβ∈ =

⎛ ⎞= ⎜ ⎟

⎝ ⎠∑

Page 15: AdaBoost new.pdf

PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition © 2005, Robi Polikar, Rowan University, Glassboro, NJ

Incremental Learning

We now pose the following question:If after training an algorithm we receive additional data, how can we update the trained classifier to learn new data?None of the classic algorithms we have seen so far, including MLP, RBF, PNN, KNN, RCE, etc. is capable of incrementally updating its knowledge base to learn new informationThe typical procedure is to scratch the previously trained classifier, combine the old and new data, and start from over.This causes all the information learned so far to be lost catastrophic forgettingFurthermore, what if the old data is no longer available, or what about if the new data introduces new classes?

The ensemble of classifiers approach which is generally used for improving the generalization accuracy of a classifier, can be used to address the issue of incremental learning.

One such implementation of ensemble classifiers for incremental learning is Learn++

Page 16: AdaBoost new.pdf

PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition © 2005, Robi Polikar, Rowan University, Glassboro, NJ

Incremental Learning

Feature 1

Feat

ure

2

Data1 Data2

C1C1 C2C2C3C3 C4C4

ΣΣΣΣ

Page 17: AdaBoost new.pdf

PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition © 2005, Robi Polikar, Rowan University, Glassboro, NJ

Learn++

So, how do we achieve incremental learning?What prevents us – if anything – in the AdaBoost formulation from learning new data, if instances of previously unseen instances are introduced…?

Actually nothing!AdaBoost should work for incremental learning, but it can be made more efficient

Learn++: Modifies the distribution update rule to make the update based on the ensemble decision, not just the previous classifier.

Why should this make any difference?

Page 18: AdaBoost new.pdf

PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition © 2005, Robi Polikar, Rowan University, Glassboro, NJ

Learn++

W Voting weights

C4h4

C8

h8

C1h1

h2

h7C5h5h6

C3h3

LEARN++

Σ

Wght. Majority Voting

Learned DecisionBoundary

D1 D2

Page 19: AdaBoost new.pdf

PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition © 2005, Robi Polikar, Rowan University, Glassboro, NJ

Learn++

Database

D1

C1

E1

P1

Classifier:

Error:

Perf.:

Composite Hypotheses

Dn

Cn

Pn

En

W1W2ΣΣ

W3W1

W2

Σ

W4W1

W2 W3

D2

C2

P2

E2

(Mis)classifiedInstances

h1

D3

C3

P3

E3

(Mis)classifiedInstances

H3

(Mis)classifiedInstances

H2

Subset: D4

C4

P4

E4

Database 1 Database 2Database 1Database 1 Database 2

(Mis)classifiedInstances

Hn-1

Database 2

Σ

Wn

W1W2

W4

W3

Final Classification

Page 20: AdaBoost new.pdf

PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition © 2005, Robi Polikar, Rowan University, Glassboro, NJ

Algorithm Learn++ (with major differences from AdaBoost.M1 indicated by )

Input: For each database drawn from Dk k=1,2,…,K

• Sequence of m training examples Sk = [(x1,y1),(x2, y2),…,(xmk,ymk)]. • Weak learning algorithm WeakLearn. • Integer Tk, specifying the number of iterations.

Do for k=1,2, …, K:

Initialize 1 ( ) 1 ,ikw D i m i= = ∀ , unless there is prior knowledge to select otherwise.

Do for t = 1,2,...,Tk:

1. Set Dt ∑=

=m

itt iw

1)(w so that Dt is a distribution.

2. Randomly choose training data subset TRt according to Dt. 3. Call WeakLearn, providing it with TRt 4. Get back a hypothesis ht : X Y, and calculate the error of ∑

≠=

iit

tyxhi

t iDh)(:

t )( : ε on Sk.

If εt > ½, set t = t – 1, discard ht and go to step 2. Otherwise, compute normalized error as ( )ttt εεβ −= 1 .

5. Call weighted majority, obtain the overall hypothesis ( )∑=∈

=yxht

tYy

tt

H)(:

1logmaxarg β , and

compute the overall error [ ]∑∑=≠

≠==m

iiit

yxHit yxHiDiDΕ t

iit

t1)(:

|)(|)()(

If Et > ½, set t = t – 1, discard Ht and go to step 2. 6. Set Bt = Et/(1-Et), and update the weights of the instances:

[ ]|)(|1)(

, 1)( ,

)()(1

iitt

tt

yxHt

iitt

Biw

otherwiseyxH ifB

iwiw

≠−×=

⎩⎨⎧ =

×=+

Call Weighted majority on combined hypotheses Ht and Output the final hypothesis:

1 : ( )

1arg max logt

K

final y Y k t H x y t

Hβ∈ = =

= ∑ ∑

Learn++

Page 21: AdaBoost new.pdf

PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition © 2005, Robi Polikar, Rowan University, Glassboro, NJ

Simulation ResultsOdorant Identification

1. APZ: Apiezon, 2. PIB: Polyisobutelene, 3. DEGA:Poly(diethyleneglycoladipate),

4. SG: Solgel, 5. OV275 :Poly(siloxane), 6. PDPP: Poly (diphenoxylphosphorazene)

1 2 3 4 5 6 0

0 .2

0 .4

0 .6

0 .8

1 E t h a n o l

1 2 3 4 5 60

0 .2

0 .4

0 .6

0 .8

1 T o lu e n e

1 2 3 4 5 60

0 .2

0 .4

0 .6

0 .8

1 X y le n e

1 2 3 4 5 60

0 .2

0 .4

0 .6

0 .8

1 T C E

1 2 3 4 5 60

0 .2

0 .4

0 .6

0 .8

1 O c t a n e

Data Distribution

Dataset ET OC TL TCE XLS1 20 20 40 0 0S2 5 5 5 25 0S3 5 5 5 5 40

TEST 34 34 62 34 40

Page 22: AdaBoost new.pdf

PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition © 2005, Robi Polikar, Rowan University, Glassboro, NJ

Odorant Identification Results

PNN

Dataset TS1 (2) TS2 (3) TS3 (3)S1 93.70% 73.70% 76.30%S2 --- 92.50% 82.50%S3 --- --- 85.00%

TEST 58.80% 67.70% 83.80%

RB

F

Dataset TS1 (5) TS2 (6) TS3 (7)S1 97.50% 81.20% 76.20%S2 --- 97.00% 95.00%S3 --- --- 90.00%

TEST 59.30% 67.60% 86.30%

ML

P

Dataset TS1 (10) TS2 (5) TS3 (9)S1 87.50% 75.00% 71.30%S2 --- 82.50% 85.00%S3 --- --- 86.70%

TEST 58.80% 71.10% 87.20%

Page 23: AdaBoost new.pdf

PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition © 2005, Robi Polikar, Rowan University, Glassboro, NJ

Ultrasonic Weld Inspection (UWI)

North South

Slag

IncompletePenetration

CrackPorosity

Lack of Fusion

Data Distribution

Dataset LOF SLAG CRACK POROSITYS1 300 300 0 0S2 200 200 200 0S3 150 150 137 99

TEST 200 200 150 125

Page 24: AdaBoost new.pdf

PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition © 2005, Robi Polikar, Rowan University, Glassboro, NJ

Ultrasonic Weld Inspection Database

Crack

Lack of Fusion

Slag

Porosity

Page 25: AdaBoost new.pdf

PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition © 2005, Robi Polikar, Rowan University, Glassboro, NJ

UWI Results

PNN

Dataset TS1 (7) TS2 (7) TS3 (14)S1 99.50% 83.60% 71.50%S2 --- 93.00% 75.00%S3 --- --- 99.40%

TEST 48.90% 62.40% 76.10%

RB

F

Dataset TS1 (9) TS2 (5) TS3 (14)S1 90.80% 78.30% 69.80%S2 --- 89.50% 75.80%S3 --- --- 94.70%

TEST 47.70% 60.90% 76.40%

ML

P

Dataset TS1 (8) TS2 (3) TS3 (17)S1 98.20% 84.70% 80.30%S2 --- 98.70% 86.50%S3 --- --- 97.00%

TEST 48.90% 66.80% 78.40%

Page 26: AdaBoost new.pdf

PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition © 2005, Robi Polikar, Rowan University, Glassboro, NJ

Optical Character Recognition (OCR)

• Handwritten character recognition problem

• 2997 instances, 62 attributes, 10 classes

• Divided into for: S1~ S4(2150) training, TEST (1797) for testing

• Different set of classes in different datasets

Page 27: AdaBoost new.pdf

PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition © 2005, Robi Polikar, Rowan University, Glassboro, NJ

OCR Database

Data Distribution

Class S1 S2 S3 S4 TEST0 100 50 50 25 1781 0 150 50 0 1822 100 50 50 25 1773 0 150 50 25 1834 100 50 50 0 1815 0 150 50 25 1826 100 50 0 100 1817 0 0 150 50 1798 100 0 0 150 1749 0 50 100 50 180

Page 28: AdaBoost new.pdf

PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition © 2005, Robi Polikar, Rowan University, Glassboro, NJ

OCR Results

PNN

Dataset TS1 (2) TS2 (5) TS3 (6) TS4 (3)S1 99.80% 80.80% 78.20% 92.40%S2 --- 96.30% 89.00% 88.70%S3 --- --- 98.20% 94.00%S4 --- --- --- 88.00%

TEST 48.70% 69.10% 82.80% 86.70%

RB

F

Dataset TS1 (4) TS2 (6) TS3 (8) TS4 (15)S1 98.00% 81.00% 77.00% 93.40%S2 --- 96.50% 88.40% 80.60%S3 --- --- 93.60% 92.70%S4 --- --- --- 90.00%

TEST 47.80% 73.20% 79.80% 85.90%

ML

P

Dataset TS1 (18) TS2 (30) TS3 (23) TS4 (3)S1 96.60% 89.80% 86.00% 94.80%S2 --- 87.10% 89.40% 87.90%S3 --- --- 92.00% 92.20%S4 --- --- --- 87.30%

TEST 46.60% 68.90% 82.00% 87.00%

Page 29: AdaBoost new.pdf

PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition © 2005, Robi Polikar, Rowan University, Glassboro, NJ

Learn++

Implementation Issues

Distribution initialization when new dataset becomes available:Solution: Start with a uniform distribution, and update that distribution based on the performance of the existing ensemble on the new data

When to stop training for each dataset?Solution: Use a validation dataset, if one available.

• Or, keep training until performance on test data peaks mild cheating

Classifier proliferation when new classes are added: Sufficient additional classifiers need to be generated to out-vote existing classifiers which cannot correctly predict the new class.

Solution: Learn++.MT (after Muhlbaier & Topalis)

Page 30: AdaBoost new.pdf

PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition © 2005, Robi Polikar, Rowan University, Glassboro, NJ

Learn++.MT:The Concept

4? 4?

4?

©20

04 A

ll R

ight

s R

eser

ved,

Muh

lbai

er a

nd T

opal

is

Page 31: AdaBoost new.pdf

PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition © 2005, Robi Polikar, Rowan University, Glassboro, NJ

Learn++.MT

Learn++.MT: Creates a preliminary class confidence on each instance and updates the weights of classifiers that have not seen a particular class.

Each classifier is assigned a weight based on its performance on the training data.The preliminary class confidence is obtained by summing the weights of all classifiers that picked a given class and dividing by the sum of the weights of all classifiers that have been trained on that class.

Updates (lowers) the weights of classifiers that have not been trained with the new class (i).

∑∑

==

t

it

CTrctt

cxhtt

ic W

WxP

:

)(:)( for c=1,2,…,C

( ): :( ) 1 ( )t

i it c CTr t t c CTr cW x W P x∉ ∉= −

Set of classifier that have seen class c

Set of classifier that have picked class cPreliminary confidence of those classifiers that has seen class c for xi belonging to class c

Page 32: AdaBoost new.pdf

PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition © 2005, Robi Polikar, Rowan University, Glassboro, NJ

Learn++.MTAlgorithm Learn++.MT Input: For each dataset Dk k=1,2,…,K

• Sequence of mk instances xi with labels },...,1{ cYy ki =∈ • Weak learning algorithm BaseClassifier. • Integer Tk, specifying the number of iterations.

Do for k=1,2,…,K If k=1 Initialize 0,/1)( 111 === eTmiDw for all i.

Else Go to Step 5: evaluate the current ensemble on new dataset Dk, update weights, and recall current # of classifiers

1

1

kk jj

eT T−

== ∑

Do for t= keT +1, keT +2,…, kk TeT + :

1. Set ∑=

=m

ittt iwwD

1

)( so that Dt is a distribution.

2. Call BaseClassifier with a subset of Dk chosen using Dt. 3. Obtain ht : X Y, and calculate its error: ∑

=iit yxhitt iD

)(:)(ε

If tε > ½, discard ht and go to step 2. Otherwise, compute

normalized error as )1( ttt εεβ −= . 4. CTrt = Yk, to save labels of classes used in training ht. 5. Call DWV to obtain the composite hypothesis Ht. 6. Compute the error of the composite hypothesis ∑

=iit yxHitt iDE

)(:)(

7. Set Bt=Et/(1-Et), and update the instance weights:

⎩⎨⎧ =

×=+ otherwiseyxHifB

wiw iitttt ,1

)(,)(1

Call DWV to obtain the final hypothesis, Hfinal.

Algorithm Dynamically Weighted Voting Input: • Sequence of i=1,…, n training instances or any test instance xi • Classifiers ht. • Corresponding error values, βt. • Classes, CTrt used in training ht.

For t=1,2,…,T where T is the total number classifiers 1. Initialize classifier weights )1log( ttW β= 2. Create normalization factor, Z, for each class

∑∈

=tCTrct

tc WZ:

, for c=1,2,…,C classes

3. Obtain preliminary decision cc ZtWPcixtht

∑=

=)(:

4. Update voting weights )1(:: cCTrcttCTrct PWWt

−= ∉∉

5. Compute final / composite hypothesis

∑=

=cxhtt

cifinal

it

WxH)(:

maxarg)(

Page 33: AdaBoost new.pdf

PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition © 2005, Robi Polikar, Rowan University, Glassboro, NJ

Learn++.MT SimulationVOC Data

Test procedureLearn++ and Learn++.MT were each allowed to create a set number of classifiers on each dataset. The number of classifiers generated in each training session was chosen to optimize the algorithm’s performance. Learn++ appeared to generate the best results when 6 classifiers were generated on the first dataset, 12 on the next, and 18 on the last. Learn++.MT, on the other hand, performed optimally using 6 classifiers in the first training, 4 on the second, and 6 on the last.

4024522424Test

4015101010Dataset 3

025101010Dataset 2

00402020Dataset 1

54321Class

Page 34: AdaBoost new.pdf

PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition © 2005, Robi Polikar, Rowan University, Glassboro, NJ

Learn++.MT SimulationVOC Data

63%64%412TS2

TS3

TS1

Training Session

81%69%618

54%55%66

Learn++.MTLearn++Learn++.MTLearn++

Performance# Classifiers Added

© 2004 All Rights Reserved Muhlbaier and Topalis

This test was executed twenty times on Learn++ and Learn++.MT to acquire a well represented generalization performance on the test data.

Page 35: AdaBoost new.pdf

PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition © 2005, Robi Polikar, Rowan University, Glassboro, NJ

Learning from Unbalanced Data: Learn++.MT2

Learn++.MT2 was created to account for the unbalanced data problemWe define unbalanced data as any discrepancy in the cardinality of each dataset used in incremental learning.

If one dataset has substantially more data than the other (s), the ensemble decision might be unfairly biased towards the data with the lower cardinality

Under the generally valid assumptions of• No instance is repeated in any dataset, and• The noise distribution remains relatively unchanged among datasets;

it is reasonable to believe that the dataset that has more instances carries more information.Classifiers generated with such data should therefore be weighted more heavily

It is not unusual to see major discrepancies in the cardinalities of datasets that subsequently become available. The cardinality of each dataset, including relative cardinalities of individual classes within a dataset, should be taken into consideration in any ensemble based learning algorithm that employs a classifier combination scheme.

Page 36: AdaBoost new.pdf

PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition © 2005, Robi Polikar, Rowan University, Glassboro, NJ

Learn++.MT2 Algorithm

The primary novelty in Learn++.MT2 is the way by which the voting weights are determined

Learn++.MT2 attempts to addresses the unbalanced data problem by keeping track of the number of instances from each class with which each classifier is trained Each classifier is first given a weight based on its performance on its own training data

• This weight is later adjusted according to its class conditional weight factor, wt,c

• For each classifier, this ratio is proportional to the number of instances from a particular class used for training that classifier, to the number of instances from that class used for training all classifiers thus far within the ensemble

The final decision is made similarly to Learn++ but with using the class conditional weights

c

ctct N

npw =,

∑=∈

=cxht

ctYc

ifinalitk

wxH)(:

,maxarg)(

pt: Training performance of the tth classifier,nc

: # of class-c instances in the current datasetNc: # of all class-c instances seen so far

Page 37: AdaBoost new.pdf

PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition © 2005, Robi Polikar, Rowan University, Glassboro, NJ

Experimental Setup

Learn++.MT2 has been tested on three databases:Wine database from UCI (3 classes, 13 features);Optical Character Recognition database from UCI (10 classes, 64 features); A real-world gas identification problem for determining one of five volatile organic compounds (VOC) based on chemical sensor data (5 classes, 6 features).

Base classifiers were all single layer MLPs, normally incapable of learning incrementally, with 12~40 nodes and an error goal of 0.05 ~ 0.025. In each case the data distributions were designed to simulate unbalanced data.

To make a comparison between Learn++ and Learn++.MT2, a set number of classifiers are created instead of selecting the number adaptively. This number was selected as a result of experimental testing, such that the tests show an accurate and unbiased comparison of the algorithms.

Page 38: AdaBoost new.pdf

PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition © 2005, Robi Polikar, Rowan University, Glassboro, NJ

Wine Recognition Database

1.6%89%88%Learn++.MT2

2.9%84%88%Learn++

Std. DevTS2TS1Algorithm

Observations:

• After initial training session (TS1), the performances of both algorithms are the same;

• After TS2, Learn++.MT2 outperforms Learn++;

• No performance degradation is seen with Learn++.MT2 after second dataset is introduced. Learn++.MT2 is more stable.

Page 39: AdaBoost new.pdf

PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition © 2005, Robi Polikar, Rowan University, Glassboro, NJ

VOC Recognition Database

1.9%89%88%Learn++.MT2

2.1%86%89%Learn++

Std. Dev.TS2TS1Algorithm

Similar observations as the Wine data :

• Again, performances are virtually identical after TS1.

• After TS2, Learn++.MT2 outperforms Learn++;

• No performance degradation is seen with Learn++.MT2 after second dataset is introduced. Learn++.MT2 is more stable. Precise termination point is not required.

Page 40: AdaBoost new.pdf

PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition © 2005, Robi Polikar, Rowan University, Glassboro, NJ

OCR Database

0.6%95%94%Learn++.MT2

0.9%92%94%Learn++

Std. Dev.TS2TS1Algorithm

Observations:

• Drastically unbalanced data, majority of the information contained in Dataset 1, explains high TS1 performance.

• Due to the imbalance, Learn++ performance declines with Dataset 2, whereas |Learn++.MT2 provides a modest gain.

Page 41: AdaBoost new.pdf

PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition © 2005, Robi Polikar, Rowan University, Glassboro, NJ

OCR Reverse Presentation

0.6%94%88%Learn++.MT2

0.7%91%85%Learn++

Std. Dev. TS2TS1Algorithm

Observations:

• Reversed scenario: little information is initially provided, followed by more substantial data. Final performances remains unchanged: the algorithm is immune to the order of presentation.

• The momentary dip in Learn++.MT2 performance as a new dataset is introduced, ironically justifies the approach taken. Why…?

Page 42: AdaBoost new.pdf

PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition © 2005, Robi Polikar, Rowan University, Glassboro, NJ

Some Open Problems

Is the distribution update rule used on Learn++ optimal? Can a weighted combination of AdaBoost and Learn++ update rule be better?Is there a better initialization scheme?Can Learn++ be used in a non-stationary learning environment, where the data distribution rule changes (in which case, it may be necessary to forget some of the previously learned information – throw away some classifiers)How can Learn++ be update / initialized if the training data is known to be very unbalanced with new classes being introduced?Can the performance of Learn++ on incremental learning be theoretically justified?Does Learn++ create more or less diverse classifiers? An analysis of the algorithm on several diversity measures.Can Learn++ be used on function approximation problems?How does Learn++ behave under different combination scenarios?

Page 43: AdaBoost new.pdf

PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition © 2005, Robi Polikar, Rowan University, Glassboro, NJ

Other Ensemble Techniques

There are several other ensemble techniques, includingStacked generalizationHierarchical mixture of expertsRandom forests

Page 44: AdaBoost new.pdf

PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition © 2005, Robi Polikar, Rowan University, Glassboro, NJ

Stacked Generalization

2nd LayerClassifier

CN

CK+1

Ck

Ck-1

C1

FinalDecision

Inputx

Classifier 1 with

parametersθ1

h1 (x, θ1)

hk(x, θk)

hN(x, θN)

Cla

ssifi

er N

+1w

ith

para

met

ers θ N

+1

1st LayerClassifiers

Page 45: AdaBoost new.pdf

PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition © 2005, Robi Polikar, Rowan University, Glassboro, NJ

Mixture of Experts

CT

CK+1

Ck

Ck-1

C1

Gating Network

CT+1

FinalDecision

w1

wNInputx

Classifier 1 with

parametersθ1h1 (x, θ1)

hk(x, θk)

hT(x, θT)

Pool

ing

/ C

ombi

ning

Sy

stem

(Usually trained with Expectation Maximization)

• stochastic• winner – takes – all• weighted average


Top Related