# adaboost new.pdf

Post on 10-Apr-2016

240 views

Embed Size (px)

TRANSCRIPT

PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition 2005, Robi Polikar, Rowan University, Glassboro, NJ

Advanced Topics in

Pattern Recognition

PR

Dept. of Electrical and Computer Engineering0909.555.01

Week 9AdaBoost & Learn++

Graphical Drawings: Pattern Classification, Duda, Hart and Stork, Copyright John Wiley and Sons, 2001.

AdaBoostAdaBoost.M1AdaBoost.M2AdaBoost.R

Learn++

PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition 2005, Robi Polikar, Rowan University, Glassboro, NJ

This Week in PR

AdaBoost and VariationsAdaBoost.M1AdaBoost.M2AdaBoost.R (independent research)Learn++

Bias Variance Analysis

PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition 2005, Robi Polikar, Rowan University, Glassboro, NJ

AdaBoost

Arguably, the most popular and successful of all ensemble generation algorithms, AdaBoost (Adaptive Boosting) is an extension of the original boosting algorithm, that extends boosting to the multi-class problems. Y. Freund and R. Schapire, A decision theoretic generalization of on-line

learning and an application to boosting, Journal of Computer and System Sciences, vol. 55, no. 1, pp. 119-129, 1997.

Solves the the general problem of producing a very accurate prediction rule by combining rough and moderately inaccurate rules-of-thumbs. AdaBoost generates an ensemble of classifiers, the training data of each is drawn

from a distribution that starts uniform and iteratively changes into one that provides more weight to those instances that are misclassified.

Each classifier in AdaBoost focuses increasingly on the more difficult to classifiyinstances.

The classifiers are then combined through weighted majority voting

PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition 2005, Robi Polikar, Rowan University, Glassboro, NJ

AdaBoost

Algorithm AdaBoostCreate a discreet distribution of the training data by assigning a weight to each instance. Initially, the distribution is uniform, hence all weights are the same

Draw a subset from this distribution and train a weak classifier with this dataset Compute the error, , of this classifier on its own training dataset. Make sure that this

error is less than . Test the entire training data on this classifier: If an instance x is correctly classified, reduce its weight proportional to If it is misclassified, increase its weight proportional to Normalize the weights such that they constitute a distribution

Repeat until T classifiers are generated Combine the classifiers using weighted majority voting

PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition 2005, Robi Polikar, Rowan University, Glassboro, NJ

AdaBoost.M1

Ck-1

EnsembleDecision

Ck

C1

Ck

CT

T

r

a

i

n

i

n

g

D

a

t

a

D

i

s

t

r

i

b

u

t

i

o

n

T

r

a

i

n

i

n

g

D

a

t

a

Voting Weightslog 1/

S1

ST

D1

DT

1

T

log T log 1

h1

hk-1

hk

hk+1

hT

UpdateDistribution

D

Normalized Error

Robi Polikar

PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition 2005, Robi Polikar, Rowan University, Glassboro, NJ

AdaBoost.M1

Algorithm AdaBoost.M1 Input:

Sequence of m examples S= [ ]),(),...,,(),,( 2211 mm yxyxyx with labels yi Y = {1,...,C} drawn from a distribution D,

Weak learning algorithm WeakLearn, Integer T specifying number of iterations.

Initialize D1(i) = im,1 .

Do for t = 1,2,...,T: 1. Call WeakLearn, providing it with the distribution Dt. 2. Get back a hypothesis ht : X Y 3. Calculate the error of

=

iit yxhitt iDh

)(:t )( :

If t > , then set T = t -1 and abort loop. 4. Set t = t / (1 - t ). 5. Update distribution Dt:

==+ otherwise

yxh ifZ

iDiD iittt

tt , 1

)( , )()(1

where ( )=i

tt iDZ is a normalization constant chosen so that

Dt+1 becomes a distribution function Output the final hypothesis:

=

=yxht t

Yyfinalt

xh)(:

1logmaxarg)(

PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition 2005, Robi Polikar, Rowan University, Glassboro, NJ

Weighted Majority VotingDemystified!

Problem: How many hours a day should our students spend working on homework?

3 6 7 9 8

0.30.2 0.25 0.15 0.1

5.96 ~ 6

Experts

WeightAssigner

Weighted Majority Voting

PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition 2005, Robi Polikar, Rowan University, Glassboro, NJ

Does AdaBoost Work?

You Betcha! The error of AdaBoost can be shown to be

where t is the error of the tth hypothesis. Note that this product gets smaller and smaller with each added classifier

But waitisnt this against the Occams razor? For explanation see Freund and Schapires paper as well as Schapires

tutorial on boosting and margin theory. More about this later.

( )1

2 1T

Tt t

t

=<

PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition 2005, Robi Polikar, Rowan University, Glassboro, NJ

Occams Razor vs. AdaBoost

What to expect: training error to decrease with number of

classifiers generalization error to increase after a

while (overfitting)

Whats observed: generalization error does not increase

even after many many iterations In fact, it even decreases even after training error is zero!

Is Occams razor of simple is betterwrong? Violated?

From R

. Schapire

http://ww

w.cs.princeton.edu/~schapire/

(letters database)

PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition 2005, Robi Polikar, Rowan University, Glassboro, NJ

The Margin Theory

The margin of an instance x roughly describes the confidence of the ensemble in its decision: Loosely speaking, the margin of an instance is simply the difference between the total (or

fraction of) vote(s) it receives from correctly identifying classifiers and the maximum (or fraction of) vote(s) it receives by any incorrect class

where kth class is the true class, and j(x) is the total support (vote) class j receives from all classifiers such that

The margin is therefore the strength of the vote, and the higher the margin, the more confidence there is the classification. Incorrect decisions have negative margins

( ) ( ) ( ){ }maxk jj km = x x x

( )1

1T

jj

== x

PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition 2005, Robi Polikar, Rowan University, Glassboro, NJ

Margin TheoryFrom

R. Schapire

-http://ww

w.cs.princeton.edu/~schapire/

(letters database)

PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition 2005, Robi Polikar, Rowan University, Glassboro, NJ

Margin Theory

Large margins indicate a lower bound on generalization error. If all margins are large, the final decision boundary can be obtained using a simpler classifier

(similar to polls can predict the outcomes of not-so-close races very early on)

They show that boosting tends to increase margins on the training data examples, and argue that an ensemble classifier with larger margins is a simpler classifierregardless of the number of classifiers that make up the ensemble.

More specifically: Let H be a finite space of base classifiers. For any >0 and >0, with probability 1- over the random choice of the training data set S, any ensemble E={h1, , hT} H combined through weighted majority satisfies

1/ 2

2

log log1 1( ) ( ) logN H

P error P training margin ON

+ +

N: number of instances|H|: Cardinality of the classifier space the weaker the classifier, the smaller the |H|P(error) is independent of the number of classifiers!

PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition 2005, Robi Polikar, Rowan University, Glassboro, NJ

AdaBoost.M2

AdaBoost.M1 requires that all classifiers have a weighted error no greater than . This is the least that can be asked from a classifier in a two-class problem, since an

error of is equivalent to random guessing. The probability of error for random guessing is much higher for multi-class

problems (specifically, k-1/k for a k-class problem). Therefore, achieving an error of becomes increasing difficult for larger number of classes, particularly if the weak classifiers are really weak.

AdaBoost.M2 address this problem by removing the weighted error restriction, instead defines the pseudo-error, which itself is then required to have an error no larger than . Pseudo-error recognizes that there is information given in the outputs of classifies

for non-selected / non-winning classes. On the OCR problem, 1 and 7 may look alike, and the classifier give high plausibility

outputs to these and low to all others when faced with a 1 or 7.

PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition 2005, Robi Polikar, Rowan University, Glassboro, NJ

AdaBoost.M2

Algorithm AdaBoost.M2 Input:

Sequence of m examples S= [ ]),(),...,,(),,( 2211 mm yxyxyx with labels yi Y = {1,...,C} drawn from a distribution D,

Weak learning algorithm WeakLearn, Integer T specifying number of iterations.

Let ( ) { }{ }, : 1, 2, , , iB i y i m y y= L Initialize ( ) ( )1 , 1 ,D i y B for i y B= . Do for t = 1,2,...,T:

1. Call WeakLearn, providing it with the distribution Dt. 2. Get back a hypothesis [ ]: 0,1th X Y 3. Calculate the pseudo-error of ht ( ) ( ) ( )( )

( )t ,1 2 ( , ) 1 , ,t t i i t i

i y BD i y h x y h x y

= +

4. Set t = t / (1 - t ). 5. Update distribution Dt:

( ) ( ) ( )( )1 2 1 , ,1

( , )( , ) t i i t ih x y h x ytt tt

D i yD i yZ

+ + = where ( )=i

tt iDZ is a

normalization constant chosen so that Dt+1 becomes a distribution function Output the final hypothesis:

( )1

1( ) arg max log ,T

final ty Y t t

h x h x y = =

PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition 2005, Robi Polikar, Rowan University, Glassboro, NJ

Incremental Learning

We now pose the following question: If after training an algorithm we receive additional data, how can we update the trained

classifier to learn new data? None of the classic algorithms we have seen so far, including MLP, RBF, PNN, KNN,

RCE, etc. is capable of incrementally updating its knowledge base to learn new information The typical procedure is to scratch the previously trained classifier, combine the old and new

data, and start from over. This causes all the information learned so far to be lost catastrophic forgetting Furthermore, what if the old data is no longer available, or what about if the new data

introduces new classes?

The ensemble of classifiers approach which is generally used for improving the generalization accuracy of a classifier, can be used to address the issue of incremental learning. One such implementation of ensemble classifiers for incremental learning is Learn++

PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition 2005, Robi Polikar, Rowan University, Glassboro, NJ

Incremental Learning

Feature 1

F

e

a

t

u

r

e

2

Data1 Data2

C1C1 C2C2C3C3 C4C4

PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition 2005, Robi Polikar, Rowan University, Glassboro, NJ

Learn++

So, how do we achieve incremental learning? What prevents us if anything in the AdaBoost formulation from learning

new data, if instances of previously unseen instances are introduced? Actually nothing! AdaBoost should work for incremental learning, but it can be made more efficient

Learn++: Modifies the distribution update rule to make the update based on the ensemble decision, not just the previous classifier.Why should this make any difference?

PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition 2005, Robi Polikar, Rowan University, Glassboro, NJ

Learn++

W Voting weights

C4h4

C8

h8

C1h1

h2

h7 C5h5h6

C3h3

LEARN++

Wght. Majority Voting

Learned DecisionBoundary

D1 D2

PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition 2005, Robi Polikar, Rowan University, Glassboro, NJ

Learn++

Database

D1

C1

E1

P1

Classifier:

Error:

Perf.:

Composite Hypotheses

Dn

Cn

Pn

En

W1 W2 W3W1

W2

W4W1

W2 W3

D2

C2

P2

E2

(Mis)classifiedInstances

h1

D3

C3

P3

E3

(Mis)classifiedInstances

H3

(Mis)classifiedInstances

H2

Subset: D4

C4

P4

E4

Database 1 Database 2Database 1Database 1 Database 2

(Mis)classifiedInstances

Hn-1

Database 2

Wn

W1 W2 W4W3

Final Classification

PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition 2005, Robi Polikar, Rowan University, Glassboro, NJ

Algorithm Learn++ (with major differences from AdaBoost.M1 indicated by ) Input: For each database drawn from Dk k=1,2,,K

Sequence of m training examples Sk = [(x1,y1),(x2, y2),,(xmk,ymk)]. Weak learning algorithm WeakLearn. Integer Tk, specifying the number of iterations.

Do for k=1,2, , K:

Initialize 1 ( ) 1 ,i kw D i m i= = , unless there is prior knowledge to select otherwise. Do for t = 1,2,...,Tk:

1. Set Dt =

= mi

tt iw1

)(w so that Dt is a distribution.

2. Randomly choose training data subset TRt according to Dt. 3. Call WeakLearn, providing it with TRt 4. Get back a hypothesis ht : X Y, and calculate the error of = iit tyxhit iDh )(:t )( : on Sk.

If t > , set t = t 1, discard ht and go to step 2. Otherwise, compute normalized error as ( )ttt = 1 . 5. Call weighted majority, obtain the overall hypothesis ( )

==

yxhtt

Yyt

t

H)(:

1logmaxarg , and

compute the overall error [ ]=

==m

iiit

yxHit yxHiDiD t

iit

t1)(:

|)(|)()(

If Et > , set t = t 1, discard Ht and go to step 2. 6. Set Bt = Et/(1-Et), and update the weights of the instances:

[ ]|)(|1)( , 1

)( , )()(1

iitt

tt

yxHt

iitt

Biw

otherwiseyxH ifB

iwiw

= ==+

Call Weighted majority on combined hypotheses Ht and Output the final hypothesis:

1 : ( )

1arg max logt

K

final y Y k t H x y t

H = ==

Learn++

PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition 2005, Robi Polikar, Rowan University, Glassboro, NJ

Simulation ResultsOdorant Identification

1. APZ: Apiezon, 2. PIB: Polyisobutelene, 3. DEGA:Poly(diethyleneglycoladipate),

4. SG: Solgel, 5. OV275 :Poly(siloxane), 6. PDPP: Poly (diphenoxylphosphorazene)

1 2 3 4 5 6 0

0 .2

0 .4

0 .6

0 .8

1 E t h a n o l

1 2 3 4 5 60

0 .2

0 .4

0 .6

0 .8

1 T o lu e n e

1 2 3 4 5 60

0 .2

0 .4

0 .6

0 .8

1 X y le n e

1 2 3 4 5 60

0 .2

0 .4

0 .6

0 .8

1 T C E

1 2 3 4 5 60

0 .2

0 .4

0 .6

0 .8

1 O c t a n e

Data Distribution

Dataset ET OC TL TCE XLS1 20 20 40 0 0S2 5 5 5 25 0S3 5 5 5 5 40

TEST 34 34 62 34 40

PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition 2005, Robi Polikar, Rowan University, Glassboro, NJ

Odorant Identification Results

P

N

N

Dataset TS1 (2) TS2 (3) TS3 (3)S1 93.70% 73.70% 76.30%S2 --- 92.50% 82.50%S3 --- --- 85.00%

TEST 58.80% 67.70% 83.80%

R

B

F

Dataset TS1 (5) TS2 (6) TS3 (7)S1 97.50% 81.20% 76.20%S2 --- 97.00% 95.00%S3 --- --- 90.00%

TEST 59.30% 67.60% 86.30%

M

L

P

Dataset TS1 (10) TS2 (5) TS3 (9)S1 87.50% 75.00% 71.30%S2 --- 82.50% 85.00%S3 --- --- 86.70%

TEST 58.80% 71.10% 87.20%

PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition 2005, Robi Polikar, Rowan University, Glassboro, NJ

Ultrasonic Weld Inspection (UWI)

North South

Slag

IncompletePenetration

CrackPorosity

Lack of Fusion

Data Distribution

Dataset LOF SLAG CRACK POROSITYS1 300 300 0 0S2 200 200 200 0S3 150 150 137 99

TEST 200 200 150 125

PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition 2005, Robi Polikar, Rowan University, Glassboro, NJ

Ultrasonic Weld Inspection Database

Crack

Lack of Fusion

Slag

Porosity

PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition 2005, Robi Polikar, Rowan University, Glassboro, NJ

UWI Results

P

N

N

Dataset TS1 (7) TS2 (7) TS3 (14)S1 99.50% 83.60% 71.50%S2 --- 93.00% 75.00%S3 --- --- 99.40%

TEST 48.90% 62.40% 76.10%

R

B

F

Dataset TS1 (9) TS2 (5) TS3 (14)S1 90.80% 78.30% 69.80%S2 --- 89.50% 75.80%S3 --- --- 94.70%

TEST 47.70% 60.90% 76.40%

M

L

P

Dataset TS1 (8) TS2 (3) TS3 (17)S1 98.20% 84.70% 80.30%S2 --- 98.70% 86.50%S3 --- --- 97.00%

TEST 48.90% 66.80% 78.40%

PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition 2005, Robi Polikar, Rowan University, Glassboro, NJ

Optical Character Recognition (OCR)

Handwritten character recognition problem

2997 instances, 62 attributes, 10 classes

Divided into for: S1~ S4(2150) training, TEST (1797) for testing

Different set of classes in different datasets

PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition 2005, Robi Polikar, Rowan University, Glassboro, NJ

OCR Database

Data Distribution

Class S1 S2 S3 S4 TEST0 100 50 50 25 1781 0 150 50 0 1822 100 50 50 25 1773 0 150 50 25 1834 100 50 50 0 1815 0 150 50 25 1826 100 50 0 100 1817 0 0 150 50 1798 100 0 0 150 1749 0 50 100 50 180

PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition 2005, Robi Polikar, Rowan University, Glassboro, NJ

OCR Results

P

N

N

Dataset TS1 (2) TS2 (5) TS3 (6) TS4 (3)S1 99.80% 80.80% 78.20% 92.40%S2 --- 96.30% 89.00% 88.70%S3 --- --- 98.20% 94.00%S4 --- --- --- 88.00%

TEST 48.70% 69.10% 82.80% 86.70%

R

B

F

Dataset TS1 (4) TS2 (6) TS3 (8) TS4 (15)S1 98.00% 81.00% 77.00% 93.40%S2 --- 96.50% 88.40% 80.60%S3 --- --- 93.60% 92.70%S4 --- --- --- 90.00%

TEST 47.80% 73.20% 79.80% 85.90%

M

L

P

Dataset TS1 (18) TS2 (30) TS3 (23) TS4 (3)S1 96.60% 89.80% 86.00% 94.80%S2 --- 87.10% 89.40% 87.90%S3 --- --- 92.00% 92.20%S4 --- --- --- 87.30%

TEST 46.60% 68.90% 82.00% 87.00%

PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition 2005, Robi Polikar, Rowan University, Glassboro, NJ

Learn++

Implementation Issues

Distribution initialization when new dataset becomes available: Solution: Start with a uniform distribution, and update that distribution based on

the performance of the existing ensemble on the new data When to stop training for each dataset? Solution: Use a validation dataset, if one available.

Or, keep training until performance on test data peaks mild cheating Classifier proliferation when new classes are added: Sufficient additional

classifiers need to be generated to out-vote existing classifiers which cannot correctly predict the new class. Solution: Learn++.MT (after Muhlbaier & Topalis)

PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition 2005, Robi Polikar, Rowan University, Glassboro, NJ

Learn++.MT:The Concept

4? 4?

4?

2

0

0

4

A

l

l

R

i

g

h

t

s

R

e

s

e

r

v

e

d

,

M

u

h

l

b

a

i

e

r

a

n

d

T

o

p

a

l

i

s

PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition 2005, Robi Polikar, Rowan University, Glassboro, NJ

Learn++.MT

Learn++.MT: Creates a preliminary class confidence on each instance and updates the weights of classifiers that have not seen a particular class. Each classifier is assigned a weight based on its performance on the training data. The preliminary class confidence is obtained by summing the weights of all classifiers that

picked a given class and dividing by the sum of the weights of all classifiers that have been trained on that class.

Updates (lowers) the weights of classifiers that have not been trained with the new class (i).

==t

it

CTrctt

cxhtt

ic W

WxP

:

)(:)( for c=1,2,,C

( ): :( ) 1 ( )ti it c CTr t t c CTr cW x W P x =

Set of classifier that have seen class c

Set of classifier that have picked class cPreliminary confidence of those classifiers that has seen class c for xi belonging to class c

PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition 2005, Robi Polikar, Rowan University, Glassboro, NJ

Learn++.MTAlgorithm Learn++.MT Input: For each dataset Dk k=1,2,,K

Sequence of mk instances xi with labels },...,1{ cYy ki = Weak learning algorithm BaseClassifier. Integer Tk, specifying the number of iterations.

Do for k=1,2,,K If k=1 Initialize 0,/1)( 111 === eTmiDw for all i.

Else Go to Step 5: evaluate the current ensemble on new dataset Dk, update weights, and recall current # of classifiers

1

1

kk jj

eT T== Do for t= keT +1, keT +2,, kk TeT + :

1. Set =

=m

ittt iwwD

1

)( so that Dt is a distribution.

2. Call BaseClassifier with a subset of Dk chosen using Dt. 3. Obtain ht : X Y, and calculate its error:

=

iit yxhitt iD

)(:)(

If t > , discard ht and go to step 2. Otherwise, compute normalized error as )1( ttt = .

4. CTrt = Yk, to save labels of classes used in training ht. 5. Call DWV to obtain the composite hypothesis Ht. 6. Compute the error of the composite hypothesis

=

iit yxHitt iDE

)(:)(

7. Set Bt=Et/(1-Et), and update the instance weights:

==+ otherwise

yxHifBwiw iitttt ,1

)(,)(1

Call DWV to obtain the final hypothesis, Hfinal.

Algorithm Dynamically Weighted Voting Input: Sequence of i=1,, n training instances or any test instance xi Classifiers ht. Corresponding error values, t. Classes, CTrt used in training ht.

For t=1,2,,T where T is the total number classifiers 1. Initialize classifier weights )1log( ttW = 2. Create normalization factor, Z, for each class

=tCTrct

tc WZ:

, for c=1,2,,C classes

3. Obtain preliminary decision cc ZtWPcixtht

=

=)(:

4. Update voting weights )1(:: cCTrcttCTrct PWW t = 5. Compute final / composite hypothesis

=

=cxhtt

cifinal

it

WxH)(:

maxarg)(

PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition 2005, Robi Polikar, Rowan University, Glassboro, NJ

Learn++.MT SimulationVOC Data

Test procedureLearn++ and Learn++.MT were each allowed to create a set number of classifiers on each dataset. The number of classifiers generated in each training session was chosen to optimize the algorithms performance. Learn++ appeared to generate the best results when 6 classifiers were generated on the first dataset, 12 on the next, and 18 on the last. Learn++.MT, on the other hand, performed optimally using 6 classifiers in the first training, 4 on the second, and 6 on the last.

4024522424Test

4015101010Dataset 3

025101010Dataset 2

00402020Dataset 1

54321Class

PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition 2005, Robi Polikar, Rowan University, Glassboro, NJ

Learn++.MT SimulationVOC Data

63%64%412TS2

TS3

TS1

Training Session

81%69%618

54%55%66

Learn++.MTLearn++Learn++.MTLearn++

Performance# Classifiers Added

2004 All Rights Reserved Muhlbaier and Topalis

This test was executed twenty times on Learn++ and Learn++.MT to acquire a well represented generalization performance on the test data.

PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition 2005, Robi Polikar, Rowan University, Glassboro, NJ

Learning from Unbalanced Data: Learn++.MT2

Learn++.MT2 was created to account for the unbalanced data problem We define unbalanced data as any discrepancy in the cardinality of each dataset used in

incremental learning. If one dataset has substantially more data than the other (s), the

ensemble decision might be unfairly biased towards the data with the lower cardinality Under the generally valid assumptions of

No instance is repeated in any dataset, and The noise distribution remains relatively unchanged among datasets;

it is reasonable to believe that the dataset that has more instances carries more information. Classifiers generated with such data should therefore be weighted more heavily

It is not unusual to see major discrepancies in the cardinalities of datasets that subsequently become available.

The cardinality of each dataset, including relative cardinalities of individual classes within a dataset, should be taken into consideration in any ensemble based learning algorithm that employs a classifier combination scheme.

PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition 2005, Robi Polikar, Rowan University, Glassboro, NJ

Learn++.MT2 Algorithm

The primary novelty in Learn++.MT2 is the way by which the voting weights are determined Learn++.MT2 attempts to addresses the unbalanced data problem by keeping track of the

number of instances from each class with which each classifier is trained Each classifier is first given a weight based on its performance on its own training data

This weight is later adjusted according to its class conditional weight factor, wt,c

For each classifier, this ratio is proportional to the number of instances from a particular class used for training that classifier, to the number of instances from that class used for training all classifiers thus far within the ensemble

The final decision is made similarly to Learn++ but with using the class conditional weights

c

ctct N

npw =,

=

=cxht

ctYc

ifinalitk

wxH)(:

,maxarg)(

pt: Training performance of the tth classifier,nc: # of class-c instances in the current datasetNc: # of all class-c instances seen so far

PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition 2005, Robi Polikar, Rowan University, Glassboro, NJ

Experimental Setup

Learn++.MT2 has been tested on three databases: Wine database from UCI (3 classes, 13 features); Optical Character Recognition database from UCI (10 classes, 64 features); A real-world gas identification problem for determining one of five volatile organic

compounds (VOC) based on chemical sensor data (5 classes, 6 features). Base classifiers were all single layer MLPs, normally incapable of learning

incrementally, with 12~40 nodes and an error goal of 0.05 ~ 0.025. In each case the data distributions were designed to simulate unbalanced data. To make a comparison between Learn++ and Learn++.MT2, a set number of classifiers

are created instead of selecting the number adaptively. This number was selected as a result of experimental testing, such that the tests show an

accurate and unbiased comparison of the algorithms.

PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition 2005, Robi Polikar, Rowan University, Glassboro, NJ

Wine Recognition Database

1.6%89%88%Learn++.MT2

2.9%84%88%Learn++

Std. DevTS2TS1Algorithm

Observations:

After initial training session (TS1), the performances of both algorithms are the same;

After TS2, Learn++.MT2 outperforms Learn++;

No performance degradation is seen with Learn++.MT2 after second dataset is introduced. Learn++.MT2 is more stable.

PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition 2005, Robi Polikar, Rowan University, Glassboro, NJ

VOC Recognition Database

1.9%89%88%Learn++.MT2

2.1%86%89%Learn++

Std. Dev.TS2TS1Algorithm

Similar observations as the Wine data :

Again, performances are virtually identical after TS1. After TS2, Learn++.MT2 outperforms Learn++;

No performance degradation is seen with Learn++.MT2 after second dataset is introduced. Learn++.MT2 is more stable. Precise termination point is not required.

PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition 2005, Robi Polikar, Rowan University, Glassboro, NJ

OCR Database

0.6%95%94%Learn++.MT2

0.9%92%94%Learn++

Std. Dev.TS2TS1Algorithm

Observations:

Drastically unbalanced data, majority of the information contained in Dataset 1, explains high TS1 performance.

Due to the imbalance, Learn++ performance declines with Dataset 2, whereas |Learn++.MT2 provides a modest gain.

PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition 2005, Robi Polikar, Rowan University, Glassboro, NJ

OCR Reverse Presentation

0.6%94%88%Learn++.MT2

0.7%91%85%Learn++

Std. Dev. TS2TS1Algorithm

Observations:

Reversed scenario: little information is initially provided, followed by more substantial data. Final performances remains unchanged: the algorithm is immune to the order of presentation.

The momentary dip in Learn++.MT2 performance as a new dataset is introduced, ironically justifies the approach taken. Why?

PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition 2005, Robi Polikar, Rowan University, Glassboro, NJ

Some Open Problems

Is the distribution update rule used on Learn++ optimal? Can a weighted combination of AdaBoost and Learn++ update rule be better?

Is there a better initialization scheme? Can Learn++ be used in a non-stationary learning environment, where the

data distribution rule changes (in which case, it may be necessary to forget some of the previously learned information throw away some classifiers)

How can Learn++ be update / initialized if the training data is known to be very unbalanced with new classes being introduced?

Can the performance of Learn++ on incremental learning be theoretically justified?

Does Learn++ create more or less diverse classifiers? An analysis of the algorithm on several diversity measures.

Can Learn++ be used on function approximation problems? How does Learn++ behave under different combination scenarios?

PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition 2005, Robi Polikar, Rowan University, Glassboro, NJ

Other Ensemble Techniques

There are several other ensemble techniques, including Stacked generalization Hierarchical mixture of experts Random forests

PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition 2005, Robi Polikar, Rowan University, Glassboro, NJ

Stacked Generalization

2nd LayerClassifier

CN

CK+1

Ck

Ck-1

C1

FinalDecision

Inputx

Classifier 1 with

parameters1

h1 (x, 1)

hk(x, k)

hN(x, N)

C

l

a

s

s

i

f

i

e

r

N

+

1

w

i

t

h

p

a

r

a

m

e

t

e

r

s

N

+

1

1st LayerClassifiers

PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition 2005, Robi Polikar, Rowan University, Glassboro, NJ

Mixture of Experts

CT

CK+1

Ck

Ck-1

C1

Gating Network

CT+1

FinalDecision

w1

wNInputx

Classifier 1 with

parameters1h1 (x, 1)

hk(x, k)

hT(x, T)

P

o

o

l

i

n

g

/

C

o

m

b

i

n

i

n

g

S

y

s

t

e

m

(Usually trained with Expectation Maximization)

stochastic winner takes all weighted average