adaboost new.pdf

Click here to load reader

Post on 10-Apr-2016

240 views

Category:

Documents

0 download

Embed Size (px)

TRANSCRIPT

  • PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

    Advanced Topics in Pattern Recognition 2005, Robi Polikar, Rowan University, Glassboro, NJ

    Advanced Topics in

    Pattern Recognition

    PR

    Dept. of Electrical and Computer Engineering0909.555.01

    Week 9AdaBoost & Learn++

    Graphical Drawings: Pattern Classification, Duda, Hart and Stork, Copyright John Wiley and Sons, 2001.

    AdaBoostAdaBoost.M1AdaBoost.M2AdaBoost.R

    Learn++

  • PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

    Advanced Topics in Pattern Recognition 2005, Robi Polikar, Rowan University, Glassboro, NJ

    This Week in PR

    AdaBoost and VariationsAdaBoost.M1AdaBoost.M2AdaBoost.R (independent research)Learn++

    Bias Variance Analysis

  • PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

    Advanced Topics in Pattern Recognition 2005, Robi Polikar, Rowan University, Glassboro, NJ

    AdaBoost

    Arguably, the most popular and successful of all ensemble generation algorithms, AdaBoost (Adaptive Boosting) is an extension of the original boosting algorithm, that extends boosting to the multi-class problems. Y. Freund and R. Schapire, A decision theoretic generalization of on-line

    learning and an application to boosting, Journal of Computer and System Sciences, vol. 55, no. 1, pp. 119-129, 1997.

    Solves the the general problem of producing a very accurate prediction rule by combining rough and moderately inaccurate rules-of-thumbs. AdaBoost generates an ensemble of classifiers, the training data of each is drawn

    from a distribution that starts uniform and iteratively changes into one that provides more weight to those instances that are misclassified.

    Each classifier in AdaBoost focuses increasingly on the more difficult to classifiyinstances.

    The classifiers are then combined through weighted majority voting

  • PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

    Advanced Topics in Pattern Recognition 2005, Robi Polikar, Rowan University, Glassboro, NJ

    AdaBoost

    Algorithm AdaBoostCreate a discreet distribution of the training data by assigning a weight to each instance. Initially, the distribution is uniform, hence all weights are the same

    Draw a subset from this distribution and train a weak classifier with this dataset Compute the error, , of this classifier on its own training dataset. Make sure that this

    error is less than . Test the entire training data on this classifier: If an instance x is correctly classified, reduce its weight proportional to If it is misclassified, increase its weight proportional to Normalize the weights such that they constitute a distribution

    Repeat until T classifiers are generated Combine the classifiers using weighted majority voting

  • PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

    Advanced Topics in Pattern Recognition 2005, Robi Polikar, Rowan University, Glassboro, NJ

    AdaBoost.M1

    Ck-1

    EnsembleDecision

    Ck

    C1

    Ck

    CT

    T

    r

    a

    i

    n

    i

    n

    g

    D

    a

    t

    a

    D

    i

    s

    t

    r

    i

    b

    u

    t

    i

    o

    n

    T

    r

    a

    i

    n

    i

    n

    g

    D

    a

    t

    a

    Voting Weightslog 1/

    S1

    ST

    D1

    DT

    1

    T

    log T log 1

    h1

    hk-1

    hk

    hk+1

    hT

    UpdateDistribution

    D

    Normalized Error

    Robi Polikar

  • PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

    Advanced Topics in Pattern Recognition 2005, Robi Polikar, Rowan University, Glassboro, NJ

    AdaBoost.M1

    Algorithm AdaBoost.M1 Input:

    Sequence of m examples S= [ ]),(),...,,(),,( 2211 mm yxyxyx with labels yi Y = {1,...,C} drawn from a distribution D,

    Weak learning algorithm WeakLearn, Integer T specifying number of iterations.

    Initialize D1(i) = im,1 .

    Do for t = 1,2,...,T: 1. Call WeakLearn, providing it with the distribution Dt. 2. Get back a hypothesis ht : X Y 3. Calculate the error of

    =

    iit yxhitt iDh

    )(:t )( :

    If t > , then set T = t -1 and abort loop. 4. Set t = t / (1 - t ). 5. Update distribution Dt:

    ==+ otherwise

    yxh ifZ

    iDiD iittt

    tt , 1

    )( , )()(1

    where ( )=i

    tt iDZ is a normalization constant chosen so that

    Dt+1 becomes a distribution function Output the final hypothesis:

    =

    =yxht t

    Yyfinalt

    xh)(:

    1logmaxarg)(

  • PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

    Advanced Topics in Pattern Recognition 2005, Robi Polikar, Rowan University, Glassboro, NJ

    Weighted Majority VotingDemystified!

    Problem: How many hours a day should our students spend working on homework?

    3 6 7 9 8

    0.30.2 0.25 0.15 0.1

    5.96 ~ 6

    Experts

    WeightAssigner

    Weighted Majority Voting

  • PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

    Advanced Topics in Pattern Recognition 2005, Robi Polikar, Rowan University, Glassboro, NJ

    Does AdaBoost Work?

    You Betcha! The error of AdaBoost can be shown to be

    where t is the error of the tth hypothesis. Note that this product gets smaller and smaller with each added classifier

    But waitisnt this against the Occams razor? For explanation see Freund and Schapires paper as well as Schapires

    tutorial on boosting and margin theory. More about this later.

    ( )1

    2 1T

    Tt t

    t

    =<

  • PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

    Advanced Topics in Pattern Recognition 2005, Robi Polikar, Rowan University, Glassboro, NJ

    Occams Razor vs. AdaBoost

    What to expect: training error to decrease with number of

    classifiers generalization error to increase after a

    while (overfitting)

    Whats observed: generalization error does not increase

    even after many many iterations In fact, it even decreases even after training error is zero!

    Is Occams razor of simple is betterwrong? Violated?

    From R

    . Schapire

    http://ww

    w.cs.princeton.edu/~schapire/

    (letters database)

  • PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

    Advanced Topics in Pattern Recognition 2005, Robi Polikar, Rowan University, Glassboro, NJ

    The Margin Theory

    The margin of an instance x roughly describes the confidence of the ensemble in its decision: Loosely speaking, the margin of an instance is simply the difference between the total (or

    fraction of) vote(s) it receives from correctly identifying classifiers and the maximum (or fraction of) vote(s) it receives by any incorrect class

    where kth class is the true class, and j(x) is the total support (vote) class j receives from all classifiers such that

    The margin is therefore the strength of the vote, and the higher the margin, the more confidence there is the classification. Incorrect decisions have negative margins

    ( ) ( ) ( ){ }maxk jj km = x x x

    ( )1

    1T

    jj

    == x

  • PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

    Advanced Topics in Pattern Recognition 2005, Robi Polikar, Rowan University, Glassboro, NJ

    Margin TheoryFrom

    R. Schapire

    -http://ww

    w.cs.princeton.edu/~schapire/

    (letters database)

  • PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

    Advanced Topics in Pattern Recognition 2005, Robi Polikar, Rowan University, Glassboro, NJ

    Margin Theory

    Large margins indicate a lower bound on generalization error. If all margins are large, the final decision boundary can be obtained using a simpler classifier

    (similar to polls can predict the outcomes of not-so-close races very early on)

    They show that boosting tends to increase margins on the training data examples, and argue that an ensemble classifier with larger margins is a simpler classifierregardless of the number of classifiers that make up the ensemble.

    More specifically: Let H be a finite space of base classifiers. For any >0 and >0, with probability 1- over the random choice of the training data set S, any ensemble E={h1, , hT} H combined through weighted majority satisfies

    1/ 2

    2

    log log1 1( ) ( ) logN H

    P error P training margin ON

    + +

    N: number of instances|H|: Cardinality of the classifier space the weaker the classifier, the smaller the |H|P(error) is independent of the number of classifiers!

  • PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

    Advanced Topics in Pattern Recognition 2005, Robi Polikar, Rowan University, Glassboro, NJ

    AdaBoost.M2

    AdaBoost.M1 requires that all classifiers have a weighted error no greater than . This is the least that can be asked from a classifier in a two-class problem, since an

    error of is equivalent to random guessing. The probability of error for random guessing is much higher for multi-class

    problems (specifically, k-1/k for a k-class problem). Therefore, achieving an error of becomes increasing difficult for larger number of classes, particularly if the weak classifiers are really weak.

    AdaBoost.M2 address this problem by removing the weighted error restriction, instead defines the pseudo-error, which itself is then required to have an error no larger than . Pseudo-error recognizes that there is information given in the outputs of classifies

    for non-selected / non-winning classes. On the OCR problem, 1 and 7 may look alike, and the classifier give high plausibility

    outputs to these and low to all others when faced with a 1 or 7.

  • PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

    Advanced Topics in Pattern Recognition 2005, Robi Polikar, Rowan University, Glassboro, NJ

    AdaBoost.M2

    Algorithm AdaBoost.M2 Input:

    Sequence of m examples S= [ ]),(),...,,(),,( 2211 mm yxyxyx with labels yi Y = {1,...,C} drawn from a distribution D,

    Weak learning algorithm WeakLearn, Integer T specifying number of iterations.

    Let ( ) { }{ }, : 1, 2, , , iB i y i m y y= L Initialize ( ) ( )1 , 1 ,D i y B for i y B= . Do for t = 1,2,...,T:

    1. Call WeakLearn, providing it with the distribution Dt. 2. Get back a hypothesis [ ]: 0,1th X Y 3. Calculate the pseudo-error of ht ( ) ( ) ( )( )

    ( )t ,1 2 ( , ) 1 , ,t t i i t i

    i y BD i y h x y h x y

    = +

    4. Set t = t / (1 - t ). 5. Update distribution Dt:

    ( ) ( ) ( )( )1 2 1 , ,1

    ( , )( , ) t i i t ih x y h x ytt tt

    D i yD i yZ

    + + = where ( )=i

    tt iDZ is a

    normalization constant chosen so that Dt+1 becomes a distribution function Output the final hypothesis:

    ( )1

    1( ) arg max log ,T

    final ty Y t t

    h x h x y = =

  • PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

    Advanced Topics in Pattern Recognition 2005, Robi Polikar, Rowan University, Glassboro, NJ

    Incremental Learning

    We now pose the following question: If after training an algorithm we receive additional data, how can we update the trained

    classifier to learn new data? None of the classic algorithms we have seen so far, including MLP, RBF, PNN, KNN,

    RCE, etc. is capable of incrementally updating its knowledge base to learn new information The typical procedure is to scratch the previously trained classifier, combine the old and new

    data, and start from over. This causes all the information learned so far to be lost catastrophic forgetting Furthermore, what if the old data is no longer available, or what about if the new data

    introduces new classes?

    The ensemble of classifiers approach which is generally used for improving the generalization accuracy of a classifier, can be used to address the issue of incremental learning. One such implementation of ensemble classifiers for incremental learning is Learn++

  • PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

    Advanced Topics in Pattern Recognition 2005, Robi Polikar, Rowan University, Glassboro, NJ

    Incremental Learning

    Feature 1

    F

    e

    a

    t

    u

    r

    e

    2

    Data1 Data2

    C1C1 C2C2C3C3 C4C4

  • PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

    Advanced Topics in Pattern Recognition 2005, Robi Polikar, Rowan University, Glassboro, NJ

    Learn++

    So, how do we achieve incremental learning? What prevents us if anything in the AdaBoost formulation from learning

    new data, if instances of previously unseen instances are introduced? Actually nothing! AdaBoost should work for incremental learning, but it can be made more efficient

    Learn++: Modifies the distribution update rule to make the update based on the ensemble decision, not just the previous classifier.Why should this make any difference?

  • PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

    Advanced Topics in Pattern Recognition 2005, Robi Polikar, Rowan University, Glassboro, NJ

    Learn++

    W Voting weights

    C4h4

    C8

    h8

    C1h1

    h2

    h7 C5h5h6

    C3h3

    LEARN++

    Wght. Majority Voting

    Learned DecisionBoundary

    D1 D2

  • PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

    Advanced Topics in Pattern Recognition 2005, Robi Polikar, Rowan University, Glassboro, NJ

    Learn++

    Database

    D1

    C1

    E1

    P1

    Classifier:

    Error:

    Perf.:

    Composite Hypotheses

    Dn

    Cn

    Pn

    En

    W1 W2 W3W1

    W2

    W4W1

    W2 W3

    D2

    C2

    P2

    E2

    (Mis)classifiedInstances

    h1

    D3

    C3

    P3

    E3

    (Mis)classifiedInstances

    H3

    (Mis)classifiedInstances

    H2

    Subset: D4

    C4

    P4

    E4

    Database 1 Database 2Database 1Database 1 Database 2

    (Mis)classifiedInstances

    Hn-1

    Database 2

    Wn

    W1 W2 W4W3

    Final Classification

  • PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

    Advanced Topics in Pattern Recognition 2005, Robi Polikar, Rowan University, Glassboro, NJ

    Algorithm Learn++ (with major differences from AdaBoost.M1 indicated by ) Input: For each database drawn from Dk k=1,2,,K

    Sequence of m training examples Sk = [(x1,y1),(x2, y2),,(xmk,ymk)]. Weak learning algorithm WeakLearn. Integer Tk, specifying the number of iterations.

    Do for k=1,2, , K:

    Initialize 1 ( ) 1 ,i kw D i m i= = , unless there is prior knowledge to select otherwise. Do for t = 1,2,...,Tk:

    1. Set Dt =

    = mi

    tt iw1

    )(w so that Dt is a distribution.

    2. Randomly choose training data subset TRt according to Dt. 3. Call WeakLearn, providing it with TRt 4. Get back a hypothesis ht : X Y, and calculate the error of = iit tyxhit iDh )(:t )( : on Sk.

    If t > , set t = t 1, discard ht and go to step 2. Otherwise, compute normalized error as ( )ttt = 1 . 5. Call weighted majority, obtain the overall hypothesis ( )

    ==

    yxhtt

    Yyt

    t

    H)(:

    1logmaxarg , and

    compute the overall error [ ]=

    ==m

    iiit

    yxHit yxHiDiD t

    iit

    t1)(:

    |)(|)()(

    If Et > , set t = t 1, discard Ht and go to step 2. 6. Set Bt = Et/(1-Et), and update the weights of the instances:

    [ ]|)(|1)( , 1

    )( , )()(1

    iitt

    tt

    yxHt

    iitt

    Biw

    otherwiseyxH ifB

    iwiw

    = ==+

    Call Weighted majority on combined hypotheses Ht and Output the final hypothesis:

    1 : ( )

    1arg max logt

    K

    final y Y k t H x y t

    H = ==

    Learn++

  • PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

    Advanced Topics in Pattern Recognition 2005, Robi Polikar, Rowan University, Glassboro, NJ

    Simulation ResultsOdorant Identification

    1. APZ: Apiezon, 2. PIB: Polyisobutelene, 3. DEGA:Poly(diethyleneglycoladipate),

    4. SG: Solgel, 5. OV275 :Poly(siloxane), 6. PDPP: Poly (diphenoxylphosphorazene)

    1 2 3 4 5 6 0

    0 .2

    0 .4

    0 .6

    0 .8

    1 E t h a n o l

    1 2 3 4 5 60

    0 .2

    0 .4

    0 .6

    0 .8

    1 T o lu e n e

    1 2 3 4 5 60

    0 .2

    0 .4

    0 .6

    0 .8

    1 X y le n e

    1 2 3 4 5 60

    0 .2

    0 .4

    0 .6

    0 .8

    1 T C E

    1 2 3 4 5 60

    0 .2

    0 .4

    0 .6

    0 .8

    1 O c t a n e

    Data Distribution

    Dataset ET OC TL TCE XLS1 20 20 40 0 0S2 5 5 5 25 0S3 5 5 5 5 40

    TEST 34 34 62 34 40

  • PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

    Advanced Topics in Pattern Recognition 2005, Robi Polikar, Rowan University, Glassboro, NJ

    Odorant Identification Results

    P

    N

    N

    Dataset TS1 (2) TS2 (3) TS3 (3)S1 93.70% 73.70% 76.30%S2 --- 92.50% 82.50%S3 --- --- 85.00%

    TEST 58.80% 67.70% 83.80%

    R

    B

    F

    Dataset TS1 (5) TS2 (6) TS3 (7)S1 97.50% 81.20% 76.20%S2 --- 97.00% 95.00%S3 --- --- 90.00%

    TEST 59.30% 67.60% 86.30%

    M

    L

    P

    Dataset TS1 (10) TS2 (5) TS3 (9)S1 87.50% 75.00% 71.30%S2 --- 82.50% 85.00%S3 --- --- 86.70%

    TEST 58.80% 71.10% 87.20%

  • PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

    Advanced Topics in Pattern Recognition 2005, Robi Polikar, Rowan University, Glassboro, NJ

    Ultrasonic Weld Inspection (UWI)

    North South

    Slag

    IncompletePenetration

    CrackPorosity

    Lack of Fusion

    Data Distribution

    Dataset LOF SLAG CRACK POROSITYS1 300 300 0 0S2 200 200 200 0S3 150 150 137 99

    TEST 200 200 150 125

  • PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

    Advanced Topics in Pattern Recognition 2005, Robi Polikar, Rowan University, Glassboro, NJ

    Ultrasonic Weld Inspection Database

    Crack

    Lack of Fusion

    Slag

    Porosity

  • PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

    Advanced Topics in Pattern Recognition 2005, Robi Polikar, Rowan University, Glassboro, NJ

    UWI Results

    P

    N

    N

    Dataset TS1 (7) TS2 (7) TS3 (14)S1 99.50% 83.60% 71.50%S2 --- 93.00% 75.00%S3 --- --- 99.40%

    TEST 48.90% 62.40% 76.10%

    R

    B

    F

    Dataset TS1 (9) TS2 (5) TS3 (14)S1 90.80% 78.30% 69.80%S2 --- 89.50% 75.80%S3 --- --- 94.70%

    TEST 47.70% 60.90% 76.40%

    M

    L

    P

    Dataset TS1 (8) TS2 (3) TS3 (17)S1 98.20% 84.70% 80.30%S2 --- 98.70% 86.50%S3 --- --- 97.00%

    TEST 48.90% 66.80% 78.40%

  • PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

    Advanced Topics in Pattern Recognition 2005, Robi Polikar, Rowan University, Glassboro, NJ

    Optical Character Recognition (OCR)

    Handwritten character recognition problem

    2997 instances, 62 attributes, 10 classes

    Divided into for: S1~ S4(2150) training, TEST (1797) for testing

    Different set of classes in different datasets

  • PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

    Advanced Topics in Pattern Recognition 2005, Robi Polikar, Rowan University, Glassboro, NJ

    OCR Database

    Data Distribution

    Class S1 S2 S3 S4 TEST0 100 50 50 25 1781 0 150 50 0 1822 100 50 50 25 1773 0 150 50 25 1834 100 50 50 0 1815 0 150 50 25 1826 100 50 0 100 1817 0 0 150 50 1798 100 0 0 150 1749 0 50 100 50 180

  • PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

    Advanced Topics in Pattern Recognition 2005, Robi Polikar, Rowan University, Glassboro, NJ

    OCR Results

    P

    N

    N

    Dataset TS1 (2) TS2 (5) TS3 (6) TS4 (3)S1 99.80% 80.80% 78.20% 92.40%S2 --- 96.30% 89.00% 88.70%S3 --- --- 98.20% 94.00%S4 --- --- --- 88.00%

    TEST 48.70% 69.10% 82.80% 86.70%

    R

    B

    F

    Dataset TS1 (4) TS2 (6) TS3 (8) TS4 (15)S1 98.00% 81.00% 77.00% 93.40%S2 --- 96.50% 88.40% 80.60%S3 --- --- 93.60% 92.70%S4 --- --- --- 90.00%

    TEST 47.80% 73.20% 79.80% 85.90%

    M

    L

    P

    Dataset TS1 (18) TS2 (30) TS3 (23) TS4 (3)S1 96.60% 89.80% 86.00% 94.80%S2 --- 87.10% 89.40% 87.90%S3 --- --- 92.00% 92.20%S4 --- --- --- 87.30%

    TEST 46.60% 68.90% 82.00% 87.00%

  • PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

    Advanced Topics in Pattern Recognition 2005, Robi Polikar, Rowan University, Glassboro, NJ

    Learn++

    Implementation Issues

    Distribution initialization when new dataset becomes available: Solution: Start with a uniform distribution, and update that distribution based on

    the performance of the existing ensemble on the new data When to stop training for each dataset? Solution: Use a validation dataset, if one available.

    Or, keep training until performance on test data peaks mild cheating Classifier proliferation when new classes are added: Sufficient additional

    classifiers need to be generated to out-vote existing classifiers which cannot correctly predict the new class. Solution: Learn++.MT (after Muhlbaier & Topalis)

  • PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

    Advanced Topics in Pattern Recognition 2005, Robi Polikar, Rowan University, Glassboro, NJ

    Learn++.MT:The Concept

    4? 4?

    4?

    2

    0

    0

    4

    A

    l

    l

    R

    i

    g

    h

    t

    s

    R

    e

    s

    e

    r

    v

    e

    d

    ,

    M

    u

    h

    l

    b

    a

    i

    e

    r

    a

    n

    d

    T

    o

    p

    a

    l

    i

    s

  • PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

    Advanced Topics in Pattern Recognition 2005, Robi Polikar, Rowan University, Glassboro, NJ

    Learn++.MT

    Learn++.MT: Creates a preliminary class confidence on each instance and updates the weights of classifiers that have not seen a particular class. Each classifier is assigned a weight based on its performance on the training data. The preliminary class confidence is obtained by summing the weights of all classifiers that

    picked a given class and dividing by the sum of the weights of all classifiers that have been trained on that class.

    Updates (lowers) the weights of classifiers that have not been trained with the new class (i).

    ==t

    it

    CTrctt

    cxhtt

    ic W

    WxP

    :

    )(:)( for c=1,2,,C

    ( ): :( ) 1 ( )ti it c CTr t t c CTr cW x W P x =

    Set of classifier that have seen class c

    Set of classifier that have picked class cPreliminary confidence of those classifiers that has seen class c for xi belonging to class c

  • PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

    Advanced Topics in Pattern Recognition 2005, Robi Polikar, Rowan University, Glassboro, NJ

    Learn++.MTAlgorithm Learn++.MT Input: For each dataset Dk k=1,2,,K

    Sequence of mk instances xi with labels },...,1{ cYy ki = Weak learning algorithm BaseClassifier. Integer Tk, specifying the number of iterations.

    Do for k=1,2,,K If k=1 Initialize 0,/1)( 111 === eTmiDw for all i.

    Else Go to Step 5: evaluate the current ensemble on new dataset Dk, update weights, and recall current # of classifiers

    1

    1

    kk jj

    eT T== Do for t= keT +1, keT +2,, kk TeT + :

    1. Set =

    =m

    ittt iwwD

    1

    )( so that Dt is a distribution.

    2. Call BaseClassifier with a subset of Dk chosen using Dt. 3. Obtain ht : X Y, and calculate its error:

    =

    iit yxhitt iD

    )(:)(

    If t > , discard ht and go to step 2. Otherwise, compute normalized error as )1( ttt = .

    4. CTrt = Yk, to save labels of classes used in training ht. 5. Call DWV to obtain the composite hypothesis Ht. 6. Compute the error of the composite hypothesis

    =

    iit yxHitt iDE

    )(:)(

    7. Set Bt=Et/(1-Et), and update the instance weights:

    ==+ otherwise

    yxHifBwiw iitttt ,1

    )(,)(1

    Call DWV to obtain the final hypothesis, Hfinal.

    Algorithm Dynamically Weighted Voting Input: Sequence of i=1,, n training instances or any test instance xi Classifiers ht. Corresponding error values, t. Classes, CTrt used in training ht.

    For t=1,2,,T where T is the total number classifiers 1. Initialize classifier weights )1log( ttW = 2. Create normalization factor, Z, for each class

    =tCTrct

    tc WZ:

    , for c=1,2,,C classes

    3. Obtain preliminary decision cc ZtWPcixtht

    =

    =)(:

    4. Update voting weights )1(:: cCTrcttCTrct PWW t = 5. Compute final / composite hypothesis

    =

    =cxhtt

    cifinal

    it

    WxH)(:

    maxarg)(

  • PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

    Advanced Topics in Pattern Recognition 2005, Robi Polikar, Rowan University, Glassboro, NJ

    Learn++.MT SimulationVOC Data

    Test procedureLearn++ and Learn++.MT were each allowed to create a set number of classifiers on each dataset. The number of classifiers generated in each training session was chosen to optimize the algorithms performance. Learn++ appeared to generate the best results when 6 classifiers were generated on the first dataset, 12 on the next, and 18 on the last. Learn++.MT, on the other hand, performed optimally using 6 classifiers in the first training, 4 on the second, and 6 on the last.

    4024522424Test

    4015101010Dataset 3

    025101010Dataset 2

    00402020Dataset 1

    54321Class

  • PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

    Advanced Topics in Pattern Recognition 2005, Robi Polikar, Rowan University, Glassboro, NJ

    Learn++.MT SimulationVOC Data

    63%64%412TS2

    TS3

    TS1

    Training Session

    81%69%618

    54%55%66

    Learn++.MTLearn++Learn++.MTLearn++

    Performance# Classifiers Added

    2004 All Rights Reserved Muhlbaier and Topalis

    This test was executed twenty times on Learn++ and Learn++.MT to acquire a well represented generalization performance on the test data.

  • PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

    Advanced Topics in Pattern Recognition 2005, Robi Polikar, Rowan University, Glassboro, NJ

    Learning from Unbalanced Data: Learn++.MT2

    Learn++.MT2 was created to account for the unbalanced data problem We define unbalanced data as any discrepancy in the cardinality of each dataset used in

    incremental learning. If one dataset has substantially more data than the other (s), the

    ensemble decision might be unfairly biased towards the data with the lower cardinality Under the generally valid assumptions of

    No instance is repeated in any dataset, and The noise distribution remains relatively unchanged among datasets;

    it is reasonable to believe that the dataset that has more instances carries more information. Classifiers generated with such data should therefore be weighted more heavily

    It is not unusual to see major discrepancies in the cardinalities of datasets that subsequently become available.

    The cardinality of each dataset, including relative cardinalities of individual classes within a dataset, should be taken into consideration in any ensemble based learning algorithm that employs a classifier combination scheme.

  • PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

    Advanced Topics in Pattern Recognition 2005, Robi Polikar, Rowan University, Glassboro, NJ

    Learn++.MT2 Algorithm

    The primary novelty in Learn++.MT2 is the way by which the voting weights are determined Learn++.MT2 attempts to addresses the unbalanced data problem by keeping track of the

    number of instances from each class with which each classifier is trained Each classifier is first given a weight based on its performance on its own training data

    This weight is later adjusted according to its class conditional weight factor, wt,c

    For each classifier, this ratio is proportional to the number of instances from a particular class used for training that classifier, to the number of instances from that class used for training all classifiers thus far within the ensemble

    The final decision is made similarly to Learn++ but with using the class conditional weights

    c

    ctct N

    npw =,

    =

    =cxht

    ctYc

    ifinalitk

    wxH)(:

    ,maxarg)(

    pt: Training performance of the tth classifier,nc: # of class-c instances in the current datasetNc: # of all class-c instances seen so far

  • PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

    Advanced Topics in Pattern Recognition 2005, Robi Polikar, Rowan University, Glassboro, NJ

    Experimental Setup

    Learn++.MT2 has been tested on three databases: Wine database from UCI (3 classes, 13 features); Optical Character Recognition database from UCI (10 classes, 64 features); A real-world gas identification problem for determining one of five volatile organic

    compounds (VOC) based on chemical sensor data (5 classes, 6 features). Base classifiers were all single layer MLPs, normally incapable of learning

    incrementally, with 12~40 nodes and an error goal of 0.05 ~ 0.025. In each case the data distributions were designed to simulate unbalanced data. To make a comparison between Learn++ and Learn++.MT2, a set number of classifiers

    are created instead of selecting the number adaptively. This number was selected as a result of experimental testing, such that the tests show an

    accurate and unbiased comparison of the algorithms.

  • PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

    Advanced Topics in Pattern Recognition 2005, Robi Polikar, Rowan University, Glassboro, NJ

    Wine Recognition Database

    1.6%89%88%Learn++.MT2

    2.9%84%88%Learn++

    Std. DevTS2TS1Algorithm

    Observations:

    After initial training session (TS1), the performances of both algorithms are the same;

    After TS2, Learn++.MT2 outperforms Learn++;

    No performance degradation is seen with Learn++.MT2 after second dataset is introduced. Learn++.MT2 is more stable.

  • PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

    Advanced Topics in Pattern Recognition 2005, Robi Polikar, Rowan University, Glassboro, NJ

    VOC Recognition Database

    1.9%89%88%Learn++.MT2

    2.1%86%89%Learn++

    Std. Dev.TS2TS1Algorithm

    Similar observations as the Wine data :

    Again, performances are virtually identical after TS1. After TS2, Learn++.MT2 outperforms Learn++;

    No performance degradation is seen with Learn++.MT2 after second dataset is introduced. Learn++.MT2 is more stable. Precise termination point is not required.

  • PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

    Advanced Topics in Pattern Recognition 2005, Robi Polikar, Rowan University, Glassboro, NJ

    OCR Database

    0.6%95%94%Learn++.MT2

    0.9%92%94%Learn++

    Std. Dev.TS2TS1Algorithm

    Observations:

    Drastically unbalanced data, majority of the information contained in Dataset 1, explains high TS1 performance.

    Due to the imbalance, Learn++ performance declines with Dataset 2, whereas |Learn++.MT2 provides a modest gain.

  • PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

    Advanced Topics in Pattern Recognition 2005, Robi Polikar, Rowan University, Glassboro, NJ

    OCR Reverse Presentation

    0.6%94%88%Learn++.MT2

    0.7%91%85%Learn++

    Std. Dev. TS2TS1Algorithm

    Observations:

    Reversed scenario: little information is initially provided, followed by more substantial data. Final performances remains unchanged: the algorithm is immune to the order of presentation.

    The momentary dip in Learn++.MT2 performance as a new dataset is introduced, ironically justifies the approach taken. Why?

  • PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

    Advanced Topics in Pattern Recognition 2005, Robi Polikar, Rowan University, Glassboro, NJ

    Some Open Problems

    Is the distribution update rule used on Learn++ optimal? Can a weighted combination of AdaBoost and Learn++ update rule be better?

    Is there a better initialization scheme? Can Learn++ be used in a non-stationary learning environment, where the

    data distribution rule changes (in which case, it may be necessary to forget some of the previously learned information throw away some classifiers)

    How can Learn++ be update / initialized if the training data is known to be very unbalanced with new classes being introduced?

    Can the performance of Learn++ on incremental learning be theoretically justified?

    Does Learn++ create more or less diverse classifiers? An analysis of the algorithm on several diversity measures.

    Can Learn++ be used on function approximation problems? How does Learn++ behave under different combination scenarios?

  • PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

    Advanced Topics in Pattern Recognition 2005, Robi Polikar, Rowan University, Glassboro, NJ

    Other Ensemble Techniques

    There are several other ensemble techniques, including Stacked generalization Hierarchical mixture of experts Random forests

  • PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

    Advanced Topics in Pattern Recognition 2005, Robi Polikar, Rowan University, Glassboro, NJ

    Stacked Generalization

    2nd LayerClassifier

    CN

    CK+1

    Ck

    Ck-1

    C1

    FinalDecision

    Inputx

    Classifier 1 with

    parameters1

    h1 (x, 1)

    hk(x, k)

    hN(x, N)

    C

    l

    a

    s

    s

    i

    f

    i

    e

    r

    N

    +

    1

    w

    i

    t

    h

    p

    a

    r

    a

    m

    e

    t

    e

    r

    s

    N

    +

    1

    1st LayerClassifiers

  • PRhttp://engineering.rowan.edu/~polikar/CLASSES/ECE555

    Advanced Topics in Pattern Recognition 2005, Robi Polikar, Rowan University, Glassboro, NJ

    Mixture of Experts

    CT

    CK+1

    Ck

    Ck-1

    C1

    Gating Network

    CT+1

    FinalDecision

    w1

    wNInputx

    Classifier 1 with

    parameters1h1 (x, 1)

    hk(x, k)

    hT(x, T)

    P

    o

    o

    l

    i

    n

    g

    /

    C

    o

    m

    b

    i

    n

    i

    n

    g

    S

    y

    s

    t

    e

    m

    (Usually trained with Expectation Maximization)

    stochastic winner takes all weighted average