research on ensemble learning - feng zhou · background first method second method experiments...

BackgroundFirst Method

Second MethodExperiments

References

Research on Ensemble Learning

Feng Zhou1, Baoliang Lu1

1Department of Computer Science

Shanghai Jiao Tong University

Jan. 22th, 2008

Feng Zhou Research on Ensemble Learning



References

Outline1 Background

Ensemble LearningProbabilistic ClassifierM3 FrameworkSummary

2 First MethodProblem DescriptionDecompositionIntegrationComparisonSummary

3 Second MethodProblem RevisitedHomo-pairwise CombinationComparison

4 Experiments Feng Zhou Research on Ensemble Learning



References


Where Ensemble Learning arises

The limitations of traditional classifier algorithms

Statistical Problem

Computational Problem

Representation Problem

The existed approaches of ensemble learning

Adaboost Strong ← WeakOne-vs-One(All) Multiclass ← PairwiseM3 Complicated ← Simple




References


How to report Probabilistic Outputs[1]

Definition

Find f : x ∈ R → p(y = 1|x) ∈ [0, 1]e.g. Sigmod

f (x) =1

1 + eAx+B

Optimization Criteria

Minimize the Cross Entropy

−

n∑

i

yi log(pi ) + (1− yi) log(1− pi )

which could be solved by the algorithm Model Trust Minimization




References


Estimated Priors

−10 −8 −6 −4 −2 0 2 40

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

p(x|ω+)

p(x|ω−)

Fitting on Posteriors




References


Estimated Priors

−10 −8 −6 −4 −2 0 2 40

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

p(x|ω+)

p(x|ω−)

Fitting on Posteriors

−10 −5 0−0.2

0

0.2

0.4

0.6

0.8

1

1.2

1.4

p(ω+

|x)

sigmod




References


How Min-Max-Modular (M3) works [2]

Learning on the decomposed training sets

−10 −5 0 5 10

−10

−5

0

5

10

Integrating the classifiers’ reports




References




−10 −5 0 5 10

−10

−5

0

5

10

−10 −5 0 5 10

−10

−5

0

5

10





References




−10 −5 0 5 10

−10

−5

0

5

10

−10 −5 0 5 10

−10

−5

0

5

10


Min

Min

Min

Min




References




−10 −5 0 5 10

−10

−5

0

5

10

−10 −5 0 5 10

−10

−5

0

5

10


Min

Min

Min

Min

Max




References


What we gain from the past researches

Why could the Min-Max principles successfully perform theintegration job?

How the decomposition stage influences the later integrationstage?

Could the system be accelerated?

Which one

is best?




References


What we gain from the past researches

Why could the Min-Max principles successfully perform theintegration job?

How the decomposition stage influences the later integrationstage?

Could the system be accelerated?

−0.5 0 0.5

−1.5

−1

−0.5

0

0.5

1

1.5

−0.5 0 0.5

−1.5

−1

−0.5

0

0.5

1

1.5

−0.5 0 0.5

−1.5

−1

−0.5

0

0.5

1

1.5

−0.5 0 0.5

−1.5

−1

−0.5

0

0.5

1

1.5

−0.5 0 0.5

−1.5

−1

−0.5

0

0.5

1

1.5

−0.5 0 0.5

−1.5

−1

−0.5

0

0.5

1

1.5

−0.5 0 0.5

−1.5

−1

−0.5

0

0.5

1

1.5

−0.5 0 0.5

−1.5

−1

−0.5

0

0.5

1

1.5

−0.5 0 0.5

−1.5

−1

−0.5

0

0.5

1

1.5

Which one

is best?




References

Problem DescriptionDecompositionIntegrationComparisonSummary

Let’s consider it again [3]

−5 0 5

−8

−6

−4

−2

0

2

4

6




References



−5 0 5

−8

−6

−4

−2

0

2

4

6

Learning

−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

ω+

ω−

Priors




References



−5 0 5

−8

−6

−4

−2

0

2

4

6

−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Decision

Boundary

Learning

−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

ω+

ω−

Priors Posteriors

Bayes Rule




References



−5 0 5

−8

−6

−4

−2

0

2

4

6

−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Decision

Boundary

Learning

−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

ω+

ω−

Priors Posteriors

Bayes Rule

Testing




References



−5 0 5

−8

−6

−4

−2

0

2

4

6

−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Decision

Boundary

Learning

−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

ω+

ω−

Priors Posteriors

Bayes Rule

Testing

Complicated




References


How to simplify the problem

−5 0 5

−8

−6

−4

−2

0




References



−5 0 5

−8

−6

−4

−2

0

2

4

6

8

10

−5 0 5

−8

−6

−4

−2

0

2

4

6

8

10

−5 0 5

−8

−6

−4

−2

0

2

4

6

8

10

−5 0 5

−8

−6

−4

−2

0

2

4

6

8

10

−5 0 5

−8

−6

−4

−2

0




References



−5 0 5

−8

−6

−4

−2

0

2

4

6

8

10

−5 0 5

−8

−6

−4

−2

0

2

4

6

8

10

−5 0 5

−8

−6

−4

−2

0

2

4

6

8

10

−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

−5 0 5

−8

−6

−4

−2

0

2

4

6

8

10

−5 0 5

−8

−6

−4

−2

0




References



−5 0 5

−8

−6

−4

−2

0

2

4

6

8

10

−5 0 5

−8

−6

−4

−2

0

2

4

6

8

10

−5 0 5

−8

−6

−4

−2

0

2

4

6

8

10

−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

−5 0 5

−8

−6

−4

−2

0

2

4

6

8

10

−5 0 5

−8

−6

−4

−2

0




References



−5 0 5

−8

−6

−4

−2

0

2

4

6

8

10

−5 0 5

−8

−6

−4

−2

0

2

4

6

8

10

−5 0 5

−8

−6

−4

−2

0

2

4

6

8

10

−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

−5 0 5

−8

−6

−4

−2

0

2

4

6

8

10

−5 0 5

−8

−6

−4

−2

0

−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

ω+

ω−

Original ProblemCurrent Problems




References


How to integrate the patches

−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

ProbabilisticOutputting

Shrinking orMinimizing

Expanding orMaximizing




References



−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

−10 −5 0 5 100

0.2

0.4

0.6

0.8

1

−10 −5 0 5 100

0.2

0.4

0.6

0.8

1

−10 −5 0 5 100

0.2

0.4

0.6

0.8

1

−10 −5 0 5 100

0.2

0.4

0.6

0.8

1







References



−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

−10 −5 0 5 100

0.2

0.4

0.6

0.8

1

−10 −5 0 5 100

0.2

0.4

0.6

0.8

1

−10 −5 0 5 100

0.2

0.4

0.6

0.8

1

−10 −5 0 5 100

0.2

0.4

0.6

0.8

1

−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1







References



−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

−10 −5 0 5 100

0.2

0.4

0.6

0.8

1

−10 −5 0 5 100

0.2

0.4

0.6

0.8

1

−10 −5 0 5 100

0.2

0.4

0.6

0.8

1

−10 −5 0 5 100

0.2

0.4

0.6

0.8

1

−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1







References


What’s the difference

Stage One

Min(x) =d

mini=1

xi

Shrink(x) =1

∑di=1

1xi− (d − 1)

Stage Two

Max(x) =d

maxi=1

xi

Expand(x) = 1−1

∑di=1

11−xi− (d − 1)

−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1




References



Stage One

Min(x) =d

mini=1

xi

Shrink(x) =1

∑di=1

1xi− (d − 1)

Stage Two

Max(x) =d

maxi=1

xi

Expand(x) = 1−1

∑di=1

11−xi− (d − 1)

−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Much Lower




References



Stage One

Min(x) =d

mini=1

xi

Shrink(x) =1

∑di=1

1xi− (d − 1)

Stage Two

Max(x) =d

maxi=1

xi

Expand(x) = 1−1

∑di=1

11−xi− (d − 1)

−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Much Lower




References



Stage One

Min(x) =d

mini=1

xi

Shrink(x) =1

∑di=1

1xi− (d − 1)

Stage Two

Max(x) =d

maxi=1

xi

Expand(x) = 1−1

∑di=1

11−xi− (d − 1)

−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Much Higher

Much Lower




References


What we could conclude

Question One (Answered)

Why could the Min-Max principles successfully perform theintegration job?Because it partially obeys the Bayes Decision Rule.

Question Two (Answered)

How the decomposition stage influences the later integration stage?The decomposition along the inner structure of each class wouldcontribute to the large distance among the patches.

Question Three (Unsolved)

Could the system be accelerated?Current Complexity O(n+ × n−)




References

Problem RevisitedHomo-pairwise CombinationComparison

A General Consideration

Single Classifier

Former framework Another View




References



Single Classifier Former framework

Another View




References



Single Classifier Former framework Another View




References



Single Classifier Former framework Another ViewBridge Class




References


What happens if we face the same classes

−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8




References



−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Former Approach




References



−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Former Approach

Homo-pairwise




References


What’s the source of the efficiency

Complexity of Algorithms

Min-Max, Shrink-Expansion: O(n+ × n−)Homo-pairwise: O(n+ + n−)

Special Probabilistic Relationship

Suppose

Linkx(ωk , ωi ) =p(ωk , x)

p(ωi , x)and Linkx(ωk , ωj) =

p(ωk , x)

p(ωj , x)

Then

Linkx(ωi , ωj) =p(ωi , x)/p(ωk , x)

p(ωj , x)/p(ωk , x)=

Linkx (ωk , ωj)

Linkx(ωk , ωi )




References

Toy DataLarge-scale Data

Does it happen as we thought

Gaussian Parameters

Label µ Σ

ω+1 (1, 1.2) (0.8, 2.1)

ω+2 (−1,−1) (1.1, 1.5)

ω−

1 (0.8,−0.5) (2.2, 0.8)ω−

2 (−0.7, 1) (1.7, 0.6)

Performance

Case ID Min-Max SE SEr

1 73.00 75.00 75.00

2 78.00 83.00 83.00

3 72.00 75.00 75.00




References

Toy DataLarge-scale Data

20 Newsgroups

Probability Estimate

−2000 −1500 −1000 −500 0 500 1000 1500 20000

0.05

0.1

0.15

0.2

0.25

0.3

alt.atheism (+) vs comp.graphics (−)

p(x|ω+)

p(x|ω−)

−2000 −1500 −1000 −500 0 500 1000 1500 2000−0.5

0

0.5

1

1.5A=−13.23 B=6.97

p(ω+|x)sigmod




References

References

J. Platt.

Probabilistic outputs for support vector machines and comparisons to

regularized likelihood methods.

Advances in Large Margin Classifiers, 1999.

B.L. Lu and M. Ito.

Task decomposition and module combination based on class relations: a

modular neural network for pattern classification.

IEEE Transactions on Neural Networks, 1999.

F. Zhou and B.L. Lu.

Learning Concepts from Large-Scale Data Sets by Pairwise Coupling with

Probabilistic Outputs.

IEEE International Joint Conference on Neural Networks, 2007.


research on ensemble learning - feng zhou · background first method second method experiments...

Documents