part ii: practical implementations

1

Part II: Practical Implementations.

2

Modeling the Classes

Stochastic Discrimination

3

Algorithm for Training a SD Classifier

Generate projectable weak model

Evaluate model w.r.t. training set, check

enrichment

Check uniformity w.r.t. existing collection

Add to discriminant

4

Dealing with Data Geometry:

SD in Practice

5

2D Example

• Adapted from [Kleinberg, PAMI, May 2000]

6

• An “r=1/2” random subset in the feature space that covers ½ of all the points

7

• Watch how many such subsets cover a particular point, say, (2,17)

(2,17)

8

It’s in 1/2 modelsY = ½ = 0.5

It’s in 2/3 modelsY = 2/3 = 0.67

It’s in 3/4 modelsY = ¾ = 0.75

It’s in 4/5 modelsY = 4/5 = 0.8

It’s in 5/6 modelsY = 5/6 = 0.83

It’s in 0/1 modelsY = 0/1 = 0.0

In Out In

In In In

9

It’s in 6/8 modelsY = 6/8 = 0.75

It’s in 7/9 modelsY = 7/9 = 0.77

It’s in 8/10 modelsY = 8/10 = 0.8

It’s in 8/11 modelsY = 8/11 = 0.73

It’s in 8/12 modelsY = 8/12 = 0.67

It’s in 5/7 modelsY = 5/7 = 0.72

In In

In Out Out

Out

10

• Fraction of “r=1/2” random subsets covering point (2,17) as more such subsets are generated

11

• Fractions of “r=1/2” random subsets covering several selected points as more such subsets are generated

12

• Distribution of model coverage for all points in space, with 100 models

13


14


15


16


17


18


19


20

• Introducing enrichment:

For any discrimination to happen, the models must have some difference in coverage for different classes.

21

• Enforcing enrichment (adding in a bias): require each subset to cover more points of one class than another

Class distribution A biased (enriched) weak model

22

• Distribution of model coverage for points in each class, with 100 enriched weak models

23


24


25


26


27


28


29


30

• Error rate decreases as number of models increases

Decision rule: if Y < 0.5 then class 2 else class 1

31

• Sparse Training Data:

Incomplete knowledge about class distributions

Training Set Test Set

32



33



34



35



36



37



38



39



No discrimination!

40

• Models of this type, when enriched for training set, are not necessarily enriched for test set


Random model with 50% coverage of space

41

• Introducing projectability:

Maintain local continuity of class interpretations.

Neighboring points of the same class should share similar model coverage.

42

• Allow some local continuity in model membership, so that interpretation of a training point can generalize to its immediate neighborhood

Class distribution A projectable model

43

• Distribution of model coverage for points in each class, with 100 enriched, projectable weak models


44



45



46



47



48



49



50

• Promoting uniformity:

All points in the same class should have equal likelihood to be covered by a model of each particular rating.

Retain models that cover the points whose coverage by current collection is less

51

• Distribution of model coverage for points in each class, with 100 enriched, projectable, uniform weak models


52



53



54



55



56

The 3 necessary conditions

Complementary Information

Discriminating Power

Generalization Power

Enrichment:

Projectability:Uniformity:

57

Extensions and Comparisons

58

Alternative Discriminants

• [Berlind 1994]

• Different discriminants for N-class problems

• Additional condition on symmetry

• Approximate uniformity

• Hierarchy of indiscernibility

59

Estimates of Classification Accuracies

• [Chen 1997]

• Statistical estimate of classification accuracy

under weaker conditions:

Approximate uniformity

Approximate indiscernibility

60

• For n classes, define n discriminants Yi, one for each class i vs the others

• Classify an unknown point to the class i for which the computed Yi is the largest

Multi-class Problems

61

[Ho & Kleinberg ICPR 1996]

65

Open Problems

• Algorithm for uniformity enforcementDeterministic methods?

• Desirable form of weak modelsFewer, more sophisticated classifiers?

• Other ways to address the 3-way trade-offEnrichment / Uniformity / Projectability

66

Random Decision Forest

• [Ho 1995, 1998]

• A structured way to create models: fully split a tree, use leaves as models

• Perfect enrichment and uniformity for TR

• Promote projectability by subspace projection

67

Compact Distribution Maps

• [Ho & Baird 1993, 1997]

• Another structured way to create models

• Start with projectable models by coarse quantization of feature value range

• Seek enrichment and uniformity

Signature of 2 types of events and measurements from a new observation

Signal IndexSignal Level

68

SD & Other Ensemble Methods

• Ensemble learning via boosting:

A sequential way to promote uniformity of ensemble element coverage

• XCS (a genetic algorithm)

A way to create, filter, and use stochastic models that are regions in feature space

69

XCS Classifier System

• [Wilson,95]Recent focus of GA community

Good performance

Reinforcement Learning + Genetic Algorithms

Model: set of rules

Environment

Set of Rules

input class

ReinforcementLearning

GeneticAlgorithms

reward

updatesearch

if (shape=square and number>10) then class=redif (shape=circle and number<5) then class=yellow

70

Multiple Classifier Systems:Examples in Word Image Recognition

71

Complementary Strengths of Classifiers

The case for classifier combination

… decision fusion

… mixture of experts

… committee decision making

Rank of true class out of a lexicon of 1091 words, by 10 classifiers for 20 images

72

Classifier Combination Methods

• Decision Optimization:

find consensus among a given set of classifiers

• Coverage Optimization:

create a set of classifiers that work best with a given decision combination function

73

Decision Optimization

• Develop classifiers with expert knowledge• Try to make the best use of their decisions

via majority/plurality vote, sum/product rule, probabilistic methods, Bayesian methods, rank/confidence score combination …

• The joint capability of the classifiers set an intrinsic limit on the combined accuracy

• There is no way to handle the blind spots

74

Difficulties in Decision Optimization

• Reliability versus overall accuracy

• Fixed or trainable combination function

• Simple models or combinatorial estimates

• How to model complementary behavior

75

Coverage Optimization

• Fix a decision combination function• Generate classifiers automatically and systematically

via training set sub-sampling (stacking, bagging, boosting),subspace projection (RSM), superclass/subclass decomposition (ECOC), random perturbation of training processes, noise injection …

• Need enough classifiers to cover all blind spots(how many are enough?)

• What else is critical?

76

Difficulties inCoverage Optimization

• What kind of differences to introduce:– Subsamples? Subspaces? Super/Subclasses?– Training parameters? – Model geometry?

• 3-way tradeoff: – discrimination + diversity + generalization

• Effects of the form of component classifiers

77

Dilemmas and Paradoxes in Classifier Combination

• Weaken individuals for a stronger whole?

• Sacrifice known samples for unseen cases?

• Seek agreements or differences?

78

Stochastic Discrimination

• A mathematical theory that relates several key concepts in pattern recognition:

– Discriminative power … enrichment– Complementary information … uniformity– Generalization power … projectability

• It offers a way to describe complementary behavior of classifiers

• It offers guidelines to design multiple classifier systems (classifier ensembles)

part ii: practical implementations

Documents