part ii: practical implementations

78
1 Part II: Practical Implementations.

Upload: frisco

Post on 05-Jan-2016

44 views

Category:

Documents


0 download

DESCRIPTION

Part II: Practical Implementations. Modeling the Classes. Stochastic Discrimination. Algorithm for Training a SD Classifier. Generate projectable weak model. Evaluate model w.r.t. training set, check enrichment. Check uniformity w.r.t. existing collection. Add to discriminant. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Part  II:   Practical Implementations

1

Part II: Practical Implementations.

Page 2: Part  II:   Practical Implementations

2

Modeling the Classes

Stochastic Discrimination

Page 3: Part  II:   Practical Implementations

3

Algorithm for Training a SD Classifier

Generate projectable weak model

Evaluate model w.r.t. training set, check

enrichment

Check uniformity w.r.t. existing collection

Add to discriminant

Page 4: Part  II:   Practical Implementations

4

Dealing with Data Geometry:

SD in Practice

Page 5: Part  II:   Practical Implementations

5

2D Example

• Adapted from [Kleinberg, PAMI, May 2000]

Page 6: Part  II:   Practical Implementations

6

• An “r=1/2” random subset in the feature space that covers ½ of all the points

Page 7: Part  II:   Practical Implementations

7

• Watch how many such subsets cover a particular point, say, (2,17)

(2,17)

Page 8: Part  II:   Practical Implementations

8

It’s in 1/2 modelsY = ½ = 0.5

It’s in 2/3 modelsY = 2/3 = 0.67

It’s in 3/4 modelsY = ¾ = 0.75

It’s in 4/5 modelsY = 4/5 = 0.8

It’s in 5/6 modelsY = 5/6 = 0.83

It’s in 0/1 modelsY = 0/1 = 0.0

In Out In

In In In

Page 9: Part  II:   Practical Implementations

9

It’s in 6/8 modelsY = 6/8 = 0.75

It’s in 7/9 modelsY = 7/9 = 0.77

It’s in 8/10 modelsY = 8/10 = 0.8

It’s in 8/11 modelsY = 8/11 = 0.73

It’s in 8/12 modelsY = 8/12 = 0.67

It’s in 5/7 modelsY = 5/7 = 0.72

In In

In Out Out

Out

Page 10: Part  II:   Practical Implementations

10

• Fraction of “r=1/2” random subsets covering point (2,17) as more such subsets are generated

Page 11: Part  II:   Practical Implementations

11

• Fractions of “r=1/2” random subsets covering several selected points as more such subsets are generated

Page 12: Part  II:   Practical Implementations

12

• Distribution of model coverage for all points in space, with 100 models

Page 13: Part  II:   Practical Implementations

13

• Distribution of model coverage for all points in space, with 200 models

Page 14: Part  II:   Practical Implementations

14

• Distribution of model coverage for all points in space, with 300 models

Page 15: Part  II:   Practical Implementations

15

• Distribution of model coverage for all points in space, with 400 models

Page 16: Part  II:   Practical Implementations

16

• Distribution of model coverage for all points in space, with 500 models

Page 17: Part  II:   Practical Implementations

17

• Distribution of model coverage for all points in space, with 1000 models

Page 18: Part  II:   Practical Implementations

18

• Distribution of model coverage for all points in space, with 2000 models

Page 19: Part  II:   Practical Implementations

19

• Distribution of model coverage for all points in space, with 5000 models

Page 20: Part  II:   Practical Implementations

20

• Introducing enrichment:

For any discrimination to happen, the models must have some difference in coverage for different classes.

Page 21: Part  II:   Practical Implementations

21

• Enforcing enrichment (adding in a bias): require each subset to cover more points of one class than another

Class distribution A biased (enriched) weak model

Page 22: Part  II:   Practical Implementations

22

• Distribution of model coverage for points in each class, with 100 enriched weak models

Page 23: Part  II:   Practical Implementations

23

• Distribution of model coverage for points in each class, with 200 enriched weak models

Page 24: Part  II:   Practical Implementations

24

• Distribution of model coverage for points in each class, with 300 enriched weak models

Page 25: Part  II:   Practical Implementations

25

• Distribution of model coverage for points in each class, with 400 enriched weak models

Page 26: Part  II:   Practical Implementations

26

• Distribution of model coverage for points in each class, with 500 enriched weak models

Page 27: Part  II:   Practical Implementations

27

• Distribution of model coverage for points in each class, with 1000 enriched weak models

Page 28: Part  II:   Practical Implementations

28

• Distribution of model coverage for points in each class, with 2000 enriched weak models

Page 29: Part  II:   Practical Implementations

29

• Distribution of model coverage for points in each class, with 5000 enriched weak models

Page 30: Part  II:   Practical Implementations

30

• Error rate decreases as number of models increases

Decision rule: if Y < 0.5 then class 2 else class 1

Page 31: Part  II:   Practical Implementations

31

• Sparse Training Data:

Incomplete knowledge about class distributions

Training Set Test Set

Page 32: Part  II:   Practical Implementations

32

• Distribution of model coverage for points in each class, with 100 enriched weak models

Training Set Test Set

Page 33: Part  II:   Practical Implementations

33

• Distribution of model coverage for points in each class, with 200 enriched weak models

Training Set Test Set

Page 34: Part  II:   Practical Implementations

34

• Distribution of model coverage for points in each class, with 300 enriched weak models

Training Set Test Set

Page 35: Part  II:   Practical Implementations

35

• Distribution of model coverage for points in each class, with 400 enriched weak models

Training Set Test Set

Page 36: Part  II:   Practical Implementations

36

• Distribution of model coverage for points in each class, with 500 enriched weak models

Training Set Test Set

Page 37: Part  II:   Practical Implementations

37

• Distribution of model coverage for points in each class, with 1000 enriched weak models

Training Set Test Set

Page 38: Part  II:   Practical Implementations

38

• Distribution of model coverage for points in each class, with 2000 enriched weak models

Training Set Test Set

Page 39: Part  II:   Practical Implementations

39

• Distribution of model coverage for points in each class, with 5000 enriched weak models

Training Set Test Set

No discrimination!

Page 40: Part  II:   Practical Implementations

40

• Models of this type, when enriched for training set, are not necessarily enriched for test set

Training Set Test Set

Random model with 50% coverage of space

Page 41: Part  II:   Practical Implementations

41

• Introducing projectability:

Maintain local continuity of class interpretations.

Neighboring points of the same class should share similar model coverage.

Page 42: Part  II:   Practical Implementations

42

• Allow some local continuity in model membership, so that interpretation of a training point can generalize to its immediate neighborhood

Class distribution A projectable model

Page 43: Part  II:   Practical Implementations

43

• Distribution of model coverage for points in each class, with 100 enriched, projectable weak models

Training Set Test Set

Page 44: Part  II:   Practical Implementations

44

• Distribution of model coverage for points in each class, with 300 enriched, projectable weak models

Training Set Test Set

Page 45: Part  II:   Practical Implementations

45

• Distribution of model coverage for points in each class, with 400 enriched, projectable weak models

Training Set Test Set

Page 46: Part  II:   Practical Implementations

46

• Distribution of model coverage for points in each class, with 500 enriched, projectable weak models

Training Set Test Set

Page 47: Part  II:   Practical Implementations

47

• Distribution of model coverage for points in each class, with 1000 enriched, projectable weak models

Training Set Test Set

Page 48: Part  II:   Practical Implementations

48

• Distribution of model coverage for points in each class, with 2000 enriched, projectable weak models

Training Set Test Set

Page 49: Part  II:   Practical Implementations

49

• Distribution of model coverage for points in each class, with 5000 enriched, projectable weak models

Training Set Test Set

Page 50: Part  II:   Practical Implementations

50

• Promoting uniformity:

All points in the same class should have equal likelihood to be covered by a model of each particular rating.

Retain models that cover the points whose coverage by current collection is less

Page 51: Part  II:   Practical Implementations

51

• Distribution of model coverage for points in each class, with 100 enriched, projectable, uniform weak models

Training Set Test Set

Page 52: Part  II:   Practical Implementations

52

• Distribution of model coverage for points in each class, with 1000 enriched, projectable, uniform weak models

Training Set Test Set

Page 53: Part  II:   Practical Implementations

53

• Distribution of model coverage for points in each class, with 5000 enriched, projectable, uniform weak models

Training Set Test Set

Page 54: Part  II:   Practical Implementations

54

• Distribution of model coverage for points in each class, with 10000 enriched, projectable, uniform weak models

Training Set Test Set

Page 55: Part  II:   Practical Implementations

55

• Distribution of model coverage for points in each class, with 50000 enriched, projectable, uniform weak models

Training Set Test Set

Page 56: Part  II:   Practical Implementations

56

The 3 necessary conditions

Complementary Information

Discriminating Power

Generalization Power

Enrichment:

Projectability:Uniformity:

Page 57: Part  II:   Practical Implementations

57

Extensions and Comparisons

Page 58: Part  II:   Practical Implementations

58

Alternative Discriminants

• [Berlind 1994]

• Different discriminants for N-class problems

• Additional condition on symmetry

• Approximate uniformity

• Hierarchy of indiscernibility

Page 59: Part  II:   Practical Implementations

59

Estimates of Classification Accuracies

• [Chen 1997]

• Statistical estimate of classification accuracy

under weaker conditions:

Approximate uniformity

Approximate indiscernibility

Page 60: Part  II:   Practical Implementations

60

• For n classes, define n discriminants Yi, one for each class i vs the others

• Classify an unknown point to the class i for which the computed Yi is the largest

Multi-class Problems

Page 61: Part  II:   Practical Implementations

61

[Ho & Kleinberg ICPR 1996]

Page 62: Part  II:   Practical Implementations

62

Page 63: Part  II:   Practical Implementations

63

Page 64: Part  II:   Practical Implementations

64

Page 65: Part  II:   Practical Implementations

65

Open Problems

• Algorithm for uniformity enforcementDeterministic methods?

• Desirable form of weak modelsFewer, more sophisticated classifiers?

• Other ways to address the 3-way trade-offEnrichment / Uniformity / Projectability

Page 66: Part  II:   Practical Implementations

66

Random Decision Forest

• [Ho 1995, 1998]

• A structured way to create models: fully split a tree, use leaves as models

• Perfect enrichment and uniformity for TR

• Promote projectability by subspace projection

Page 67: Part  II:   Practical Implementations

67

Compact Distribution Maps

• [Ho & Baird 1993, 1997]

• Another structured way to create models

• Start with projectable models by coarse quantization of feature value range

• Seek enrichment and uniformity

Signature of 2 types of events and measurements from a new observation

Signal IndexSignal Level

Page 68: Part  II:   Practical Implementations

68

SD & Other Ensemble Methods

• Ensemble learning via boosting:

A sequential way to promote uniformity of ensemble element coverage

• XCS (a genetic algorithm)

A way to create, filter, and use stochastic models that are regions in feature space

Page 69: Part  II:   Practical Implementations

69

XCS Classifier System

• [Wilson,95]Recent focus of GA community

Good performance

Reinforcement Learning + Genetic Algorithms

Model: set of rules

Environment

Set of Rules

input class

ReinforcementLearning

GeneticAlgorithms

reward

updatesearch

if (shape=square and number>10) then class=redif (shape=circle and number<5) then class=yellow

Page 70: Part  II:   Practical Implementations

70

Multiple Classifier Systems:Examples in Word Image Recognition

Page 71: Part  II:   Practical Implementations

71

Complementary Strengths of Classifiers

The case for classifier combination

… decision fusion

… mixture of experts

… committee decision making

Rank of true class out of a lexicon of 1091 words, by 10 classifiers for 20 images

Page 72: Part  II:   Practical Implementations

72

Classifier Combination Methods

• Decision Optimization:

find consensus among a given set of classifiers

• Coverage Optimization:

create a set of classifiers that work best with a given decision combination function

Page 73: Part  II:   Practical Implementations

73

Decision Optimization

• Develop classifiers with expert knowledge• Try to make the best use of their decisions

via majority/plurality vote, sum/product rule, probabilistic methods, Bayesian methods, rank/confidence score combination …

• The joint capability of the classifiers set an intrinsic limit on the combined accuracy

• There is no way to handle the blind spots

Page 74: Part  II:   Practical Implementations

74

Difficulties in Decision Optimization

• Reliability versus overall accuracy

• Fixed or trainable combination function

• Simple models or combinatorial estimates

• How to model complementary behavior

Page 75: Part  II:   Practical Implementations

75

Coverage Optimization

• Fix a decision combination function• Generate classifiers automatically and systematically

via training set sub-sampling (stacking, bagging, boosting),subspace projection (RSM), superclass/subclass decomposition (ECOC), random perturbation of training processes, noise injection …

• Need enough classifiers to cover all blind spots(how many are enough?)

• What else is critical?

Page 76: Part  II:   Practical Implementations

76

Difficulties inCoverage Optimization

• What kind of differences to introduce:– Subsamples? Subspaces? Super/Subclasses?– Training parameters? – Model geometry?

• 3-way tradeoff: – discrimination + diversity + generalization

• Effects of the form of component classifiers

Page 77: Part  II:   Practical Implementations

77

Dilemmas and Paradoxes in Classifier Combination

• Weaken individuals for a stronger whole?

• Sacrifice known samples for unseen cases?

• Seek agreements or differences?

Page 78: Part  II:   Practical Implementations

78

Stochastic Discrimination

• A mathematical theory that relates several key concepts in pattern recognition:

– Discriminative power … enrichment– Complementary information … uniformity– Generalization power … projectability

• It offers a way to describe complementary behavior of classifiers

• It offers guidelines to design multiple classifier systems (classifier ensembles)