cs 559: machine learning fundamentals and applications

96
CS 559: Machine Learning Fundamentals and Applications 10 th Set of Notes Instructor: Philippos Mordohai Webpage: www.cs.stevens.edu/~mordohai E-mail: [email protected] Office: Lieb 215 1

Upload: others

Post on 24-May-2022

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS 559: Machine Learning Fundamentals and Applications

CS 559: Machine Learning Fundamentals and Applications

10th Set of Notes

Instructor: Philippos MordohaiWebpage: www.cs.stevens.edu/~mordohaiE-mail: [email protected]

Office: Lieb 215

1

Page 2: CS 559: Machine Learning Fundamentals and Applications

Ensemble Methods

• Bagging (Breiman 1994,…)• Random forests (Breiman 2001,…)• Boosting (Freund and Schapire 1995, Friedman

et al. 1998,…)

Predict class label for unseen data by aggregating a set of predictions (classifiers learned from the training data)

2

Page 3: CS 559: Machine Learning Fundamentals and Applications

General Idea

S Training Data

S1 S2 SnMultiple Data Sets

C1 C2 CnMultiple Classifiers

HCombined Classifier 3

Page 4: CS 559: Machine Learning Fundamentals and Applications

Ensemble Classifiers• Basic idea: Build different “experts” and let them vote

• Advantages: • Improve predictive performance• Different types of classifiers can be directly

included• Easy to implement• Not too much parameter tuning

• Disadvantages:• The combined classifier is not transparent (black box)

• Not a compact representation4

Page 5: CS 559: Machine Learning Fundamentals and Applications

Why do they work?

• Suppose there are 25 base classifiers• Each classifier has error rate • Assume independence among classifiers• Probability that the ensemble classifier makes

a wrong prediction:

06.0)1( 25 25i

25

13

i

i i

35.0

5

Page 6: CS 559: Machine Learning Fundamentals and Applications

Bagging• Training

o Given a dataset S, at each iteration i, a training set Si is sampled

with replacement from S (i.e. bootstraping)

o A classifier Ci is learned for each Si

• Classification: given an unseen sample Xo Each classifier Ci returns its class prediction

o The bagged classifier H counts the votes and assigns the class

with the most votes to X

6

Page 7: CS 559: Machine Learning Fundamentals and Applications

Bagging• Bagging works because it reduces variance

by voting/averagingo In some pathological hypothetical situations the

overall error might increase

o Usually, the more classifiers the better

• Problem: we only have one dataset

• Solution: generate new ones of size n by bootstrapping, i.e. sampling with replacement

• Can help a lot if data is noisy

7

Page 8: CS 559: Machine Learning Fundamentals and Applications

Aggregating Classifiers

• Bagging and BoostingAggregating Classifiers

Breiman (1996) found gains in accuracy by aggregating predictors built from reweighedversions of the learning set

FINAL RULE

8

Page 9: CS 559: Machine Learning Fundamentals and Applications

Bagging

• Bagging = Bootstrap Aggregating

• Reweighing of the learning sets is done by

drawing at random with replacement from

the learning sets

• Predictors are aggregated by plurality

voting9

Page 10: CS 559: Machine Learning Fundamentals and Applications

The Bagging Algorithm

• B bootstrap samples

• From which we derive:

Bpppp ,...,,, : 1 ,0 321

Bcccc ,...,,, : 1 ,1 321 B Classifiers

B Estimated probabilities

The aggregate classifier becomes:

or

10

B

b

bbag xc

Bsignxc

1

)(1)(

B

b

bbag xp

Bxp

1

)(1)(

Page 11: CS 559: Machine Learning Fundamentals and Applications

Initial set

Classifier 1

Classifier 2

Drawing with replacement 1

Drawing with replacement 2

Weighting

11

Page 12: CS 559: Machine Learning Fundamentals and Applications

Sign

Aggregation

Classifier 1

Classifier 2

Classifier 3

Classifier T

Initial set

Final rule=

12

Page 13: CS 559: Machine Learning Fundamentals and Applications

Random Forests Breiman L. Random forests. Machine Learning 2001;45(1):5-32http://stat-www.berkeley.edu/users/breiman/RandomForests/

13

Page 14: CS 559: Machine Learning Fundamentals and Applications

14

Decision Forestsfor computer vision and medical image analysis

A. Criminisi, J. Shotton and E. Konukoglu

http://research.microsoft.com/projects/decisionforests

Page 15: CS 559: Machine Learning Fundamentals and Applications

Decision Trees and Decision Forests

A forest is an ensemble of trees. The trees are all slightly different from one another 15

terminal (leaf) node

internal (split) node

root node0

1 2

3 4 5 6

7 8 9 10 11 12 13 14

A general tree structure

Is top part blue?

Is bottom part green?

Is bottom part blue?

A decision tree

Page 16: CS 559: Machine Learning Fundamentals and Applications

Decision Tree Testing (Runtime)

16

Inputtestpoint Split the data at node

Input data in feature space

Prediction at leaf

Page 17: CS 559: Machine Learning Fundamentals and Applications

17

Decision Tree Training (Offline)How to split the data?

Binary tree? Ternary?How big a tree?What tree structure?

Input data in feature space

Input training data

Page 18: CS 559: Machine Learning Fundamentals and Applications

18

Decision Tree Training (Offline)

How many trees?How different?How to fuse their outputs?

… …

Page 19: CS 559: Machine Learning Fundamentals and Applications

Decision Forest Model

Basic notation

Output/label space Categorical, continuous? e.g.

Input data point e.g. Collection of feature responses . d=?

Feature response selector Features can be e.g. wavelets? Pixel intensities? Context?

Forest model

tree Node weak learner The test function for splitting data at a node j.e.g.

Node objective function (train.) The “energy” to be minimized when training the j-th split nodee.g.

Stopping criteria (train.) e.g. max tree depth = When to stop growing a tree during training

The ensemble model How to compute the forest output from that of individual trees?e.g.

ense

mbl

e

Forest size Total number of trees in the forest

Leaf predictor model Point estimate? Full distribution?e.g.

Randomness model (train.) e.g. 1. Bagging, 2. Randomized node optimization

How is randomness injected during training? How much?

Node test parameters Parameters related to each split node: i) which features, ii) what geometric primitive, iii) thresholds.

Page 20: CS 559: Machine Learning Fundamentals and Applications

Randomness Model

20

1) Bagging (randomizing the training set)

The full training set

The randomly sampled subset of training data made available for the tree t

Forest training

Efficient training

Page 21: CS 559: Machine Learning Fundamentals and Applications

Randomness Model

21

The full set of all possible node test parameters

For each node the set of randomly sampled features

Randomness control parameter. For no randomness and maximum tree correlation.For max randomness and minimum tree correlation.

2) Randomized node optimization (RNO)

Small value of ; little tree correlation. Large value of ; large tree correlation.

The effect of

Node weak learner

Node test params

Node training

Page 22: CS 559: Machine Learning Fundamentals and Applications

The Ensemble Model

An example forest to predict continuous variables

Page 23: CS 559: Machine Learning Fundamentals and Applications

Training and Information Gain

Bef

ore

split

Information gain

Shannon’s entropy

Node training

Split

1Sp

lit 2

Page 24: CS 559: Machine Learning Fundamentals and Applications

Overfitting and Underfitting

24

Page 25: CS 559: Machine Learning Fundamentals and Applications

Classification Forest

25

Training data in feature space

?

?

?

Entropy of a discrete distribution

with

Classification treetraining

Obj. funct. for node j (information gain)

Training node j

Output is categorical

Input data point

Node weak learner

Predictor model (class posterior)

Model specialization for classification

( is feature response)

(discrete set)

Page 26: CS 559: Machine Learning Fundamentals and Applications

Weak Learners

26

Node weak learner

Node test params

Splitting data at node j

Weak learner: axis aligned Weak learner: oriented line Weak learner: conic section

Examples of weak learners

Feature response for 2D example.

With a generic line in homog. coordinates.

Feature response for 2D example.

With a matrix representing a conic.

Feature response for 2D example.

In general may select only a very small subset of features

With or

Page 27: CS 559: Machine Learning Fundamentals and Applications

27

Prediction ModelWhat do we do at the leaf?

leafleaf

leaf

Prediction model: probabilistic

Page 28: CS 559: Machine Learning Fundamentals and Applications

28

Classification Forest: Ensemble Model

Tree t=1 t=2 t=3

Forest output probability

The ensemble model

Page 29: CS 559: Machine Learning Fundamentals and Applications

Effect of Tree Depth

Page 30: CS 559: Machine Learning Fundamentals and Applications

Effect of Weak Learner Model and RandomnessD=

5D=

13

Parameters: T=400 predictor model = prob.

Weak learner: axis aligned Weak learner: oriented line Weak learner: conic section

Testing posteriors

Page 31: CS 559: Machine Learning Fundamentals and Applications

Effect of Weak Learner Model and Randomness

Parameters: T=400 predictor model = prob.

Testing posteriorsD=

5D=

13

Weak learner: axis aligned

Page 32: CS 559: Machine Learning Fundamentals and Applications

Body tracking in Microsoft Kinect for XBox 360

left hand

rightshoulder neck

right foot

Input depth image Training labelled data Visual features

Objective function

Node training

Labels are categorical

Input data point

Predictor model

Classification forest

Visual featuresNode parameters

Weak learner

Feature response

Page 33: CS 559: Machine Learning Fundamentals and Applications

33

Body tracking in Microsoft Kinect for XBox 360

Input depth image (bg removed) Inferred body parts posterior

Page 34: CS 559: Machine Learning Fundamentals and Applications

Advantages of Random Forests• Very high accuracy – not easily surpassed by other

algorithms • Efficient on large datasets• Can handle thousands of input variables without

variable deletion• Effective method for estimating missing data, also

maintains accuracy when a large proportion of the data are missing

• Can handle categorical variables• Robust to label noise• Can be used in clustering, locating outliers and semi-

supervised learning

34L. Breiman’s web page

Page 35: CS 559: Machine Learning Fundamentals and Applications

Boosting

35

Page 36: CS 559: Machine Learning Fundamentals and Applications

Boosting Resources

• Slides based on:– Tutorial by Rob Schapire– Tutorial by Yoav Freund– Slides by Carlos Guestrin– Tutorial by Paul Viola – Tutorial by Ron Meir– Slides by Aurélie Lemmens– Slides by Zhuowen Tu

36

Page 37: CS 559: Machine Learning Fundamentals and Applications

Code

• Antonio Torralba (Object detection)– http://people.csail.mit.edu/torralba/shortCours

eRLOC/boosting/boosting.html

• GML AdaBoost– http://graphics.cs.msu.ru/ru/science/research/

machinelearning/adaboosttoolbox

37

Page 38: CS 559: Machine Learning Fundamentals and Applications

Boosting

• Invented independently by Schapire(1989) and Freund (1990)– Later joined forces

• Main idea: train a strong classifier by combining weak classifiers– Practically useful– Theoretically interesting

38

Page 39: CS 559: Machine Learning Fundamentals and Applications

Boosting

• Given a set of weak learners, run them multiple times on (reweighted) training data, then let learned classifiers vote

• At each iteration t:– Weight each training example by how incorrectly

it was classified– Learn a hypothesis – ht

• The one with the smallest error– Choose a strength for this hypothesis – αt

• Final classifier: weighted combination of weak learners

39

Page 40: CS 559: Machine Learning Fundamentals and Applications

Learning from Weighted Data

• Sometimes not all data points are equal– Some data points are more equal than others

• Consider a weighted dataset– D(i) – weight of i th training example (xi,yi)– Interpretations:

• i th training example counts as D(i) examples• If I were to “resample” data, I would get more samples

of “heavier” data points

• Now, in all calculations the i th training example counts as D(i) “examples”

40

Page 41: CS 559: Machine Learning Fundamentals and Applications

Definition of Boosting

• Given training set (x1,y1),…, (xm,ym)• yi ϵ{-1,+1} correct label of instance xiϵX• For t=1,…,T

– construct distribution Dt on {1,…,m}– find weak hypothesis – ht: X {-1,+1} with small error εt on Dt

• Output final hypothesis Hfinal

41

Page 42: CS 559: Machine Learning Fundamentals and Applications

AdaBoost• Constructing Dt

– D1=1/m– Given Dt and ht:

where Zt is a normalizationconstant

• Final hypothesis:

42

Page 43: CS 559: Machine Learning Fundamentals and Applications

The AdaBoost Algorithm

43

Page 44: CS 559: Machine Learning Fundamentals and Applications

The AdaBoost Algorithm

44

with minimum

Page 45: CS 559: Machine Learning Fundamentals and Applications

The AdaBoost Algorithm

45

Page 46: CS 559: Machine Learning Fundamentals and Applications

The AdaBoost Algorithm

46

Page 47: CS 559: Machine Learning Fundamentals and Applications

Toy Example

47

D1

Page 48: CS 559: Machine Learning Fundamentals and Applications

Toy Example: Round 1

48

D2

ε1=0.3α1=0.42

Page 49: CS 559: Machine Learning Fundamentals and Applications

Toy Example: Round 2

49

D3

ε2=0.21α2=0.65

Page 50: CS 559: Machine Learning Fundamentals and Applications

Toy Example: Round 3

50

ε3=0.14α3=0.92

Page 51: CS 559: Machine Learning Fundamentals and Applications

Toy Example: Final Hypothesis

51

Hfinal

Page 52: CS 559: Machine Learning Fundamentals and Applications

How to choose Weights

Training error of final classifier is bounded by:

where:

52

Notice that:

Page 53: CS 559: Machine Learning Fundamentals and Applications

How to choose WeightsTraining error of final classifier is bounded by:

where

In final round:

53

Page 54: CS 559: Machine Learning Fundamentals and Applications

How to choose Weights• If we minimize ΠtZt, we minimize our

training error– We can tighten this bound greedily, by

choosing αt and ht in each iteration to minimize Zt

• For boolean target function, this is accomplished by [Freund & Schapire ’97]:

54

Page 55: CS 559: Machine Learning Fundamentals and Applications

Weak and Strong Classifiers

• If each classifier is (at least slightly) better than random– εt < 0.5

• AdaBoost will achieve zero training error (exponentially fast):

• Is it hard to achieve better than random training error?

55

Page 56: CS 559: Machine Learning Fundamentals and Applications

Important Aspects of Boosting

• Exponential loss function• Choice of weak learners• Generalization and overfitting• Multi-class boosting

56

Page 57: CS 559: Machine Learning Fundamentals and Applications

Exponential Loss Function• The exponential loss function is an upper

bound of the 0-1 loss function (classification error)

• AdaBoost provably minimizes exponential loss

• Therefore, it also minimizes the upper bound of classification error

57

Page 58: CS 559: Machine Learning Fundamentals and Applications

Exponential Loss Function• AdaBoost attempts to minimize:

(*)

• Really a coordinate descent procedure– At each round add atht to sum to minimize (*)

• Why this loss function?– upper bound on training (classification) error– easy to work with– connection to logistic regression

58

Page 59: CS 559: Machine Learning Fundamentals and Applications

Coordinate Descent Explanation

59

Page 60: CS 559: Machine Learning Fundamentals and Applications

Weak Learners• Stumps:

– Single-axis parallel partition of space

• Decision trees:– Hierarchical partition of space

• Multi-layer perceptrons:– General nonlinear function approximators

60

Y

Y=1 Y=-1

X<5 X≥5

Page 61: CS 559: Machine Learning Fundamentals and Applications

Decision Trees• Hierarchical and recursive partitioning of the

feature space• A simple model (e.g. constant) is fit in each region• Often, splits are parallel to axes

61

Page 62: CS 559: Machine Learning Fundamentals and Applications

Decision Trees – Nominal Features

62

Page 63: CS 559: Machine Learning Fundamentals and Applications

Decision Trees - Instability

63

Page 64: CS 559: Machine Learning Fundamentals and Applications

Boosting: Analysis of Training Error

• Training error of final classifier is bounded by:

• For binary classifiers with choice of αt as before, the error is bounded by

64

Page 65: CS 559: Machine Learning Fundamentals and Applications

Analysis of Training Error

• If each base classifier is slightly better than random such that there exists γ such that γt>γ for all t

• Then the training error drops exponentially fast in T

• AdaBoost is indeed a boosting algorithm in the sense that it can efficiently convert a true weak learning algorithm into a strong learning algorithm – Weak learning algorithm: can always generate a classifier with a

weak edge for any distribution – Strong learning algorithm: can generate a classifier with an

arbitrarily low error rate, given sufficient data

65

)2exp()2exp( 22 Tt

t

Page 66: CS 559: Machine Learning Fundamentals and Applications

Generalization Error

• T – number of boosting rounds• d – VC dimension of weak learner, measures

complexity of classifier– The Vapnik-Chervonenkis (VC) dimension is a

standard measure of the “complexity” of a space of binary functions

• m – number of training examples

66

Page 67: CS 559: Machine Learning Fundamentals and Applications

Overfitting

• This bound suggests that boosting will overfitif run for too many rounds

• Several authors observed empirically that boosting often does not overfit, even when run for thousands of rounds – Moreover, it was observed that AdaBoost would

sometimes continue to drive down the generalization error long after the training error had reached zero, clearly contradicting the bound above

67

Page 68: CS 559: Machine Learning Fundamentals and Applications

Analysis of Margins

• An alternative analysis can be made in terms of the margins of the training examples. The margin of example (x,y) is:

• It is a number in [-1, 1] and it is positive when the example is correctly classified

• Larger margins on the training set translate into a superior upper bound on the generalization error

68

Page 69: CS 559: Machine Learning Fundamentals and Applications

Analysis of Margins

• It can be shown that the generalization error is at most:

– Independent of T

• Boosting is particularly aggressive at increasing the margin since it concentrates on the examples with the smallest margins– positive or negative

69

Page 70: CS 559: Machine Learning Fundamentals and Applications

Error Rates and Margins

70

Page 71: CS 559: Machine Learning Fundamentals and Applications

Margin Analysis

• Margin theory gives a qualitative explanation of the effectiveness of boosting

• Quantitatively, the bounds are rather weak• One classifier can have a margin distribution

that is better than that of another classifier, and yet be inferior in test accuracy

• Margin theory points to a strong connection between boosting and the support-vector machines

71

Page 72: CS 559: Machine Learning Fundamentals and Applications

Advantages of Boosting•Simple and easy to implement•Flexible – can be combined with any learning algorithm

•No requirement on data being in metric space – data features don’t need to be normalized, like in kNN and SVMs (this has been a central problem in machine learning)

•Feature selection and fusion are naturally combined with the same goal for minimizing an objective error function

72

Page 73: CS 559: Machine Learning Fundamentals and Applications

Advantages of Boosting (cont.)•Can show that if a gap exists between positive and negative points, generalization error converges to zero

•No parameters to tune (maybe T)•No prior knowledge needed about weak learner

•Provably effective•Versatile – can be applied on a wide variety of problems

•Non-parametric

73

Page 74: CS 559: Machine Learning Fundamentals and Applications

Disadvantages of Boosting

• Performance of AdaBoost depends on data and weak learner

• Consistent with theory, AdaBoost can fail if• weak classifier too complex – overfitting• weak classifier too weak - underfitting

• Empirically, AdaBoost seems especially susceptible to uniform noise

• Decision boundaries are often rugged

74

Page 75: CS 559: Machine Learning Fundamentals and Applications

Multi-class AdaBoost• Assume yϵ{1,…,k}• Direct approach (AdaBoost.M1):

• can prove same bound on error if• else: abort

75

Page 76: CS 559: Machine Learning Fundamentals and Applications

Limitation of AdaBoost.M1• Achieving may be hard if k

(number of classes) is large

• [Mukherjee and Schapire, 2010]: weak learners that perform slightly better than random chance can be used in multi-class boosting framework– Out of scope for now

76

Page 77: CS 559: Machine Learning Fundamentals and Applications

Reducing to Binary Problems

• Say possible labels are {a,b,c,d,e}• Each training example replaced by five {-1,+1}

labeled examples

77

Page 78: CS 559: Machine Learning Fundamentals and Applications

AdaBoost.MH

• Formally

where

Can prove that

78

Used to be X {-1, +1}

Page 79: CS 559: Machine Learning Fundamentals and Applications

Random Forests vs. Boosting

• RF Pros:– More robust– Faster to train (no reweighting, each split is on a

small subset of data and features)– Can handle missing/partial data– Easier to extend to online version

• RF Cons:– Feature selection process is not explicit– Weaker performance on small size training data– Weaker theoretical foundations

79

Page 80: CS 559: Machine Learning Fundamentals and Applications

Applications of Boosting

Real time face detection using a classifier cascade [Viola and Jones, 2001 and 2004]

80

Page 81: CS 559: Machine Learning Fundamentals and Applications

The Classical Face Detection Process

SmallestScale

LargerScale

50,000 Locations/Scales

81

Page 82: CS 559: Machine Learning Fundamentals and Applications

Classifier is Trained on Labeled Data

• Training Data– 5000 faces

• All frontal– 108 non faces– Faces are normalized

• Scale, translation

• Many variations– Across individuals– Illumination– Pose (rotation both in plane and out)

82

Page 83: CS 559: Machine Learning Fundamentals and Applications

Key Properties of Face Detection

• Each image contains 10,000 – 50,000 locations/scales

• Faces are rare 0 - 50 per image– 1000 times as many non-faces as faces

• Goal: Extremely small rate of false negatives: 10-6

83

Page 84: CS 559: Machine Learning Fundamentals and Applications

“Support Vectors”

84

Challenging negative examples are extremely important

Page 85: CS 559: Machine Learning Fundamentals and Applications

Classifier Cascade (Viola-Jones)

• For real problems results are only as good as the features used...– This is the main piece of ad-hoc (or domain) knowledge

• Rather than the pixels, use a very large set of simple functions – Sensitive to edges and other critical features of the image– Computed at multiple scales

• Introduce a threshold to yield binary features – Binary features seem to work better in practice– In general, convert continuous features to binary by

quantizing

85

Page 86: CS 559: Machine Learning Fundamentals and Applications

Boosted Face Detection: Image Features

“Rectangle filters”

Similar to Haar wavelets

000,000,6100000,60 Unique Binary Features

otherwise

)( if )(

t

tittit

xfxh

t

t bxhxC )()(

86

Page 87: CS 559: Machine Learning Fundamentals and Applications

Feature Selection

• For each round of boosting:– Evaluate each rectangle filter on each

example– Sort examples by filter values– Select best threshold for each filter– Select best filter/threshold (= Feature) – Reweight examples

87

Page 88: CS 559: Machine Learning Fundamentals and Applications

Example Classifier for Face Detection

ROC curve for 200 feature classifier

A classifier with 200 rectangle features was learned using AdaBoost

95% correct detection on test set with 1 in 14084false positives.

88

Page 89: CS 559: Machine Learning Fundamentals and Applications

Building Fast Classifiers• In general, simple classifiers are more efficient, but

they are also weaker

• We could define a computational risk hierarchy– A nested set of classifier classes

• The training process is reminiscent of boosting…– Previous classifiers reweight the examples used to train

subsequent classifiers

• The goal of the training process is different– Minimize errors, but also minimize false positives

89

Page 90: CS 559: Machine Learning Fundamentals and Applications

Cascaded Classifier1 Feature 5 Features

F

50%20 Features

20% 2%FACE

NON-FACE

F

NON-FACE

F

NON-FACE

IMAGESUB-WINDOW

• A 1-feature classifier achieves 100% detection rate and about 50% false positive rate

• A 5-feature classifier achieves 100% detection rate and 40% false positive rate– using data from previous stage

• A 20-feature classifier achieve 100% detection rate with 10% false positive rate

90

Page 91: CS 559: Machine Learning Fundamentals and Applications

Output of Face Detector on Test Images

91

Page 92: CS 559: Machine Learning Fundamentals and Applications

Solving other “Face” Tasks

Facial Feature Localization

DemographicAnalysis

Profile Detection

92

Page 93: CS 559: Machine Learning Fundamentals and Applications

Feature Localization• Surprising properties of Viola-Jones framework

– The cost of detection is not a function of image size• Just the number of features

– Learning automatically focuses attention on key regions

• Conclusion: the “feature” detector can include a large contextual region around the feature

Sub-windows rejected at final stages

93

Page 94: CS 559: Machine Learning Fundamentals and Applications

Feature Localization• Learned features reflect the task

94

Page 95: CS 559: Machine Learning Fundamentals and Applications

Profile Detection

95

Page 96: CS 559: Machine Learning Fundamentals and Applications

Profile Features

96