lecture 2: learning without over-learning isabelle guyon [email protected]

37
Lecture 2: Learning without Over-learning Isabelle Guyon [email protected]

Upload: iris-fowler

Post on 03-Jan-2016

226 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 2: Learning without Over-learning Isabelle Guyon isabelle@clopinet.com

Lecture 2:Learning without

Over-learning

Isabelle [email protected]

Page 2: Lecture 2: Learning without Over-learning Isabelle Guyon isabelle@clopinet.com

Machine Learning

• Learning machines include:– Linear discriminant (including Naïve Bayes)– Kernel methods– Neural networks– Decision trees

• Learning is tuning:– Parameters (weights w or , threshold b)– Hyperparameters (basis functions, kernels, number

of units)

Page 3: Lecture 2: Learning without Over-learning Isabelle Guyon isabelle@clopinet.com

Conventions

X={xij}

n

mxi

y ={yj}

w

Page 4: Lecture 2: Learning without Over-learning Isabelle Guyon isabelle@clopinet.com

What is a Risk Functional?

• A function of the parameters of the learning machine, assessing how much it is expected to fail on a given task.

Parameter space (w)

R[f(x,w)]

w*

Page 5: Lecture 2: Learning without Over-learning Isabelle Guyon isabelle@clopinet.com

Examples of risk functionals

• Classification:

– Error rate: (1/m) i=1:m 1(F(xi)yi)

– 1- AUC

• Regression:

– Mean square error: (1/m) i=1:m(f(xi)-yi)2

Page 6: Lecture 2: Learning without Over-learning Isabelle Guyon isabelle@clopinet.com

How to Train?

• Define a risk functional R[f(x,w)]

• Find a method to optimize it, typically “gradient descent”

wj wj - R/wj

or any optimization method (mathematical programming, simulated annealing, genetic algorithms, etc.)

Page 7: Lecture 2: Learning without Over-learning Isabelle Guyon isabelle@clopinet.com

x1

x2

Fit / Robustness Tradeoff

x1

x2

15

Page 8: Lecture 2: Learning without Over-learning Isabelle Guyon isabelle@clopinet.com

Overfitting

Example: Polynomial regression

-10 -8 -6 -4 -2 0 2 4 6 8 10

-0.5

0

0.5

1

1.5

Learning machine: y=w0+w1x + w2x2 …+ w10x10

Target: a 10th degree polynomial + noise

Error

x

y

Page 9: Lecture 2: Learning without Over-learning Isabelle Guyon isabelle@clopinet.com

Underfitting

Example: Polynomial regression

-10 -8 -6 -4 -2 0 2 4 6 8 10

-0.5

0

0.5

1

1.5

Linear model: y=w0+w1x

Target: a 10th degree polynomial + noise

Error

x

y

Page 10: Lecture 2: Learning without Over-learning Isabelle Guyon isabelle@clopinet.com

Variance

10-10 -8 -6 -4 -2 0 2 4 6 8-1.5

-1

-0.5

0

0.5

1

1.5

2

x

y

x

y

Page 11: Lecture 2: Learning without Over-learning Isabelle Guyon isabelle@clopinet.com

-10 -8 -6 -4 -2 0 2 4 6 8 10

-1.5

-1

-0.5

0

0.5

1

1.5

2

x

y

Bias

x

y

Page 12: Lecture 2: Learning without Over-learning Isabelle Guyon isabelle@clopinet.com

Ockham’s Razor

• Principle proposed by William of Ockham in the fourteenth century: “Pluralitas non est ponenda sine neccesitate”.

• Of two theories providing similarly good predictions, prefer the simplest one.

• Shave off unnecessary parameters of your models.

Page 13: Lecture 2: Learning without Over-learning Isabelle Guyon isabelle@clopinet.com

The Power of Amnesia

• The human brain is made out of billions of cells or Neurons, which are highly interconnected by synapses.

• Exposure to enriched environments with extra sensory and social stimulation enhances the connectivity of the synapses, but children and adolescents can lose them up to 20 million per day.

Page 14: Lecture 2: Learning without Over-learning Isabelle Guyon isabelle@clopinet.com

Artificial Neurons

x1

x2

xn

1

f(x)

w1

w2

wn

b

f(x) = w x + b

Axon

Synapses

Activation of other neurons Dendrites

Cell potential

Activation function

McCulloch and Pitts, 1943

Page 15: Lecture 2: Learning without Over-learning Isabelle Guyon isabelle@clopinet.com

Hebb’s Rule

wj wj + yi xij

Axon

yxj wj

Synapse

Activation of another neuron

Dendrite

Page 16: Lecture 2: Learning without Over-learning Isabelle Guyon isabelle@clopinet.com

Weight Decay

wj wj + yi xij Hebb’s rule

wj (1-) wj + yi xij Weight decay

[0, 1], decay parameter

Page 17: Lecture 2: Learning without Over-learning Isabelle Guyon isabelle@clopinet.com

Overfitting Avoidance

-10 -8 -6 -4 -2 0 2 4 6 8 10

-0.5

0

0.5

1

1.5d=10, r= 0.01

-10 -8 -6 -4 -2 0 2 4 6 8 10

-0.5

0

0.5

1

1.5d=10, r= 0.1

-10 -8 -6 -4 -2 0 2 4 6 8 10

-0.5

0

0.5

1

1.5d=10, r= 1

-10 -8 -6 -4 -2 0 2 4 6 8 10

-0.5

0

0.5

1

1.5d=10, r= 10

-10 -8 -6 -4 -2 0 2 4 6 8 10

-0.5

0

0.5

1

1.5d=10, r=1e+002

-10 -8 -6 -4 -2 0 2 4 6 8 10

-0.5

0

0.5

1

1.5d=10, r=1e+003

-10 -8 -6 -4 -2 0 2 4 6 8 10

-0.5

0

0.5

1

1.5d=10, r=1e+004

-10 -8 -6 -4 -2 0 2 4 6 8 10

-0.5

0

0.5

1

1.5d=10, r=1e+005

-10 -8 -6 -4 -2 0 2 4 6 8 10

-0.5

0

0.5

1

1.5d=10, r=1e+006

-10 -8 -6 -4 -2 0 2 4 6 8 10

-0.5

0

0.5

1

1.5d=10, r=1e+007

-10 -8 -6 -4 -2 0 2 4 6 8 10

-0.5

0

0.5

1

1.5d=10, r=1e+008

Example: Polynomial regression

Target: a 10th degree polynomial + noise

Learning machine: y=w0+w1x + w2x2 …+ w10x10

Page 18: Lecture 2: Learning without Over-learning Isabelle Guyon isabelle@clopinet.com

Weight Decay for MLP

xj

Replace: wj wj + back_prop(j)by: wj (1-) wj + back_prop(j)

Page 19: Lecture 2: Learning without Over-learning Isabelle Guyon isabelle@clopinet.com

Theoretical Foundations

• Structural Risk Minimization

• Bayesian priors

• Minimum Description Length

• Bayes/variance tradeoff

Page 20: Lecture 2: Learning without Over-learning Isabelle Guyon isabelle@clopinet.com

Risk Minimization

• Examples are given:

(x1, y1), (x2, y2), … (xm, ym)

loss function unknown distribution

• Learning problem: find the best function f(x; w) minimizing a risk functional

R[f] = L(f(x; w), y) dP(x, y)

Page 21: Lecture 2: Learning without Over-learning Isabelle Guyon isabelle@clopinet.com

Approximations of R[f]

• Empirical risk: Rtrain[f] = (1/n) i=1:m L(f(xi; w), yi)

– 0/1 loss 1(F(xi)yi) : Rtrain[f] = error rate

– square loss (f(xi)-yi)2 : Rtrain[f] = mean square error

• Guaranteed risk:

With high probability (1-), R[f] Rgua[f]

Rgua[f] = Rtrain[f] + C

Page 22: Lecture 2: Learning without Over-learning Isabelle Guyon isabelle@clopinet.com

Structural Risk Minimization

Vapnik, 1974

S3

S2

S1

Increasing complexity

Nested subsets of models, increasing complexity/capacity:

S1 S2 … SN

0 10 20 30 40 50 60 70 80 90 100 0

0.2

0.4

0.6

0.8

1

1.2

1.4

Tr, Training error

Ga, Guaranteed riskGa= Tr + C

, Function of Model Complexity C

Complexity/Capacity C

Page 23: Lecture 2: Learning without Over-learning Isabelle Guyon isabelle@clopinet.com

SRM Example (linear model)

• Rank with ||w||2 = i wi2

Sk = { w | ||w||2 < k2 }, 1<2<…<n

• Minimization under constraint:

min Rtrain[f] s.t. ||w||2 < k2

• Lagrangian:

Rreg[f,] = Rtrain[f] + ||w||2

R

capacity

S1 S2 … SN

Page 24: Lecture 2: Learning without Over-learning Isabelle Guyon isabelle@clopinet.com

Gradient Descent

Rreg[f] = Remp[f] + ||w||2 SRM/regularization

wj wj - Rreg/wj

wj wj - Remp/wj - 2 wj

wj (1- )wj - Remp/wj Weight decay

Page 25: Lecture 2: Learning without Over-learning Isabelle Guyon isabelle@clopinet.com

Multiple Structures

• Shrinkage (weight decay, ridge regression, SVM):

Sk = { w | ||w||2< k }, 1<2<…<k

1 > 2 > 3 >… > k ( is the ridge)

• Feature selection:

Sk = { w | ||w||0< k },

1<2<…<k ( is the number of features)

• Data compression:

1<2<…<k ( may be the number of clusters)

Page 26: Lecture 2: Learning without Over-learning Isabelle Guyon isabelle@clopinet.com

Hyper-parameter Selection

• Learning = adjusting: parameters (w vector).

hyper-parameters().

• Cross-validation with K-folds:

For various values of : - Adjust w on a fraction (K-1)/K of

training examples e.g. 9/10th. - Test on 1/K remaining examples

e.g. 1/10th. - Rotate examples and average test

results (CV error). - Select to minimize CV error. - Re-compute w on all training

examples using optimal .

X y

Prospective study / “real”

validation

Tra

inin

g d

ata:

Mak

e K

fol

ds

Tes

t da

ta

Page 27: Lecture 2: Learning without Over-learning Isabelle Guyon isabelle@clopinet.com

Summary

• High complexity models may “overfit”:– Fit perfectly training examples– Generalize poorly to new cases

• SRM solution: organize the models in nested subsets such that in every structure element

complexity < threshold.• Regularization: Formalize learning as a

constrained optimization problem, minimize regularized risk = training error + penalty.

Page 28: Lecture 2: Learning without Over-learning Isabelle Guyon isabelle@clopinet.com

Bayesian MAP SRM

• Maximum A Posteriori (MAP):f = argmax P(f|D) = argmax P(D|f) P(f) = argmin –log P(D|f) –log P(f)

• Structural Risk Minimization (SRM):

f = argmin Remp[f] + [f]

Negative log likelihood = Empirical risk Remp[f]

Negative log prior = Regularizer [f]

Page 29: Lecture 2: Learning without Over-learning Isabelle Guyon isabelle@clopinet.com

Example: Gaussian Prior

• Linear model:

f(x) = w.x

• Gaussian prior:

P(f) = exp -||w||2/2

• Regularizer:

[f] = –log P(f) = ||w||2

w1

w2

Page 30: Lecture 2: Learning without Over-learning Isabelle Guyon isabelle@clopinet.com

Minimum Description Length

• MDL: minimize the length of the “message”.

• Two part code: transmit the model and the residual.

• f = argmin –log2 P(D|f) –log2 P(f)

Length of the shortest code to encode the model (model complexity)

Residual: length of the shortest code to encode the data given the model

Page 31: Lecture 2: Learning without Over-learning Isabelle Guyon isabelle@clopinet.com

Bias-variance tradeoff

• f trained on a training set D of size m (m fixed)

• For the square loss:

ED[f(x)-y]2 = [EDf(x)-y]2 + ED[f(x)-EDf(x)]2

VarianceBias2Expected value of the loss over datasets D of the same size

EDf(x)f(x)

y target

Bias2

Variance

Page 32: Lecture 2: Learning without Over-learning Isabelle Guyon isabelle@clopinet.com

-10 -8 -6 -4 -2 0 2 4 6 8 10

-1.5

-1

-0.5

0

0.5

1

1.5

2

x

y

Bias

x

y

Page 33: Lecture 2: Learning without Over-learning Isabelle Guyon isabelle@clopinet.com

Variance

10-10 -8 -6 -4 -2 0 2 4 6 8-1.5

-1

-0.5

0

0.5

1

1.5

2

x

y

x

y

Page 34: Lecture 2: Learning without Over-learning Isabelle Guyon isabelle@clopinet.com

The Effect of SRM

Reduces the variance…

…at the expense of introducing some bias.

Page 35: Lecture 2: Learning without Over-learning Isabelle Guyon isabelle@clopinet.com

Ensemble Methods

ED[f(x)-y]2 = [EDf(x)-y]2 + ED[f(x)-EDf(x)]2

• Variance can also be reduced with committee machines.

• The committee members “vote” to make the final decision.

• Committee members are built e.g. with data subsamples.

• Each committee member should have a low bias (no use of ridge/weight decay).

Page 36: Lecture 2: Learning without Over-learning Isabelle Guyon isabelle@clopinet.com

Overall summary

• Weight decay is a powerful means of overfitting avoidance (||w||2 regularizer).

• It has several theoretical justifications: SRM, Bayesian prior, MDL.

• It controls variance in the learning machine family, but introduces bias.

• Variance can also be controlled with ensemble methods.

Page 37: Lecture 2: Learning without Over-learning Isabelle Guyon isabelle@clopinet.com

Want to Learn More?

• Statistical Learning Theory, V. Vapnik. Theoretical book. Reference book on generatization, VC dimension, Structural Risk Minimization, SVMs, ISBN : 0471030031.

• Structural risk minimization for character recognition, I. Guyon, V. Vapnik, B. Boser, L. Bottou, and S.A. Solla. In J. E. Moody et al., editor, Advances in Neural Information Processing Systems 4 (NIPS 91), pages 471--479, San Mateo CA, Morgan Kaufmann, 1992. http://clopinet.com/isabelle/Papers/srm.ps.Z

• Kernel Ridge Regression Tutorial, I. Guyon. http://clopinet.com/isabelle/Projects/ETH/KernelRidge.pdf

• Feature Extraction: Foundations and Applications. I. Guyon et al, Eds. Book for practitioners with datasets of NIPS 2003 challenge, tutorials, best performing methods, Matlab code, teaching material. http://clopinet.com/fextract-book