lec16 linear models - university of massachusetts...

31
Subhransu Maji November 3, 2016 CMPSCI 670: Computer Vision Linear models

Upload: others

Post on 02-Jun-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: lec16 linear models - University of Massachusetts Amherstsmaji/cmpsci670/slides/lec16_linear_models_1UP.pdfHere λ is a hyperparameter of optimization problem Learning as optimization

Subhransu Maji

November 3, 2016

CMPSCI 670: Computer Vision

Linear models

Page 2: lec16 linear models - University of Massachusetts Amherstsmaji/cmpsci670/slides/lec16_linear_models_1UP.pdfHere λ is a hyperparameter of optimization problem Learning as optimization

Subhransu Maji (UMASS)CMPSCI 670

A neuron (or how our brains work)

2

Neuroscience 101

Page 3: lec16 linear models - University of Massachusetts Amherstsmaji/cmpsci670/slides/lec16_linear_models_1UP.pdfHere λ is a hyperparameter of optimization problem Learning as optimization

Subhransu Maji (UMASS)CMPSCI 670

Input are feature valuesEach feature has a weightSum in the activation

If the activation is:‣ > b, output class 1 ‣ otherwise, output class 2

Perceptron

3

> b⌃

w1

w2

w3x3

x2

x1

activation(w,x) =

X

i

wixi = w

Tx

x ! (x, 1)

w

Tx+ b ! (w, b)T (x, 1)

Page 4: lec16 linear models - University of Massachusetts Amherstsmaji/cmpsci670/slides/lec16_linear_models_1UP.pdfHere λ is a hyperparameter of optimization problem Learning as optimization

Subhransu Maji (UMASS)CMPSCI 670

Imagine 3 features (spam is “positive” class):‣ free (number of occurrences of “free”) ‣ money (number of occurrences of “money”) ‣ BIAS (intercept, always has value 1)

Example: Spam

4

email wx

w

Tx

w

Tx > 0 ! SPAM!!

Page 5: lec16 linear models - University of Massachusetts Amherstsmaji/cmpsci670/slides/lec16_linear_models_1UP.pdfHere λ is a hyperparameter of optimization problem Learning as optimization

Subhransu Maji (UMASS)CMPSCI 670

In the space of feature vectors‣ examples are points (in D dimensions) ‣ an weight vector is a hyperplane (a D-1 dimensional object) ‣ One side corresponds to y=+1 ‣ Other side corresponds to y=-1 Perceptrons are also called as linear classifiers

Geometry of the perceptron

5

w

w

Tx = 0

Page 6: lec16 linear models - University of Massachusetts Amherstsmaji/cmpsci670/slides/lec16_linear_models_1UP.pdfHere λ is a hyperparameter of optimization problem Learning as optimization

Subhransu Maji (UMASS)CMPSCI 670

Initialize for iter = 1,…,T

‣ for i = 1,..,n• predict according to the current model

• if , no change• else,

Learning a perceptron

6

yi = yiw w + yixi

w [0, . . . , 0]

(x1, y1), (x2, y2), . . . , (xn, yn)Input: training dataPerceptron training algorithm [Rosenblatt 57]

xi

wyix

yi =

⇢+1 if wT

xi > 0�1 if wT

xi 0

error driven, online, activations increase for +, randomize

yi = �1

Page 7: lec16 linear models - University of Massachusetts Amherstsmaji/cmpsci670/slides/lec16_linear_models_1UP.pdfHere λ is a hyperparameter of optimization problem Learning as optimization

Subhransu Maji (UMASS)CMPSCI 670

Separability: some parameters will classify the training data perfectly

Convergence: if the training data is separable then the perceptron training will eventually converge [Block 62, Novikoff 62]

Mistake bound: the maximum number of mistakes is related to the margin

Properties of perceptrons

7

#mistakes < 1�2

assuming, ||xi|| 1

� = max

w

min(xi,yi)

⇥yiwT

xi

such that, ||w|| = 1

Page 8: lec16 linear models - University of Massachusetts Amherstsmaji/cmpsci670/slides/lec16_linear_models_1UP.pdfHere λ is a hyperparameter of optimization problem Learning as optimization

Subhransu Maji (UMASS)CMPSCI 670

Convergence: if the data isn’t separable, the training algorithm may not terminate‣ noise can cause this ‣ some simple functions are not

separable (xor)

Mediocre generation: the algorithm finds a solution that “barely” separates the data

Overtraining: test/validation accuracy rises and then falls‣ Overtraining is a kind of overfitting

Limitations of perceptrons

8

Page 9: lec16 linear models - University of Massachusetts Amherstsmaji/cmpsci670/slides/lec16_linear_models_1UP.pdfHere λ is a hyperparameter of optimization problem Learning as optimization

Subhransu Maji (UMASS)CMPSCI 670

Linear models‣ Perceptron: model and learning algorithm combined as one ‣ Is there a better way to learn linear models? We will separate models and learning algorithms‣ Learning as optimization ‣ Surrogate loss function ‣ Regularization ‣ Gradient descent ‣ Batch and online gradients ‣ Subgradient descent ‣ Support vector machines

Overview

9

}model design

} optimization

Page 10: lec16 linear models - University of Massachusetts Amherstsmaji/cmpsci670/slides/lec16_linear_models_1UP.pdfHere λ is a hyperparameter of optimization problem Learning as optimization

Subhransu Maji (UMASS)CMPSCI 670

Learning as optimization

10

minw

X

n

1[ynwTxn < 0] + �R(w)

fewest mistakes

The perceptron algorithm will find an optimal w if the data is separable‣ efficiency depends on the margin and norm of the data However, if the data is not separable, optimizing this is NP-hard‣ i.e., there is no efficient way to minimize this unless P=NP

Page 11: lec16 linear models - University of Massachusetts Amherstsmaji/cmpsci670/slides/lec16_linear_models_1UP.pdfHere λ is a hyperparameter of optimization problem Learning as optimization

Subhransu Maji (UMASS)CMPSCI 670

In addition to minimizing training error, we want a simpler model‣ Remember our goal is to minimize generalization error ‣ Recall the bias and variance tradeoff for learners We can add a regularization term R(w) that prefers simpler models ‣ For example we may prefer decision trees of shallow depth Here λ is a hyperparameter of optimization problem

Learning as optimization

11

minw

X

n

1[ynwTxn < 0] + �R(w)

simpler modelfewest mistakes

hyperparameter

Page 12: lec16 linear models - University of Massachusetts Amherstsmaji/cmpsci670/slides/lec16_linear_models_1UP.pdfHere λ is a hyperparameter of optimization problem Learning as optimization

Subhransu Maji (UMASS)CMPSCI 670

The questions that remain are:‣ What are good ways to adjust the optimization problem so that

there are efficient algorithms for solving it? ‣ What are good regularizations R(w) for hyperplanes? ‣ Assuming that the optimization problem can be adjusted

appropriately, what algorithms exist for solving the regularized optimization problem?

Learning as optimization

12

minw

X

n

1[ynwTxn < 0] + �R(w)

simpler modelfewest mistakes

hyperparameter

Page 13: lec16 linear models - University of Massachusetts Amherstsmaji/cmpsci670/slides/lec16_linear_models_1UP.pdfHere λ is a hyperparameter of optimization problem Learning as optimization

Subhransu Maji (UMASS)CMPSCI 670

Zero/one loss is hard to optimize‣ Small changes in w can cause large changes in the loss Surrogate loss: replace Zero/one loss by a smooth function‣ Easier to optimize if the surrogate loss is convex Examples:

Convex surrogate loss functions

13

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 30

1

2

3

4

5

6

7

8

9

Pred

ictio

n

Loss

Zero/oneHingeLogisticExponentialSquared

y w

Txy = +1

concave

convex

Page 14: lec16 linear models - University of Massachusetts Amherstsmaji/cmpsci670/slides/lec16_linear_models_1UP.pdfHere λ is a hyperparameter of optimization problem Learning as optimization

Subhransu Maji (UMASS)CMPSCI 670

What are good regularization functions R(w) for hyperplanes?We would like the weights —‣ To be small —

➡ Change in the features cause small change to the score ➡ Robustness to noise

‣ To be sparse — ➡ Use as few features as possible ➡ Similar to controlling the depth of a decision tree

This is a form of inductive bias

Weight regularization

14

Page 15: lec16 linear models - University of Massachusetts Amherstsmaji/cmpsci670/slides/lec16_linear_models_1UP.pdfHere λ is a hyperparameter of optimization problem Learning as optimization

Subhransu Maji (UMASS)CMPSCI 670

Just like the surrogate loss function, we would like R(w) to be convexSmall weights regularization

Sparsity regularization

Family of “p-norm” regularization

Weight regularization

15

R(norm)(w) =

sX

d

w2

d R(sqrd)(w) =X

d

w2d

R(count)(w) =X

d

1[|wd| > 0] not convex

R(p-norm)(w) =

X

d

|wd|p!

1/p

Page 16: lec16 linear models - University of Massachusetts Amherstsmaji/cmpsci670/slides/lec16_linear_models_1UP.pdfHere λ is a hyperparameter of optimization problem Learning as optimization

Subhransu Maji (UMASS)CMPSCI 670

Contours of p-norms

16

convex for p � 1

http://en.wikipedia.org/wiki/Lp_space

Page 17: lec16 linear models - University of Massachusetts Amherstsmaji/cmpsci670/slides/lec16_linear_models_1UP.pdfHere λ is a hyperparameter of optimization problem Learning as optimization

Subhransu Maji (UMASS)CMPSCI 670

Contours of p-norms

17

not convex for 0 p < 1

p =2

3

p = 0

R(count)(w) =X

d

1[|wd| > 0]

Counting non-zeros:

http://en.wikipedia.org/wiki/Lp_space

Page 18: lec16 linear models - University of Massachusetts Amherstsmaji/cmpsci670/slides/lec16_linear_models_1UP.pdfHere λ is a hyperparameter of optimization problem Learning as optimization

Subhransu Maji (UMASS)CMPSCI 670

Select a suitable:‣ convex surrogate loss ‣ convex regularization Select the hyperparameter λMinimize the regularized objective with respect to wThis framework for optimization is called Tikhonov regularization or generally Structural Risk Minimization (SRM)

General optimization framework

18

regularizationsurrogate loss

hyperparameter

minw

X

n

`�yn,w

Txn

�+ �R(w)

http://en.wikipedia.org/wiki/Tikhonov_regularization

Page 19: lec16 linear models - University of Massachusetts Amherstsmaji/cmpsci670/slides/lec16_linear_models_1UP.pdfHere λ is a hyperparameter of optimization problem Learning as optimization

Subhransu Maji (UMASS)CMPSCI 670

Optimization by gradient descent

19

Convex function

p1

p2

p5 p6

⌘1 p3⌘2 p4

⌘3

step size

local optima = global optima

local optima

global optima

Non-convex function

pk+1 pk � ⌘kg(k)

take a step down the gradient

g(k) rpF (p)|pk

compute gradient at the current location

Page 20: lec16 linear models - University of Massachusetts Amherstsmaji/cmpsci670/slides/lec16_linear_models_1UP.pdfHere λ is a hyperparameter of optimization problem Learning as optimization

Subhransu Maji (UMASS)CMPSCI 670

Choice of step size

20

Good step sizep1

p2

p3

p4p5 p6

p1 p2

p3 p4p5 p6

⌘1

⌘1

Bad step size

The step size is important — ‣ too small: slow convergence ‣ too large: no convergence A strategy is to use large step sizes initially and small step sizes later:

There are methods that converge faster by adapting step size to the curvature of the function‣ Field of convex optimization

⌘t ⌘0/(t0 + t)

http://stanford.edu/~boyd/cvxbook/

Page 21: lec16 linear models - University of Massachusetts Amherstsmaji/cmpsci670/slides/lec16_linear_models_1UP.pdfHere λ is a hyperparameter of optimization problem Learning as optimization

Subhransu Maji (UMASS)CMPSCI 670

Example: Exponential loss

21

L(w) =

X

n

exp(�ynwTxn) +

2

||w||2 objective

dLdw

=

X

n

�ynxn exp(�ynwTxn) + �w gradient

updatew w � ⌘

X

n

�ynxn exp(�ynwTxn) + �w

!

w w + cynxn

loss term

high for misclassified points

similar to the perceptron update rule!

w (1� ⌘�)w

regularization term

shrinks weights towards zero

Page 22: lec16 linear models - University of Massachusetts Amherstsmaji/cmpsci670/slides/lec16_linear_models_1UP.pdfHere λ is a hyperparameter of optimization problem Learning as optimization

Subhransu Maji (UMASS)CMPSCI 670

Batch and online gradients

22

w w � ⌘

X

n

dLn

dw

!batch gradient

w w � ⌘

✓dLn

dw

◆online gradient

L(w) =X

n

Ln(w)

w w � ⌘dLdw

objective

gradient descent

sum of n gradients gradient at nth pointupdate weight after you see all points update weights after you see each point

Online gradients are the default method for multi-layer perceptrons

Page 23: lec16 linear models - University of Massachusetts Amherstsmaji/cmpsci670/slides/lec16_linear_models_1UP.pdfHere λ is a hyperparameter of optimization problem Learning as optimization

Subhransu Maji (UMASS)CMPSCI 670

The hinge loss is not differentiable at z=1Subgradient is any direction that is below the functionFor the hinge loss a possible subgradient is:

Subgradient

23

1

subgradient

`(hinge)(y,wTx) = max(0, 1� ywT

x)

z

z

d`hinge

dw =

⇢0 if ywT

x > 1

�yx otherwise

Page 24: lec16 linear models - University of Massachusetts Amherstsmaji/cmpsci670/slides/lec16_linear_models_1UP.pdfHere λ is a hyperparameter of optimization problem Learning as optimization

Subhransu Maji (UMASS)CMPSCI 670

Example: Hinge loss

24

objective

w (1� ⌘�)w

regularization term

shrinks weights towards zero

L(w) =

X

n

max(0, 1� ynwTxn) +

2

||w||2

loss term

only for points

w w + ⌘ynxn

ynwTxn 1

perceptron update ynwTxn 0

updatew w � ⌘

X

n

�1[ynwTxn 1]ynxn + �w

!subgradientdL

dw=

X

n

�1[ynwTxn 1]ynxn + �w

Page 25: lec16 linear models - University of Massachusetts Amherstsmaji/cmpsci670/slides/lec16_linear_models_1UP.pdfHere λ is a hyperparameter of optimization problem Learning as optimization

Subhransu Maji (UMASS)CMPSCI 670

Example: Squared loss

25

objectiveL(w) =X

n

�yn �w

Txn

�2+

2||w||2

matrix notation

equivalent loss

Page 26: lec16 linear models - University of Massachusetts Amherstsmaji/cmpsci670/slides/lec16_linear_models_1UP.pdfHere λ is a hyperparameter of optimization problem Learning as optimization

Subhransu Maji (UMASS)CMPSCI 670

Example: Squared loss

26

gradient

exactclosed-form

solution

At optima the gradient=0

objective

Page 27: lec16 linear models - University of Massachusetts Amherstsmaji/cmpsci670/slides/lec16_linear_models_1UP.pdfHere λ is a hyperparameter of optimization problem Learning as optimization

Subhransu Maji (UMASS)CMPSCI 670

Assume, we have D features and N pointsOverall time via matrix inversion‣ The closed form solution involves computing:

‣ Total time is O(D2N + D3 + DN), assuming O(D3) matrix inversion ‣ If N > D, then total time is O(D2N) Overall time via gradient descent‣ Gradient:

‣ Each iteration: O(ND); T iterations: O(TND) Which one is faster?‣ Small problems D < 100: probably faster to run matrix inversion ‣ Large problems D > 10,000: probably faster to run gradient descent

Matrix inversion vs. gradient descent

27

dLdw

=X

n

�2(yn �w

Txn)xn + �w

Page 28: lec16 linear models - University of Massachusetts Amherstsmaji/cmpsci670/slides/lec16_linear_models_1UP.pdfHere λ is a hyperparameter of optimization problem Learning as optimization

Subhransu Maji (UMASS)CMPSCI 670

Under suitable conditions*, provided you pick the step sizes appropriately, the convergence rate of gradient descent is O(1/N)‣ i.e., if you want a solution within 0.0001 of the optimal you have to

run the gradient descent for N=1000 iterations. For linear models (hinge/logistic/exponential loss) and squared-norm regularization there are off-the-shelf solvers that are fast in practice: SVMperf , LIBLINEAR, PEGASOS‣ SVMperf , LIBLINEAR use a different optimization method

Optimization for linear models

28

* the function is strongly convex:

Page 29: lec16 linear models - University of Massachusetts Amherstsmaji/cmpsci670/slides/lec16_linear_models_1UP.pdfHere λ is a hyperparameter of optimization problem Learning as optimization

Subhransu Maji (UMASS)CMPSCI 689 /25

Even if a feature is useful some normalization may be goodPer-feature normalization‣ Centering

‣ Variance scaling

‣ Absolute scaling

‣ Non-linear transformation ➡ square-root

Per-example normalization‣ fixed norm for each example

Feature normalization

29

||x|| = 1

xn,d xn,d � µd

xn,d xn,d/�d

xn,d xn,d/rd

µd =1

N

X

n

xn,d

�d =

s1

N

X

n

(xn,d � µd)2

rd = max

n|xn,d|

Caltech-101 image classification

41.6% linear63.8% square-rootxn,d

pxn,d

(corrects for burstiness)

Page 30: lec16 linear models - University of Massachusetts Amherstsmaji/cmpsci670/slides/lec16_linear_models_1UP.pdfHere λ is a hyperparameter of optimization problem Learning as optimization

Subhransu Maji (UMASS)CMPSCI 670

Figures of various “p-norms” are from Wikipedia‣ http://en.wikipedia.org/wiki/Lp_space Some of the slides are based on CIML book by Hal Daume III

Slides credit

30

Page 31: lec16 linear models - University of Massachusetts Amherstsmaji/cmpsci670/slides/lec16_linear_models_1UP.pdfHere λ is a hyperparameter of optimization problem Learning as optimization

Subhransu Maji (UMASS)CMPSCI 670

Appendix: code for surrogateLoss

31

2/20/15 11:12 AM /Users/smaji/Dropbo.../surrogateLoss.m 1 of 1

% Code to plot various loss functionsy1=1;y2=linspace(−2,3,500);zeroOneLoss = y1*y2 <=0;hingeLoss = max(0, 1−y1*y2);logisticLoss = log(1+exp(−y1*y2))/log(2);expLoss = exp(−y1*y2);squaredLoss = (y1−y2).^2; % Plot themfigure(1); clf; hold on;plot(y2, zeroOneLoss,’k−’,’LineWidth’,1);plot(y2, hingeLoss,’b−’,’LineWidth’,1);plot(y2, logisticLoss,’r−’,’LineWidth’,1);plot(y2, expLoss,’g−’,’LineWidth’,1);plot(y2, squaredLoss,’m−’,’LineWidth’,1);ylabel(’Prediction’,’FontSize’,16);xlabel(’Loss’,’FontSize’,16);legend({’Zero/one’, ’Hinge’, ’Logistic’, ’Exponential’, ’Squared’}, ’Location’, ’NorthEast’, ’FontSize’,16);box on;

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 30

1

2

3

4

5

6

7

8

9

Pred

ictio

n

Loss

Zero/oneHingeLogisticExponentialSquared

Output

Matlab code