machine learning: intro & linear clasifier · supervised machine learning •training phase...

93
Machine Learning: Intro & Linear Clasifier Olga Veksler

Upload: others

Post on 01-Aug-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

Machine Learning: Intro & Linear Clasifier

Olga Veksler

Page 2: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

Outline

• Introduction to Machine Learning

• motivation of machine learning

• types of machine learning

• basic concepts

• overfitting, underfitting, generalization

• Optimization with gradient descent

• Linear Classifier• Two classes

• Loss functions

• Gradient descent schemes

• Multiple classes

Page 3: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

INTRO TO ML

Page 4: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

Why use Machine Learning? • Difficult to come up with explicit program for some tasks

• Digit Recognition, a classic example

• Easy to collect images of digits with their correct labels

0

• Machine Learning Algorithm takes collected data and produces program for recognizing digits• done right, program will recognize correctly new images it has never seen

4

Page 5: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

5

What is Machine Learning?

• General definition (Tom Mitchell):

• Based on experience E, improve performance on task T as measured by performance measure P

• Digit Recognition Example

• T = recognize character in the image

• P = percentage of correctly classified images

• E = dataset of human-labeled images of characters

Page 6: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

6

Types of Machine Learning • Supervised Learning

• given training examples with corresponding outputs (also called label, target, class, answer, etc.)

• learn to produces correct output for a new example

• Unsupervised Learning

• given training examples only

• discover good data representation

• e.g. “natural” clusters

• k-means is the most widely known example

• Reinforcement Learning

• learn to select action that maximizes payoff

Page 7: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

7

Supervised Machine Learning: subtypes

• Classification • output belongs to a finite set

• example: age {baby, child, adult, elder}

• output is also called class or label

• Regression• output is continuous

• example: age [0,130]

• Difference mostly in design of loss function

Page 8: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

Supervised Machine Learning

salmon salmonsea bass sea bass

• Have examples with corresponding outputs

• Example: fish classification, salmon or sea bass?

x1 x2 x3 x4

=

7.5

3.3

=

7.8

3.6

=

7.1

3.2

=

0.7

4.6

y1=0 y2=1 y3 = 0 y4=1

• Each example represented by a vector • data may be given in vector form from the start

• if not, for each example i, extract useful features and put them in a vector xi

• fish classification example• extract two features, fish length and average fish brightness

• can extract as many other features

• for images, can also use raw pixel intensity or color as features

• an example is often called feature vector

• yi is the output for example xi

Page 9: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

Supervised Machine Learning

• Training phase

• estimate function y = f(x) from labeled data• f(x) is called classifier, learning machine, prediction function, etc.

• Testing phase (deployment)

• predict output f(x) for a new (unseen) sample x

• We are given

1. Training examples x1, x2,…, xn

2. Target output for each sample y1, y2,…ynlabeled data

Page 10: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

label prediction

Training/Testing Phases Illustrated

training labels

training examples

Training

Training

feature vectors

feature vector

Testing

test Image

Learned model f

Learned model f

Page 11: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

More on Training Phase

• Estimate prediction function y = f(x) from labeled data

• Choose hypothesis space f(x) belongs to

• hypothesis space f(x,w) is parameterized by vector of weights w

• each setting of w corresponds to a different hypothesis

• find f(x,w) in the hypothesis space s.t. f(xi,w) = yi “as much as possible” for training examples

• define “as much as possible” via some loss function L(f(x,w),y)

f(x,w1)f(x,w3)

f(x,w2)

f(x,w4)

f(x,w5)

hypothesis space

Page 12: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

Training Phase Example in 1D• 2 class classification problem, yi ∊{-1,1}

• Training set class -1: {-2, -1, 1} class 1: {2, 3, 5}

x

-1+2x

0.5

class -1 class 1

• Hypothesis space f(x,w) = sign(w0 + w1x)

0

=

1

0

w

ww

• with w0 = -1, w1 = 2, f(x) = sign( -1 + 2x )

f(x)

Page 13: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

Training Phase Example in 1D

x

1.5+x

1.5

class -1 class 1

0

• The process of finding good w is weight tuning, or training

• Hypothesis space f(x,w) = sign(w0+w1x )• With w0 = -1.5, w1 = 1, f(x) = sign(-1.5 + x )

• 2 class classification problem, yi ∊{-1,1}

• Training set class -1: {-2, -1, 1} class 1: {2, 3, 5}

Page 14: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

Training Phase Example in 2D• For 2 class problem and 2 dimensional samples

f(x,w) = sign(w0+w1x1+w2x2)

decision boundary

decision regions

x1

x2

• Can be generalized to examples of arbitrary dimension

• Classifier that makes a decision based on linear combination of features is called a linear classifier

Page 15: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

Training Phase: Linear Classifier

classification error 38%

bad setting of w

x1

x2

x1

x2

better setting of w

classification error 4%

Page 16: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

16

Training Stage: More Complex Classifier

• for example if f(x,w) is a polynomial of high degree

• 0% classification error

x1

x2

Page 17: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

Test Classifier on New Data

• The goal is to classify well on new data

• Test “wiggly” classifier on new data: 25% error

x1

x2

Page 18: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

Overfitting

• Have only limited amount of data for training

• Complex model often has too many parameters to fit reliably with limited data

• Complex model may adapt too closely to random noise of training data, rather than look at a ‘big picture’

x1

x2

Page 19: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

Overfitting: Extreme Example• 2 class problem: face and non-face images

• Memorize (i.e. store) all the “face” images

• For a new image, see if it is one of the stored faces

• if yes, output “face” as the classification result

• If no, output “non-face”

• also called “rote learning”

• problem: new “face” images are different from stored

“face” examples

• zero error on stored data, 50% error on test (new) data

• decision boundary is very irregular

• Rote learning is memorization without generalization

slide is modified from Y. LeCun

Page 20: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

Generalizationtraining data

• Ability to produce correct outputs on previously unseen examples

is called generalization

• Big question of learning theory: how to get good generalization

with a limited number of examples

• Intuitive idea: favor simpler classifiers

• William of Occam (1284-1347): “entities are not to be multiplied without necessity”

• Simpler decision boundary may not fit ideally to training data but

tends to generalize better to new data

new data

Page 21: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

21

Training and Testing• How to diagnose overfitting?

• Divide all labeled samples x1,x2,…xn into training set and test set

• Use training set (training samples) to tune classifier weights w

• Use test set (test samples) to see how well classifier with tuned weights w work on unseen examples

• Thus there are 2 main phases in classifier design

1. training

2. testing

Page 22: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

Training Phase

• Find weights w s.t. f(xi,w) = yi “as much as possible” for training samples xi

• define “as much as possible”• penalty whenever f(xi,w) ≠ yi

• via loss function L(f(xi,w), yi)

• how to search for good setting of w?• usually through optimization

• can be very time consuming

• classification error on training data is called training error

Page 23: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

Testing Phase• The goal is good performance on unseen examples

• Evaluate performance of the trained classifier f(x,w) on the test samples (unseen labeled samples)

• Testing on unseen labeled examples is an approximation of how well classifier will perform in practice

• If testing results are poor, may have to go back to the training phase and redesign f(x,w)

• Classification error on test data is called test error

• Side note

• when we “deploy” the final classifier f(x,w) in practice, this is also called testing

Page 24: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

24

• Can also underfit data, i.e. too simple decision boundary • chosen hypothesis space is not

expressive enough

• No linear decision boundary can separate the samples well

• Training error is too high• test error is, of course, also high

Underfitting

Page 25: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

Underfitting → Overfitting

underfitting “just right” overfitting

• high training error

• high test error

• low training error

• low test error

• low training error

• high test error

Page 26: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

26

Sketch of Supervised Machine Learning

• Chose a hypothesis space f(x,w)

• w are tunable weights

• x is the input sample

• tune w so that f(x,w) gives the correct label for training samples x

• Which hypothesis space f(x,w) to choose?

• has to be expressive enough to model our problem well, i.e. to avoid underfitting

• yet not to complicated to avoid overfitting

Page 27: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

Classification System Design Overview

• Collect and label data by handsalmon salmon salmonsea bass sea bass sea bass

• Preprocess data (i.e. segmenting fish from background)

• Extract possibly discriminating features• length, lightness, width, number of fins,etc.

• Classifier design• choose model for classifier

• train classifier on training data

• Test classifier on test data

• Split data into training and test sets

mostly look at these steps in the course

Page 28: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

OPTIMIZATION

Page 29: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

( ) 0xJdx

d=

Optimization

• How to minimize a function of a single variable

J(x) =(x-5)2

• From calculus, take derivative, set it to 0

• Solve the resulting equation• maybe easy or hard to solve

• Example above is easy

( ) ( ) 5x05x2xJdx

d==−=

Page 30: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

( )

( )

( ) 0xJ

xJx

xJx

d

1

==

Optimization• How to minimize a function of many variables

J(x) = J(x1,…, xd)

• From calculus, take partial derivatives, set them to 0

gradient

• Solve the resulting system of d equations

• It may not be possible to solve the system of equations above analytically

Page 31: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

Optimization: Gradient Direction

x2x1

J(x1, x2)

Picture from Andrew Ng

• Gradient J(x) points in the direction of steepest increase of function J(x)

• - J(x) points in the direction of steepest decrease

Page 32: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

Gradient Direction in 2D• J(x1, x2) =(x1-5)2+(x2-10)2

( ) ( )5x2xJx

1

1

−=

( ) ( )10x2xJx

2

2

−=

( ) 101

=

aJ

x•

( ) 102

−=

aJ

x•

a

global min

x1

x2

5

10

10

5

10

10

( )

−=

10

10aJ•

• Let

=

5

10a

( )

−=−

10

10aJ•

Page 33: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

Gradient Descent: Step Size

• J(x1, x2) =(x1-5)2+(x2-10)2

• Which step size to take?

• Controlled by parameter

• called learning rate

• From previous slide

• J(10, 5) = 50; J(8,7) = 18

a

global min

x1

x2

5

10

10

5

10

10

( )

−=−

=

10

10

5

10aJ,a•

• Let = 0.2

( )aJa −α

−+

=

10

1020

5

10.

=

7

8

Page 34: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

J(x)

x

Gradient Descent Algorithm

x(1) x(2)

-J(x(1))

-J(x(2))

x(k)

-J(x(k))0

k = 1

x(1) = any initial guess

choose ,

while ||J(x(k))|| >

x(k+1) = x (k) - J(x(k))

k = k + 1

Page 35: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

Gradient Descent: Local Minimum

• Not guaranteed to find global minimum• gets stuck in local minimum

J(x)

x

x(1)x(2)

-J(x(1))

-J(x(2))

x(k)

-J(x(k))=0

global minimum

• Still gradient descent is very popular because it is simple and applicable to any differentiable function

Page 36: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

x

How to Set Learning Rate ?

• If too large, may overshoot the local minimum and possibly never even converge

J(x)

x

• If too small, too many iterations to converge

x(2) x(1)x(4) x(3)

• It helps to compute J(x) as a function of iteration number, to make sure we are properly minimizing it

J(x)

Page 37: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

Variable Learning Rate

k = 1

x(1) = any initial guess

choose ,

while ||J(x(k))|| >

x(k+1) = x (k) - J(x(k))

k = k + 1

• If desired, can change learning rate at each iteration

k = 1

x(1) = any initial guess

choose

while ||J(x(k))|| >

choose (k)

x(k+1) = x (k) - (k) J(x(k))

k = k + 1

Page 38: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

Variable Learning Rate

k = 1

x(1) = any initial guess

choose ,

while ||J(x(k))|| >

x(k+1) = x (k) - J(x(k))

k = k + 1

• Usually do not keep track of all intermediate solutions

k = 1

x = any initial guess

choose ,

while ||J(x)|| >

x = x - J(x)

k = k + 1

Page 39: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

Learning Rate

• Monitor learning rate by looking at how fast the objective function decreases

J(x)

number of iterations

very high learning rate

high learning rate

low learning rate

good learning rate

Page 40: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

1.0=Learning Rate: Loss Surface Illustration

001.0=

updates 3~ k

updates 30.~ k

01.0=

1.0=

Page 41: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

Linear Classifiers

Page 42: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

Supervised Machine Learning: Recap• Chose type of f(x,w)

• w are tunable weights, x is the input example

• f(x,w) should output the correct class of sample x

• use labeled samples to tune weights w so that f(x,w) give the correct class y for x in training data

• loss function L(f(x,w) ,y)

• How to choose type of f(x,w)?

• many choices

• simple choice: linear classifier

• other choices

• Neural Network

• Convolutional Neural Network

Page 43: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

Linear Classifier

• Use

• y = 1 for the first class

• y = -1 for the second class

• One choice for linear classifier

f(x,w) = sign(g(x,w))

• 1 if g(x,w) is positive

• -1 if g(x,w) is negative

g(x)

x-1

1f(x)

• Classifier it makes a decision based on linear combination of features

g(x,w) = w0+x1w1 + … + xdwd

• g(x,w) called discriminant function

Page 44: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

bad boundary

Linear Classifier: Decision Boundary

• f(x,w) = sign(g(x,w)) = sign(w0+x1w1 + … + xdwd)

• Decision boundary is linear

• Find w0, w1,…, wd that give best separation of two classes with a linear boundary

better boundary

Page 45: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

More on Linear Discriminant Function (LDF)

• LDF: g(x,w,w0) = w0+x1w1 + … + xdwd

x1

x2

decision boundary

g(x) = 0

g(x) > 0

decision region for class 1

g(x) < 0

decision region for class 2

=

dw

...

w

w

w2

1bias or threshold

Page 46: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

More on Linear Discriminant Function (LDF)

• Decision boundary: g(x,w) = w0+x1w1 + … + xdwd = 0

• This is a hyperplane, by definition

• a point in 1D

• a line in 2D

• a plane in 3D

• a hyperplane in higher dimensions

Page 47: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

Vector Notation • Linear discriminant function g(x, w, w0) = wtx + w0

• Example in 2D

( ) 423 210 ++= xxw,w,xg 42

3

0 =

= w,w

• Shorter notation if add extra feature of value 1 to x

=

2

1

x

xx

=

2

3

4

a

=

2

1

1

x

xz ( )

==

2

1

1

234

x

xaza,zg t

( ) 21 234 xxaza,zg t ++== ( )00 w,w,xgwwxt =+=

• Use atz instead of wtx + w0

Page 48: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

dx

x

1

1

Fitting Parameters w

• Rewrite g(x,w,w0) = [w0 wt] = atz = g(z,a)1xnew

feature vector z

new weight vector a

• z is called augmented feature vector

• new problem equivalent to the old g(z,a) = atz

• f(z,a) = sign(g(z,a))

dw

w

w

1

0

Page 49: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

a

a

a

Solution Region• If there is a that classifies all examples correctly, it is

called separating or solution vector

• then there are infinitely many solution vectors a

• original samples x1,… xn are also linearly separable

Page 50: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

Loss Function• How to find solution vector a?

• or, if no separating a exists, a good approximation a?

• Design a non-negative loss function L(a)

• L(a) is small if a is good

• L(a) is large if a is bad

• Minimize L(a) with gradient descent

• Two steps in design of L(a)

1. per-example loss L(f(zi,a),yi)• penalizes for deviations of f(zi,a) from yi

2. total loss adds up per-sample loss over all training examples

( ) ( )( )=i

ii y,a,zfLaL

Page 51: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

Loss Function, First Attempt• Per-example loss function ‘counts’ if error happens

( )( ) ( )

=

=otherwise

ya,zfify,a,zfL

iiii

1

0

• Example

=

2

11z 11 =y

=

4

12z 12 −=y

−=

3

2a

( )( ) 111 =y,a,zfL

( ) ( )11 zasigna,zf t= ( ) ( )22 zasigna,zf t=

( )( ) 022 =y,a,zfL

( )2321 −= sign

1−=

( )4321 −= sign

1−=

Page 52: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

Loss Function, First Attempt• Per-example loss function ‘counts’ if error happens

( )( ) ( )

=

=otherwise

ya,zfify,a,zfL

iiii

1

0

• Total loss function counts total number of errors

−=

3

2a

( )( ) 111 =y,a,zfL

( )( ) 022 =y,a,zfL

( ) ( )( )=i

ii y,a,zfLaL

• For previous example

( ) 101 =+=aL

• Total loss

Page 53: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

Loss Function: First Attempt

• ‘error count’ loss function not suitable for gradient descent

• piecewise constant, gradient zero or does not exist

a

L(a)

Page 54: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

Perceptron Loss Function• Different Loss Function: Perceptron Loss

• Lp(a) is non-negative• positive misclassified example zi

• atzi < 0

• yi = 1

• yi(atzi) < 0

• negative misclassified example zi

• atzi > 0

• yi = -1

• yi(atzi) < 0

• if zi is misclassified then yi(atzi) < 0

• if zi is misclassified then -yi(atzi) > 0

• Lp(a) proportional to distance of misclassified example to boundary

( )( ) ( )( )

==

otherwisezay

ya,zfify,a,zfL

iti

iiii

p

0

Page 55: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

Perceptron Loss Function

( )( ) ( )( )

==

otherwisezay

ya,zfify,a,zfL

iti

iiii

p

0

• Example

=

2

11z 11 =y

=

4

12z 12 −=y

−=

3

2a

( )( ) 411 =y,a,zfLp

( ) ( )11 zasigna,zf t= ( ) ( )22 zasigna,zf t=

( )( ) 022 =y,a,zfLp

( )2321 −= sign

1−=

( )4321 −= sign

1−=( )4−= sign

• Total loss Lp(a) = 4 + 0 = 4

Page 56: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

Perceptron Loss Function

• Lp(a) is piecewise linear and suitable for gradient descent

a

Lp(a)

• Per-example loss

( ) ( )( )=i

iip y,a,zfLaL

• Total loss

( )( ) ( )( )

==

otherwisezay

ya,zfify,a,zfL

iti

iiii

p

0

Page 57: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

Optimizing with Gradient Descent

• Gradient descent to minimize Lp(a)

a= a – Lp(a)

• Need gradient vector Lp(a)• the same dimension as a

=

1

3

2

a( )

=

3

2

1

a

L

a

L

a

L

aL

p

p

p

p

• Per-example loss

( ) ( )( )=i

iip y,a,zfLaL

• Total loss

( )( ) ( )( )

==

otherwisezay

ya,zfify,a,zfL

iti

iiii

p

0

Page 58: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

Optimizing with Gradient Descent

• Gradient descent to minimize Lp(a)

a= a – Lp(a)

• Per-example loss

( ) ( )( )=i

iip y,a,zfLaL

• Total loss

( )( ) ( )( )

==

otherwisezay

ya,zfify,a,zfL

iti

iiii

p

0

• Compute

( )( )=i

iip y,a,zfL( )aLp ( )( )=

i

iip y,a,zfL

( )( )

( )( )

( )( )

3

2

1

a

y,a,zfL

a

y,a,zfL

a

y,a,zfL

iip

iip

iip

per example gradient

• Compute and add up per example gradients

Page 59: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

Per Example Loss Gradient

( )( )( )

=

=

otherwise?

ya,zfif

y,a,zfL

ii

iip

000

• First case, f(zi,a) = yi

• Per-example loss has two cases

( )( ) ( )( )

==

otherwisezay

ya,zfify,a,zfL

iti

iiii

p

0

( )( )( )

==

otherwise?

ya,zfify,a,zfL

ii

iip

0

• To save space, rewrite

Page 60: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

Per Example Loss Gradient

( )( )= iip y,a,zfL

• Second case, f(zi,a) ≠ yi

( )( )

( )( )

( )( )

iti

iti

iti

zaya

L

zaya

L

zaya

L

3

2

1

( )( )

( )( )

( )( )

++−

++−

++−

=

iiii

iiii

iiii

zazazaya

L

zazazaya

L

zazazaya

L

332211

3

332211

2

332211

1

=ii

ii

ii

zy

zy

zy

3

2

1

iizy−=

• Per-example loss has two cases

( )( ) ( )( )

==

otherwisezay

ya,zfify,a,zfL

iti

iiii

p

0

( )( ) ( )

==

otherwisezy

ya,zfify,a,zfL

ii

iiii

p

0

• Gradient for per-example loss

Page 61: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

Optimizing with Gradient Descent

• Simpler formula

( ) −=

iexamplesiedmisclassif

iip zyaL

• Gradient decent update rule for Lp(a)

+=

iexamplesiedmisclassif

iizyaa α

• called batch because it is based on all examples

Page 62: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

• Examples

=

=

=53

34

32 321 xxx

Perceptron Loss Batch Example

11 =y

• Labels

12 =y 13 =y 14 −=y 15 −=y

• Add extra feature

=

1

2

11z

=

3

4

12z

=

5

3

13z

=

6

5

15z

=

3

1

14z

=

=65

31 54 xx

class 1 class 2

• Pile all examples as rows in matrix Z

=

651311531341321

Z

=

1

1

1

1

1

Y

• Pile all labels into column vector Y

Page 63: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

• Initial weights

Perceptron Loss Batch Example

• Examples in Z, labels in Y

=

651

311

531

341

321

Z

=

1

1

1

1

1

Y

=

1

1

1

a

• This is line x1 + x2 + 1 = 0

Page 64: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

• Perceptron Batch

Perceptron Loss Batch Example

=

651

311

531

341

321

Z

=

1

1

1

1

1

Y

=

1

1

1

a

+=

iexamplesiedmisclassif

iizyaa α

• Sample misclassified if y(atz) < 0

• Let us use learning rate α = 0.2

+=

iexamplesiedmisclassif

iizy.aa 20

Page 65: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

Perceptron Loss Batch Example

=

651

311

531

341

321

Z

=

1

1

1

1

1

Y

=

1

1

1

a

• Can compute y(atz) for all samples

( )

−−

==

125986

11111

*.a*Z*.Y

• Sample misclassified if y(atz) < 0

=

12

5

9

8

6

Total loss is L(a) = 5+12 = 17

( )( ) ( )( )

==

otherwisezay

ya,zfify,a,zfL

iti

iiii

p

0• Per example loss is

Page 66: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

• Samples 4 and 5 misclassified

• Perceptron Batch rule update

Perceptron Loss Batch Example

=

651

311

531

341

321

Z

=

1

1

1

1

1

Y

=

1

1

1

a

+=

iexamplesiedmisclassif

iizy.aa 20

−+=

6

5

1

1

3

1

1

120.aa

−=

80

20

60

.

.

.

=

21

1

20

60

20

20

1

1

1

.

.

.

.

.

• This is line -0.2x1 -0.8 x2 +0.6 = 0

Page 67: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

Perceptron Loss Batch Example

=

651

311

531

341

321

Z

=

1

1

1

1

1

Y

−=

80

20

60

.

.

.

a

• Find all misclassified samples

( ) Y*.a*Z

• Sample misclassified if y(atz) < 0

=

25

2

04

62

22

.

.

.

.

• Total loss is L(a) = 2.2 + 2.6 +4= 8.8• previous loss was 17 with 2 misclassified examples

Page 68: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

• Perceptron Batch rule update

Perceptron Loss Batch Example

=

651

311

531

341

321

Z

=

1

1

1

1

1

Y

+=

iexamplesiedmisclassif

iizy.aa 20

+

+

+=

5

3

1

1

3

4

1

1

3

2

1

120.aa

=

41

61

21

.

.

.

• This is line 1.6x1 +1.4 x2 + 1.2 = 0

−=

80

20

60

.

.

.

a

+

+

+

−=

1

60

20

60

80

20

60

40

20

80

20

60

.

.

.

.

.

.

.

.

.

.

.

( ) Y*.a*Z

=

25

2

04

62

22

.

.

.

.

Page 69: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

Perceptron Loss Batch Example

=

651

311

531

341

321

Z

=

1

1

1

1

1

Y

=

41

61

21

.

.

.

a

• Find all misclassified samples

( ) Y*.a*Z

• Sample misclassified if y(atz) < 0

=

617

7

013

811

68

.

.

.

.

• Total loss is L(a) = 7 + 17.6 = 24.6• previous loss was 8.8 with 3 misclassified examples

• loss went up, means learning rate of 0.2 is too high

Page 70: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

• Batch gradient descent can be slow to converge if lots of examples

• Single sample optimization• update weights a as soon as possible, after seeing 1 example

• One iteration (epoch) • go over all examples, in random order

• update after seeing one example zi

Single-Sample Gradient Descent

per example gradient

𝐚 = 𝐚 − 𝛂𝛻𝐋p 𝐟 𝐳i, 𝐚 , 𝐲i

Page 71: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

Single sample gradient descent, one iteration

global min

finish

start

Batch Size: Loss Surface Illustration

Batch Gradient Descent, one iteration

see only one example

global min

start

finish

Page 72: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

• Mini-Batch optimization• update weights a after seeing m examples

• One iteration (epoch) • go over all examples, in random order

• update after seeing m examples

Mini-Batch Gradient Descent

per example gradient

𝐚 = 𝐚 − 𝛂

𝑚

𝛻𝐋p 𝐟 𝐳i, 𝐚 , 𝐲i

• Most often used in practice

• Practical Issue: both batch and mini-batch algorithms converge faster if features are roughly on the same scale

Page 73: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

Logistic Regression, Motivation• Recall line fitting

z

1

-1

z

1

-1

• solution with squared error loss

• one sample misclassified

• solution with perceptron loss

• all samples classified correctly

• Instead of trying to get atz close to y• use differentiable function Ϭ(atz) that “squishes” the range

• try to get Ϭ(atz) close to y

Page 74: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

Linear Classifier: Logistic Regression

( )( )texp

t−+

=1

t

Ϭ(t)

0.5

1

• Use logistic sigmoid function Ϭ(t) for “squishing” atz

• Denote classes with 1 and 0 now• yi = 1 for positive class, yi = 0 for negative

• Despite “regression” in the name, logistic regression is used for classification, not regression

Page 75: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

Logistic Regression vs. Regresson

Ϭ(atz)

atz

0.5

1

quadratic loss

logistic regression loss

Page 76: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

Logistic Regression: Loss Function• Could use (yi- Ϭ(atz)) 2 as per-example loss function

• Instead use a different loss • if z has label 1, want Ϭ(aTz) close to 1, define loss as

–log [Ϭ(aTz)]

• if z has label 0, want Ϭ(aTz) close to 0, define loss as

–log [1-Ϭ(aTz)]

1

t

-log t

• Gradient descent batch update rule

( )( ) i

i

iti zzayaa −+= σα

• Probabilistic interpretation• P(class 1) = Ϭ(aTz)

• P(class 0) = 1 - P(class 1)

• loss function is negative log-likelihood , -log P(y)

• standard objective in statistics

Page 77: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

Logistic Regression vs. Perceptron• Green example classified correctly, but

close to decision boundary• no loss under Perceptron

• non-negligible loss under logistic regression

x

Ϭ(wtx)0.5

1

• Logistic Regression encourages decision boundary move away from training samples

• better generalization

• zero Perceptron loss

• smaller LR loss• zero Perceptron loss

• larger LR loss

• red classifier works better for new data

Page 78: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

Logistic Regression vs. Regression vs. Perceptron

yatz

misclassified classified correctly but close to

decision boundary

classified correctly and not too close

to the decision boundary

1

quadratic loss

perceptron

loss

• Assuming labels are +1 and -1

logistic regression

loss

Page 79: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

• Define m linear discriminant functions

gi(x) = witx + wi0 for i = 1, 2, … m

Multiple Classes: General Case

• Assign x to class i if

gi(x) > gj(x) for all j ≠ i

• Let Ri be decision region for class i• all points in Ri assigned to class i

g2(x) > g1(x)g2(x) > g3(x)

R1R2

R3

g1(x) > g2(x)g1(x) > g3(x)

g3(x) > g1(x)g3(x) > g2(x)

Page 80: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

Multiple Classes

• Can be shown that decision regions are convex

• In particular, they must be spatially contiguous

Page 81: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

Multiclass Linear Classifier: Matrix Notation

• Assume examples x are augmented with extra feature 1, no need to write bias explicitly• but from now on will not change notation to z’s

• Define m discriminant functions

gi(x) = witx for i = 1, 2, … m

• Assign x to i that gives maximum gi(x)

• Picture illustration

x

g1(x)

g2(x)

g3(x)

g4(x)

10

9

3

5→ 5

→ 3

→ -9

→ 10

pile all outputs into one vector

decide class 4

Page 82: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

Multiclass Linear Classifier: Matrix Notation• Could use one dimensional output yi ∊ {1, 2, 3, …, m}

got this

0

0

1

0

want this

• Convenient to use multi-dimensional outputs (one-hot encoding)

=

0

0

0

1

jy

class 1

=

0

0

1

0

jy

class 2

=

0

1

0

0

jy

class 3

=

1

0

0

0

jy

class 4

x

g1(x)

g2(x)

g3(x)

g4(x)

10

9

3

5• if sample is of class i, want output vector to be 0 everywhere except position i, where it should be 1 x is of class 2

Page 83: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

• Assign x to i that gives maximum gi(x)= witx

x

g1(x)

g2(x)

g3(x)

g4(x)

w2tx

w3tx

w4tx

w1tx

x

• In matrix notation

172

254

239

742

4

7

1

−=

43

47

4

2w1

w2

w3

w4

xW Wx• Assign x to class that corresponds to largest row of Wx

Multiclass Linear Classifier: Matrix Notation

742 −

x

x

x

x

x

239 −

254

172 −

Page 84: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

• Assign sample xi to class that corresponds to largest row of Wxi

• Loss function?

43

47

4

2

Wxi

1

0

0

0

yi

( ) ( )( ) −=i

tiii xyWxWL

• Can use quadratic loss per sample xi as ½||Wxi - yi ||2

• for example above, loss (22 + 42 + 472 +442)/2

• total loss on all training samples L(W) = ½Σi|| Wxi - yi ||2

• gradient of the loss

• L(W) has the same shape as the same shape as W

• batch gradient descent update

( ) ( )ti

i

ii xyWxWW −−=

Quadratic Loss Function

Page 85: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

• Consider gradient descent update, single sample x with α = 1

( ) txyWxWW −−=

• Suppose and is of class 2 and

=

2

3

1

x

=

172

254

239

742

W

=

=−

17

23

3

0

0

0

1

0

17

23

4

0

yWxok

too smalltoo large

−−−

=

=

345117

466923

693

000

172

254

239

742

231

17

23

3

0

172

254

239

742

W

• update rule

Quadratic Loss Function

−−−

−−

=

354419

446419

4126

742

Page 86: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

Quadratic Loss Function

−−−

−−

=

354419

446419

4126

742

W

−=

221

299

38

0

Wx• With new ,

• Already saw that quadratic loss does not work that well for classification

=

=−

17

23

3

0

0

0

1

0

17

23

4

0

yWxok

too smalltoo large

Page 87: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

Softmax Function

( )

( )

( )

( )

( )

( )

( )

( )

=

=

=

=

4

1j

4

4

1j

3

4

1j

2

4

1j

1

exp

exp

exp

exp

j

j

j

j

aexp

a

aexp

a

aexp

a

aexp

a

• Define softmax(a) function

4

3

2

1

a

a

a

a

=

26760

72750

0050

.

.

.

softmax

• Example

( )( ) ( ) ( )

( )( ) ( ) ( )

( )( ) ( ) ( )

++−

++−

++−

123

1exp

123

2exp

123

3exp

expexpexp

expexpexp

expexpexp

1

2

3

softmax

• Softmax renormalizes a vector so that it can be interpreted as a vector of probabilities

Page 88: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

• Generalization of logistic regression to multiclass case

• Instead of raw scores

xw

xw

xw

xw

T

4

T

3

T

2

T

1

−=

3

5

1

2

• Use softmax scores

=

xw

xw

xw

xw

T

4

T

3

T

2

T

1

maxsoft

( )

( )

( )

( )

=

4

3

2

1

classPr

classPr

classPr

classPr

=

00030

95000

00240

04730

.

.

.

.

Softmax Loss Function

3

5

1

2

maxsoft

• Classifier output interpreted as probability for each class

Page 89: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

• Optimize under –log Pr( yi) loss function

Gradient Descent: Softmax Loss Function

• Example

• Loss on this example is –log(0.000000000000001) = 40

=

2

3

1

ix

=

172

254

239

742

W

1

0

0

0

yi

( )=iWxmaxsoft

( )

( )

( )

( )

=

4

3

2

1

classPr

classPr

classPr

classPr

=

−17

23

4

0

maxsoft

00000010.00000000

42945850.99999999

56027960.00000000

01026190.00000000

Page 90: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

• Update rule for weight matrix W

( )( ) ( ) −+=i

tiii xWxmaxsoftyWW

Gradient Descent: Softmax Loss Function

• Example, single sample gradient descent with α = 0.1

231

17

23

4

0

1

0

0

0

10

172

254

239

742

+

= maxsoft.W

=

2

3

1

ix

=

172

254

239

742

W

1

0

0

0

yi

=

17

23

4

0

iWx

=

217612

817493

239

742

...

...

• Update for W

Page 91: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

More General Discriminant Functions

• Linear discriminant functions

• simple decision boundary

• should try simpler models first to avoid overfitting

• optimal for certain type of data

• Gaussian distributions with equal covariance

• May not be optimal for other data distributions

• Discriminant functions can be more general than linear

• For example, polynomial discriminant functions

• Decision boundaries more complex than linear

• Later will look more at non-linear discriminant functions

Page 92: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

• Linear classifier works well when examples are linearly separable, or almost separable

• Perceptron Loss function was the first historic loss

• Logistic regression/softmax work better in practice

• Optimization with gradient descent

• stochastic mini-batch works best in practice

Summary

Page 93: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning

Linear Classifier: Quadratic Loss

( )( ) ( )22

1 itiiip zayz,a,zfL −=

• This is standard line fitting • note that even correctly classified examples can have a large loss

• Can find optimal weight a analytically with least squares• expensive for large problems

• Gradient descent more efficient for a larger problem

( ) ( ) i

i

itip zzayaL −−=

( ) i

i

iti zzayaa −+= α

• Quadratic per-example loss

• Batch update rule

z

1

-1

atz