cs4442/9542b artificial intelligence ii prof. olga veksler...

83
CS4442/9542b Artificial Intelligence II prof. Olga Veksler Lecture 4 Machine Learning Linear Classifier 2 classes

Upload: others

Post on 28-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS4442/9542b Artificial Intelligence II prof. Olga Veksler ...olga/Courses/Winter2018/CS4442_9542b/L4-ML-L… · CS4442/9542b Artificial Intelligence II prof. Olga Veksler Lecture

CS4442/9542bArtificial Intelligence II

prof. Olga Veksler

Lecture 4Machine Learning

Linear Classifier 2 classes

Page 2: CS4442/9542b Artificial Intelligence II prof. Olga Veksler ...olga/Courses/Winter2018/CS4442_9542b/L4-ML-L… · CS4442/9542b Artificial Intelligence II prof. Olga Veksler Lecture

Outline

• Optimization with gradient descent

• Linear Classifier• Two class case

• Loss functions

• Perceptron• Batch

• Single sample

• Logistic Regression

Page 3: CS4442/9542b Artificial Intelligence II prof. Olga Veksler ...olga/Courses/Winter2018/CS4442_9542b/L4-ML-L… · CS4442/9542b Artificial Intelligence II prof. Olga Veksler Lecture

0xJdx

d

Optimization

• How to minimize a function of a single variable

J(x) =(x-5)2

• From calculus, take derivative, set it to 0

• Solve the resulting equation• maybe easy or hard to solve

• Example above is easy:

5x05x2xJdx

d

Page 4: CS4442/9542b Artificial Intelligence II prof. Olga Veksler ...olga/Courses/Winter2018/CS4442_9542b/L4-ML-L… · CS4442/9542b Artificial Intelligence II prof. Olga Veksler Lecture

0xJ

xJx

xJx

d

1

Optimization• How to minimize a function of many variables

J(x) = J(x1,…, xd)

• From calculus, take partial derivatives, set them to 0

gradient

• Solve the resulting system of d equations

• It may not be possible to solve the system of equations above analytically

Page 5: CS4442/9542b Artificial Intelligence II prof. Olga Veksler ...olga/Courses/Winter2018/CS4442_9542b/L4-ML-L… · CS4442/9542b Artificial Intelligence II prof. Olga Veksler Lecture

Optimization: Gradient Direction

x2x1

J(x1, x2)

Picture from Andrew Ng

• Gradient J(x) points in the direction of steepest increase of function J(x)

• - J(x) points in the direction of steepest decrease

Page 6: CS4442/9542b Artificial Intelligence II prof. Olga Veksler ...olga/Courses/Winter2018/CS4442_9542b/L4-ML-L… · CS4442/9542b Artificial Intelligence II prof. Olga Veksler Lecture

Gradient Direction in 1D• Gradient is just derivative in 1D

• Example: J(x) =(x-5)2 and derivative is 5x2xJdx

d

43Jdx

d•

• derivative says increase x

x=3

J(x)

x

negative slope, negative derivative

• Let x = 3

x=8

J(x)

x

positive slope, positive derivative

63Jdx

d•

• derivative says decrease x

• Let x = 8

Page 7: CS4442/9542b Artificial Intelligence II prof. Olga Veksler ...olga/Courses/Winter2018/CS4442_9542b/L4-ML-L… · CS4442/9542b Artificial Intelligence II prof. Olga Veksler Lecture

Gradient Direction in 2D• J(x1, x2) =(x1-5)2+(x2-10)2

5x2xJx

1

1

10x2xJx

2

2

101

aJ

x•

102

aJ

x•

a

global min

x1

x2

5

10

10

5

10

10

10

10aJ•

• Let

5

10a

10

10aJ•

Page 8: CS4442/9542b Artificial Intelligence II prof. Olga Veksler ...olga/Courses/Winter2018/CS4442_9542b/L4-ML-L… · CS4442/9542b Artificial Intelligence II prof. Olga Veksler Lecture

Gradient Descent: Step Size

• J(x1, x2) =(x1-5)2+(x2-10)2

• Which step size to take?

• Controlled by parameter

• called learning rate

• From previous slide

• J(10, 5) = 50; J(8,7) = 18

a

global min

x1

x2

5

10

10

5

10

10

10

10

5

10aJ,a•

• Let = 0.2

aJa α

10

1020

5

10.

7

8

Page 9: CS4442/9542b Artificial Intelligence II prof. Olga Veksler ...olga/Courses/Winter2018/CS4442_9542b/L4-ML-L… · CS4442/9542b Artificial Intelligence II prof. Olga Veksler Lecture

J(x)

x

Gradient Descent Algorithm

x(1) x(2)

-J(x(1))

-J(x(2))

x(k)

-J(x(k))0

k = 1

x(1) = any initial guess

choose ,

while ||J(x(k))|| >

x(k+1) = x (k) - J(x(k))

k = k + 1

Page 10: CS4442/9542b Artificial Intelligence II prof. Olga Veksler ...olga/Courses/Winter2018/CS4442_9542b/L4-ML-L… · CS4442/9542b Artificial Intelligence II prof. Olga Veksler Lecture

Gradient Descent: Local Minimum

• Not guaranteed to find global minimum• gets stuck in local minimum

J(x)

x

x(1)x(2)

-J(x(1))

-J(x(2))

x(k)

-J(x(k))=0

global minimum

• Still gradient descent is very popular because it is simple and applicable to any differentiable function

Page 11: CS4442/9542b Artificial Intelligence II prof. Olga Veksler ...olga/Courses/Winter2018/CS4442_9542b/L4-ML-L… · CS4442/9542b Artificial Intelligence II prof. Olga Veksler Lecture

x

How to Set Learning Rate ?

• If too large, may overshoot the local minimum and possibly never even converge

J(x)

x

• If too small, too many iterations to converge

x(2) x(1)x(4) x(3)

• It helps to compute J(x) as a function of iteration number, to make sure we are properly minimizing it

J(x)

Page 12: CS4442/9542b Artificial Intelligence II prof. Olga Veksler ...olga/Courses/Winter2018/CS4442_9542b/L4-ML-L… · CS4442/9542b Artificial Intelligence II prof. Olga Veksler Lecture

Variable Learning Rate

k = 1

x(1) = any initial guess

choose ,

while ||J(x(k))|| >

x(k+1) = x (k) - J(x(k))

k = k + 1

• If desired, can change learning rate at each iteration

k = 1

x(1) = any initial guess

choose

while ||J(x(k))|| >

choose (k)

x(k+1) = x (k) - (k) J(x(k))

k = k + 1

Page 13: CS4442/9542b Artificial Intelligence II prof. Olga Veksler ...olga/Courses/Winter2018/CS4442_9542b/L4-ML-L… · CS4442/9542b Artificial Intelligence II prof. Olga Veksler Lecture

Variable Learning Rate

k = 1

x(1) = any initial guess

choose ,

while ||J(x(k))|| >

x(k+1) = x (k) - J(x(k))

k = k + 1

• Usually do not keep track of all intermediate solutions

k = 1

x = any initial guess

choose ,

while ||J(x)|| >

x = x - J(x)

k = k + 1

Page 14: CS4442/9542b Artificial Intelligence II prof. Olga Veksler ...olga/Courses/Winter2018/CS4442_9542b/L4-ML-L… · CS4442/9542b Artificial Intelligence II prof. Olga Veksler Lecture

Learning Rate

• Monitor learning rate by looking at how fast the objective function decreases

J(x)

number of iterations

very high learning rate

high learning rate

low learning rate

good learning rate

Page 15: CS4442/9542b Artificial Intelligence II prof. Olga Veksler ...olga/Courses/Winter2018/CS4442_9542b/L4-ML-L… · CS4442/9542b Artificial Intelligence II prof. Olga Veksler Lecture

1.0Learning Rate: Loss Surface Illustration

001.0

updates 3~ k

updates 30.~ k

01.0

1.0

Page 16: CS4442/9542b Artificial Intelligence II prof. Olga Veksler ...olga/Courses/Winter2018/CS4442_9542b/L4-ML-L… · CS4442/9542b Artificial Intelligence II prof. Olga Veksler Lecture

Advanced Optimization Methods

• There are more advanced gradient-based optimization methods

• Such as conjugate gradient

• automatically pick a good learning rate

• usually converge faster

• however more complex to understand and implement

• in Matlab, use fminunc for various advanced optimization methods

Page 17: CS4442/9542b Artificial Intelligence II prof. Olga Veksler ...olga/Courses/Winter2018/CS4442_9542b/L4-ML-L… · CS4442/9542b Artificial Intelligence II prof. Olga Veksler Lecture

Supervised Machine Learning (Recap)

• Chose type of f(x,w)

• w are tunable weights, x is the input example

• f(x,w) should output the correct class of sample x

• use labeled samples to tune weights w so that f(x,w) give the correct class y for x

• with help of loss function L(f(x,w) ,y)

• How to choose type of f(x,w)?• many choices

• previous lecture: kNN classifier

• this lecture: linear classifier

Page 18: CS4442/9542b Artificial Intelligence II prof. Olga Veksler ...olga/Courses/Winter2018/CS4442_9542b/L4-ML-L… · CS4442/9542b Artificial Intelligence II prof. Olga Veksler Lecture

Linear Classifier

• Encode 2 classes as

• y = 1 for the first class

• y = -1 for the second class

• One choice for linear classifier

f(x,w) = sign(g(x,w))

• 1 if g(x,w) is positive

• -1 if g(x,w) is negative

g(x)

x-1

1f(x)

• Classifier is linear if it makes a decision based on linear combination of features

g(x,w) = w0+x1w1 + … + xdwd

• g(x,w) sometimes called discriminant function

Page 19: CS4442/9542b Artificial Intelligence II prof. Olga Veksler ...olga/Courses/Winter2018/CS4442_9542b/L4-ML-L… · CS4442/9542b Artificial Intelligence II prof. Olga Veksler Lecture

bad boundary

Linear Classifier: Decision Boundary

• f(x,w) = sign(g(x,w)) = sign(w0+x1w1 + … + xdwd)

• Decision boundary is linear

• Find w0, w1,…, wd that gives best separation of two classes with linear boundary

better boundary

Page 20: CS4442/9542b Artificial Intelligence II prof. Olga Veksler ...olga/Courses/Winter2018/CS4442_9542b/L4-ML-L… · CS4442/9542b Artificial Intelligence II prof. Olga Veksler Lecture

More on Linear Discriminant Function (LDF)

• LDF: g(x,w,w0) = w0+x1w1 + … + xdwd

x1

x2

decision boundary

g(x) = 0

g(x) > 0

decision region for class 1

g(x) < 0

decision region for class 2

dw

...

w

w

w2

1

bias or threshold

Page 21: CS4442/9542b Artificial Intelligence II prof. Olga Veksler ...olga/Courses/Winter2018/CS4442_9542b/L4-ML-L… · CS4442/9542b Artificial Intelligence II prof. Olga Veksler Lecture

More on Linear Discriminant Function (LDF)

• Decision boundary: g(x,w) = w0+x1w1 + … + xdwd = 0

• This is a hyperplane, by definition

• a point in 1D

• a line in 2D

• a plane in 3D

• a hyperplane in higher dimensions

Page 22: CS4442/9542b Artificial Intelligence II prof. Olga Veksler ...olga/Courses/Winter2018/CS4442_9542b/L4-ML-L… · CS4442/9542b Artificial Intelligence II prof. Olga Veksler Lecture

Vector Notation • Linear discriminant function g(x,w, w0) = wtx + w0

• Example in 2D

423 210 xxw,w,xg 42

3

0

w,w

• Shorter notation if add extra feature of value 1 to x

2

1

x

xx

2

3

4

a

2

1

1

x

xz

2

1

1

234

x

xaza,zg t

21 234 xxaza,zg t 00 w,w,xgwwx t

• Use atz instead of wtx + w0

Page 23: CS4442/9542b Artificial Intelligence II prof. Olga Veksler ...olga/Courses/Winter2018/CS4442_9542b/L4-ML-L… · CS4442/9542b Artificial Intelligence II prof. Olga Veksler Lecture

dx

x

1

1

Fitting Parameters w

• Rewrite g(x,w,w0) = [w0 wt] = atz = g(z,a)1xnew

feature vector z

new weight vector a

• z is called augmented feature vector

• new problem equivalent to the old g(z,a) = atz

dw

w

w

1

0

Page 24: CS4442/9542b Artificial Intelligence II prof. Olga Veksler ...olga/Courses/Winter2018/CS4442_9542b/L4-ML-L… · CS4442/9542b Artificial Intelligence II prof. Olga Veksler Lecture

g(z) > 0

g(z) < 0 z

g(z) = 0

Augmented Feature Vector

• Feature augmenting simplifies notation

• Assume augmented feature vectors for the rest of lecture

• given examples x1,…, xn convert them to augmented examples z1,…, zn by adding a new dimension of value 1

• g(z,a) = atz

• f(z,a) = sign(g(z,a))

Page 25: CS4442/9542b Artificial Intelligence II prof. Olga Veksler ...olga/Courses/Winter2018/CS4442_9542b/L4-ML-L… · CS4442/9542b Artificial Intelligence II prof. Olga Veksler Lecture

a

a

a

Solution Region• If there is weight vector a that classifies all examples

correctly, it is called a separating or solution vector

• then there are infinitely many solution vectors a

• then the original samples x1,… xn are also linearly separable

Page 26: CS4442/9542b Artificial Intelligence II prof. Olga Veksler ...olga/Courses/Winter2018/CS4442_9542b/L4-ML-L… · CS4442/9542b Artificial Intelligence II prof. Olga Veksler Lecture

a

Solution Region

• Solution region: the set of all solution vectors a

Page 27: CS4442/9542b Artificial Intelligence II prof. Olga Veksler ...olga/Courses/Winter2018/CS4442_9542b/L4-ML-L… · CS4442/9542b Artificial Intelligence II prof. Olga Veksler Lecture

Loss Function• How to find solution vector a?

• or, if no separating a exists, a good approximate solution vector a?

• Design a non-negative loss function L(a)

• L(a) is small if a is good

• L(a) is large if a is bad

• Minimize L(a) with gradient descent

• Usually design of L(a) has two steps

1. design per-example loss L(f(zi,a),yi)• penalizes for deviations of f(zi,a) from yi

2. total loss adds up per-sample loss over all training examples

i

ii y,a,zfLaL

Page 28: CS4442/9542b Artificial Intelligence II prof. Olga Veksler ...olga/Courses/Winter2018/CS4442_9542b/L4-ML-L… · CS4442/9542b Artificial Intelligence II prof. Olga Veksler Lecture

Loss Function, First Attempt• Per-example loss function measures if error happens

otherwise

ya,zfify,a,zfL

iiii

1

0

• Example

2

11z 11 y

4

12z 12 y

3

2a

111 y,a,zfL

11 zasigna,zf t 22 zasigna,zf t

022 y,a,zfL

2321 sign

1

4321 sign

1

Page 29: CS4442/9542b Artificial Intelligence II prof. Olga Veksler ...olga/Courses/Winter2018/CS4442_9542b/L4-ML-L… · CS4442/9542b Artificial Intelligence II prof. Olga Veksler Lecture

Loss Function, First Attempt• Per-example loss function measures if error happens

otherwise

ya,zfify,a,zfL

iiii

1

0

• Total loss function

3

2a

111 y,a,zfL

022 y,a,zfL

i

ii y,a,zfLaL

• For previous example

101 aL

• Thus this loss function just counts the number of errors

Page 30: CS4442/9542b Artificial Intelligence II prof. Olga Veksler ...olga/Courses/Winter2018/CS4442_9542b/L4-ML-L… · CS4442/9542b Artificial Intelligence II prof. Olga Veksler Lecture

Loss Function: First Attempt

• Unfortunately, cannot minimize this loss function with gradient descent

• piecewise constant, gradient zero or does not exist

a

L(a)

• Per-example loss

otherwise

ya,zfify,a,zfL

iiii

1

0 i

ii y,a,zfLaL

• Total loss

Page 31: CS4442/9542b Artificial Intelligence II prof. Olga Veksler ...olga/Courses/Winter2018/CS4442_9542b/L4-ML-L… · CS4442/9542b Artificial Intelligence II prof. Olga Veksler Lecture

Perceptron Loss Function• Different Loss Function: Perceptron Loss

• Lp(a) is non-negative• positive misclassified example zi

• atzi < 0

• yi = 1

• yi(atzi) < 0

• negative misclassified example zi

• atzi > 0

• yi = -1

• yi(atzi) < 0

• if zi is misclassified then yi(atzi) < 0

• if zi is misclassified then -yi(atzi) > 0

• Lp(a) proportional to distance of misclassified example to boundary

otherwisezay

ya,zfify,a,zfL

iti

iiii

p

0

Page 32: CS4442/9542b Artificial Intelligence II prof. Olga Veksler ...olga/Courses/Winter2018/CS4442_9542b/L4-ML-L… · CS4442/9542b Artificial Intelligence II prof. Olga Veksler Lecture

Perceptron Loss Function

otherwisezay

ya,zfify,a,zfL

iti

iiii

p

0

• Example

2

11z 11 y

4

12z

12 y

3

2a

411 y,a,zfLp

11 zasigna,zf t 22 zasigna,zf t

022 y,a,zfLp

2321 sign

1

4321 sign

1 4 sign

• Total loss Lp(a) = 4 + 0 = 4

Page 33: CS4442/9542b Artificial Intelligence II prof. Olga Veksler ...olga/Courses/Winter2018/CS4442_9542b/L4-ML-L… · CS4442/9542b Artificial Intelligence II prof. Olga Veksler Lecture

Perceptron Loss Function

• Lp(a) is piecewise linear and suitable for gradient descent

a

Lp(a)

• Per-example loss

i

iip y,a,zfLaL

• Total loss

otherwisezay

ya,zfify,a,zfL

iti

iiii

p

0

Page 34: CS4442/9542b Artificial Intelligence II prof. Olga Veksler ...olga/Courses/Winter2018/CS4442_9542b/L4-ML-L… · CS4442/9542b Artificial Intelligence II prof. Olga Veksler Lecture

Optimizing with Gradient Descent

• Gradient descent to minimize Lp(a), main step

a= a – Lp(a)

• Recall minimization with gradient descent, main step

x = x – J(x)

• Need gradient vector Lp(a)• has as many dimensions as dimension of a

• if a has 3 dimensions, gradient Lp(a) has 3 dimensions

1

3

2

a

3

2

1

a

L

a

L

a

L

aL

p

p

p

p

• Per-example loss

i

iip y,a,zfLaL

• Total loss

otherwisezay

ya,zfify,a,zfL

iti

iiii

p

0

Page 35: CS4442/9542b Artificial Intelligence II prof. Olga Veksler ...olga/Courses/Winter2018/CS4442_9542b/L4-ML-L… · CS4442/9542b Artificial Intelligence II prof. Olga Veksler Lecture

Optimizing with Gradient Descent

• Gradient descent to minimize Lp(a), main step

a= a – Lp(a)

• Per-example loss

i

iip y,a,zfLaL

• Total loss

otherwisezay

ya,zfify,a,zfL

iti

iiii

p

0

• Need gradient vector Lp(a)

i

iip y,a,zfL aLp

i

iip y,a,zfL

3

2

1

a

y,a,zfL

a

y,a,zfL

a

y,a,zfL

iip

iip

iip

per example gradient

• Compute and add up per example gradient vectors

Page 36: CS4442/9542b Artificial Intelligence II prof. Olga Veksler ...olga/Courses/Winter2018/CS4442_9542b/L4-ML-L… · CS4442/9542b Artificial Intelligence II prof. Olga Veksler Lecture

Per Example Loss Gradient

otherwise?

ya,zfif

y,a,zfL

ii

iip

000

• First case, f(zi,a) = yi

• Per-example loss has two cases

otherwisezay

ya,zfify,a,zfL

iti

iiii

p

0

otherwise?

ya,zfify,a,zfL

ii

iip

0

• To save space, rewrite

Page 37: CS4442/9542b Artificial Intelligence II prof. Olga Veksler ...olga/Courses/Winter2018/CS4442_9542b/L4-ML-L… · CS4442/9542b Artificial Intelligence II prof. Olga Veksler Lecture

Per Example Loss Gradient

iip y,a,zfL

• Second case, f(zi,a) ≠ yi

iti

iti

iti

zaya

L

zaya

L

zaya

L

3

2

1

iiii

iiii

iiii

zazazaya

L

zazazaya

L

zazazaya

L

332211

3

332211

2

332211

1

ii

ii

ii

zy

zy

zy

3

2

1

iizy

• Per-example loss has two cases

otherwisezay

ya,zfify,a,zfL

iti

iiii

p

0

otherwisezy

ya,zfify,a,zfL

ii

iiii

p

0

• Combining both cases, gradient for per-example loss

Page 38: CS4442/9542b Artificial Intelligence II prof. Olga Veksler ...olga/Courses/Winter2018/CS4442_9542b/L4-ML-L… · CS4442/9542b Artificial Intelligence II prof. Olga Veksler Lecture

Optimizing with Gradient Descent

otherwisezy

ya,zfify,a,zfL

ii

iiii

p

0

• Gradient for per-example loss

• Total gradient i

iipp y,a,zfLaL

• Simpler formula

iexamplesiedmisclassif

iip zyaL

• Gradient decent update rule for Lp(a)

iexamplesiedmisclassif

iizyaa α

• called batch because it is based on all examples

• can be slow if number of examples is very large

Page 39: CS4442/9542b Artificial Intelligence II prof. Olga Veksler ...olga/Courses/Winter2018/CS4442_9542b/L4-ML-L… · CS4442/9542b Artificial Intelligence II prof. Olga Veksler Lecture

• Examples

53

34

32 321 xxx

Perceptron Loss Batch Example

11 y

• Labels

12 y 13 y 14 y 15 y

• Add extra feature

1

2

11z

3

4

12z

5

3

13z

6

5

15z

3

1

14z

65

31 54 xx

class 1 class 2

• Pile all examples as rows in matrix Z

651311531341321

Z

1

1

1

1

1

Y

• Pile all labels into column vector Y

Page 40: CS4442/9542b Artificial Intelligence II prof. Olga Veksler ...olga/Courses/Winter2018/CS4442_9542b/L4-ML-L… · CS4442/9542b Artificial Intelligence II prof. Olga Veksler Lecture

• Initial weights

Perceptron Loss Batch Example

• Examples in Z, labels in Y

651

311

531

341

321

Z

1

1

1

1

1

Y

1

1

1

a

• This is line x1 + x2 + 1 = 0

Page 41: CS4442/9542b Artificial Intelligence II prof. Olga Veksler ...olga/Courses/Winter2018/CS4442_9542b/L4-ML-L… · CS4442/9542b Artificial Intelligence II prof. Olga Veksler Lecture

• Perceptron Batch

Perceptron Loss Batch Example

651

311

531

341

321

Z

1

1

1

1

1

Y

1

1

1

a

iexamplesiedmisclassif

iizyaa α

• Sample misclassified if y(atz) < 0

• Let us use learning rate α = 0.2

iexamplesiedmisclassif

iizy.aa 20

Page 42: CS4442/9542b Artificial Intelligence II prof. Olga Veksler ...olga/Courses/Winter2018/CS4442_9542b/L4-ML-L… · CS4442/9542b Artificial Intelligence II prof. Olga Veksler Lecture

Perceptron Loss Batch Example

651

311

531

341

321

Z

1

1

1

a

• Find all misclassified samples with one line in matlab

• Could have for loop to compute atz

• For i = 1

• Sample misclassified if y(atz) < 0

06

3

2

1

111111

zay t

1

1

1

1

1

Y

• Repeat for i = 2, 3, 4, 5

Page 43: CS4442/9542b Artificial Intelligence II prof. Olga Veksler ...olga/Courses/Winter2018/CS4442_9542b/L4-ML-L… · CS4442/9542b Artificial Intelligence II prof. Olga Veksler Lecture

Perceptron Loss Batch Example

651

311

531

341

321

Z

1

1

1

1

1

Y

1

1

1

a

• Find all misclassified samples with one line in matlab

• Can compute atz for all samples

• Sample misclassified if y(atz) < 0

a*Z

5

4

3

2

1

za

za

za

za

za

t

t

t

t

t

12

5

9

8

6

Page 44: CS4442/9542b Artificial Intelligence II prof. Olga Veksler ...olga/Courses/Winter2018/CS4442_9542b/L4-ML-L… · CS4442/9542b Artificial Intelligence II prof. Olga Veksler Lecture

Perceptron Loss Batch Example

651

311

531

341

321

Z

1

1

1

1

1

Y

1

1

1

a

• Can compute y(atz) for all samples in one line

125986

11111

*.a*Z*.Y

• Sample misclassified if y(atz) < 0

12

5

9

8

6

55

44

33

22

11

zay

zay

zay

zay

zay

t

t

t

t

t

Total loss is L(a) = 5+12 = 17

otherwisezay

ya,zfify,a,zfL

iti

iiii

p

0• Per example loss is

Page 45: CS4442/9542b Artificial Intelligence II prof. Olga Veksler ...olga/Courses/Winter2018/CS4442_9542b/L4-ML-L… · CS4442/9542b Artificial Intelligence II prof. Olga Veksler Lecture

• Samples 4 and 5 misclassified

• Perceptron Batch rule update

Perceptron Loss Batch Example

651

311

531

341

321

Z

1

1

1

1

1

Y

1

1

1

a

iexamplesiedmisclassif

iizy.aa 20

6

5

1

1

3

1

1

120.aa

80

20

60

.

.

.

21

1

20

60

20

20

1

1

1

.

.

.

.

.

• This is line -0.2x1 -0.8 x2 +0.6 = 0

Page 46: CS4442/9542b Artificial Intelligence II prof. Olga Veksler ...olga/Courses/Winter2018/CS4442_9542b/L4-ML-L… · CS4442/9542b Artificial Intelligence II prof. Olga Veksler Lecture

Perceptron Loss Batch Example

651

311

531

341

321

Z

1

1

1

1

1

Y

80

20

60

.

.

.

a

• Find all misclassified samples

Y*.a*Z

• Sample misclassified if y(atz) < 0

25

2

04

62

22

.

.

.

.

• Total loss is L(a) = 2.2 + 2.6 +4= 8.8• previous loss was 17 with 2 misclassified examples

Page 47: CS4442/9542b Artificial Intelligence II prof. Olga Veksler ...olga/Courses/Winter2018/CS4442_9542b/L4-ML-L… · CS4442/9542b Artificial Intelligence II prof. Olga Veksler Lecture

• Perceptron Batch rule update

Perceptron Loss Batch Example

651

311

531

341

321

Z

1

1

1

1

1

Y

iexamplesiedmisclassif

iizy.aa 20

5

3

1

1

3

4

1

1

3

2

1

120.aa

41

61

21

.

.

.

• This is line 1.6x1 +1.4 x2 + 1.2 = 0

80

20

60

.

.

.

a

1

60

20

60

80

20

60

40

20

80

20

60

.

.

.

.

.

.

.

.

.

.

.

Y*.a*Z

25

2

04

62

22

.

.

.

.

Page 48: CS4442/9542b Artificial Intelligence II prof. Olga Veksler ...olga/Courses/Winter2018/CS4442_9542b/L4-ML-L… · CS4442/9542b Artificial Intelligence II prof. Olga Veksler Lecture

Perceptron Loss Batch Example

651

311

531

341

321

Z

1

1

1

1

1

Y

41

61

21

.

.

.

a

• Find all misclassified samples

Y*.a*Z

• Sample misclassified if y(atz) < 0

617

7

013

811

68

.

.

.

.

• Total loss is L(a) = 7 + 17.6 = 24.6• previous loss was 8.8 with 3 misclassified examples

• loss went up, means learning rate of 0.2 is too high

Page 49: CS4442/9542b Artificial Intelligence II prof. Olga Veksler ...olga/Courses/Winter2018/CS4442_9542b/L4-ML-L… · CS4442/9542b Artificial Intelligence II prof. Olga Veksler Lecture

• Batch Perceptron can be slow to converge if lots of examples

• Single sample optimization• update weights a as soon as possible, after seeing 1 example

• One iteration (epoch) • go over all examples, as soon as find misclassified example, update

a =a +y z• z is misclassified example, y is its label

• Best to go over examples in random order

• z misclassified by a means

0yzat

a

• z is on the wrong side of decision boundary

• adding y z moves decision boundary in the right direction

• Illustration for positive example z

Perceptron Single Sample Gradient Descent

z

anew

yz

• Geometric intuition

Page 50: CS4442/9542b Artificial Intelligence II prof. Olga Veksler ...olga/Courses/Winter2018/CS4442_9542b/L4-ML-L… · CS4442/9542b Artificial Intelligence II prof. Olga Veksler Lecture

Perceptron Single Sample Rule

if is too large, previously correctly classified sample zi is now misclassified

a(k)z

a(k+1)

zi

a(k)

if is too small, z is still misclassified

z

a(k+1)

Page 51: CS4442/9542b Artificial Intelligence II prof. Olga Veksler ...olga/Courses/Winter2018/CS4442_9542b/L4-ML-L… · CS4442/9542b Artificial Intelligence II prof. Olga Veksler Lecture

Single sample gradient descent, one iteration

global min

finish

start

Batch Size: Loss Surface Illustration

Batch Gradient Descent, one iteration

see only one example

global min

start

finish

Page 52: CS4442/9542b Artificial Intelligence II prof. Olga Veksler ...olga/Courses/Winter2018/CS4442_9542b/L4-ML-L… · CS4442/9542b Artificial Intelligence II prof. Olga Veksler Lecture

features gradename good

attendance?tall? sleeps in

class?chews gum?

Jane yes yes no no A

Steve yes yes yes yes F

Mary no no no yes F

Peter yes no no yes A

• class 1: students who get grade A

• class 2: students who get grade F

Perceptron Single Sample Rule Example

Page 53: CS4442/9542b Artificial Intelligence II prof. Olga Veksler ...olga/Courses/Winter2018/CS4442_9542b/L4-ML-L… · CS4442/9542b Artificial Intelligence II prof. Olga Veksler Lecture

features yname good

attendance?tall? sleeps in

class?chews gum?

Jane 1 1 -1 -1 1

Steve 1 1 1 1 -1

Mary -1 -1 -1 1 -1

Peter 1 -1 1 1 1

• Convert attributes to numerical values

Perceptron Single Sample Rule Example

Page 54: CS4442/9542b Artificial Intelligence II prof. Olga Veksler ...olga/Courses/Winter2018/CS4442_9542b/L4-ML-L… · CS4442/9542b Artificial Intelligence II prof. Olga Veksler Lecture

features y

name extra good attendance?

tall? sleeps in class?

chews gum?

Jane 1 1 1 -1 -1 1

Steve 1 1 1 1 1 -1

Mary 1 -1 -1 -1 1 -1

Peter 1 1 -1 1 1 1

• convert samples x1,…, xn to augmented samples z1,…, zn

by adding a new dimension of value 1

Augment Feature Vector

Page 55: CS4442/9542b Artificial Intelligence II prof. Olga Veksler ...olga/Courses/Winter2018/CS4442_9542b/L4-ML-L… · CS4442/9542b Artificial Intelligence II prof. Olga Veksler Lecture

• Set fixed learning rate to = 1

• Gradient descent with single sample rule• visit examples in random order

• example misclassified if y(atz) < 0

• when misclassified example z found, update a(k+1) =a(k) + yz

Apply Single Sample Rule

features y

name extra good attendance?

tall? sleeps in class?

chews gum?

Jane 1 1 1 -1 -1 1

Steve 1 1 1 1 1 -1

Mary 1 -1 -1 -1 1 -1

Peter 1 1 -1 1 1 1

Page 56: CS4442/9542b Artificial Intelligence II prof. Olga Veksler ...olga/Courses/Winter2018/CS4442_9542b/L4-ML-L… · CS4442/9542b Artificial Intelligence II prof. Olga Veksler Lecture

• initial weights

name y y(atz) misclassified?

Jane 1 0.25*1+0.25*1+0.25*1+0.25*(-1)+0.25*(-1) > 0 no

Steve -1 -1 * (0.25*1+0.25*1+0.25*1+0.25*1+0.25*1) < 0 yes

• for simplicity, we will visit all samples sequentially

• example misclassified if y(atz) < 0

• new weights

Apply Single Sample Rule

250

250

250

250

250

1

.

.

.

.

.

a

1

1

1

1

1

250

250

250

250

250

12

.

.

.

.

.

yzaa

750

750

750

750

750

.

.

.

.

.

Page 57: CS4442/9542b Artificial Intelligence II prof. Olga Veksler ...olga/Courses/Winter2018/CS4442_9542b/L4-ML-L… · CS4442/9542b Artificial Intelligence II prof. Olga Veksler Lecture

name y y(atz) misclassified?

Mary -1 -1*(-0.75*1-0.75*(-1) -0.75 *(-1) -0.75 *(-1) -0.75*1) < 0 yes

Apply Single Sample Rule

750

750

750

750

750

2

.

.

.

.

.

a

• new weights

1

1

1

1

1

750

750

750

750

750

23

.

.

.

.

.

yzaa

751

250

250

250

751

.

.

.

.

.

Page 58: CS4442/9542b Artificial Intelligence II prof. Olga Veksler ...olga/Courses/Winter2018/CS4442_9542b/L4-ML-L… · CS4442/9542b Artificial Intelligence II prof. Olga Veksler Lecture

name y y(atz) misclassified?

Peter 1 -1.75 *1 +0.25* 1+0.25* (-1) +0.25 *(-1)-1.75*1 < 0 yes

Apply Single Sample Rule

751

250

250

250

751

3

.

.

.

.

.

a

• new weights

1

1

1

1

1

751

250

250

250

751

34

.

.

.

.

.

yzaa

750

750

750

251

750

.

.

.

.

.

Page 59: CS4442/9542b Artificial Intelligence II prof. Olga Veksler ...olga/Courses/Winter2018/CS4442_9542b/L4-ML-L… · CS4442/9542b Artificial Intelligence II prof. Olga Veksler Lecture

name y y(atz) misclassified?

Jane 1 -0.75 *1 +1.25*1 -0.75*1 -0.75 *(-1) -0.75 *(-1)+0 no

Steve -1 -1*(-0.75*1+1.25*1 -0.75*1 -0.75*1-0.75*1)>0 no

Mary -1 -1*(-0.75 *1+1.25*(-1)-0.75*(-1) -0.75 *(-1) –0.75*1 )>0 no

Peter 1 -0.75 *1+ 1.25*1-0.75* (-1)-0.75* (-1) -0.75 *1 >0 no

Single Sample Rule: Convergence

750

750

750

251

750

4

.

.

.

.

.

a

Page 60: CS4442/9542b Artificial Intelligence II prof. Olga Veksler ...olga/Courses/Winter2018/CS4442_9542b/L4-ML-L… · CS4442/9542b Artificial Intelligence II prof. Olga Veksler Lecture

• Discriminant function is

g(z) = -0.75 z0+1.25z1 – 0.75z2 - 0.75z3 - 0.75z4

• Converting back to the original features x

g(x) = 1.25x1 – 0.75x2 - 0.75x3 - 0.75x4 - 0.75

Single Sample Rule: Convergence

750

750

750

251

750

4

.

.

.

.

.

a

Page 61: CS4442/9542b Artificial Intelligence II prof. Olga Veksler ...olga/Courses/Winter2018/CS4442_9542b/L4-ML-L… · CS4442/9542b Artificial Intelligence II prof. Olga Veksler Lecture

good

attendance

tall sleeps in class chews gum

• This is just one possible solution vector

• With a(1)=[0,0.5, 0.5, 0, 0], solution is [-1,1.5, -0.5, -1, -1]

1.5x1 - 0.5x2 - x3 - x4 > 1 grade A

• in this solution, being tall is the least important feature

• Trained LDF: g(x) = 1.25x1 – 0.75x2 - 0.75x3 - 0.75x4 - 0.75

• Leads to classifier:

Final Classifier

1.25x1 – 0.75x2 - 0.75x3 - 0.75x4 > 0.75 grade A

Page 62: CS4442/9542b Artificial Intelligence II prof. Olga Veksler ...olga/Courses/Winter2018/CS4442_9542b/L4-ML-L… · CS4442/9542b Artificial Intelligence II prof. Olga Veksler Lecture

1. Classes are linearly separable• with fixed learning rate, both single sample and batch versions converge to

a correct solution a

• can be any a in the solution space

2. Classes are not linearly separable• with fixed learning rate, both single sample and batch do not converge

• can ensure convergence with appropriate variable learning rate

• → 0 as k → ∞

• example, inverse linear: = c/k, where c is any constant

• also converges in the linearly separable case

• Practical Issue: both single sample and batch algorithms converge faster if features are roughly on the same scale• see kNN lecture on feature normalization

Convergence under Perceptron Loss

Page 63: CS4442/9542b Artificial Intelligence II prof. Olga Veksler ...olga/Courses/Winter2018/CS4442_9542b/L4-ML-L… · CS4442/9542b Artificial Intelligence II prof. Olga Veksler Lecture

• True gradient descent, full gradient computed

• Smoother gradient because all samples are used

• Takes longer to converge

Batch• Only partial gradient is

computed

• Noisier gradient, may concentrates more than necessary on any isolated training examples (those could be noise)

• Converges faster

Single Sample

Batch vs. Single Sample Rules

• Update weights after seeing batchSize examples

• Faster convergence than the Batch rule

• Less susceptible to noisy examples than Single Sample Rule

Mini-Batch

Page 64: CS4442/9542b Artificial Intelligence II prof. Olga Veksler ...olga/Courses/Winter2018/CS4442_9542b/L4-ML-L… · CS4442/9542b Artificial Intelligence II prof. Olga Veksler Lecture

Linear Classifier: Quadratic Loss

22

1 itiiip zayz,a,zfL

• Trying to fit labels +1 and -1 to function atz

• This is just standard line fitting in (linear regression)• note that even correctly classified examples can have a large loss

• Can find optimal weight a analytically with least squares• expensive for large problems

• Gradient descent more efficient for a larger problem

i

i

itip zzayaL

i

i

iti zzayaa α

• Quadratic per-example loss

• Batch update rule

iti azsigna,zf

• Other loss functions are possible for our classifier

z

1

-1

atz

Page 65: CS4442/9542b Artificial Intelligence II prof. Olga Veksler ...olga/Courses/Winter2018/CS4442_9542b/L4-ML-L… · CS4442/9542b Artificial Intelligence II prof. Olga Veksler Lecture

Linear Classifier: Quadratic Loss• Quadratic loss is an inferior choice for classification

z

1

-1

z

1

-1

• Optimal classifier under quadratic loss

• smallest squared errors

• one sample misclassified

• Classifier found with Perceptron loss

• huge squared errors

• all samples classified correctly

• Idea: instead of trying to get atz close to y , use some differentiable function Ϭ(atz) with “squished range”, and try to get Ϭ(atz) close to y

Page 66: CS4442/9542b Artificial Intelligence II prof. Olga Veksler ...olga/Courses/Winter2018/CS4442_9542b/L4-ML-L… · CS4442/9542b Artificial Intelligence II prof. Olga Veksler Lecture

Linear Classifier: Logistic Regression

texp

t

1

t

Ϭ(t)

0.5

1

atz

Ϭ(atz)0.5

1

• Use logistic sigmoid function Ϭ(t) for “squishing” atz

• Denote classes with 1 and 0 now• yi = 1 for positive class, yi = 0 for negative

• Despite “regression” in the name, logistic regression is used for classification, not regression

Page 67: CS4442/9542b Artificial Intelligence II prof. Olga Veksler ...olga/Courses/Winter2018/CS4442_9542b/L4-ML-L… · CS4442/9542b Artificial Intelligence II prof. Olga Veksler Lecture

Logistic Regression vs. Regresson

zϬ(atz)

atz

0.5

1

quadratic loss

logistic regression loss

Page 68: CS4442/9542b Artificial Intelligence II prof. Olga Veksler ...olga/Courses/Winter2018/CS4442_9542b/L4-ML-L… · CS4442/9542b Artificial Intelligence II prof. Olga Veksler Lecture

Logistic Regression: Loss Function• Could use (yi- Ϭ(atz)) 2 as per-example loss function

• Instead use a different loss • if example z has label 1, want Ϭ(aTz) close to 1, define loss as

–log [Ϭ(aTz)]

• if example z has label 0, want Ϭ(aTz) close to 0, define loss as

–log [1-Ϭ(aTz)]

1

t

log t

1

t

-log t

atz

Ϭ(atz)0.5

1

Page 69: CS4442/9542b Artificial Intelligence II prof. Olga Veksler ...olga/Courses/Winter2018/CS4442_9542b/L4-ML-L… · CS4442/9542b Artificial Intelligence II prof. Olga Veksler Lecture

Logistic Regression: Loss Function

• Per-example loss function• if example x has label 1, loss is

–log [Ϭ(aTz)]

• if example x has label 0, loss is

–log [1-Ϭ(aTz)]

• Total loss is sum over per-example losses

• Convex, can be optimized exactly with gradient descent

• Gradient descent batch update rule

i

i

iti zzayaa σα

atz

Ϭ(atz)

0.5

• Logistic Regression has interesting probabilistic interpretation• P(class 1) = Ϭ(aTz)

• P(class 0) = 1 - P(class 1)

• Therefore loss function is -log P(y) (negative log-likelihood)

• standard objective in statistics

Page 70: CS4442/9542b Artificial Intelligence II prof. Olga Veksler ...olga/Courses/Winter2018/CS4442_9542b/L4-ML-L… · CS4442/9542b Artificial Intelligence II prof. Olga Veksler Lecture

Logistic Regression vs. Perceptron• Green example classified correctly, but

close to decision boundary• Suppose wtx = 0.8 for green example

• classified correctly, no loss under Perceptron

• loss of -log(Ϭ(0.8)) = 0.37 under logistic regression

x

Ϭ(wtx)0.5

1

• Logistic Regression (LR) encourages decision boundary move away from any training sample

• may work better for new samples (better generalization)

• zero Perceptron loss

• smaller LR loss• zero Perceptron loss

• larger LR loss

• red classifier works better for new data

Page 71: CS4442/9542b Artificial Intelligence II prof. Olga Veksler ...olga/Courses/Winter2018/CS4442_9542b/L4-ML-L… · CS4442/9542b Artificial Intelligence II prof. Olga Veksler Lecture

• Batch Logistic Regression with learning rate α=1

• Initial weights

Linear Classifier: Logistic Regression

• Examples in Z, labels in Y

651

311

531

341

321

Z

0

0

1

1

1

Y

1

1

1

a

• This is line x1 + x2 + 1 = 0

Page 72: CS4442/9542b Artificial Intelligence II prof. Olga Veksler ...olga/Courses/Winter2018/CS4442_9542b/L4-ML-L… · CS4442/9542b Artificial Intelligence II prof. Olga Veksler Lecture

• Logistic Regression Batch rule update with α = 1

Linear Classifier: Logistic Regression

651

311

531

341

321

Z

0

0

1

1

1

Y

1

1

1

a

i

i

iti zzayaa σ

• Can compute each (yi- Ϭ(atzi)) zi with for loop, and add them up

• For i = 1,

3

2

1

3

2

1

1111111 σσ zzay t

00750

0050

00250

3

2

1

00250

.

.

.

.

3

2

1

61 σ

Page 73: CS4442/9542b Artificial Intelligence II prof. Olga Veksler ...olga/Courses/Winter2018/CS4442_9542b/L4-ML-L… · CS4442/9542b Artificial Intelligence II prof. Olga Veksler Lecture

• Logistic Regression Batch rule update with α = 1

Linear Classifier: Logistic Regression

651

311

531

341

321

Z

0

0

1

1

1

Y

1

1

1

a

i

i

iti zzayaa σ

12

5

9

8

6

a*Z

• But also can compute update with a few lines in Matlab, no need for a loop

• First compute atzi for all examples

5

4

3

2

1

za

za

za

za

za

t

t

t

t

t

Page 74: CS4442/9542b Artificial Intelligence II prof. Olga Veksler ...olga/Courses/Winter2018/CS4442_9542b/L4-ML-L… · CS4442/9542b Artificial Intelligence II prof. Olga Veksler Lecture

• Batch update rule

Linear Classifier: Logistic Regression

651

311

531

341

321

Z

1

1

1

1

1

Y

1

1

1

a

i

i

iti zzayaa σ• Apply sigmoid to each row

0001

99330

99990

99970

99750

12

5

9

8

6

.

.

.

.

.

σ

σ

σ

σ

σ

12

5

9

8

6

a*Z

5

4

3

2

1

za

za

za

za

za

t

t

t

t

t

5

4

3

2

1

za

za

za

za

za

t

t

t

t

t

σ

σ

σ

σ

σ

Page 75: CS4442/9542b Artificial Intelligence II prof. Olga Veksler ...olga/Courses/Winter2018/CS4442_9542b/L4-ML-L… · CS4442/9542b Artificial Intelligence II prof. Olga Veksler Lecture

Linear Classifier: Logistic Regression

• Assume you have sigmoid function Ϭ(t) implemented• takes scalar t as an input, outputs Ϭ(t)

• To apply sigmoid to each element of column vector with one line, use arrayfun(functionPtr, A) in matlab

0001

99330

99990

99970

99750

12

5

9

8

6

.

.

.

.

.

σ

σ

σ

σ

σ

Page 76: CS4442/9542b Artificial Intelligence II prof. Olga Veksler ...olga/Courses/Winter2018/CS4442_9542b/L4-ML-L… · CS4442/9542b Artificial Intelligence II prof. Olga Veksler Lecture

Linear Classifier: Logistic Regression

651

311

531

341

321

Z

0

0

1

1

1

Y

1

1

1

a

0001

99330

99990

99970

99750

5

4

3

2

1

.

.

.

.

.

za

za

za

za

za

t

t

t

t

t

σ

σ

σ

σ

σ

• Subtract from labels Y

0001

99330

00010

00030

00250

0001

99330

99990

99970

99750

.

.

.

.

.

.

.

.

.

.

Y

• Batch rule update

i

i

iti zzayaa σ

55

44

33

22

11

zay

zay

zay

zay

zay

t

t

t

t

t

σ

σ

σ

σ

σ

Page 77: CS4442/9542b Artificial Intelligence II prof. Olga Veksler ...olga/Courses/Winter2018/CS4442_9542b/L4-ML-L… · CS4442/9542b Artificial Intelligence II prof. Olga Veksler Lecture

Linear Classifier: Logistic Regression

651

311

531

341

321

Z

0

0

1

1

1

Y

1

1

1

a

• Multiply by corresponding example

• Batch rule update

i

i

iti zzayaa σ

0001

99330

00010

00030

00250

.

.

.

.

.

55

44

33

22

11

zay

zay

zay

zay

zay

v

t

t

t

t

t

σ

σ

σ

σ

σ

555

444

333

222

111

zzay

zzay

zzay

zzay

zzay

t

t

t

t

t

σ

σ

σ

σ

σ

Z*.,,vrepmat 31 Z*.

...

...

...

...

...

001001001

990990990

000100001000010

000300003000030

002500025000250

Page 78: CS4442/9542b Artificial Intelligence II prof. Olga Veksler ...olga/Courses/Winter2018/CS4442_9542b/L4-ML-L… · CS4442/9542b Artificial Intelligence II prof. Olga Veksler Lecture

Linear Classifier: Logistic Regression

651

311

531

341

321

Z

0

0

1

1

1

Y

1

1

1

a

• Multiply by corresponding example continued

• Batch rule update

i

i

iti zzayaa σ

0605001

982990990

000600004000010

00100013000030

007400049000250

001001001

990990990

000100001000010

000300003000030

002500025000250

...

...

...

...

...

Z*.

...

...

...

...

...

0001

99330

00010

00030

00250

.

.

.

.

.

55

44

33

22

11

zay

zay

zay

zay

zay

v

t

t

t

t

t

σ

σ

σ

σ

σ

Page 79: CS4442/9542b Artificial Intelligence II prof. Olga Veksler ...olga/Courses/Winter2018/CS4442_9542b/L4-ML-L… · CS4442/9542b Artificial Intelligence II prof. Olga Veksler Lecture

Linear Classifier: Logistic Regression

• Add up all rows

• Batch rule update i

i

iti zzayaa σ

9789959911 ...,Asum

555

444

333

222

111

zzay

zzay

zzay

zzay

zzay

t

t

t

t

t

σ

σ

σ

σ

σ

A

...

...

...

...

...

0605001

982990990

000600004000010

00100013000030

007400049000250

• Transpose to get the needed update

978

995

991

978995991

.

.

.

... t i

i

iti zzay σ

Page 80: CS4442/9542b Artificial Intelligence II prof. Olga Veksler ...olga/Courses/Winter2018/CS4442_9542b/L4-ML-L… · CS4442/9542b Artificial Intelligence II prof. Olga Veksler Lecture

Linear Classifier: Logistic Regression

• Batch rule update

i

i

iti zzayaa σ

• Finally update

1

1

1

a

978

995

991

.

.

.

977

994

990

.

.

.

• This is line -4.99x1 -7.97 x2 - 0.99 = 0

978

995

991

.

.

.

1

1

1

Page 81: CS4442/9542b Artificial Intelligence II prof. Olga Veksler ...olga/Courses/Winter2018/CS4442_9542b/L4-ML-L… · CS4442/9542b Artificial Intelligence II prof. Olga Veksler Lecture

Logistic Regression vs. Regression vs. Perceptron

yatz

misclassified classified correctly but close to

decision boundary

classified correctly and not too close

to the decision boundary

1

quadratic loss

perceptron

loss

• Assuming labels are +1 and -1

logistic regression

loss

Page 82: CS4442/9542b Artificial Intelligence II prof. Olga Veksler ...olga/Courses/Winter2018/CS4442_9542b/L4-ML-L… · CS4442/9542b Artificial Intelligence II prof. Olga Veksler Lecture

More General Discriminant Functions

• Linear discriminant functions

• simple decision boundary

• should try simpler models first to avoid overfitting

• optimal for certain type of data

• Gaussian distributions with equal covariance

• May not be optimal for other data distributions

• Discriminant functions can be more general than linear

• For example, polynomial discriminant functions

• Decision boundaries more complex than linear

• Later will look more at non-linear discriminant functions

Page 83: CS4442/9542b Artificial Intelligence II prof. Olga Veksler ...olga/Courses/Winter2018/CS4442_9542b/L4-ML-L… · CS4442/9542b Artificial Intelligence II prof. Olga Veksler Lecture

• Linear classifier works well when examples are linearly separable, or almost separable

• Two Linear Classifiers

• Perceptron

• find a separating hyperplane in the linearly separable case

• uses gradient descent for optimization

• does not converge in the non-separable case

• can force convergence by using a decreasing learning rate

• Logistic Regression

• has probabilistic interpretation

• can be optimized exactly with gradient descent

Summary