04 logistic regression - colorado state universitycs545/fall18/lib/exe/fetch... · cross entropy...

31
Linear models: Logistic regression Chapter 3.3 1

Upload: others

Post on 27-Jun-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 04 logistic regression - Colorado State Universitycs545/fall18/lib/exe/fetch... · Cross entropy and logistic regression The logistic regression cost function: It is the average cross

Linear models: Logistic regression

Chapter 3.3

1

Page 2: 04 logistic regression - Colorado State Universitycs545/fall18/lib/exe/fetch... · Cross entropy and logistic regression The logistic regression cost function: It is the average cross

Predicting probabilities

Objective: learn to predict a probability P(y | x) for a binary classification problem using a linear classifier The target function: For positive examples P(y = +1 | x) = 1 whereas P(y = +1 | x) = 0 for negative examples.

2

The Target Function is Inherently Noisy

f(x) = P[y = +1 | x].

The data is generated from a noisy target function:

P (y | x) =

⎪⎨

⎪⎩

f(x) for y = +1;

1− f(x) for y = −1.

c⃝ AML Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 6 /23 When is h good? −→

Page 3: 04 logistic regression - Colorado State Universitycs545/fall18/lib/exe/fetch... · Cross entropy and logistic regression The logistic regression cost function: It is the average cross

Predicting probabilities

Objective: learn to predict a probability P(y | x) for a binary classification problem using a linear classifier The target function: For positive examples P(y = +1 | x) = 1 whereas P(y = +1 | x) = 0 for negative examples. Can we assume that P(y = +1 | x) is linear?

3

The Target Function is Inherently Noisy

f(x) = P[y = +1 | x].

The data is generated from a noisy target function:

P (y | x) =

⎪⎨

⎪⎩

f(x) for y = +1;

1− f(x) for y = −1.

c⃝ AML Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 6 /23 When is h good? −→

Page 4: 04 logistic regression - Colorado State Universitycs545/fall18/lib/exe/fetch... · Cross entropy and logistic regression The logistic regression cost function: It is the average cross

Logistic regression

4

s =d!

i=0

wixi

h(x) = (s) h(x) = s h(x) = θ(s)

sx

x

x

x0

1

2

d

h x( )

sx

x

x

x0

1

2

d

h x( )

sx

x

x

x0

1

2

d

h x( )

⃝ AML

s = w

|x

θ

θ(s) =es

1 + es

θ(s)1

0 s

⃝ AML

θ

θ(s) =es

1 + es

θ(s)1

0 s

⃝ AML

The logistic function (aka squashing function):

The signal is the basis for several linear models:

Page 5: 04 logistic regression - Colorado State Universitycs545/fall18/lib/exe/fetch... · Cross entropy and logistic regression The logistic regression cost function: It is the average cross

Properties of the logistic function

5

Predicting a Probability

Will someone have a heart attack over the next year?

age 62 yearsgender maleblood sugar 120 mg/dL40,000HDL 50LDL 120Mass 190 lbsHeight 5′ 10′′

. . . . . .

Classification: Yes/No

Logistic Regression: Likelihood of heart attack logistic regression ≡ y ∈ [0, 1]

h(x) = θ

!d"

i=0

wixi

#

= θ(wtx)θ(s)

1

0 s

θ(s) =es

1 + es=

1

1 + e−s.

θ(−s) =e−s

1 + e−s=

1

1 + es= 1− θ(s).

c⃝ AML Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 4 /23 Data is binary ±1 −→

Predicting a Probability

Will someone have a heart attack over the next year?

age 62 yearsgender maleblood sugar 120 mg/dL40,000HDL 50LDL 120Mass 190 lbsHeight 5′ 10′′

. . . . . .

Classification: Yes/No

Logistic Regression: Likelihood of heart attack logistic regression ≡ y ∈ [0, 1]

h(x) = θ

!d"

i=0

wixi

#

= θ(wtx)θ(s)

1

0 s

θ(s) =es

1 + es=

1

1 + e−s.

θ(−s) =e−s

1 + e−s=

1

1 + es= 1− θ(s).

c⃝ AML Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 4 /23 Data is binary ±1 −→

Page 6: 04 logistic regression - Colorado State Universitycs545/fall18/lib/exe/fetch... · Cross entropy and logistic regression The logistic regression cost function: It is the average cross

Predicting probabilities

Fitting the data means finding a good hypothesis h

6

What Makes an h Good?

‘fiting’ the data means finding a good h

h is good if:

h(xn) ≈ 1 whenever yn = +1;

h(xn) ≈ 0 whenever yn = −1.

A simple error measure that captures this:

Ein(h) =1

N

N∑

n=1

(

h(xn)−12(1 + yn)

)2.

Not very convenient (hard to minimize).

c⃝ AML Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 7 /23 Cross entropy error −→

The Probabilistic Interpretation

Suppose that h(x) = θ(wtx) closely captures P[+1|x]:

P (y | x) =

⎪⎨

⎪⎩

θ(wtx) for y = +1;

1− θ(wtx) for y = −1.

c⃝ AML Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 9 /23 1− θ(s) = θ(−s) −→

The Probabilistic Interpretation

Suppose that h(x) = θ(wtx) closely captures P[+1|x]:

P (y | x) =

⎪⎨

⎪⎩

θ(wtx) for y = +1;

1− θ(wtx) for y = −1.

c⃝ AML Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 9 /23 1− θ(s) = θ(−s) −→

Page 7: 04 logistic regression - Colorado State Universitycs545/fall18/lib/exe/fetch... · Cross entropy and logistic regression The logistic regression cost function: It is the average cross

Predicting probabilities

Fitting the data means finding a good hypothesis h

7

What Makes an h Good?

‘fiting’ the data means finding a good h

h is good if:

h(xn) ≈ 1 whenever yn = +1;

h(xn) ≈ 0 whenever yn = −1.

A simple error measure that captures this:

Ein(h) =1

N

N∑

n=1

(

h(xn)−12(1 + yn)

)2.

Not very convenient (hard to minimize).

c⃝ AML Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 7 /23 Cross entropy error −→

The Probabilistic Interpretation

Suppose that h(x) = θ(wtx) closely captures P[+1|x]:

P (y | x) =

⎪⎨

⎪⎩

θ(wtx) for y = +1;

1− θ(wtx) for y = −1.

c⃝ AML Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 9 /23 1− θ(s) = θ(−s) −→

The Probabilistic Interpretation

So, if h(x) = θ(wtx) closely captures P[+1|x]:

P (y | x) =

⎪⎨

⎪⎩

θ(wtx) for y = +1;

θ(−wtx) for y = −1.

c⃝ AML Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 10 /23 Simplify to one equation −→

The Probabilistic Interpretation

So, if h(x) = θ(wtx) closely captures P[+1|x]:

P (y | x) =

⎪⎨

⎪⎩

θ(wtx) for y = +1;

θ(−wtx) for y = −1.

. . . or, more compactly,P (y | x) = θ(y ·wtx)

c⃝ AML Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 11 /23 The likelihood −→

More compactly:

Page 8: 04 logistic regression - Colorado State Universitycs545/fall18/lib/exe/fetch... · Cross entropy and logistic regression The logistic regression cost function: It is the average cross

Is logistic regression really linear?

To figure out how the decision boundary looks like set solving for x we get: i.e.

8

P (y = +1|x) = exp(w

|x)

exp(w

|x) + 1

P (y = �1|x) = 1� P (y = +1|x) = 1

exp(w

|x) + 1

P (y = +1|x) = P (y = �1|x)

exp(w

|x) = 1

w

|x = 0

Page 9: 04 logistic regression - Colorado State Universitycs545/fall18/lib/exe/fetch... · Cross entropy and logistic regression The logistic regression cost function: It is the average cross

Maximum likelihood

We will find w using the principle of maximum likelihood.

9

The Likelihood

P (y | x) = θ(y ·wtx)

Recall: (x1, y1), . . . , (xN, yN) are independently generated

Likelihood:The probability of getting the y1, . . . , yN in D from the corresponding x1, . . . ,xN :

P (y1, . . . , yN | x1, . . . ,xn) =N!

n=1

P (yn | xn).

The likelihood measures the probability that the data were generated if f were h.

c⃝ AML Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 12 /23 Maximize the likelihood −→

The Likelihood

P (y | x) = θ(y ·wtx)

Recall: (x1, y1), . . . , (xN, yN) are independently generated

Likelihood:The probability of getting the y1, . . . , yN in D from the corresponding x1, . . . ,xN :

P (y1, . . . , yN | x1, . . . ,xn) =N!

n=1

P (yn | xn).

The likelihood measures the probability that the data were generated if f were h.

c⃝ AML Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 12 /23 Maximize the likelihood −→

Valid since

Page 10: 04 logistic regression - Colorado State Universitycs545/fall18/lib/exe/fetch... · Cross entropy and logistic regression The logistic regression cost function: It is the average cross

Maximizing the likelihood

10

Maximizing The Likelihood (why?)

max!N

n=1P (yn | xn)

⇔ max ln"!N

n=1P (yn | xn)#

≡ max$N

n=1 lnP (yn | xn)

⇔ min − 1N

$Nn=1 lnP (yn | xn)

≡ min 1N

$Nn=1 ln

1P (yn|xn)

≡ min 1N

$Nn=1 ln

1θ(yn·wtxn)

← we specialize to our “model” here

≡ min 1N

$Nn=1 ln(1 + e−yn·w

txn)

Ein(w) =1

N

N%

n=1

ln(1 + e−yn·wtxn)

c⃝ AML Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 13 /23 How to minimize Ein(w) −→

Page 11: 04 logistic regression - Colorado State Universitycs545/fall18/lib/exe/fetch... · Cross entropy and logistic regression The logistic regression cost function: It is the average cross

Maximizing the likelihood

Summary: maximizing the likelihood is equivalent to

11

−1

Nln

!N"

n=1

θ(yn w xn)

#

=1

N

N$

n=1

ln

%1

θ(yn w xn)

& '

θ(s) =1

1 + e−s

(

E (w) =1

N

N$

n=1

ln)

1 + e−ynw xn

*

+ ,- .

(h(xn),yn)⃝ AML

minimize

Cross entropy error

Page 12: 04 logistic regression - Colorado State Universitycs545/fall18/lib/exe/fetch... · Cross entropy and logistic regression The logistic regression cost function: It is the average cross

Maximizing the likelihood

Summary: maximizing the likelihood is equivalent to

12

−1

Nln

!N"

n=1

θ(yn w xn)

#

=1

N

N$

n=1

ln

%1

θ(yn w xn)

& '

θ(s) =1

1 + e−s

(

E (w) =1

N

N$

n=1

ln)

1 + e−ynw xn

*

+ ,- .

(h(xn),yn)⃝ AML

minimize

Cross entropy error

Ein(w) =1

N

NX

n=1

I(yn = +1) ln1

h(xn)+ I(yn = �1) ln

1

1� h(xn)

Exercise: check that this is equivalent to:

Page 13: 04 logistic regression - Colorado State Universitycs545/fall18/lib/exe/fetch... · Cross entropy and logistic regression The logistic regression cost function: It is the average cross

Digression: information theory

I am thinking of an integer between 0 and 1,023. You want to guess it using the fewest number of questions. Most of us would ask “is it between 0 and 512?” This is a good strategy because it provides the most information about the unknown number. It provides the first binary digit of the number. Initially you need to obtain log2(1024) = 10 bits of information. After the first question you only need log2(512) = 9 bits.

Page 14: 04 logistic regression - Colorado State Universitycs545/fall18/lib/exe/fetch... · Cross entropy and logistic regression The logistic regression cost function: It is the average cross

Information and Entropy

By halving the search space we obtained one bit. In general, the information associated with a probabilistic outcome: Why the logarithm? Assume we have two independent events x, and y. We would like the information they carry to be additive. Let’s check: Entropy:

14

I(p) = � log p

I(x, y) = � logP (x, y) = � logP (x)P (y)

= � logP (x)� logP (y) = I(x) + I(y)

H(P ) = �X

x

P (x) logP (x)

Page 15: 04 logistic regression - Colorado State Universitycs545/fall18/lib/exe/fetch... · Cross entropy and logistic regression The logistic regression cost function: It is the average cross

Entropy

For a Bernoulli random variable: Maximal when p = ½.

H(p) = �p log p� (1� p) log(1� p)

Page 16: 04 logistic regression - Colorado State Universitycs545/fall18/lib/exe/fetch... · Cross entropy and logistic regression The logistic regression cost function: It is the average cross

KL divergence

The KL divergence between distributions P and Q: Properties: ²  Non-negative, equal to 0 iff P = Q ²  It is not symmetric

16

D

KL

(P ||Q) = �X

x

P (x) log

Q(x)

P (x)

Page 17: 04 logistic regression - Colorado State Universitycs545/fall18/lib/exe/fetch... · Cross entropy and logistic regression The logistic regression cost function: It is the average cross

KL divergence

The KL divergence between distributions P and Q:

17

D

KL

(P ||Q) = �X

x

P (x) log

Q(x)

P (x)

D

KL

(P ||Q) = �X

x

P (x) logQ(x) +

X

x

P (x) logP (x)

cross entropy - entropy

Page 18: 04 logistic regression - Colorado State Universitycs545/fall18/lib/exe/fetch... · Cross entropy and logistic regression The logistic regression cost function: It is the average cross

Cross entropy and logistic regression

The logistic regression cost function: It is the average cross entropy between the learned P(y | x) and the observed probabilities Cross entropy And for binary variables:

18

Ein(w) =1

N

NX

n=1

I(yn = +1) ln1

h(xn)+ I(yn = �1) ln

1

1� h(xn)

H(y, y) = �y log y � (1� y) log(1� y)

H(P,Q) = �X

x

P (x) logQ(x)

Page 19: 04 logistic regression - Colorado State Universitycs545/fall18/lib/exe/fetch... · Cross entropy and logistic regression The logistic regression cost function: It is the average cross

In-sample error

The in-sample error for logistic regression

19

−1

Nln

!N"

n=1

θ(yn w xn)

#

=1

N

N$

n=1

ln

%1

θ(yn w xn)

& '

θ(s) =1

1 + e−s

(

E (w) =1

N

N$

n=1

ln)

1 + e−ynw xn

*

+ ,- .

(h(xn),yn)⃝ AML

minimize

Cross entropy error t = 0 w(0)

t = 0, 1, 2, . . .

∇E = −1

N

N!

n=1

ynxn

1 + eynw (t)xn

w(t + 1) = w(t)− η∇E

w

⃝ AML

t = 0 w(0)t = 0, 1, 2, . . .

∇E = −1

N

N!

n=1

ynxn

1 + eynw (t)xn

w(t + 1) = w(t)− η∇E

w

⃝ AML

t = 0 w(0)t = 0, 1, 2, . . .

∇E = −1

N

N!

n=1

ynxn

1 + eynw (t)xn

w(t + 1) = w(t)− η∇E

w

⃝ AML

Page 20: 04 logistic regression - Colorado State Universitycs545/fall18/lib/exe/fetch... · Cross entropy and logistic regression The logistic regression cost function: It is the average cross

Digression: gradient ascent/descent

Topographical maps can give us intuition on how to optimize a cost function

20

http://www.csus.edu/indiv/s/slaymaker/archives/geol10l/shield1.jpg http://www.sir-ray.com/touro/IMG_0001_NEW.jpg

Page 21: 04 logistic regression - Colorado State Universitycs545/fall18/lib/exe/fetch... · Cross entropy and logistic regression The logistic regression cost function: It is the average cross

Digression: gradient descent

21 Images from http://en.wikipedia.org/wiki/Gradient_descent

Given a function E(w), the gradient is the direction of steepest ascent Therefore to minimize E(w), take a step in the direction of the negative of the gradient

Notice that the gradient is perpendicular to contours of equal E(w)

Page 22: 04 logistic regression - Colorado State Universitycs545/fall18/lib/exe/fetch... · Cross entropy and logistic regression The logistic regression cost function: It is the average cross

Gradient descent

Gradient descent is an iterative process How to pick ?

22

Our Ein Has Only One Valley

Weights, w

In-sam

pleError,E

in

. . . because Ein(w) is a convex function of w. (So, who care’s if it looks ugly!)

c⃝ AML Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 16 /23 How to roll down? −→

How to “Roll Down”?

Assume you are at weights w(t) and you take a step of size η in the direction v.

w(t + 1) = w(t) + ηv

We get to pick v ← what’s the best direction to take the step?

Pick v to make Ein(w(t + 1)) as small as possible.

c⃝ AML Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 17 /23 The gradient −→

w(t+ 1) = w(t) + ⌘v

Page 23: 04 logistic regression - Colorado State Universitycs545/fall18/lib/exe/fetch... · Cross entropy and logistic regression The logistic regression cost function: It is the average cross

Gradient descent

The gradient is the best direction to take to optimize Ein(w):

23

The Gradient is the Fastest Way to Roll Down

Approximating the change in Ein

∆Ein = Ein(w(t+ 1))− Ein(w(t))

= Ein(w(t) + ηv)− Ein(w(t))

= η∇Ein(w(t))tv

! "# $

minimized at v = − ∇Ein(w(t))||∇Ein(w(t)) ||

+O(η2) (Taylor’s Approximation)

>≈ −η||∇Ein(w(t)) || ←attained at v = − ∇Ein(w(t))

||∇Ein(w(t)) ||

The best (steepest) direction to move is the negative gradient:

v = −

∇Ein(w(t))

||∇Ein(w(t)) ||

c⃝ AML Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 18 /23 Iterate the gradient −→

The Gradient is the Fastest Way to Roll Down

Approximating the change in Ein

∆Ein = Ein(w(t+ 1))− Ein(w(t))

= Ein(w(t) + ηv)− Ein(w(t))

= η∇Ein(w(t))tv

! "# $

minimized at v = − ∇Ein(w(t))||∇Ein(w(t)) ||

+O(η2) (Taylor’s Approximation)

>≈ −η||∇Ein(w(t)) || ←attained at v = − ∇Ein(w(t))

||∇Ein(w(t)) ||

The best (steepest) direction to move is the negative gradient:

v = −

∇Ein(w(t))

||∇Ein(w(t)) ||

c⃝ AML Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 18 /23 Iterate the gradient −→

Page 24: 04 logistic regression - Colorado State Universitycs545/fall18/lib/exe/fetch... · Cross entropy and logistic regression The logistic regression cost function: It is the average cross

Choosing the step size

The choice of the step size affects the rate of convergence:

24

The ‘Goldilocks’ Step Size

η too small η too large variable ηt – just right

Weights, w

In-sam

pleError,E

in

Weights, w

In-sam

pleError,E

in

Weights, w

In-sam

pleError,E

in

large η

small η

η = 0.1; 75 steps η = 2; 10 steps variable ηt; 10 steps

c⃝ AML Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 20 /23 Fixed learning rate gradient descent −→

Let’s use a variable learning rate:

w(t+ 1) = w(t) + ⌘tv

⌘t = ⌘ · ||rEin(w(t))||

||rEin(w(t))|| ! 0When approaching the minimum:

Page 25: 04 logistic regression - Colorado State Universitycs545/fall18/lib/exe/fetch... · Cross entropy and logistic regression The logistic regression cost function: It is the average cross

Choosing the step size

The choice of the step size affects the rate of convergence:

25

The ‘Goldilocks’ Step Size

η too small η too large variable ηt – just right

Weights, w

In-sam

pleError,E

in

Weights, w

In-sam

pleError,E

in

Weights, w

In-sam

pleError,E

in

large η

small η

η = 0.1; 75 steps η = 2; 10 steps variable ηt; 10 steps

c⃝ AML Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 20 /23 Fixed learning rate gradient descent −→

Let’s use a variable learning rate:

w(t+ 1) = w(t) + ⌘tv

⌘t = ⌘ · ||rEin(w(t))||

⌘tv = �⌘ · ||rEin(w(t))|| · rEin(w(t))

||rEin(w(t))|| = �⌘rEin(w(t))

Page 26: 04 logistic regression - Colorado State Universitycs545/fall18/lib/exe/fetch... · Cross entropy and logistic regression The logistic regression cost function: It is the average cross

The final form of gradient descent

The choice of the step size affects the rate of convergence:

26

The ‘Goldilocks’ Step Size

η too small η too large variable ηt – just right

Weights, w

In-sam

pleError,E

in

Weights, w

In-sam

pleError,E

in

Weights, w

In-sam

pleError,E

in

large η

small η

η = 0.1; 75 steps η = 2; 10 steps variable ηt; 10 steps

c⃝ AML Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 20 /23 Fixed learning rate gradient descent −→

w(t+ 1) = w(t)� ⌘rEin(w(t))

Page 27: 04 logistic regression - Colorado State Universitycs545/fall18/lib/exe/fetch... · Cross entropy and logistic regression The logistic regression cost function: It is the average cross

Logistic regression using gradient descent

We will use gradient descent to minimize our error function. Fortunately, the logistic regression error function has a single global minimum:

27

Our Ein Has Only One Valley

Weights, w

In-sam

pleError,E

in

. . . because Ein(w) is a convex function of w. (So, who care’s if it looks ugly!)

c⃝ AML Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 16 /23 How to roll down? −→

Finding The Best Weights - Hill Descent

Ball on a complicated hilly terrain

— rolls down to a local valley↑

this is called a local minimum

Questions:

How to get to the bottom of the deepest valey?

How to do this when we don’t have gravity?

c⃝ AML Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 15 /23 Our Ein is convex −→

So we don’t need to worry about getting stuck in local minima

Page 28: 04 logistic regression - Colorado State Universitycs545/fall18/lib/exe/fetch... · Cross entropy and logistic regression The logistic regression cost function: It is the average cross

Logistic regression using gradient descent Putting it all together: This is called batch gradient descent

28

Fixed Learning Rate Gradient Descent

ηt = η · ||∇Ein(w(t)) ||

||∇Ein(w(t)) ||→ 0 when closer to the minimum.

v = −ηt ·∇Ein(w(t))

||∇Ein(w(t)) ||

= −η · ||∇Ein(w(t)) || ·∇Ein(w(t))

||∇Ein(w(t)) ||

v = −η ·∇Ein(w(t))

1: Initialize at step t = 0 to w(0).2: for t = 0, 1, 2, . . . do3: Compute the gradient

gt = ∇Ein(w(t)). ←− (Ex. 3.7 in LFD)

4: Move in the direction vt = −gt.5: Update the weights:

w(t+ 1) = w(t) + ηvt.

6: Iterate ‘until it is time to stop’.7: end for8: Return the final weights.

Gradient descent can minimize any smooth function, for example

Ein(w) =1

N

N!

n=1

ln(1 + e−yn·wtx) ← logistic regression

c⃝ AML Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 21 /23 Stochastic gradient descent −→

t = 0 w(0)t = 0, 1, 2, . . .

∇E = −1

N

N!

n=1

ynxn

1 + eynw (t)xn

w(t + 1) = w(t)− η∇E

w

⃝ AML

−1

Nln

!N"

n=1

θ(yn w xn)

#

=1

N

N$

n=1

ln

%1

θ(yn w xn)

& '

θ(s) =1

1 + e−s

(

E (w) =1

N

N$

n=1

ln)

1 + e−ynw xn

*

+ ,- .

(h(xn),yn)⃝ AML

Page 29: 04 logistic regression - Colorado State Universitycs545/fall18/lib/exe/fetch... · Cross entropy and logistic regression The logistic regression cost function: It is the average cross

Logistic regression

Comments: v  In practice logistic regression is solved by faster

methods than gradient descent v  There is an extension to multi-class classification

29

Page 30: 04 logistic regression - Colorado State Universitycs545/fall18/lib/exe/fetch... · Cross entropy and logistic regression The logistic regression cost function: It is the average cross

Stochastic gradient descent

Variation on gradient descent that considers the error for a single training example: Tends to converge faster than the batch version.

30

Stochastic Gradient Descent (SGD)

A variation of GD that considers only the error on one data point.

Ein(w) =1

N

N!

n=1

ln(1 + e−yn·wtx) =

1

N

N!

n=1

e(w,xn, yn)

• Pick a random data point (x∗, y∗)

• Run an iteration of GD on e(w,x∗, y∗)

w(t + 1)← w(t)− η∇we(w,x∗, y∗)

1. The ‘average’ move is the same as GD;

2. Computation: fraction 1N

cheaper per step;

3. Stochastic: helps escape local minima;

4. Simple;

5. Similar to PLA.

Logistic Regression:

w(t + 1)← w(t) + y∗x∗

1 + ey∗wtx∗

#

(Recall PLA: w(t+ 1)← w(t) + y∗x∗)

c⃝ AML Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 22 /23 GD versus SGD, a picture −→

Stochastic Gradient Descent (SGD)

A variation of GD that considers only the error on one data point.

Ein(w) =1

N

N!

n=1

ln(1 + e−yn·wtx) =

1

N

N!

n=1

e(w,xn, yn)

• Pick a random data point (x∗, y∗)

• Run an iteration of GD on e(w,x∗, y∗)

w(t + 1)← w(t)− η∇we(w,x∗, y∗)

1. The ‘average’ move is the same as GD;

2. Computation: fraction 1N

cheaper per step;

3. Stochastic: helps escape local minima;

4. Simple;

5. Similar to PLA.

Logistic Regression:

w(t + 1)← w(t) + y∗x∗

1 + ey∗wtx∗

#

(Recall PLA: w(t+ 1)← w(t) + y∗x∗)

c⃝ AML Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 22 /23 GD versus SGD, a picture −→

Stochastic Gradient Descent (SGD)

A variation of GD that considers only the error on one data point.

Ein(w) =1

N

N!

n=1

ln(1 + e−yn·wtx) =

1

N

N!

n=1

e(w,xn, yn)

• Pick a random data point (x∗, y∗)

• Run an iteration of GD on e(w,x∗, y∗)

w(t + 1)← w(t)− η∇we(w,x∗, y∗)

1. The ‘average’ move is the same as GD;

2. Computation: fraction 1N

cheaper per step;

3. Stochastic: helps escape local minima;

4. Simple;

5. Similar to PLA.

Logistic Regression:

w(t + 1)← w(t) + y∗x∗

1 + ey∗wtx∗

#

(Recall PLA: w(t+ 1)← w(t) + y∗x∗)

c⃝ AML Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 22 /23 GD versus SGD, a picture −→

Page 31: 04 logistic regression - Colorado State Universitycs545/fall18/lib/exe/fetch... · Cross entropy and logistic regression The logistic regression cost function: It is the average cross

Summary of linear models

Linear methods for classification and regression:

31

The Linear Signal

s = wtx−→

⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩

→ sign(wtx) {−1,+1}

→ wtx R

→ θ(wtx) [0, 1]

y = θ(s)

c⃝ AML Creator: Malik Magdon-Ismail Linear Classification and Regression: 6 /21 Classification and PLA −→

More to come!