ece 595: machine learning i lecture 05 gradient descent · kdk2: we can show that the d that...

31
c Stanley Chan 2020. All Rights Reserved. ECE 595: Machine Learning I Lecture 05 Gradient Descent Spring 2020 Stanley Chan School of Electrical and Computer Engineering Purdue University 1 / 31

Upload: others

Post on 20-Jul-2020

53 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ECE 595: Machine Learning I Lecture 05 Gradient Descent · kdk2: We can show that the d that minimizes f(xt + d) is d = rf(xt). This suggests: Use a quadratic function to locally

c©Stanley Chan 2020. All Rights Reserved.

ECE 595: Machine Learning ILecture 05 Gradient Descent

Spring 2020

Stanley Chan

School of Electrical and Computer EngineeringPurdue University

1 / 31

Page 2: ECE 595: Machine Learning I Lecture 05 Gradient Descent · kdk2: We can show that the d that minimizes f(xt + d) is d = rf(xt). This suggests: Use a quadratic function to locally

c©Stanley Chan 2020. All Rights Reserved.

Outline

2 / 31

Page 3: ECE 595: Machine Learning I Lecture 05 Gradient Descent · kdk2: We can show that the d that minimizes f(xt + d) is d = rf(xt). This suggests: Use a quadratic function to locally

c©Stanley Chan 2020. All Rights Reserved.

Outline

Mathematical Background

Lecture 4: Intro to Optimization

Lecture 5: Gradient Descent

Lecture 5: Gradient Descent

Gradient Descent

Descent DirectionStep SizeConvergence

Stochastic Gradient Descent

Difference between GD and SGDWhy does SGD work?

3 / 31

Page 4: ECE 595: Machine Learning I Lecture 05 Gradient Descent · kdk2: We can show that the d that minimizes f(xt + d) is d = rf(xt). This suggests: Use a quadratic function to locally

c©Stanley Chan 2020. All Rights Reserved.

Gradient Descent

The algorithm:

x (t+1) = x (t) − α(t)∇f (x (t)), t = 0, 1, 2, . . . ,

where α(t) is called the step size.

4 / 31

Page 5: ECE 595: Machine Learning I Lecture 05 Gradient Descent · kdk2: We can show that the d that minimizes f(xt + d) is d = rf(xt). This suggests: Use a quadratic function to locally

c©Stanley Chan 2020. All Rights Reserved.

Why is the direction −∇f (x)?

Recall (Lecture 4): If x∗ is optimal, then

limε→0

1

ε[f (x∗ + εd )− f (x∗)]︸ ︷︷ ︸

≥0, ∀d

= ∇f (x∗)Td

=⇒ ∇f (x∗)Td ≥ 0, ∀d

But if x (t) is not optimal, then we want

f (x (t) + εd ) ≤ f (x (t))

So,

limε→0

1

ε

[f (x (t) + εd )− f (x (t))

]︸ ︷︷ ︸

≤0, for some d

= ∇f (x (t))Td

=⇒ ∇f (x (t))Td ≤ 0

5 / 31

Page 6: ECE 595: Machine Learning I Lecture 05 Gradient Descent · kdk2: We can show that the d that minimizes f(xt + d) is d = rf(xt). This suggests: Use a quadratic function to locally

c©Stanley Chan 2020. All Rights Reserved.

Descent Direction

Pictorial illustration:

∇f (x) is perpendicular to the contour.

A search direction d can either be on the positive side ∇f (x)Td ≥ 0or negative side ∇f (x)Td < 0.

Only those on the negative side can reduce the cost.

All such d ’s are called the descent directions.

6 / 31

Page 7: ECE 595: Machine Learning I Lecture 05 Gradient Descent · kdk2: We can show that the d that minimizes f(xt + d) is d = rf(xt). This suggests: Use a quadratic function to locally

c©Stanley Chan 2020. All Rights Reserved.

The Steepest d

Previous slide: If x (t) is not optimal yet, then some d will give

∇f (x (t))Td ≤ 0.

So, let us make ∇f (x (t))T as negative as possible.

d (t) = argmin‖d‖2=δ

∇f (x (t))Td ,

We need δ to control the magnitude; Otherwise d is unbounded.

The solution isd (t) = −∇f (x (t))

Why? By Cauchy Schwarz,

∇f (x (t))Td ≥ −‖∇f (x (t))‖2‖d‖2.

Minimum attained when d = −∇f (x (t)).

Set δ = ‖∇f (x (t))‖2.7 / 31

Page 8: ECE 595: Machine Learning I Lecture 05 Gradient Descent · kdk2: We can show that the d that minimizes f(xt + d) is d = rf(xt). This suggests: Use a quadratic function to locally

c©Stanley Chan 2020. All Rights Reserved.

Steepest Descent Direction

Pictorial illustration:

Put a ball surrounding the current point.

All d ’s inside the ball are feasible.

Pick the one that minimizes ∇f (x)Td .

This direction must be parallel (but opposite sign) to ∇f (x).

8 / 31

Page 9: ECE 595: Machine Learning I Lecture 05 Gradient Descent · kdk2: We can show that the d that minimizes f(xt + d) is d = rf(xt). This suggests: Use a quadratic function to locally

c©Stanley Chan 2020. All Rights Reserved.

Step Size

The algorithm:

x (t+1) = x (t) − α(t)∇f (x (t)), t = 0, 1, 2, . . . ,

where α(t) is called the step size.

1. Fixed step sizeα(t) = α.

2. Exact line search

α(t) = argminα

f(x (t) + αd (t)

),

E.g., if f (x) = 12x

THx + cTx , then

α(t) = −∇f (x (t))Td (t)

d (t)THd (t).

3. Inexact line search:Amijo / Wolfe conditions. See Nocedal-Wright Chapter 3.1.

9 / 31

Page 10: ECE 595: Machine Learning I Lecture 05 Gradient Descent · kdk2: We can show that the d that minimizes f(xt + d) is d = rf(xt). This suggests: Use a quadratic function to locally

c©Stanley Chan 2020. All Rights Reserved.

Convergence

Let x∗ be the global minimizer. Assume the followings:

Assume f is twice differentiable so that ∇2f exist.

Assume 0 � λminI � ∇2f (x) � λmaxI for all x ∈ Rn

Run gradient descent with exact line search.

Then, (Nocedal-Wright Chapter 3, Theorem 3.3)

f (x (t+1))− f (x∗) ≤(

1− λmin

λmax

)2 (f (x (t))− f (x∗)

)≤(

1− λmin

λmax

)4 (f (x (t−1))− f (x∗)

)≤

...

≤(

1− λmin

λmax

)2t (f (x (1))− f (x∗)

).

Thus, f (x (t))→ f (x∗) as t →∞.10 / 31

Page 11: ECE 595: Machine Learning I Lecture 05 Gradient Descent · kdk2: We can show that the d that minimizes f(xt + d) is d = rf(xt). This suggests: Use a quadratic function to locally

c©Stanley Chan 2020. All Rights Reserved.

Understanding Convergence

Gradient descent can be viewed as successive approximation.Approximate the function as

f (x t + d ) ≈ f (x t) +∇f (x t)Td +1

2α‖d‖2.

We can show that the d that minimizes f (x t + d ) is d = −α∇f (x t).This suggests: Use a quadratic function to locally approximate f .Converge when curvature α of the approximation is not too big.

11 / 31

Page 12: ECE 595: Machine Learning I Lecture 05 Gradient Descent · kdk2: We can show that the d that minimizes f(xt + d) is d = rf(xt). This suggests: Use a quadratic function to locally

c©Stanley Chan 2020. All Rights Reserved.

Advice on Gradient Descent

Gradient descent is useful because

Simple to implement (compared to ADMM, FISTA, etc)Low computational cost per iteration (no matrix inversion)Requires only first order derivative (no Hessian)Gradient is available in deep networks (via back propagation)

Most machine learning has built-in (stochastic) gradient descents

Welcome to implement your own, but you need to be careful

Convex non-differentiable problems, e.g., `1-normNon-convex problem, e.g., ReLU in deep networkTrap by local minimaInappropriate step size, a.k.a. learning rate

Consider more “transparent” algorithms such as CVX when

Formulating problems. No need to worry about algorithm.Trying to obtain insights.

12 / 31

Page 13: ECE 595: Machine Learning I Lecture 05 Gradient Descent · kdk2: We can show that the d that minimizes f(xt + d) is d = rf(xt). This suggests: Use a quadratic function to locally

c©Stanley Chan 2020. All Rights Reserved.

Outline

Mathematical Background

Lecture 4: Intro to Optimization

Lecture 5: Gradient Descent

Lecture 5: Gradient Descent

Gradient Descent

Descent DirectionStep SizeConvergence

Stochastic Gradient Descent

Difference between GD and SGDWhy does SGD work?

13 / 31

Page 14: ECE 595: Machine Learning I Lecture 05 Gradient Descent · kdk2: We can show that the d that minimizes f(xt + d) is d = rf(xt). This suggests: Use a quadratic function to locally

c©Stanley Chan 2020. All Rights Reserved.

Stochastic Gradient Descent

Most loss functions in machine learning problems are separable:

J(θ) =1

N

N∑n=1

L(gθ(xn), yn) =1

N

N∑n=1

Jn(θ). (1)

For example,

Square-loss:

J(θ) =N∑

n=1

(gθ(xn)− yn)2

Cross-entropy loss:

J(θ) = −N∑

n=1

{yn log gθ(xn) + (1− yn) log(1− gθ(xn))

}Logistic loss:

J(θ) =N∑

n=1

log(1 + e−ynθT xn

)

14 / 31

Page 15: ECE 595: Machine Learning I Lecture 05 Gradient Descent · kdk2: We can show that the d that minimizes f(xt + d) is d = rf(xt). This suggests: Use a quadratic function to locally

c©Stanley Chan 2020. All Rights Reserved.

Full Gradient VS Partial Gradient

Vanilla gradient descent:

θt+1 = θt − ηt∇J(θt)︸ ︷︷ ︸main computation

. (2)

The full gradient of the loss is

∇J(θ) =1

N

N∑n=1

∇Jn(θ) (3)

Stochastic gradient descent:

∇J(θ) ≈ 1

|B|∑n∈B∇Jn(θ) (4)

where B ⊆ {1, . . . ,N} is a random subset. |B| = batch size.

15 / 31

Page 16: ECE 595: Machine Learning I Lecture 05 Gradient Descent · kdk2: We can show that the d that minimizes f(xt + d) is d = rf(xt). This suggests: Use a quadratic function to locally

c©Stanley Chan 2020. All Rights Reserved.

SGD Algorithm

Algorithm (Stochastic Gradient Descent)

1 Given {(xn, yn) | n = 1, . . . ,N}.2 Initialize θ (zero or random)3 For t = 1, 2, 3, . . .

Draw a random subset B ⊆ {1, . . . ,N}.Update

θt+1 = θt − ηt 1

|B|∑n∈B∇Jn(θ) (5)

If |B| = 1, then use only one sample at a time.

The approximate gradient is unbiased: (See Appendix for Proof)

E

[1

|B|∑n∈B∇Jn(θ)

]= ∇J(θ).

16 / 31

Page 17: ECE 595: Machine Learning I Lecture 05 Gradient Descent · kdk2: We can show that the d that minimizes f(xt + d) is d = rf(xt). This suggests: Use a quadratic function to locally

c©Stanley Chan 2020. All Rights Reserved.

Interpreting SGD

Just showed that the SGD step is unbiased:

E

[1

|B|∑n∈B∇Jn(θ)

]= ∇J(θ).

Unbiased gradient implies that each update is

gradient + zero-mean noise

Step size: SGD with constant step size does not converge.

If θ∗ is a minimizer, then J(θ∗) = 1N

∑Nn=1 Jn(θ∗) = 0. But

1

|B|∑n∈B

Jn(θ∗) 6= 0, since B is a subset.

Typical strategy: Start with large step size and gradually decrease:ηt → 0, e.g., ηt = t−a for some constant a.

17 / 31

Page 18: ECE 595: Machine Learning I Lecture 05 Gradient Descent · kdk2: We can show that the d that minimizes f(xt + d) is d = rf(xt). This suggests: Use a quadratic function to locally

c©Stanley Chan 2020. All Rights Reserved.

Perspectives of SGD

Classical optimization literature have the following observations.

Compared to GD in convex problems:

SGD offers a trade-off between accuracy and efficiency

More iterations

Less gradient evaluation per iteration

Noise is a by-product

Recent studies of SGD for non-convex problems found that

SGD for training deep neural networks works

SGD finds solution faster

SGD find a better local minima

Noise matters

18 / 31

Page 19: ECE 595: Machine Learning I Lecture 05 Gradient Descent · kdk2: We can show that the d that minimizes f(xt + d) is d = rf(xt). This suggests: Use a quadratic function to locally

c©Stanley Chan 2020. All Rights Reserved.

GD compared to SGD

19 / 31

Page 20: ECE 595: Machine Learning I Lecture 05 Gradient Descent · kdk2: We can show that the d that minimizes f(xt + d) is d = rf(xt). This suggests: Use a quadratic function to locally

c©Stanley Chan 2020. All Rights Reserved.

Smoothing the Landscape

Analyzing SGD is an active research topic. Here is one by Kleinberg et al.(https://arxiv.org/pdf/1802.06175.pdf ICML 2018)

The SGD step can be written as GD + noise:

x t+1 = x t − η(∇f (x t) + w t)

= x t − η∇f (x t)︸ ︷︷ ︸def=y t

− ηw t .

y t is the “ideal” location returned by GD.

Let us analyze y t+1:

y t+1 def= x t+1 − η∇f (x t+1)

= (y t − ηw t)− η∇f (y t − ηw t)

Assume E[w ] = 0, then

E[y t+1] = y t − η∇E[f (y t − ηw t)]

20 / 31

Page 21: ECE 595: Machine Learning I Lecture 05 Gradient Descent · kdk2: We can show that the d that minimizes f(xt + d) is d = rf(xt). This suggests: Use a quadratic function to locally

c©Stanley Chan 2020. All Rights Reserved.

Smoothing the Landscape

Let us look at E[f (y t − ηw t)]:

E[f (y − ηw)] =

∫f (y − ηw)p(w) dw ,

where p(w) is the distribution of w .∫f (y − ηw)p(w) dw is the convolution between f and p.

p(w) ≥ 0 for all w , so the convolution always smoothes the function.

Learning rate controls the smoothness

Too small: Under-smooth. You have not yet escaped from bad localminimum.

Too large: Over-smooth. You may miss a local minimum.

21 / 31

Page 22: ECE 595: Machine Learning I Lecture 05 Gradient Descent · kdk2: We can show that the d that minimizes f(xt + d) is d = rf(xt). This suggests: Use a quadratic function to locally

c©Stanley Chan 2020. All Rights Reserved.

Smoothing the Landscape

22 / 31

Page 23: ECE 595: Machine Learning I Lecture 05 Gradient Descent · kdk2: We can show that the d that minimizes f(xt + d) is d = rf(xt). This suggests: Use a quadratic function to locally

c©Stanley Chan 2020. All Rights Reserved.

Reading List

Gradient Descent

S. Boyd and L. Vandenberghe, “Convex Optimization”, Chapter 9.2-9.4.

J. Nocedal and S. Wright, “Numerical Optimization”, Chapter 3.1-3.3.

Y. Nesterov, “Introductory lectures on convex optimization”, Chapter 2.

CMU 10.725 Lecture https://www.stat.cmu.edu/~ryantibs/

convexopt/lectures/grad-descent.pdf

Stochastic Gradient Descent

CMU 10.725 Lecture https://www.stat.cmu.edu/~ryantibs/

convexopt/lectures/stochastic-gd.pdf

Kleinberg et al. (2018) “When Does SGD Escape Local Minima”,https://arxiv.org/pdf/1802.06175.pdf

23 / 31

Page 24: ECE 595: Machine Learning I Lecture 05 Gradient Descent · kdk2: We can show that the d that minimizes f(xt + d) is d = rf(xt). This suggests: Use a quadratic function to locally

c©Stanley Chan 2020. All Rights Reserved.

Appendix

24 / 31

Page 25: ECE 595: Machine Learning I Lecture 05 Gradient Descent · kdk2: We can show that the d that minimizes f(xt + d) is d = rf(xt). This suggests: Use a quadratic function to locally

c©Stanley Chan 2020. All Rights Reserved.

Proof of Unbiasedness of SGD gradient

Lemma

If n is a random variable with uniform distribution over {1, . . . ,N}, then

E

[1

|B|∑n∈B∇Jn(θ)

]= ∇J(θ).

Denote the density function of n as p(n) = 1/N. Then,

E

[1

|B|∑n∈B∇Jn(θ)

]=

1

|B|∑n∈B

E [∇Jn(θ)] =1

|B|∑n∈B

{ N∑n=1

Jn(θ)p(n)

}

=1

|B|∑n∈B

{1

N

N∑n=1

Jn(θ)

}=

1

|B|∑n∈B∇J(θ) = ∇J(θ).

25 / 31

Page 26: ECE 595: Machine Learning I Lecture 05 Gradient Descent · kdk2: We can show that the d that minimizes f(xt + d) is d = rf(xt). This suggests: Use a quadratic function to locally

c©Stanley Chan 2020. All Rights Reserved.

Q&A 1: What is momentum method?

The momentum method was originally proposed by Polyak (1964).

Momentum method says:

x t+1 = x t − α[βg t−1 + (1− β)g t

],

where g t = ∇f (x t), and 0 < β < 1 is the damping constant.

Momentum method can be applied to both gradient descent andstochastic gradient descent.

A variant is the Nesterov accelerated gradient (NAG) method (1983).

Importance of NAG is elaborated by Sutskever et al. (2013).

The key idea of NAG is to write x t+1 as a linear combination of x t

and the span of the past gradients.

Yurii Nesterov proved that such combination is the best one can dowith first order methods.

26 / 31

Page 27: ECE 595: Machine Learning I Lecture 05 Gradient Descent · kdk2: We can show that the d that minimizes f(xt + d) is d = rf(xt). This suggests: Use a quadratic function to locally

c©Stanley Chan 2020. All Rights Reserved.

Q&A 1: What is momentum method?

Here are some references on momentum method.

Sutskever et al. (2013), “On the importance of initialization andmomentum in deep learning”,http://proceedings.mlr.press/v28/sutskever13.pdf

UIC Lecture Notehttps://www2.cs.uic.edu/~zhangx/teaching/agm.pdf

Cornell Lecture Note http:

//www.cs.cornell.edu/courses/cs6787/2017fa/Lecture3.pdf

Yurii Nesterov, “Introductory Lectures on Convex Optimization”,2003. (See Assumption 2.1.4 and discussions thereafter)

G. Goh, “Why Momentum Really Works”,https://distill.pub/2017/momentum/

27 / 31

Page 28: ECE 595: Machine Learning I Lecture 05 Gradient Descent · kdk2: We can show that the d that minimizes f(xt + d) is d = rf(xt). This suggests: Use a quadratic function to locally

c©Stanley Chan 2020. All Rights Reserved.

Q&A 2: With exact line search, will we get to a minimumin one step?

No. Exact line search only allows you to converge faster. It does notguarantee convergence in one step.Here is an example. The function is called the rosenbrock function.

28 / 31

Page 29: ECE 595: Machine Learning I Lecture 05 Gradient Descent · kdk2: We can show that the d that minimizes f(xt + d) is d = rf(xt). This suggests: Use a quadratic function to locally

c©Stanley Chan 2020. All Rights Reserved.

Q&A 3: Any example of gradient descent?

Consider the loss function

J(θ) = ‖Aθ − y‖2

Then the gradient is

∇J(θ) = 2AT (Aθ − y)

So the gradient descent step is

θt+1 = θt + η2AT (Aθt − y)︸ ︷︷ ︸∇J(θt)

.

Since this is a quadratic equation, you can find the exact line searchstep size (assume d = −∇f (θ)):

η = −‖d‖2/(dTATAd ).

29 / 31

Page 30: ECE 595: Machine Learning I Lecture 05 Gradient Descent · kdk2: We can show that the d that minimizes f(xt + d) is d = rf(xt). This suggests: Use a quadratic function to locally

c©Stanley Chan 2020. All Rights Reserved.

Q&A 4: In finding the steepest direction, why is δunimportant?

The constraint ‖d‖ = δ is necessary for minimizing ∇f (x)Td .Without the constraint, this problem is unbounded below and thesolution is −∞ times whatever direction d that lives on the negativehalf plane of ∇f .

For any δ, the solution (according to Cauchy Schwarz inequality), is

d = −δ ∇f (x)

‖∇f (x)‖.

You can show that this d minimizes ∇f (x)Td and satisfies ‖d‖ = δ.

Now, if we use this d in the gradient descent step, the step size α willcompensate for the δ.

So we can just choose δ = 1 and the above derivation will still work.

30 / 31

Page 31: ECE 595: Machine Learning I Lecture 05 Gradient Descent · kdk2: We can show that the d that minimizes f(xt + d) is d = rf(xt). This suggests: Use a quadratic function to locally

c©Stanley Chan 2020. All Rights Reserved.

Q&A 5: What is a good batch size for SGD?

There is no definite answer. Generally you need to look at the validationcurve to determine if you need to increase/decrease the mini-batch size.Here are some suggestions in the literature.

Bengio (2012) https://arxiv.org/pdf/1206.5533.pdf

[batch size] is typically chosen between 1 and a few hundreds, e.g.[batch size] = 32 is a good default value, with values above 10 taking

advantage of the speedup of matrix-matrix products overmatrix-vector products.

Masters and Luschi (2018) https://arxiv.org/abs/1804.07612

The presented results confirm that using small batch sizes achievesthe best training stability and generalization performance, for a givencomputational cost, across a wide range of experiments. In all cases

the best results have been obtained with batch sizes m = 32 orsmaller, often as small as m = 2 or m = 4.

31 / 31