machine learning (cse 446): pratical issues: optimization...

19
Machine Learning (CSE 446): Pratical issues: optimization and learning Sham M Kakade c 2018 University of Washington [email protected] 1 / 11

Upload: others

Post on 03-Aug-2020

22 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Machine Learning (CSE 446): Pratical issues: optimization ...courses.cs.washington.edu/courses/cse446/18wi/slides/practical_annotated.pdfMachine Learning (CSE 446): Pratical issues:

Machine Learning (CSE 446):Pratical issues: optimization and learning

Sham M Kakadec© 2018

University of [email protected]

1 / 11

Page 2: Machine Learning (CSE 446): Pratical issues: optimization ...courses.cs.washington.edu/courses/cse446/18wi/slides/practical_annotated.pdfMachine Learning (CSE 446): Pratical issues:

Announcements

I Midterm summary:I stats: 71.5 std: 18I Office hours today: 1:15-2:30 (No office hours on Monday)

I Monday: John Thickstun guest lecture

I Grading:I HW3 posted

I will be periodically updated for typos/clarifictionsI extra credit posted soon

I Today:I Midterm reviewI GD/SGD: practical issues

1 / 11

Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Page 3: Machine Learning (CSE 446): Pratical issues: optimization ...courses.cs.washington.edu/courses/cse446/18wi/slides/practical_annotated.pdfMachine Learning (CSE 446): Pratical issues:

Midterm

1 / 11

Page 4: Machine Learning (CSE 446): Pratical issues: optimization ...courses.cs.washington.edu/courses/cse446/18wi/slides/practical_annotated.pdfMachine Learning (CSE 446): Pratical issues:

What is a good model of this distribution?

“A mixture of Gaussians”

2 / 11

Page 5: Machine Learning (CSE 446): Pratical issues: optimization ...courses.cs.washington.edu/courses/cse446/18wi/slides/practical_annotated.pdfMachine Learning (CSE 446): Pratical issues:

What is a good model of this distribution?“A mixture of Gaussians” 2 / 11

Page 6: Machine Learning (CSE 446): Pratical issues: optimization ...courses.cs.washington.edu/courses/cse446/18wi/slides/practical_annotated.pdfMachine Learning (CSE 446): Pratical issues:

Midterm Q4: scratch space

3 / 11

Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Page 7: Machine Learning (CSE 446): Pratical issues: optimization ...courses.cs.washington.edu/courses/cse446/18wi/slides/practical_annotated.pdfMachine Learning (CSE 446): Pratical issues:

Midterm: scratch space

3 / 11

Page 8: Machine Learning (CSE 446): Pratical issues: optimization ...courses.cs.washington.edu/courses/cse446/18wi/slides/practical_annotated.pdfMachine Learning (CSE 446): Pratical issues:

Midterm Q5: scratch space

4 / 11

Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Page 9: Machine Learning (CSE 446): Pratical issues: optimization ...courses.cs.washington.edu/courses/cse446/18wi/slides/practical_annotated.pdfMachine Learning (CSE 446): Pratical issues:

Midterm: scratch space

4 / 11

Page 10: Machine Learning (CSE 446): Pratical issues: optimization ...courses.cs.washington.edu/courses/cse446/18wi/slides/practical_annotated.pdfMachine Learning (CSE 446): Pratical issues:

Today

4 / 11

Page 11: Machine Learning (CSE 446): Pratical issues: optimization ...courses.cs.washington.edu/courses/cse446/18wi/slides/practical_annotated.pdfMachine Learning (CSE 446): Pratical issues:

The “general” Loss Minimization Problem

w∗ = argminw

1

N

N∑n=1

`(xn, yn,w)︸ ︷︷ ︸`n(w)

+R(w)

How do we run GD? SGD? Which one to use?

How do run them?

5 / 11

Page 12: Machine Learning (CSE 446): Pratical issues: optimization ...courses.cs.washington.edu/courses/cse446/18wi/slides/practical_annotated.pdfMachine Learning (CSE 446): Pratical issues:

Our running example

argminw

1

N

N∑n=1

1

2(yn −w · xn)

2 +1

2λ‖w‖2

I GD? SGD?

I Note we are computing an average. What is a crude way to estimate an average?

Will it converge?

6 / 11

Kira Goldner
Kira Goldner
Kira Goldner
Page 13: Machine Learning (CSE 446): Pratical issues: optimization ...courses.cs.washington.edu/courses/cse446/18wi/slides/practical_annotated.pdfMachine Learning (CSE 446): Pratical issues:

How does GD behave? A 1-dim example

6 / 11

Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Page 14: Machine Learning (CSE 446): Pratical issues: optimization ...courses.cs.washington.edu/courses/cse446/18wi/slides/practical_annotated.pdfMachine Learning (CSE 446): Pratical issues:

GD: How do we set the step sizes?

I Theory:

I square loss:I more generally:

I Practice:

I square loss:I more generally:

I Do we decay the stepsize?

7 / 11

Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Page 15: Machine Learning (CSE 446): Pratical issues: optimization ...courses.cs.washington.edu/courses/cse446/18wi/slides/practical_annotated.pdfMachine Learning (CSE 446): Pratical issues:

SGD for the square loss

Data: step sizes 〈η(1), . . . , η(K)〉Result: parameter winitialize: w(0) = 0;for k ∈ {1, . . . ,K} do

n ∼ Uniform({1, . . . , N});w(k) = w(k−1) + η(k)

(yn −w(k−1) · xn

)xn;

end

return w(K);Algorithm 1: SGD

I where did the N go?

I regularization?

I minibatching?

8 / 11

Page 16: Machine Learning (CSE 446): Pratical issues: optimization ...courses.cs.washington.edu/courses/cse446/18wi/slides/practical_annotated.pdfMachine Learning (CSE 446): Pratical issues:

SGD for the square loss

Data: step sizes 〈η(1), . . . , η(K)〉Result: parameter winitialize: w(0) = 0;for k ∈ {1, . . . ,K} do

n ∼ Uniform({1, . . . , N});w(k) = w(k−1) + η(k)

(yn −w(k−1) · xn

)xn;

end

return w(K);Algorithm 2: SGD

I where did the N go?

I regularization?

I minibatching?

8 / 11

Page 17: Machine Learning (CSE 446): Pratical issues: optimization ...courses.cs.washington.edu/courses/cse446/18wi/slides/practical_annotated.pdfMachine Learning (CSE 446): Pratical issues:

SGD: How do we set the step sizes?

I Theory:

I Practice:I How do start it?I When do we decay it?

9 / 11

Page 18: Machine Learning (CSE 446): Pratical issues: optimization ...courses.cs.washington.edu/courses/cse446/18wi/slides/practical_annotated.pdfMachine Learning (CSE 446): Pratical issues:

Stochastic Gradient Descent: Convergence

w∗ = argminw

1

N

N∑n=1

`n(w)

I w(k): our parameter after k updates.

I Thm: Suppose `(·) is convex (and satisfies mild regularity conditions). There isdecreasing sequence of step sizes η(k) so that our function value, F (w(k)),converges to the minimal function value, F (w∗).

I GD vs SGD: we need to turn down our step sizes over time!

10 / 11

Page 19: Machine Learning (CSE 446): Pratical issues: optimization ...courses.cs.washington.edu/courses/cse446/18wi/slides/practical_annotated.pdfMachine Learning (CSE 446): Pratical issues:

Making features: scratch space

11 / 11