machine learning: the optimization perspective -...

130
Machine Learning: The Optimization Perspective Konstantin Tretyakov http://kt.era.ee AACIMP 2012 August 2012

Upload: others

Post on 28-Sep-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Machine Learning:

The Optimization Perspective

Konstantin Tretyakov http://kt.era.ee

AACIMP 2012

August 2012

Page 2: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

AACIMP Summer School.

August, 2012

Page 3: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

So far…

Machine learning is important and

interesting

The general concept:

AACIMP Summer School.

August, 2012

Fitting models to data

Page 4: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

So far…

Machine learning is important and

interesting

The general concept:

AACIMP Summer School.

August, 2012

Searching for the best

model fitting to data

Page 5: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Searching for the best

model fitting to data

So far…

Machine learning is important and

interesting

The general concept:

AACIMP Summer School.

August, 2012

Optimization

Probability

Theory

Page 6: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Today

1. Optimization is important

2. Optimization is possible

AACIMP Summer School.

August, 2012

Page 7: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Today

1. Optimization is important

2. Optimization is possible*

* Basic techniques Constrained / Unconstrained

Analytic / Iterative

Continuous / Discrete

AACIMP Summer School.

August, 2012

Page 8: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Special cases of optimization

Machine learning

AACIMP Summer School.

August, 2012

Page 9: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Special cases of optimization

Machine learning

Algorithms and data structures

General problem-solving

Management and decision-making

AACIMP Summer School.

August, 2012

Page 10: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Special cases of optimization

Machine learning

Algorithms and data structures

General problem-solving

Management and decision-making

Evolution

The Meaning of Life?

AACIMP Summer School.

August, 2012

Page 11: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Optimization task

Given a function

find the argument x resulting in the optimal

value.

AACIMP Summer School.

August, 2012

Page 12: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Constrained optimization task

Given a function

find the argument x resulting in the optimal

value, subject to

AACIMP Summer School.

August, 2012

Page 13: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Optimization methods

In principle, x can be anything:

Discrete

Value (e.g. a name)

Structure (e.g. a graph, plaintext)

Finite / infinite

Continuous*

Real-number, vector, matrix, …

Complex-number, function, …

AACIMP Summer School.

August, 2012

Page 14: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Optimization methods

In principle, f can be anything:

Random oracle

Structured

Continuous

Differentiable

Convex

AACIMP Summer School.

August, 2012

Page 15: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Optimization methods

AACIMP Summer School.

August, 2012

Knowledge about f

Not much A lot

Type of x

Discrete

Continuous

Page 16: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Optimization methods

AACIMP Summer School.

August, 2012

Knowledge about f

Not much A lot

Type of x

Discrete

Combinatorial

search:

Brute-force,

Stepwise, MCMC,

Population-based, …

Algorithmic

Continuous

Numeric methods:

Gradient-based,

Newton-like,

MCMC,

Population-based, …

Analytic

Page 17: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Optimization methods

AACIMP Summer School.

August, 2012

Knowledge about f

Not much A lot

Type of x

Discrete

Combinatorial

search:

Brute-force,

Stepwise, MCMC,

Population-based, …

Algorithmic

Continuous

Numeric methods:

Gradient-based,

Newton-like,

MCMC,

Population-based, …

Analytic

Finding a weight-

vector w, that

minimizes the model

error

Page 18: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Optimization methods

AACIMP Summer School.

August, 2012

Knowledge about f

Not much A lot

Type of x

Discrete

Combinatorial

search:

Brute-force,

Stepwise, MCMC,

Population-based, …

Algorithmic

Continuous

Numeric methods:

Gradient-based,

Newton-like,

MCMC,

Population-based, …

Analytic

Finding a weight-

vector w, that

minimizes the model

error,

in a fairly general

case

Page 19: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Optimization methods

AACIMP Summer School.

August, 2012

Knowledge about f

Not much A lot

Type of x

Discrete

Combinatorial

search:

Brute-force,

Stepwise, MCMC,

Population-based, …

Algorithmic

Continuous

Numeric methods:

Gradient-based,

Newton-like,

MCMC,

Population-based, …

Analytic

Finding a weight-

vector w, that

minimizes the model

error,

in a very general

case

Page 20: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Optimization methods

AACIMP Summer School.

August, 2012

Knowledge about f

Not much A lot

Type of x

Discrete

Combinatorial

search:

Brute-force,

Stepwise, MCMC,

Population-based, …

Algorithmic

Continuous

Numeric methods:

Gradient-based,

Newton-like,

MCMC,

Population-based, …

Analytic

Finding a weight-

vector w, that

minimizes the model

error,

in many practical

cases

Page 21: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Optimization methods

AACIMP Summer School.

August, 2012

Knowledge about f

Not much A lot

Type of x

Discrete

Combinatorial

search:

Brute-force,

Stepwise, MCMC,

Population-based, …

Algorithmic

Continuous

Numeric methods:

Gradient-based,

Newton-like,

MCMC,

Population-based, …

Analytic

This lecture

Page 22: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Minima and maxima

AACIMP Summer School.

August, 2012

Page 23: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Differentiability

AACIMP Summer School.

August, 2012

𝒇 𝒙 = 𝒃 + 𝒄𝚫𝒙

Page 24: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Differentiability

AACIMP Summer School.

August, 2012

𝒇 𝒙𝟏, 𝒙𝟐 = 𝒃 + 𝒄𝟏𝚫𝒙𝟏 + 𝒄𝟐𝚫𝐱𝟐

Page 25: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Differentiability

AACIMP Summer School.

August, 2012

Page 26: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

The Most Important Observation

AACIMP Summer School.

August, 2012

Page 27: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

The Most Important Observation

AACIMP Summer School.

August, 2012

Page 28: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

The Most Important Observation

This small observation gives us everything we

need for now

A nice interpretation of the gradient

An extremality criterion

An iterative algorithm for function minimization

AACIMP Summer School.

August, 2012

Page 29: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Interpretation of the gradient

AACIMP Summer School.

August, 2012

Page 30: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Interpretation of the gradient

AACIMP Summer School.

August, 2012

Page 31: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Extremality criterion

AACIMP Summer School.

August, 2012

Page 32: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Iterative algorithm

1. Pick random point 𝒙0

2. If 𝛻𝑓 𝒙0 = 𝟎, then we’ve found an extremum.

3. Otherwise,

AACIMP Summer School.

August, 2012

Page 33: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Iterative algorithm

1. Pick random point 𝒙0

2. If 𝛻𝑓 𝒙0 = 𝟎, then we’ve found an extremum.

3. Otherwise, make a small step downhill:

𝒙1 ← 𝒙0 − 𝜇0𝛻𝑓 𝒙0

AACIMP Summer School.

August, 2012

Page 34: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Iterative algorithm

1. Pick random point 𝒙0

2. If 𝛻𝑓 𝒙0 = 𝟎, then we’ve found an extremum.

3. Otherwise, make a small step downhill:

𝒙1 ← 𝒙0 − 𝜇0𝛻𝑓 𝒙0

4. … and then another step

𝒙2 ← 𝒙1 − 𝜇1𝛻𝑓 𝒙1

5. … and so on until

AACIMP Summer School.

August, 2012

Page 35: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Gradient descent

1. Pick random point 𝒙0

2. If 𝛻𝑓 𝒙0 = 𝟎, then we’ve found an extremum.

3. Otherwise, make a small step downhill:

𝒙1 ← 𝒙0 − 𝜇0𝛻𝑓 𝒙0

4. … and then another step

𝒙2 ← 𝒙1 − 𝜇1𝛻𝑓 𝒙1

5. … and so on until 𝛻𝑓 𝒙𝑛 ≈ 𝟎 or we’re tired.

With a smart choice of 𝜇𝑖 we’ll converge to a minimum

AACIMP Summer School.

August, 2012

Page 36: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Gradient descent

1. 2.

3.

𝒙1 ← 𝒙0 − 𝜇0𝛻𝑓 𝒙0

4.

𝒙2 ← 𝒙1 − 𝜇1𝛻𝑓 𝒙1

AACIMP Summer School.

August, 2012

Page 37: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Gradient descent

𝒙𝑖+1 ← 𝒙𝑖 − 𝜇𝑖𝛻𝑓 𝒙𝑖

AACIMP Summer School.

August, 2012

Page 38: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Gradient descent

∆𝒙𝑖= −𝜇𝑖𝛻𝑓 𝒙𝑖

AACIMP Summer School.

August, 2012

Page 39: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Gradient descent

∆𝒙𝑖= −𝜇𝒄

AACIMP Summer School.

August, 2012

Page 40: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Gradient descent (fixed step)

AACIMP Summer School.

August, 2012

∆𝒙𝑖= −𝜇 𝛻𝑓 𝒙𝑖

Page 41: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Gradient descent (fixed step)

AACIMP Summer School.

August, 2012

∆𝒙𝑖= −𝜇 𝛻𝑓 𝒙𝑖

Page 42: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Example

AACIMP Summer School.

August, 2012

Page 43: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Example

AACIMP Summer School.

August, 2012

𝒙𝟏, 𝒙𝟐, … , 𝒙𝟓𝟎

Page 44: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Example

AACIMP Summer School.

August, 2012

𝑓 𝒘 = 𝒙𝒊 −𝒘 𝟐

𝒊

Page 45: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Example

AACIMP Summer School.

August, 2012

𝛻𝑓 𝒘 = − 2(𝒙𝒊 −𝒘)

𝒊

Page 46: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Example

AACIMP Summer School.

August, 2012

𝛻𝑓 𝒘 = 2𝑛(𝒘 − 𝒙 )

Page 47: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Example

AACIMP Summer School.

August, 2012

𝛻𝑓 𝒘 = 𝟎

Page 48: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Example

AACIMP Summer School.

August, 2012

𝛻𝑓 𝒘

Page 49: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Example

AACIMP Summer School.

August, 2012

𝚫𝐰 = −𝜇𝛻𝑓 𝒘

Page 50: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Stochastic gradient descent

Whenever

𝑓 𝒘 = 𝑔(𝒘, 𝒙𝒌)

AACIMP Summer School.

August, 2012

Page 51: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Stochastic gradient descent

Whenever

𝑓 𝒘 = 𝑔(𝒘, 𝒙𝒌)

e.g.

Error 𝒘 = model𝒘 𝒙𝒌 − 𝒚𝒌𝟐

AACIMP Summer School.

August, 2012

Page 52: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Stochastic gradient descent

Whenever

𝑓 𝒘 = 𝑔(𝒘, 𝒙𝒌)

the gradient is also a sum:

𝛻𝑓 𝒘 = 𝛻𝑔(𝒘, 𝒙𝒌)

AACIMP Summer School.

August, 2012

Page 53: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Stochastic gradient descent

Whenever

𝑓 𝒘 = 𝑔(𝒘, 𝒙𝒌)

the gradient is also a sum:

𝛻𝑓 𝒘 = 𝛻𝑔(𝒘, 𝒙𝒌)

The GD step is then also a sum

∆𝒘𝒊 = −𝜇 𝛻𝑔 𝒘𝒊, 𝒙𝒌

AACIMP Summer School.

August, 2012

Page 54: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Stochastic gradient descent

Whenever

𝑓 𝒘 = 𝑔(𝒘, 𝒙𝒌)

the gradient is also a sum:

𝛻𝑓 𝒘 = 𝛻𝑔(𝒘, 𝒙𝒌)

The GD step is then also a sum

∆𝒘𝒊 = −𝜇 𝛻𝑔 𝒘𝒊, 𝒙𝒌

AACIMP Summer School.

August, 2012

Page 55: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Stochastic gradient descent

Batch update:

∆𝒘𝒊 = −𝜇 𝛻𝑔 𝒘𝒊, 𝒙𝒌

AACIMP Summer School.

August, 2012

Page 56: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Stochastic gradient descent

Batch update:

∆𝒘𝒊 = −𝜇 𝛻𝑔 𝒘𝒊, 𝒙𝒌

On-line update:

∆𝒘𝒊 = −𝜇 𝛻𝑔 𝒘𝒊, 𝒙𝒓𝒂𝒏𝒅𝒐𝒎

AACIMP Summer School.

August, 2012

Page 57: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Example

AACIMP Summer School.

August, 2012

𝚫𝐰 = 𝜇(𝒙𝒊 −𝒘)

Page 58: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Summary

An interpretation of the gradient

An extremality criterion

An iterative algorithm for function

minimization

A stochastic iterative algorithm for

function minimization

AACIMP Summer School.

August, 2012

Page 59: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Quiz

The symbol 𝚫 is called ______ and denotes

_____

The symbol 𝛁 is called ______ and denotes

_____

AACIMP Summer School.

August, 2012

Page 60: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Quiz

Gradient descent:

∆𝒘𝑖= _________

AACIMP Summer School.

August, 2012

Page 61: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Quiz

Gradient descent:

∆𝒘𝑖= −𝜇𝛁𝒘𝐟(𝒘)

AACIMP Summer School.

August, 2012

Page 62: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Quiz

Gradient descent:

∆𝒘𝑖= −𝜇𝒄

AACIMP Summer School.

August, 2012

Page 63: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Quiz

Gradient descent:

∆𝒘𝑖= −𝜇𝒄

Suppose the batch gradient descent step is

∆𝒘𝒊 = 𝒘𝑻𝒙𝒊 + 𝒙𝒊𝟐 𝒆

𝒊

the corresponding stochastic gradient descent

step is:

𝚫𝒘𝒊 = _______________ AACIMP Summer School.

August, 2012

Page 64: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Quiz

Can bacteria learn?

AACIMP Summer School.

August, 2012

Page 65: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Questions?

AACIMP Summer School.

August, 2012

Page 66: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Linear Regression

Konstantin Tretyakov http://kt.era.ee

AACIMP 2012

August 2012

Page 67: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Supervised Learning

Let 𝑋 and 𝑌 be some sets.

Let there be a training dataset:

𝐷 = 𝑥1, 𝑦1 , 𝑥2, 𝑦2 , … , 𝑥𝑛, 𝑦𝑛

𝑥𝑖 ∈ 𝑋, 𝑦𝑖 ∈ 𝑌

AACIMP Summer School.

August, 2012

Page 68: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Supervised Learning

Let 𝑋 and 𝑌 be some sets.

Let there be a training dataset:

𝐷 = 𝑥1, 𝑦1 , 𝑥2, 𝑦2 , … , 𝑥𝑛, 𝑦𝑛

𝑥𝑖 ∈ 𝑋, 𝑦𝑖 ∈ 𝑌

Supervised learning:

AACIMP Summer School.

August, 2012

Find a function 𝑓: 𝑋 → 𝑌,

generalizing the dependency

present in the data.

Page 69: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Classification

𝑋 = ℝ2, 𝑌 = {blue, red}

𝐷 = { 1.3, 0.8 , red , 2.5,2.3 , blue , … }

𝑓 𝑥1, 𝑥2 = if 𝑥1 + 𝑥2 > 3 then blue else red

AACIMP Summer School.

August, 2012

Page 70: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Regression

𝑋 = ℝ, 𝑌 = ℝ

𝐷 = { 0.5, 0.26 , 0.43, 0.08 , … }

𝑓 𝑥 = 𝑥2

AACIMP Summer School.

August, 2012

Page 71: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Linear Regression

𝑋 = ℝ, 𝑌 = ℝ

𝐷 = { 0.5, 0.26 , 0.43, 0.08 , … }

𝑓 𝑥 = −0.14 + 1.01 𝑥

AACIMP Summer School.

August, 2012

Page 72: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Linear Regression

𝑋 = ℝm, Y = ℝ

𝑓 𝑥1, … , 𝑥𝑚 = 𝑤0 + 𝑤1𝑥1 +⋯+𝑤𝑚𝑥𝑚

AACIMP Summer School.

August, 2012

Page 73: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Linear Regression

𝑋 = ℝm, Y = ℝ

𝑓 𝑥1, … , 𝑥𝑚 = 𝑤0 + 𝑤1𝑥1 +⋯+𝑤𝑚𝑥𝑚

𝑓 𝑥1, … , 𝑥𝑚 = 𝑤0 + ⟨𝒘, 𝒙⟩

AACIMP Summer School.

August, 2012

Inner product

Page 74: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Linear Regression

𝑋 = ℝm, Y = ℝ

𝑓 𝑥1, … , 𝑥𝑚 = 𝑤0 + 𝑤1𝑥1 +⋯+𝑤𝑚𝑥𝑚

𝑓 𝑥1, … , 𝑥𝑚 = 𝑤0 + ⟨𝒘, 𝒙⟩

𝑓 𝑥1, … , 𝑥𝑚 = 𝑤0 + (𝑤1, … , 𝑤𝑚)

𝑥1𝑥2⋮𝑥𝑚

AACIMP Summer School.

August, 2012

Page 75: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Linear Regression

𝑋 = ℝm, Y = ℝ

𝑓 𝑥1, … , 𝑥𝑚 = 𝑤0 + 𝑤1𝑥1 +⋯+𝑤𝑚𝑥𝑚

𝑓 𝑥1, … , 𝑥𝑚 = 𝑤0 + ⟨𝒘, 𝒙⟩

𝑓 𝑥1, … , 𝑥𝑚 = 𝑤0 + (𝑤1, … , 𝑤𝑚)

𝑥1𝑥2⋮𝑥𝑚

𝑓 𝑥1, … , 𝑥𝑚 = 𝑤0 +𝒘𝑇𝒙

AACIMP Summer School.

August, 2012

Page 76: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Linear Regression

𝑓 𝒙 = 𝑤0 +𝒘𝑇𝒙

AACIMP Summer School.

August, 2012

Page 77: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Linear Regression

𝑓 𝒙 = 𝑤0 +𝒘𝑇𝒙

AACIMP Summer School.

August, 2012

Bias term

Page 78: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Linear Regression

𝑓 𝒙 = 𝑤0 +𝒘𝑇𝒙

𝑓 𝑥1, … , 𝑥𝑚 = (𝑤0, 𝑤1, … , 𝑤𝑚)

1𝑥1⋮𝑥𝑚

AACIMP Summer School.

August, 2012

Page 79: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Linear Regression

𝑓 𝒙 = 𝑤0 +𝒘𝑇𝒙

𝑓 𝑥1, … , 𝑥𝑚 = (𝑤0, 𝑤1, … , 𝑤𝑚)

1𝑥1⋮𝑥𝑚

𝑓 𝒙 = 𝒘 𝑇𝒙

AACIMP Summer School.

August, 2012

Page 80: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Linear Regression

𝑓 𝒙 = 𝒘𝑇𝒙

AACIMP Summer School.

August, 2012

Page 81: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Linear Regression

𝑓 𝑥 = 𝑤𝑥 AACIMP Summer School.

August, 2012

Page 82: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Objective Function

𝑒𝑖 = 𝑓(𝑥𝑖) − 𝑦𝑖2

AACIMP Summer School.

August, 2012

Page 83: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Objective Function

𝑒𝑖 = 𝑦𝑖 − 𝑦𝑖2

AACIMP Summer School.

August, 2012

Page 84: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Objective Function

𝑒𝑖 = 𝑤𝑥𝑖 − 𝑦𝑖2

AACIMP Summer School.

August, 2012

Page 85: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Objective Function

𝐸𝐷(𝑤) = 𝑤𝑥𝑖 − 𝑦𝑖2

𝑖

AACIMP Summer School.

August, 2012

“OLS”

Page 86: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Objective Function

𝐸𝐷(𝑤) = 𝑤𝑥𝑖 − 𝑦𝑖2

𝑖

AACIMP Summer School.

August, 2012

Page 87: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Objective Function

𝐸𝐷(𝒘) = 𝒘𝑻𝒙𝒊 − 𝑦𝑖2

𝑖

AACIMP Summer School.

August, 2012

Page 88: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Objective Function

𝐸𝐷 𝒘 = 𝒙𝒊𝑇𝒘− 𝑦𝑖

2

𝑖

AACIMP Summer School.

August, 2012

(…𝒙𝟏𝑻…) 𝒙𝟏

𝑻𝒘

(…𝒙𝟐𝑻…) 𝒙𝟐

𝑻𝒘⋮

(…𝒙𝒏𝑻…) 𝒙𝒏

𝑻𝒘

𝑦1𝑦2⋮𝑦𝑛

⋮𝒘⋮

Page 89: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Objective Function

𝐸𝐷 𝒘 = 𝒙𝒊𝑇𝒘− 𝑦𝑖

2

𝑖

AACIMP Summer School.

August, 2012

(…𝒙𝟏𝑻…) 𝒙𝟏

𝑻𝒘

(…𝒙𝟐𝑻…) 𝒙𝟐

𝑻𝒘⋮

(…𝒙𝒏𝑻…) 𝒙𝒏

𝑻𝒘

𝑦1𝑦2⋮𝑦𝑛

⋮𝒘⋮

𝑿 𝒚

Page 90: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Objective Function

𝐸𝐷 𝒘 = 𝒙𝒊𝑇𝒘− 𝑦𝑖

2

𝑖

AACIMP Summer School.

August, 2012

𝑿𝒘− 𝒚

Page 91: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Objective Function

𝐸𝐷 𝒘 = 𝒙𝒊𝑇𝒘− 𝑦𝑖

2

𝑖

AACIMP Summer School.

August, 2012

𝑿𝒘− 𝒚 𝟐

Page 92: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Objective Function

𝐸𝐷 𝒘 = 𝒙𝒊𝑇𝒘− 𝑦𝑖

2

𝑖

AACIMP Summer School.

August, 2012

𝑿𝒘− 𝒚 𝟐

𝑿𝒘 ≈ 𝒚

Page 93: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Objective Function

𝐸𝐷 𝒘 = 𝒙𝒊𝑇𝒘− 𝑦𝑖

2

𝑖

AACIMP Summer School.

August, 2012

𝑿𝒘− 𝒚 𝟐

𝒘 ≈ 𝑿−𝟏𝒚?

𝑿𝒘 ≈ 𝒚

Page 94: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Objective Function

𝐸𝐷 𝒘 = 𝒙𝒊𝑇𝒘− 𝑦𝑖

2

𝑖

AACIMP Summer School.

August, 2012

𝑿𝒘− 𝒚 𝟐

𝒘 = 𝑿+𝒚 !

𝑿𝒘 ≈ 𝒚

Page 95: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Optimization

argmin𝐰 𝑿𝒘 − 𝒚 2

AACIMP Summer School.

August, 2012

Page 96: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Optimization

argmin𝐰1

2𝑿𝒘 − 𝒚 2

AACIMP Summer School.

August, 2012

Page 97: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Optimization

argmin𝐰1

2𝑿𝒘 − 𝒚 2

1

2𝑿𝒘 − 𝒚 𝑇(𝑿𝒘 − 𝒚)

AACIMP Summer School.

August, 2012

𝒂 2 = 𝒂𝑇𝒂

Page 98: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Optimization

argmin𝐰1

2𝑿𝒘 − 𝒚 2

1

2𝑿𝒘 − 𝒚 𝑇(𝑿𝒘 − 𝒚)

1

2𝒘𝑇𝑿𝑇 − 𝒚𝑇 (𝑿𝒘 − 𝒚)

AACIMP Summer School.

August, 2012

𝒂 + 𝒃 𝑇 = (𝒂𝑇 + 𝒃𝑇)

Page 99: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Optimization

1

2𝒘𝑇𝑿𝑇 − 𝒚𝑇 (𝑿𝒘 − 𝒚)

1

2𝒘𝑇𝑿𝑇𝑿𝒘 − 𝒚𝑇𝑿𝒘−𝒘𝑻𝑿𝑻𝒚 + 𝒚𝑻𝒚

AACIMP Summer School.

August, 2012

𝒂 𝒃 + 𝒄 = 𝒂𝒃 + 𝒂𝒄

Page 100: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Optimization

1

2𝒘𝑇𝑿𝑇 − 𝒚𝑇 (𝑿𝒘 − 𝒚)

1

2𝒘𝑇𝑿𝑇𝑿𝒘 − 𝒚𝑇𝑿𝒘−𝒘𝑻𝑿𝑻𝒚 + 𝒚𝑻𝒚

1

2𝒘𝑇𝑿𝑇𝑿𝒘 − 2𝒚𝑇𝑿𝒘+ 𝒚𝑻𝒚

AACIMP Summer School.

August, 2012

𝒚𝑻𝑿𝒘 = 𝒘𝑻𝑿𝑻𝒚 = scalar

Page 101: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Optimization

𝐸 𝒘 =1

2𝒘𝑇𝑿𝑇𝑿𝒘 − 2𝒚𝑇𝑿𝒘+ 𝒚𝑇𝒚

AACIMP Summer School.

August, 2012

Page 102: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Optimization

𝐸 𝒘 =1

2𝒘𝑇𝑿𝑇𝑿𝒘 − 2𝒚𝑇𝑿𝒘+ 𝒚𝑇𝒚

𝛻𝐸 𝒘 = 𝑿𝑻𝑿𝒘− 𝑿𝑻𝒚

AACIMP Summer School.

August, 2012

𝛁 𝒇 + 𝒈 = 𝛁𝐟 + 𝛁𝐠

𝛁 𝒘𝑻𝑨𝒘 = 2𝑨𝒘

𝛁 𝒂𝑻𝒘 = 𝒂

Page 103: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Optimization

𝐸 𝒘 =1

2𝒘𝑇𝑿𝑇𝑿𝒘 − 2𝒚𝑇𝑿𝒘+ 𝒚𝑇𝒚

𝛻𝐸 𝒘 = 𝑿𝑻𝑿𝒘− 𝑿𝑻𝒚

𝟎 = 𝑿𝑻𝑿𝒘− 𝑿𝑻𝒚

𝑿𝑻𝑿𝒘 = 𝑿𝑻𝒚

AACIMP Summer School.

August, 2012

Page 104: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Optimization

𝐸 𝒘 =1

2𝒘𝑇𝑿𝑇𝑿𝒘 − 2𝒚𝑇𝑿𝒘+ 𝒚𝑇𝒚

𝛻𝐸 𝒘 = 𝑿𝑻𝑿𝒘− 𝑿𝑻𝒚

𝟎 = 𝑿𝑻𝑿𝒘− 𝑿𝑻𝒚

𝑿𝑻𝑿𝒘 = 𝑿𝑻𝒚

𝒘 = 𝑿𝑇𝑿 −1𝑿𝑇𝒚

AACIMP Summer School.

August, 2012

Page 105: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Linear Regression Solution

𝒘 = 𝑿𝑇𝑿 −1𝑿𝑇𝒚

AACIMP Summer School.

August, 2012

Page 106: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Linear Regression Solution

𝒘 = 𝑿𝑇𝑿 −1𝑿𝑇𝒚

X = matrix(X)

y = matrix(y)

w =(X.T * X).I * X.T * y

AACIMP Summer School.

August, 2012

Page 107: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Linear Regression Solution

𝒘 = 𝑿𝑇𝑿 −1𝑿𝑇𝒚

X = matrix(X)

y = matrix(y)

w =(X.T * X).I * X.T * y

AACIMP Summer School.

August, 2012

Moore-Penrose

pseudoinverse

Page 108: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Linear Regression Solution

𝒘 = 𝑿𝑇𝑿 −1𝑿𝑇𝒚 𝒘 = 𝑿+𝒚

X = matrix(X)

y = matrix(y)

w =(X.T * X).I * X.T * y

w = pinv(X) * y

AACIMP Summer School.

August, 2012

Moore-Penrose

pseudoinverse

Page 109: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Linear Regression & SKLearn

from sklearn.linear_model import

LinearRegression

model = LinearRegression()

model.fit(X, y)

w=(model.intercept_,model.coef_)

model.predict(X_new)

AACIMP Summer School.

August, 2012

Page 110: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Stochastic Gradient Regression

𝚫𝒘 = −𝜇(𝒘𝑇𝒙𝑖 − 𝑦𝑖)𝒙𝒊

AACIMP Summer School.

August, 2012

Page 111: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Stochastic Gradient Regression

𝚫𝒘 = −𝜇𝑒𝑖𝒙𝒊

AACIMP Summer School.

August, 2012

Page 112: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Stochastic Gradient Regression

𝚫𝒘 = −𝜇𝑒𝑖𝒙𝒊

AACIMP Summer School.

August, 2012

from sklearn.linear_model import SGDRegressor

model = SGDRegressor(alpha=0, n_iter=30)

model.fit(X, y)

Page 113: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Polynomial Regression

Say we’d like to fit a model:

𝑓 𝑥1, 𝑥2= 𝑤0 + 𝑤1𝑥1 + 𝑤2𝑥2

2 + 𝑤3𝑥1𝑥2

AACIMP Summer School.

August, 2012

Page 114: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Polynomial Regression

Say we’d like to fit a model:

𝑓 𝑥1, 𝑥2= 𝑤0 + 𝑤1𝑥1 + 𝑤2𝑥2

2 + 𝑤3𝑥1𝑥2

Simply transform the features and proceed as

normal:

𝑥1, 𝑥2 → (𝑥1, 𝑥22, 𝑥1𝑥2)

AACIMP Summer School.

August, 2012

Page 115: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Single-variable Polynomial OLS

# n x 1 matrix

x = matrix(…)

# Add bias & square features

X = hstack([x**0, x**1, x**2])

# Solve for w

w = pinv(X) * y

AACIMP Summer School.

August, 2012

Page 116: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Overfitting

AACIMP Summer School.

August, 2012

Page 117: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Regularization

AACIMP Summer School.

August, 2012

𝐸 𝒘 =1

2𝑿𝒘 − 𝒚 2 + 𝜆 𝒘 2

Page 118: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Regularization

AACIMP Summer School.

August, 2012

𝐸 𝒘 =1

2𝑿𝒘 − 𝒚 2 + 𝜆 𝒘 2

ℓ2-penalty ℓ2-loss

Page 119: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Regularization

AACIMP Summer School.

August, 2012

𝐸 𝒘 =1

2𝑿𝒘 − 𝒚 2 + 𝜆 𝑤 1

ℓ1-penalty ℓ2-loss

Page 120: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Regularization

AACIMP Summer School.

August, 2012

𝐸 𝒘 =1

2𝑿𝒘 − 𝒚 2 + 𝜆 𝑤 0

ℓ0-penalty ℓ2-loss

Page 121: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Regularization

AACIMP Summer School.

August, 2012

𝐸 𝒘 =1

2𝑿𝒘 − 𝒚 2 + 𝜆 𝑤 0

ℓ0-penalty ℓ2-loss

>>> SGDRegressor?

Parameters

----------

loss : str, 'squared_loss' or 'huber' ...

...

penalty : str, 'l2' or 'l1' or 'elasticnet'

...

Page 122: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Regularization

AACIMP Summer School.

August, 2012

𝐸 𝒘 =1

2𝑿𝒘 − 𝒚 2 + 𝜆 𝒘 2

ℓ2-penalty ℓ2-loss

Page 123: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Ridge regression

argmin𝒘1

2𝑿𝒘 − 𝒚 2 + 𝜆 𝒘 2

𝒘 = 𝑿𝑇𝑿 + 𝜆 𝑰 −1𝑿𝑇𝒚

AACIMP Summer School.

August, 2012

𝑰 =

1 0 … 00 1 … 0⋮ ⋮ ⋱ ⋮0 0 … 1

Page 124: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Ridge regression

argmin𝒘1

2𝑿𝒘 − 𝒚 2 + 𝜆 𝒘∗

2

𝒘 = 𝑿𝑇𝑿 + 𝜆 𝑰∗−1𝑿𝑇𝒚

AACIMP Summer School.

August, 2012

𝑰∗ =

𝟎 0 … 00 1 … 0⋮ ⋮ ⋱ ⋮0 0 … 1

The bias term 𝑤0 is

usually not penalized.

Page 125: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Exercise

Derive an SGD algorithm

for Ridge Regression.

AACIMP Summer School.

August, 2012

Page 126: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Effects of Regularization

AACIMP Summer School.

August, 2012

Page 127: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Quiz

OLS linear regression searches for a _______

model that has the best _________.

Analytic solution for OLS regression:

𝒘 = _________

Stochastic gradient solution for OLS

regression:

𝚫𝒘 = _________

AACIMP Summer School.

August, 2012

Page 128: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Quiz

Large number of model parameters and/or

small data may lead to ___________.

We address overfitting by __________.

“Ridge regression” means __-loss and ___-

penalty.

Analytic solution for Ridge regression:

𝒘 = _________

AACIMP Summer School.

August, 2012

Page 129: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Quiz

As we increase regularization strength (i.e.

increase 𝜆), the training error _________.

… and the test error ________.

AACIMP Summer School.

August, 2012

Page 130: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a

Questions?

AACIMP Summer School.

August, 2012