linear regression & gradient descent · 2020. 1. 22. · linear regression & gradient descent robot...

Linear Regression

& Gradient Descent

Robot Image Credit: Viktoriya Sukhanova © 123RF.com

These slides were assembled by Byron Boots, with grateful acknowledgement to Eric Eaton and the many

others who made their course materials freely available online. Feel free to reuse or adapt these slides for your own academic purposes, provided that you include proper attribution.

Regression

Given:

– Data where

– Corresponding labels where

2

0

1

2

3

4

5

6

7

8

9

1975 1980 1985 1990 1995 2000 2005 2010 2015

Sept

embe

r Arc

tic S

ea Ic

e Ex

tent

(1

,000

,000

sq k

m)

Year

Data from G. Witt. Journal of Statistics Education, Volume 21, Number 1 (2013)

Linear Regression

Quadratic Regression

X =nx(1), . . . ,x(n)

ox(i) 2 Rd

y =ny(1), . . . , y(n)

oy(i) 2 R

Linear Regression

• Hypothesis:

• Fit model by minimizing sum of squared errors

3

x

y = ✓0 + ✓1x1 + ✓2x2 + . . .+ ✓dxd =dX

j=0

✓jxj

Assume x0 = 1

y = ✓0 + ✓1x1 + ✓2x2 + . . .+ ✓dxd =dX

j=0

✓jxj

Figures are courtesy of Greg Shakhnarovich

Least Squares Linear Regression

4

• Cost Function

• Fit by solving

J(✓) =1

2n

nX

i=1

⇣h✓

⇣x(i)

⌘� y(i)

⌘2

min✓

J(✓)

Intuition Behind Cost Function

5

For insight on J(), let’s assume so x 2 R ✓ = [✓0, ✓1]

J(✓) =1

2n

nX

i=1

⇣h✓

⇣x(i)

⌘� y(i)

⌘2

Based on example

by Andrew Ng


6

0

1

2

3

0 1 2 3

y

x

(for fixed , this is a function of x) (function of the parameter )

0

1

2

3

-0.5 0 0.5 1 1.5 2 2.5


J(✓) =1

2n

nX

i=1

⇣h✓

⇣x(i)

⌘� y(i)

⌘2

Based on example

by Andrew Ng


7

0

1

2

3

0 1 2 3

y

x


0

1

2

3

-0.5 0 0.5 1 1.5 2 2.5


J(✓) =1

2n

nX

i=1

⇣h✓

⇣x(i)

⌘� y(i)

⌘2

J([0, 0.5]) =1

2⇥ 3⇥(0.5� 1)2 + (1� 2)2 + (1.5� 3)2

⇤⇡ 0.58Based on example

by Andrew Ng


8

0

1

2

3

0 1 2 3

y

x


0

1

2

3

-0.5 0 0.5 1 1.5 2 2.5


J(✓) =1

2n

nX

i=1

⇣h✓

⇣x(i)

⌘� y(i)

⌘2

J([0, 0]) ⇡ 2.333

Based on example

by Andrew Ng

J() is convex


9

A function of a single variable J() is convex if it is

twice differentiable and its second derivative is always nonnegative.


10

Jensen’s Inequality


11Slide by Andrew Ng


12

(for fixed , this is a function of x) (function of the parameters )

Slide by Andrew Ng


13


Slide by Andrew Ng


14


Slide by Andrew Ng


15


Slide by Andrew Ng

Basic Search Procedure

• Choose initial value for • Until we reach a minimum:– Choose a new value for to reduce

16

✓

✓ J(✓)

q1q0

J(q0,q1)

Figure by Andrew Ng



17

✓

✓

J(✓)

q1q0

J(q0,q1)

✓

Figure by Andrew Ng



18

✓

✓

J(✓)

q1q0

J(q0,q1)

✓

Figure by Andrew Ng

Since the least squares objective function is convex (concave), we

don’t need to worry about local minima in linear regression

Gradient Descent

• Initialize • Repeat until convergence

19

✓

✓j ✓j � ↵@

@✓jJ(✓) simultaneous update

for j = 0 ... d

learning rate (small)

e.g., α = 0.05

J(✓)

✓

0

1

2

3

-0.5 0 0.5 1 1.5 2 2.5

↵

Gradient Descent


20

✓

✓j ✓j � ↵@


for j = 0 ... d

For Linear Regression:@

@✓jJ(✓) =

@

@✓j

1

2n

nX

i=1

⇣h✓⇣x(i)

⌘� y(i)

⌘2

=@

@✓j

1

2n

nX

i=1

dX

k=0

✓kx(i)k � y

(i)

!2

=1

n

nX

i=1

dX

k=0

✓kx(i)k � y

(i)

!⇥ @

@✓j

dX

k=0

✓kx(i)k � y

(i)

!

=1

n

nX

i=1

dX

k=0

✓kx(i)k � y

(i)

!x(i)j

=1

n

nX

i=1

⇣h✓⇣x(i)

⌘� y(i)

⌘x(i)j

Gradient Descent


21

✓

✓j ✓j � ↵@


for j = 0 ... d


@✓jJ(✓) =

@

@✓j

1

2n

nX

i=1

⇣h✓⇣x(i)

⌘� y(i)

⌘2

=@

@✓j

1

2n

nX

i=1

dX

k=0

✓kx(i)k � y

(i)

!2

=1

n

nX

i=1

dX

k=0

✓kx(i)k � y

(i)

!⇥ @

@✓j

dX

k=0

✓kx(i)k � y

(i)

!

=1

n

nX

i=1

dX

k=0

✓kx(i)k � y

(i)

!x(i)j

=1

n

nX

i=1

⇣h✓⇣x(i)

⌘� y(i)

⌘x(i)j

Gradient Descent


22

✓

✓j ✓j � ↵@


for j = 0 ... d


@✓jJ(✓) =

@

@✓j

1

2n

nX

i=1

⇣h✓⇣x(i)

⌘� y(i)

⌘2

=@

@✓j

1

2n

nX

i=1

dX

k=0

✓kx(i)k � y

(i)

!2

=1

n

nX

i=1

dX

k=0

✓kx(i)k � y

(i)

!⇥ @

@✓j

dX

k=0

✓kx(i)k � y

(i)

!

=1

n

nX

i=1

dX

k=0

✓kx(i)k � y

(i)

!x(i)j

=1

n

nX

i=1

⇣h✓⇣x(i)

⌘� y(i)

⌘x(i)j

Gradient Descent


23

✓

✓j ✓j � ↵@


for j = 0 ... d


@✓jJ(✓) =

@

@✓j

1

2n

nX

i=1

⇣h✓⇣x(i)

⌘� y(i)

⌘2

=@

@✓j

1

2n

nX

i=1

dX

k=0

✓kx(i)k � y

(i)

!2

=1

n

nX

i=1

dX

k=0

✓kx(i)k � y

(i)

!⇥ @

@✓j

dX

k=0

✓kx(i)k � y

(i)

!

=1

n

nX

i=1

dX

k=0

✓kx(i)k � y

(i)

!x(i)j

=1

n

nX

i=1

⇣h✓⇣x(i)

⌘� y(i)

⌘x(i)j

Gradient Descent for Linear Regression


24

✓

simultaneous

update

for j = 0 ... d

✓j ✓j � ↵1

n

nX

i=1

⇣h✓

⇣x(i)

⌘� y(i)

⌘x(i)j

kvk2 =sX

i

v2i =q

v21 + v22 + . . .+ v

2|v|L2 norm:

k✓new � ✓oldk2 < ✏• Assume convergence when

Gradient Descent

25


h(x) = -900 – 0.1 x

Slide by Andrew Ng

Gradient Descent

26


Slide by Andrew Ng

Gradient Descent

27


Slide by Andrew Ng

Gradient Descent

28


Slide by Andrew Ng

Gradient Descent

29


Slide by Andrew Ng

Gradient Descent

30


Slide by Andrew Ng

Gradient Descent

31


Slide by Andrew Ng

Gradient Descent

32


Slide by Andrew Ng

Gradient Descent

33


Slide by Andrew Ng

Choosing α

34

α too small

slow convergence

α too large

Increasing value for J(✓)

• May overshoot the minimum• May fail to converge• May even diverge

To see if gradient descent is working, print out each iteration

• The value should decrease at each iteration• If it doesn’t, adjust α

J(✓)

linear regression & gradient descent · 2020. 1. 22. · linear regression & gradient descent robot...

Documents