linear regression & gradient descent · 2020. 1. 22. · linear regression & gradient descent robot...
TRANSCRIPT
-
Linear Regression
& Gradient Descent
Robot Image Credit: Viktoriya Sukhanova © 123RF.com
These slides were assembled by Byron Boots, with grateful acknowledgement to Eric Eaton and the many
others who made their course materials freely available online. Feel free to reuse or adapt these slides for your own academic purposes, provided that you include proper attribution.
-
Regression
Given:
– Data where
– Corresponding labels where
2
0
1
2
3
4
5
6
7
8
9
1975 1980 1985 1990 1995 2000 2005 2010 2015
Sept
embe
r Arc
tic S
ea Ic
e Ex
tent
(1
,000
,000
sq k
m)
Year
Data from G. Witt. Journal of Statistics Education, Volume 21, Number 1 (2013)
Linear Regression
Quadratic Regression
X =nx(1), . . . ,x(n)
ox(i) 2 Rd
y =ny(1), . . . , y(n)
oy(i) 2 R
-
Linear Regression
• Hypothesis:
• Fit model by minimizing sum of squared errors
3
x
y = ✓0 + ✓1x1 + ✓2x2 + . . .+ ✓dxd =dX
j=0
✓jxj
Assume x0 = 1
y = ✓0 + ✓1x1 + ✓2x2 + . . .+ ✓dxd =dX
j=0
✓jxj
Figures are courtesy of Greg Shakhnarovich
-
Least Squares Linear Regression
4
• Cost Function
• Fit by solving
J(✓) =1
2n
nX
i=1
⇣h✓
⇣x(i)
⌘� y(i)
⌘2
min✓
J(✓)
-
Intuition Behind Cost Function
5
For insight on J(), let’s assume so x 2 R ✓ = [✓0, ✓1]
J(✓) =1
2n
nX
i=1
⇣h✓
⇣x(i)
⌘� y(i)
⌘2
Based on example
by Andrew Ng
-
Intuition Behind Cost Function
6
0
1
2
3
0 1 2 3
y
x
(for fixed , this is a function of x) (function of the parameter )
0
1
2
3
-0.5 0 0.5 1 1.5 2 2.5
For insight on J(), let’s assume so x 2 R ✓ = [✓0, ✓1]
J(✓) =1
2n
nX
i=1
⇣h✓
⇣x(i)
⌘� y(i)
⌘2
Based on example
by Andrew Ng
-
Intuition Behind Cost Function
7
0
1
2
3
0 1 2 3
y
x
(for fixed , this is a function of x) (function of the parameter )
0
1
2
3
-0.5 0 0.5 1 1.5 2 2.5
For insight on J(), let’s assume so x 2 R ✓ = [✓0, ✓1]
J(✓) =1
2n
nX
i=1
⇣h✓
⇣x(i)
⌘� y(i)
⌘2
J([0, 0.5]) =1
2⇥ 3⇥(0.5� 1)2 + (1� 2)2 + (1.5� 3)2
⇤⇡ 0.58Based on example
by Andrew Ng
-
Intuition Behind Cost Function
8
0
1
2
3
0 1 2 3
y
x
(for fixed , this is a function of x) (function of the parameter )
0
1
2
3
-0.5 0 0.5 1 1.5 2 2.5
For insight on J(), let’s assume so x 2 R ✓ = [✓0, ✓1]
J(✓) =1
2n
nX
i=1
⇣h✓
⇣x(i)
⌘� y(i)
⌘2
J([0, 0]) ⇡ 2.333
Based on example
by Andrew Ng
J() is convex
-
Intuition Behind Cost Function
9
A function of a single variable J() is convex if it is
twice differentiable and its second derivative is always nonnegative.
-
Intuition Behind Cost Function
10
Jensen’s Inequality
-
Intuition Behind Cost Function
11Slide by Andrew Ng
-
Intuition Behind Cost Function
12
(for fixed , this is a function of x) (function of the parameters )
Slide by Andrew Ng
-
Intuition Behind Cost Function
13
(for fixed , this is a function of x) (function of the parameters )
Slide by Andrew Ng
-
Intuition Behind Cost Function
14
(for fixed , this is a function of x) (function of the parameters )
Slide by Andrew Ng
-
Intuition Behind Cost Function
15
(for fixed , this is a function of x) (function of the parameters )
Slide by Andrew Ng
-
Basic Search Procedure
• Choose initial value for • Until we reach a minimum:– Choose a new value for to reduce
16
✓
✓ J(✓)
q1q0
J(q0,q1)
Figure by Andrew Ng
-
Basic Search Procedure
• Choose initial value for • Until we reach a minimum:– Choose a new value for to reduce
17
✓
✓
J(✓)
q1q0
J(q0,q1)
✓
Figure by Andrew Ng
-
Basic Search Procedure
• Choose initial value for • Until we reach a minimum:– Choose a new value for to reduce
18
✓
✓
J(✓)
q1q0
J(q0,q1)
✓
Figure by Andrew Ng
Since the least squares objective function is convex (concave), we
don’t need to worry about local minima in linear regression
-
Gradient Descent
• Initialize • Repeat until convergence
19
✓
✓j ✓j � ↵@
@✓jJ(✓) simultaneous update
for j = 0 ... d
learning rate (small)
e.g., α = 0.05
J(✓)
✓
0
1
2
3
-0.5 0 0.5 1 1.5 2 2.5
↵
-
Gradient Descent
• Initialize • Repeat until convergence
20
✓
✓j ✓j � ↵@
@✓jJ(✓) simultaneous update
for j = 0 ... d
For Linear Regression:@
@✓jJ(✓) =
@
@✓j
1
2n
nX
i=1
⇣h✓⇣x(i)
⌘� y(i)
⌘2
=@
@✓j
1
2n
nX
i=1
dX
k=0
✓kx(i)k � y
(i)
!2
=1
n
nX
i=1
dX
k=0
✓kx(i)k � y
(i)
!⇥ @
@✓j
dX
k=0
✓kx(i)k � y
(i)
!
=1
n
nX
i=1
dX
k=0
✓kx(i)k � y
(i)
!x(i)j
=1
n
nX
i=1
⇣h✓⇣x(i)
⌘� y(i)
⌘x(i)j
-
Gradient Descent
• Initialize • Repeat until convergence
21
✓
✓j ✓j � ↵@
@✓jJ(✓) simultaneous update
for j = 0 ... d
For Linear Regression:@
@✓jJ(✓) =
@
@✓j
1
2n
nX
i=1
⇣h✓⇣x(i)
⌘� y(i)
⌘2
=@
@✓j
1
2n
nX
i=1
dX
k=0
✓kx(i)k � y
(i)
!2
=1
n
nX
i=1
dX
k=0
✓kx(i)k � y
(i)
!⇥ @
@✓j
dX
k=0
✓kx(i)k � y
(i)
!
=1
n
nX
i=1
dX
k=0
✓kx(i)k � y
(i)
!x(i)j
=1
n
nX
i=1
⇣h✓⇣x(i)
⌘� y(i)
⌘x(i)j
-
Gradient Descent
• Initialize • Repeat until convergence
22
✓
✓j ✓j � ↵@
@✓jJ(✓) simultaneous update
for j = 0 ... d
For Linear Regression:@
@✓jJ(✓) =
@
@✓j
1
2n
nX
i=1
⇣h✓⇣x(i)
⌘� y(i)
⌘2
=@
@✓j
1
2n
nX
i=1
dX
k=0
✓kx(i)k � y
(i)
!2
=1
n
nX
i=1
dX
k=0
✓kx(i)k � y
(i)
!⇥ @
@✓j
dX
k=0
✓kx(i)k � y
(i)
!
=1
n
nX
i=1
dX
k=0
✓kx(i)k � y
(i)
!x(i)j
=1
n
nX
i=1
⇣h✓⇣x(i)
⌘� y(i)
⌘x(i)j
-
Gradient Descent
• Initialize • Repeat until convergence
23
✓
✓j ✓j � ↵@
@✓jJ(✓) simultaneous update
for j = 0 ... d
For Linear Regression:@
@✓jJ(✓) =
@
@✓j
1
2n
nX
i=1
⇣h✓⇣x(i)
⌘� y(i)
⌘2
=@
@✓j
1
2n
nX
i=1
dX
k=0
✓kx(i)k � y
(i)
!2
=1
n
nX
i=1
dX
k=0
✓kx(i)k � y
(i)
!⇥ @
@✓j
dX
k=0
✓kx(i)k � y
(i)
!
=1
n
nX
i=1
dX
k=0
✓kx(i)k � y
(i)
!x(i)j
=1
n
nX
i=1
⇣h✓⇣x(i)
⌘� y(i)
⌘x(i)j
-
Gradient Descent for Linear Regression
• Initialize • Repeat until convergence
24
✓
simultaneous
update
for j = 0 ... d
✓j ✓j � ↵1
n
nX
i=1
⇣h✓
⇣x(i)
⌘� y(i)
⌘x(i)j
kvk2 =sX
i
v2i =q
v21 + v22 + . . .+ v
2|v|L2 norm:
k✓new � ✓oldk2 < ✏• Assume convergence when
-
Gradient Descent
25
(for fixed , this is a function of x) (function of the parameters )
h(x) = -900 – 0.1 x
Slide by Andrew Ng
-
Gradient Descent
26
(for fixed , this is a function of x) (function of the parameters )
Slide by Andrew Ng
-
Gradient Descent
27
(for fixed , this is a function of x) (function of the parameters )
Slide by Andrew Ng
-
Gradient Descent
28
(for fixed , this is a function of x) (function of the parameters )
Slide by Andrew Ng
-
Gradient Descent
29
(for fixed , this is a function of x) (function of the parameters )
Slide by Andrew Ng
-
Gradient Descent
30
(for fixed , this is a function of x) (function of the parameters )
Slide by Andrew Ng
-
Gradient Descent
31
(for fixed , this is a function of x) (function of the parameters )
Slide by Andrew Ng
-
Gradient Descent
32
(for fixed , this is a function of x) (function of the parameters )
Slide by Andrew Ng
-
Gradient Descent
33
(for fixed , this is a function of x) (function of the parameters )
Slide by Andrew Ng
-
Choosing α
34
α too small
slow convergence
α too large
Increasing value for J(✓)
• May overshoot the minimum• May fail to converge• May even diverge
To see if gradient descent is working, print out each iteration
• The value should decrease at each iteration• If it doesn’t, adjust α
J(✓)