linear regression & gradient descent

Linear Regression

Robot Image Credit: Viktoriya Sukhanova © 123RF.com

These slides were assembled by Eric Eaton, with grateful acknowledgement of the many others who made their course materials freely available online. Feel free to reuse or adapt these slides for your own academic purposes, provided that you include proper attribution. Please send comments and corrections to Eric.

RegressionGiven:– Data where

– Corresponding labels where

2

0

1

2

3

4

5

6

7

8

9

1970 1980 1990 2000 2010 2020

Septem

ber A

rctic Sea Ice Extent

(1,000,000 sq

km)

Year

Data from G. Witt. Journal of Statistics Education, Volume 21, Number 1 (2013)

Linear RegressionQuadratic Regression

X =n

x

(1), . . . ,x(n)o

x

(i) 2 Rd

y =n

y(1), . . . , y(n)o

y(i) 2 R

Linear Regression• Hypothesis:

• Fit model by minimizing sum of squared errors

3

x

y = ✓0 + ✓1x1 + ✓2x2 + . . .+ ✓dxd =dX

j=0

✓jxj

Assume x0 = 1

y = ✓0 + ✓1x1 + ✓2x2 + . . .+ ✓dxd =dX

j=0

✓jxj

Figures are courtesy of Greg Shakhnarovich

Least Squares Linear Regression

4

• Cost Function

• Fit by solving

J(✓) =1

2n

nX

i=1

⇣h✓

⇣x

(i)⌘� y(i)

⌘2

min✓

J(✓)

Intuition Behind Cost Function

5

For insight on J(), let’s assume so x 2 R ✓ = [✓0, ✓1]

J(✓) =1

2n

nX

i=1

⇣h✓

⇣x

(i)⌘� y(i)

⌘2

Based on example by Andrew Ng


6

0

1

2

3

0 1 2 3

y

x

(for fixed , this is a function of x) (function of the parameter )

0

1

2

3

-‐0.5 0 0.5 1 1.5 2 2.5


J(✓) =1

2n

nX

i=1

⇣h✓

⇣x

(i)⌘� y(i)

⌘2



7

0

1

2

3

0 1 2 3

y

x


0

1

2

3

-‐0.5 0 0.5 1 1.5 2 2.5


J(✓) =1

2n

nX

i=1

⇣h✓

⇣x

(i)⌘� y(i)

⌘2

J([0, 0.5]) =1

2⇥ 3

⇥(0.5� 1)2 + (1� 2)2 + (1.5� 3)2

⇤⇡ 0.58Based on example

by Andrew Ng


8

0

1

2

3

0 1 2 3

y

x


0

1

2

3

-‐0.5 0 0.5 1 1.5 2 2.5


J(✓) =1

2n

nX

i=1

⇣h✓

⇣x

(i)⌘� y(i)

⌘2

J([0, 0]) ⇡ 2.333


J() is concave


9Slide by Andrew Ng


10

(for fixed , this is a function of x) (function of the parameters )

Slide by Andrew Ng


11


Slide by Andrew Ng


12


Slide by Andrew Ng


13


Slide by Andrew Ng

Basic Search Procedure• Choose initial value for • Until we reach a minimum:– Choose a new value for to reduce

14

✓

✓ J(✓)

q1q0

J(q0,q1)

Figure by Andrew Ng


15

✓

✓

J(✓)

q1q0

J(q0,q1)

✓

Figure by Andrew Ng


16

✓

✓

J(✓)

q1q0

J(q0,q1)

✓

Figure by Andrew Ng

Since the least squares objective function is convex (concave), we don’t need to worry about local minima

Gradient Descent• Initialize • Repeat until convergence

17

✓

✓j ✓j � ↵@

@✓jJ(✓) simultaneous update

for j = 0 ... d

learning rate (small)e.g., α = 0.05

J(✓)

✓

0

1

2

3

-‐0.5 0 0.5 1 1.5 2 2.5

↵


18

✓

✓j ✓j � ↵@


for j = 0 ... d

For Linear Regression:@

@✓jJ(✓) =

@

@✓j

1

2n

nX

i=1

⇣h✓

⇣x

(i)⌘� y

(i)⌘2

=@

@✓j

1

2n

nX

i=1

dX

k=0

✓kx(i)k � y

(i)

!2

=1

n

nX

i=1

dX

k=0

✓kx(i)k � y

(i)

!⇥ @

@✓j

dX

k=0

✓kx(i)k � y

(i)

!

=1

n

nX

i=1

dX

k=0

✓kx(i)k � y

(i)

!x

(i)j

=1

n

nX

i=1

⇣h✓

⇣x

(i)⌘� y

(i)⌘x

(i)j


19

✓

✓j ✓j � ↵@


for j = 0 ... d


@✓jJ(✓) =

@

@✓j

1

2n

nX

i=1

⇣h✓

⇣x

(i)⌘� y

(i)⌘2

=@

@✓j

1

2n

nX

i=1

dX

k=0

✓kx(i)k � y

(i)

!2

=1

n

nX

i=1

dX

k=0

✓kx(i)k � y

(i)

!⇥ @

@✓j

dX

k=0

✓kx(i)k � y

(i)

!

=1

n

nX

i=1

dX

k=0

✓kx(i)k � y

(i)

!x

(i)j

=1

n

nX

i=1

⇣h✓

⇣x

(i)⌘� y

(i)⌘x

(i)j


20

✓

✓j ✓j � ↵@


for j = 0 ... d


@✓jJ(✓) =

@

@✓j

1

2n

nX

i=1

⇣h✓

⇣x

(i)⌘� y

(i)⌘2

=@

@✓j

1

2n

nX

i=1

dX

k=0

✓kx(i)k � y

(i)

!2

=1

n

nX

i=1

dX

k=0

✓kx(i)k � y

(i)

!⇥ @

@✓j

dX

k=0

✓kx(i)k � y

(i)

!

=1

n

nX

i=1

dX

k=0

✓kx(i)k � y

(i)

!x

(i)j

=1

n

nX

i=1

⇣h✓

⇣x

(i)⌘� y

(i)⌘x

(i)j


21

✓

✓j ✓j � ↵@


for j = 0 ... d


@✓jJ(✓) =

@

@✓j

1

2n

nX

i=1

⇣h✓

⇣x

(i)⌘� y

(i)⌘2

=@

@✓j

1

2n

nX

i=1

dX

k=0

✓kx(i)k � y

(i)

!2

=1

n

nX

i=1

dX

k=0

✓kx(i)k � y

(i)

!⇥ @

@✓j

dX

k=0

✓kx(i)k � y

(i)

!

=1

n

nX

i=1

dX

k=0

✓kx(i)k � y

(i)

!x

(i)j

=1

n

nX

i=1

⇣h✓

⇣x

(i)⌘� y

(i)⌘x

(i)j

Gradient Descent for Linear Regression

• Initialize • Repeat until convergence

22

✓

simultaneous update for j = 0 ... d

✓j ✓j � ↵

1

n

nX

i=1

⇣h✓

⇣x

(i)⌘� y

(i)⌘x

(i)j

• To achieve simultaneous update• At the start of each GD iteration, compute• Use this stored value in the update step loop

h✓

⇣x

(i)⌘

kvk2 =

sX

i

v2i =q

v21 + v22 + . . .+ v2|v|L2 norm:

k✓new

� ✓old

k2 < ✏• Assume convergence when

Gradient Descent

23


h(x) = -‐900 – 0.1 x

Slide by Andrew Ng

Gradient Descent

24


Slide by Andrew Ng

Gradient Descent

25


Slide by Andrew Ng

Gradient Descent

26


Slide by Andrew Ng

Gradient Descent

27


Slide by Andrew Ng

Gradient Descent

28


Slide by Andrew Ng

Gradient Descent

29


Slide by Andrew Ng

Gradient Descent

30


Slide by Andrew Ng

Gradient Descent

31


Slide by Andrew Ng

Choosing α

32

α too small

slow convergence

α too large

Increasing value for J(✓)

• May overshoot the minimum• May fail to converge• May even diverge

To see if gradient descent is working, print out each iteration• The value should decrease at each iteration• If it doesn’t, adjust α

J(✓)

Extending Linear Regression to More Complex Models

• The inputs X for linear regression can be:– Original quantitative inputs– Transformation of quantitative inputs

• e.g. log, exp, square root, square, etc.

– Polynomial transformation• example: y = b0 + b1×x + b2×x2 + b3×x3

– Basis expansions– Dummy coding of categorical inputs– Interactions between variables

• example: x3 = x1 × x2

This allows use of linear regression techniques to fit non-‐linear datasets.

Linear Basis Function Models

• Generally,

• Typically, so that acts as a bias• In the simplest case, we use linear basis functions :

h✓(x) =dX

j=0

✓j�j(x)

�0(x) = 1 ✓0

�j(x) = xj

basis function

Based on slide by Christopher Bishop (PRML)


– These are global; a small change in x affects all basis functions

• Polynomial basis functions:

• Gaussian basis functions:

– These are local; a small change in x only affect nearby basis functions. μj and s control location and scale (width).


Linear Basis Function Models• Sigmoidal basis functions:

where

– These are also local; a small change in x only affects nearby basis functions. μjand s control location and scale (slope).


Example of Fitting a Polynomial Curve with a Linear Model

y = ✓0 + ✓1x+ ✓2x2 + . . .+ ✓px

p =pX

j=0

✓jxj


• Basic Linear Model:

• Generalized Linear Model:

• Once we have replaced the data by the outputs of the basis functions, fitting the generalized model is exactly the same problem as fitting the basic model– Unless we use the kernel trick – more on that when we cover support vector machines

– Therefore, there is no point in cluttering the math with basis functions

38

h✓(x) =dX

j=0

✓j�j(x)

h✓(x) =dX

j=0

✓jxj

Based on slide by Geoff Hinton

Linear Algebra Concepts• Vector in is an ordered set of d real numbers– e.g., v = [1,6,3,4] is in – “[1,6,3,4]” is a column vector:– as opposed to a row vector:

• An m-‐by-‐n matrix is an object with m rows and n columns, where each entry is a real number:

÷÷÷÷÷

ø

ö

ççççç

è

æ

4361

( )4361

÷÷÷

ø

ö

ççç

è

æ

2396784821

Rd

R4

Based on slides by Joseph Bradley

• Transpose: reflect vector/matrix on line:

( )baba T

=÷÷ø

öççè

æ÷÷ø

öççè

æ=÷÷

ø

öççè

ædbca

dcba T

– Note: (Ax)T=xTAT (We’ll define multiplication soon…)

• Vector norms:– Lp norm of v = (v1,…,vk) is – Common norms: L1, L2– Linfinity = maxi |vi|

• Length of a vector v is L2(v)

X

i

|vi|p! 1

p


Linear Algebra Concepts

• Vector dot product:

– Note: dot product of u with itself = length(u)2 =

• Matrix product:

( ) ( ) 22112121 vuvuvvuuvu +=•=•

÷÷ø

öççè

æ++++

=

÷÷ø

öççè

æ=÷÷

ø

öççè

æ=

2222122121221121

2212121121121111

2221

1211

2221

1211 ,

babababababababa

AB

bbbb

Baaaa

A

kuk22



• Vector products:– Dot product:

– Outer product:

( ) 22112

121 vuvuvv

uuvuvu T +=÷÷ø

öççè

æ==•

( ) ÷÷ø

öççè

æ=÷÷

ø

öççè

æ=

2212

211121

2

1

vuvuvuvu

vvuu

uvT



h(x) = ✓

|x

x

| =⇥1 x1 . . . xd

⇤

Vectorization• Benefits of vectorization– More compact equations– Faster code (using optimized matrix libraries)

• Consider our model:

• Let

• Can write the model in vectorized form as43

h(x) =dX

j=0

✓jxj

✓ =

2

6664

✓0✓1...✓d

3

7775

Vectorization• Consider our model for n instances:

• Let

• Can write the model in vectorized form as44

h✓(x) = X✓

X =

2

66666664

1 x

(1)1 . . . x

(1)d

......

. . ....

1 x

(i)1 . . . x

(i)d

......

. . ....

1 x

(n)1 . . . x

(n)d

3

77777775

✓ =

2

6664

✓0✓1...✓d

3

7775

h

⇣x

(i)⌘=

dX

j=0

✓jx(i)j

R(d+1)⇥1 Rn⇥(d+1)

J(✓) =1

2n

nX

i=1

⇣✓

|x

(i) � y(i)⌘2

Vectorization• For the linear regression cost function:

45

J(✓) =1

2n(X✓ � y)| (X✓ � y)

J(✓) =1

2n

nX

i=1

⇣h✓

⇣x

(i)⌘� y(i)

⌘2

Rn⇥(d+1)

R(d+1)⇥1

Rn⇥1R1⇥n

Let:

y =

2

6664

y(1)

y(2)

...y(n)

3

7775

Closed Form Solution:

Closed Form Solution• Instead of using GD, solve for optimal analytically– Notice that the solution is when

• Derivation:

Take derivative and set equal to 0, then solve for :

46

✓@

@✓J(✓) = 0

J (✓) =1

2n(X✓ � y)| (X✓ � y)

/ ✓|X|X✓ � y|X✓ � ✓|X|y + y|y/ ✓|X|X✓ � 2✓|X|y + y|y

1 x 1J (✓) =1

2n(X✓ � y)| (X✓ � y)

/ ✓|X|X✓ � y|X✓ � ✓|X|y + y|y/ ✓|X|X✓ � 2✓|X|y + y|y

J (✓) =1

2n(X✓ � y)| (X✓ � y)

/ ✓|X|X✓ � y|X✓ � ✓|X|y + y|y/ ✓|X|X✓ � 2✓|X|y + y|y

@

@✓(✓|X|X✓ � 2✓|X|y + y|y) = 0

(X|X)✓ �X|y = 0

(X|X)✓ = X|y

✓ = (X|X)�1X|y

✓@

@✓(✓|X|X✓ � 2✓|X|y + y|y) = 0

(X|X)✓ �X|y = 0

(X|X)✓ = X|y

✓ = (X|X)�1X|y

@

@✓(✓|X|X✓ � 2✓|X|y + y|y) = 0

(X|X)✓ �X|y = 0

(X|X)✓ = X|y

✓ = (X|X)�1X|y

@

@✓(✓|X|X✓ � 2✓|X|y + y|y) = 0

(X|X)✓ �X|y = 0

(X|X)✓ = X|y

✓ = (X|X)�1X|y

Closed Form Solution• Can obtain by simply plugging X and into

• If X T X is not invertible (i.e., singular), may need to:– Use pseudo-‐inverse instead of the inverse

• In python, numpy.linalg.pinv(a)

– Remove redundant (not linearly independent) features– Remove extra features to ensure that d ≤ n

47

@

@✓(✓|X|X✓ � 2✓|X|y + y|y) = 0

(X|X)✓ �X|y = 0

(X|X)✓ = X|y

✓ = (X|X)�1X|y

y =

2

6664

y(1)

y(2)

...y(n)

3

7775X =

2

66666664

1 x

(1)1 . . . x

(1)d

......

. . ....

1 x

(i)1 . . . x

(i)d

......

. . ....

1 x

(n)1 . . . x

(n)d

3

77777775

✓ y

Gradient Descent vs Closed Form

Gradient Descent Closed Form Solution

48

• Requires multiple iterations• Need to choose α• Works well when n is large• Can support incremental

learning

• Non-‐iterative• No need for α• Slow if n is large

– Computing (X T X)-‐1 is roughly O(n3)

Improving Learning: Feature Scaling

• Idea: Ensure that feature have similar scales

• Makes gradient descent converge much faster

49

0

5

10

15

20

0 5 10 15 20✓1

✓2

Before Feature Scaling

0

5

10

15

20

0 5 10 15 20✓1

✓2

After Feature Scaling

Feature Standardization• Rescales features to have zero mean and unit variance

– Let μj be the mean of feature j:

– Replace each value with:

• sj is the standard deviation of feature j• Could also use the range of feature j (maxj – minj) for sj

• Must apply the same transformation to instances for both training and prediction

• Outliers can cause problems50

µj =1

n

nX

i=1

x

(i)j

x

(i)j

x

(i)j � µj

sj

for j = 1...d(not x0!)

Quality of Fit

Overfitting:• The learned hypothesis may fit the training set very well ( )

• ...but fails to generalize to new examples

51

Price

Size

Price

Size

Price

Size

Underfitting(high bias)

Overfitting(high variance)

Correct fit

J(✓) ⇡ 0


Regularization• A method for automatically controlling the complexity of the learned hypothesis

• Idea: penalize for large values of– Can incorporate into the cost function– Works well when we have a lot of features, each that contributes a bit to predicting the label

• Can also address overfitting by eliminating features (either manually or via model selection)

52

✓j

Regularization• Linear regression objective function

– is the regularization parameter ( )– No regularization on !

53

J(✓) =1

2n

nX

i=1

⇣h✓

⇣x

(i)⌘� y(i)

⌘2+ �

dX

j=1

✓2j

model fit to data regularization

✓0

� � � 0

J(✓) =1

2n

nX

i=1

⇣h✓

⇣x

(i)⌘� y(i)

⌘2+

�

2

dX

j=1

✓2j

Understanding Regularization

• Note that

– This is the magnitude of the feature coefficient vector!

• We can also think of this as:

• L2 regularization pulls coefficients toward 0

54

dX

j=1

✓2j = k✓1:dk22

dX

j=1

(✓j � 0)2 = k✓1:d � ~0k22

J(✓) =1

2n

nX

i=1

⇣h✓

⇣x

(i)⌘� y(i)

⌘2+

�

2

dX

j=1

✓2j


• What happens if we set to be huge (e.g., 1010)?

55

�Price

Size


J(✓) =1

2n

nX

i=1

⇣h✓

⇣x

(i)⌘� y(i)

⌘2+

�

2

dX

j=1

✓2j


• What happens if we set to be huge (e.g., 1010)?

56

�Price

Size0 0 0 0


J(✓) =1

2n

nX

i=1

⇣h✓

⇣x

(i)⌘� y(i)

⌘2+

�

2

dX

j=1

✓2j

Regularized Linear Regression

57

• Cost Function

• Fit by solving

• Gradient update:

min✓

J(✓)

✓j ✓j � ↵

1

n

nX

i=1

⇣h✓

⇣x

(i)⌘� y

(i)⌘x

(i)j

✓0 ✓0 � ↵1

n

nX

i=1

⇣h✓

⇣x

(i)⌘� y(i)

⌘

regularization

@

@✓jJ(✓)

@

@✓0J(✓)

✓j ✓j � ↵

1

n

nX

i=1

⇣h✓

⇣x

(i)⌘� y

(i)⌘x

(i)j � �✓j

J(✓) =1

2n

nX

i=1

⇣h✓

⇣x

(i)⌘� y(i)

⌘2+

�

2

dX

j=1

✓2j


58

✓0 ✓0 � ↵1

n

nX

i=1

⇣h✓

⇣x

(i)⌘� y(i)

⌘

• We can rewrite the gradient step as:

J(✓) =1

2n

nX

i=1

⇣h✓

⇣x

(i)⌘� y(i)

⌘2+

�

2

dX

j=1

✓2j

✓j ✓j � ↵

1

n

nX

i=1

⇣h✓

⇣x

(i)⌘� y

(i)⌘x

(i)j � �✓j

✓j ✓j (1� ↵�)� ↵

1

n

nX

i=1

⇣h✓

⇣x

(i)⌘� y

(i)⌘x

(i)j


59

✓ =

0

BBBBB@X|X + �

2

666664

0 0 0 . . . 00 1 0 . . . 00 0 1 . . . 0...

......

. . ....

0 0 0 . . . 1

3

777775

1

CCCCCA

�1

X|y

• To incorporate regularization into the closed form solution:


60

• To incorporate regularization into the closed form solution:

• Can derive this the same way, by solving

• Can prove that for λ > 0, inverse exists in the equation above

✓ =

0

BBBBB@X|X + �

2

666664

0 0 0 . . . 00 1 0 . . . 00 0 1 . . . 0...

......

. . ....

0 0 0 . . . 1

3

777775

1

CCCCCA

�1

X|y

@

@✓J(✓) = 0

linear regression & gradient descent

Documents