linear regression & gradient descent

60
Linear Regression Robot Image Credit: Viktoriya Sukhanova © 123RF.com These slides were assembled by Eric Eaton, with grateful acknowledgement of the many others who made their course materials freely available online. Feel free to reuse or adapt these slides for your own academic purposes, provided that you include proper attribution. Please send comments and corrections to Eric.

Upload: phungliem

Post on 02-Jan-2017

236 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Linear Regression & Gradient Descent

Linear  Regression

Robot  Image  Credit:  Viktoriya Sukhanova ©  123RF.com

These  slides  were  assembled  by  Eric  Eaton,  with  grateful  acknowledgement  of  the  many  others  who  made  their  course  materials  freely  available  online.  Feel  free  to  reuse  or  adapt  these  slides  for  your  own  academic  purposes,  provided  that  you  include  proper  attribution.  Please  send  comments  and  corrections  to  Eric.  

Page 2: Linear Regression & Gradient Descent

RegressionGiven:– Data                                                                                            where

– Corresponding  labels                                                                                      where    

2

0

1

2

3

4

5

6

7

8

9

1970 1980 1990 2000 2010 2020

Septem

ber  A

rctic  Sea  Ice  Extent  

(1,000,000  sq

 km)

Year

Data  from  G.  Witt.  Journal  of  Statistics  Education,  Volume  21,  Number  1  (2013)

Linear  RegressionQuadratic  Regression

X =n

x

(1), . . . ,x(n)o

x

(i) 2 Rd

y =n

y(1), . . . , y(n)o

y(i) 2 R

Page 3: Linear Regression & Gradient Descent

Linear  Regression• Hypothesis:  

• Fit  model  by  minimizing  sum  of  squared  errors  

3

x

y = ✓0 + ✓1x1 + ✓2x2 + . . .+ ✓dxd =dX

j=0

✓jxj

Assume  x0 =  1

y = ✓0 + ✓1x1 + ✓2x2 + . . .+ ✓dxd =dX

j=0

✓jxj

Figures are courtesy of  Greg  Shakhnarovich

Page 4: Linear Regression & Gradient Descent

Least  Squares  Linear  Regression

4

• Cost  Function

• Fit  by  solving  

J(✓) =1

2n

nX

i=1

⇣h✓

⇣x

(i)⌘� y(i)

⌘2

min✓

J(✓)

Page 5: Linear Regression & Gradient Descent

Intuition  Behind  Cost  Function

5

For  insight  on  J(),  let’s  assume                              so    x 2 R ✓ = [✓0, ✓1]

J(✓) =1

2n

nX

i=1

⇣h✓

⇣x

(i)⌘� y(i)

⌘2

Based  on  example  by  Andrew  Ng

Page 6: Linear Regression & Gradient Descent

Intuition  Behind  Cost  Function

6

0

1

2

3

0 1 2 3

y

x

(for  fixed          ,  this  is  a  function  of  x) (function  of  the  parameter            )

0

1

2

3

-­‐0.5 0 0.5 1 1.5 2 2.5

For  insight  on  J(),  let’s  assume                              so    x 2 R ✓ = [✓0, ✓1]

J(✓) =1

2n

nX

i=1

⇣h✓

⇣x

(i)⌘� y(i)

⌘2

Based  on  example  by  Andrew  Ng

Page 7: Linear Regression & Gradient Descent

Intuition  Behind  Cost  Function

7

0

1

2

3

0 1 2 3

y

x

(for  fixed          ,  this  is  a  function  of  x) (function  of  the  parameter            )

0

1

2

3

-­‐0.5 0 0.5 1 1.5 2 2.5

For  insight  on  J(),  let’s  assume                              so    x 2 R ✓ = [✓0, ✓1]

J(✓) =1

2n

nX

i=1

⇣h✓

⇣x

(i)⌘� y(i)

⌘2

J([0, 0.5]) =1

2⇥ 3

⇥(0.5� 1)2 + (1� 2)2 + (1.5� 3)2

⇤⇡ 0.58Based  on  example  

by  Andrew  Ng

Page 8: Linear Regression & Gradient Descent

Intuition  Behind  Cost  Function

8

0

1

2

3

0 1 2 3

y

x

(for  fixed          ,  this  is  a  function  of  x) (function  of  the  parameter            )

0

1

2

3

-­‐0.5 0 0.5 1 1.5 2 2.5

For  insight  on  J(),  let’s  assume                              so    x 2 R ✓ = [✓0, ✓1]

J(✓) =1

2n

nX

i=1

⇣h✓

⇣x

(i)⌘� y(i)

⌘2

J([0, 0]) ⇡ 2.333

Based  on  example  by  Andrew  Ng

J()  is  concave

Page 9: Linear Regression & Gradient Descent

Intuition  Behind  Cost  Function

9Slide  by  Andrew  Ng

Page 10: Linear Regression & Gradient Descent

Intuition  Behind  Cost  Function

10

(for  fixed                      ,  this  is  a  function  of  x) (function  of  the  parameters                        )

Slide  by  Andrew  Ng

Page 11: Linear Regression & Gradient Descent

Intuition  Behind  Cost  Function

11

(for  fixed                      ,  this  is  a  function  of  x) (function  of  the  parameters                        )

Slide  by  Andrew  Ng

Page 12: Linear Regression & Gradient Descent

Intuition  Behind  Cost  Function

12

(for  fixed                      ,  this  is  a  function  of  x) (function  of  the  parameters                        )

Slide  by  Andrew  Ng

Page 13: Linear Regression & Gradient Descent

Intuition  Behind  Cost  Function

13

(for  fixed                      ,  this  is  a  function  of  x) (function  of  the  parameters                        )

Slide  by  Andrew  Ng

Page 14: Linear Regression & Gradient Descent

Basic  Search  Procedure• Choose  initial  value  for  • Until  we  reach  a  minimum:– Choose  a  new  value  for            to  reduce  

14

✓ J(✓)

q1q0

J(q0,q1)

Figure  by  Andrew  Ng

Page 15: Linear Regression & Gradient Descent

Basic  Search  Procedure• Choose  initial  value  for  • Until  we  reach  a  minimum:– Choose  a  new  value  for            to  reduce  

15

J(✓)

q1q0

J(q0,q1)

Figure  by  Andrew  Ng

Page 16: Linear Regression & Gradient Descent

Basic  Search  Procedure• Choose  initial  value  for  • Until  we  reach  a  minimum:– Choose  a  new  value  for            to  reduce  

16

J(✓)

q1q0

J(q0,q1)

Figure  by  Andrew  Ng

Since  the  least  squares  objective  function  is  convex  (concave),  we  don’t  need  to  worry  about  local  minima

Page 17: Linear Regression & Gradient Descent

Gradient  Descent• Initialize  • Repeat  until  convergence

17

✓j ✓j � ↵@

@✓jJ(✓) simultaneous  update  

for  j =  0  ...  d

learning  rate  (small)e.g.,  α  =  0.05

J(✓)

0

1

2

3

-­‐0.5 0 0.5 1 1.5 2 2.5

Page 18: Linear Regression & Gradient Descent

Gradient  Descent• Initialize  • Repeat  until  convergence

18

✓j ✓j � ↵@

@✓jJ(✓) simultaneous  update  

for  j =  0  ...  d

For  Linear  Regression:@

@✓jJ(✓) =

@

@✓j

1

2n

nX

i=1

⇣h✓

⇣x

(i)⌘� y

(i)⌘2

=@

@✓j

1

2n

nX

i=1

dX

k=0

✓kx(i)k � y

(i)

!2

=1

n

nX

i=1

dX

k=0

✓kx(i)k � y

(i)

!⇥ @

@✓j

dX

k=0

✓kx(i)k � y

(i)

!

=1

n

nX

i=1

dX

k=0

✓kx(i)k � y

(i)

!x

(i)j

=1

n

nX

i=1

⇣h✓

⇣x

(i)⌘� y

(i)⌘x

(i)j

Page 19: Linear Regression & Gradient Descent

Gradient  Descent• Initialize  • Repeat  until  convergence

19

✓j ✓j � ↵@

@✓jJ(✓) simultaneous  update  

for  j =  0  ...  d

For  Linear  Regression:@

@✓jJ(✓) =

@

@✓j

1

2n

nX

i=1

⇣h✓

⇣x

(i)⌘� y

(i)⌘2

=@

@✓j

1

2n

nX

i=1

dX

k=0

✓kx(i)k � y

(i)

!2

=1

n

nX

i=1

dX

k=0

✓kx(i)k � y

(i)

!⇥ @

@✓j

dX

k=0

✓kx(i)k � y

(i)

!

=1

n

nX

i=1

dX

k=0

✓kx(i)k � y

(i)

!x

(i)j

=1

n

nX

i=1

⇣h✓

⇣x

(i)⌘� y

(i)⌘x

(i)j

Page 20: Linear Regression & Gradient Descent

Gradient  Descent• Initialize  • Repeat  until  convergence

20

✓j ✓j � ↵@

@✓jJ(✓) simultaneous  update  

for  j =  0  ...  d

For  Linear  Regression:@

@✓jJ(✓) =

@

@✓j

1

2n

nX

i=1

⇣h✓

⇣x

(i)⌘� y

(i)⌘2

=@

@✓j

1

2n

nX

i=1

dX

k=0

✓kx(i)k � y

(i)

!2

=1

n

nX

i=1

dX

k=0

✓kx(i)k � y

(i)

!⇥ @

@✓j

dX

k=0

✓kx(i)k � y

(i)

!

=1

n

nX

i=1

dX

k=0

✓kx(i)k � y

(i)

!x

(i)j

=1

n

nX

i=1

⇣h✓

⇣x

(i)⌘� y

(i)⌘x

(i)j

Page 21: Linear Regression & Gradient Descent

Gradient  Descent• Initialize  • Repeat  until  convergence

21

✓j ✓j � ↵@

@✓jJ(✓) simultaneous  update  

for  j =  0  ...  d

For  Linear  Regression:@

@✓jJ(✓) =

@

@✓j

1

2n

nX

i=1

⇣h✓

⇣x

(i)⌘� y

(i)⌘2

=@

@✓j

1

2n

nX

i=1

dX

k=0

✓kx(i)k � y

(i)

!2

=1

n

nX

i=1

dX

k=0

✓kx(i)k � y

(i)

!⇥ @

@✓j

dX

k=0

✓kx(i)k � y

(i)

!

=1

n

nX

i=1

dX

k=0

✓kx(i)k � y

(i)

!x

(i)j

=1

n

nX

i=1

⇣h✓

⇣x

(i)⌘� y

(i)⌘x

(i)j

Page 22: Linear Regression & Gradient Descent

Gradient  Descent  for  Linear  Regression

• Initialize  • Repeat  until  convergence

22

simultaneous  update  for  j =  0  ...  d

✓j ✓j � ↵

1

n

nX

i=1

⇣h✓

⇣x

(i)⌘� y

(i)⌘x

(i)j

• To  achieve  simultaneous  update• At  the  start  of  each  GD  iteration,  compute• Use  this  stored  value  in  the  update  step  loop

h✓

⇣x

(i)⌘

kvk2 =

sX

i

v2i =q

v21 + v22 + . . .+ v2|v|L2 norm:

k✓new

� ✓old

k2 < ✏• Assume  convergence  when  

Page 23: Linear Regression & Gradient Descent

Gradient  Descent

23

(for  fixed                      ,  this  is  a  function  of  x) (function  of  the  parameters                        )

h(x)  =  -­‐900  – 0.1  x

Slide  by  Andrew  Ng

Page 24: Linear Regression & Gradient Descent

Gradient  Descent

24

(for  fixed                      ,  this  is  a  function  of  x) (function  of  the  parameters                        )

Slide  by  Andrew  Ng

Page 25: Linear Regression & Gradient Descent

Gradient  Descent

25

(for  fixed                      ,  this  is  a  function  of  x) (function  of  the  parameters                        )

Slide  by  Andrew  Ng

Page 26: Linear Regression & Gradient Descent

Gradient  Descent

26

(for  fixed                      ,  this  is  a  function  of  x) (function  of  the  parameters                        )

Slide  by  Andrew  Ng

Page 27: Linear Regression & Gradient Descent

Gradient  Descent

27

(for  fixed                      ,  this  is  a  function  of  x) (function  of  the  parameters                        )

Slide  by  Andrew  Ng

Page 28: Linear Regression & Gradient Descent

Gradient  Descent

28

(for  fixed                      ,  this  is  a  function  of  x) (function  of  the  parameters                        )

Slide  by  Andrew  Ng

Page 29: Linear Regression & Gradient Descent

Gradient  Descent

29

(for  fixed                      ,  this  is  a  function  of  x) (function  of  the  parameters                        )

Slide  by  Andrew  Ng

Page 30: Linear Regression & Gradient Descent

Gradient  Descent

30

(for  fixed                      ,  this  is  a  function  of  x) (function  of  the  parameters                        )

Slide  by  Andrew  Ng

Page 31: Linear Regression & Gradient Descent

Gradient  Descent

31

(for  fixed                      ,  this  is  a  function  of  x) (function  of  the  parameters                        )

Slide  by  Andrew  Ng

Page 32: Linear Regression & Gradient Descent

Choosing  α

32

α  too  small

slow  convergence

α  too  large

Increasing  value  for J(✓)

• May  overshoot  the  minimum• May  fail  to  converge• May  even  diverge

To  see  if  gradient  descent  is  working,  print  out                      each  iteration• The  value  should  decrease  at  each  iteration• If  it  doesn’t,  adjust  α

J(✓)

Page 33: Linear Regression & Gradient Descent

Extending  Linear  Regression  to  More  Complex  Models

• The  inputs  X for  linear  regression  can  be:– Original  quantitative  inputs– Transformation  of  quantitative  inputs

• e.g.  log,  exp,  square  root,  square,  etc.

– Polynomial  transformation• example:    y =  b0 +  b1×x +  b2×x2 +  b3×x3

– Basis  expansions– Dummy  coding  of  categorical  inputs– Interactions  between  variables

• example:  x3 =  x1 × x2

This  allows  use  of  linear  regression  techniques  to  fit  non-­‐linear  datasets.

Page 34: Linear Regression & Gradient Descent

Linear  Basis  Function  Models

• Generally,

• Typically,                                            so  that                  acts  as  a  bias• In  the  simplest  case,  we  use  linear  basis  functions  :

h✓(x) =dX

j=0

✓j�j(x)

�0(x) = 1 ✓0

�j(x) = xj

basis  function

Based  on  slide  by  Christopher  Bishop  (PRML)

Page 35: Linear Regression & Gradient Descent

Linear  Basis  Function  Models

– These  are  global;  a  small  change  in  x affects  all  basis  functions

• Polynomial  basis  functions:

• Gaussian  basis  functions:

– These  are  local;  a  small  change  in  x only  affect  nearby  basis  functions.  μj and  s control  location  and  scale  (width).

Based  on  slide  by  Christopher  Bishop  (PRML)

Page 36: Linear Regression & Gradient Descent

Linear  Basis  Function  Models• Sigmoidal basis  functions:

where

– These  are  also  local;  a  small  change  in  x only  affects  nearby  basis  functions.    μjand  s control  location  and  scale  (slope).

Based  on  slide  by  Christopher  Bishop  (PRML)

Page 37: Linear Regression & Gradient Descent

Example  of  Fitting  a  Polynomial  Curve  with  a  Linear  Model

y = ✓0 + ✓1x+ ✓2x2 + . . .+ ✓px

p =pX

j=0

✓jxj

Page 38: Linear Regression & Gradient Descent

Linear  Basis  Function  Models

• Basic  Linear  Model:  

• Generalized  Linear  Model:

• Once  we  have  replaced  the  data  by  the  outputs  of  the  basis  functions,  fitting  the  generalized  model  is  exactly  the  same  problem  as  fitting  the  basic  model– Unless  we  use  the  kernel  trick  – more  on  that  when  we  cover  support  vector  machines

– Therefore,  there  is  no  point  in  cluttering  the  math  with  basis  functions

38

h✓(x) =dX

j=0

✓j�j(x)

h✓(x) =dX

j=0

✓jxj

Based  on  slide  by  Geoff  Hinton

Page 39: Linear Regression & Gradient Descent

Linear  Algebra  Concepts• Vector in   is  an  ordered  set  of  d real  numbers– e.g.,  v  =  [1,6,3,4]  is  in  – “[1,6,3,4]” is  a  column  vector:– as  opposed  to  a  row  vector:

• An  m-­‐by-­‐n matrix is  an  object  with  m rows  and  n columns,  where  each  entry  is  a  real  number:

÷÷÷÷÷

ø

ö

ççççç

è

æ

4361

( )4361

÷÷÷

ø

ö

ççç

è

æ

2396784821

Rd

R4

Based  on  slides  by  Joseph  Bradley

Page 40: Linear Regression & Gradient Descent

• Transpose:  reflect  vector/matrix  on  line:

( )baba T

=÷÷ø

öççè

æ÷÷ø

öççè

æ=÷÷

ø

öççè

ædbca

dcba T

– Note:  (Ax)T=xTAT (We’ll  define  multiplication  soon…)

• Vector  norms:– Lp norm  of  v =  (v1,…,vk)  is  – Common  norms:  L1,  L2– Linfinity =  maxi |vi|

• Length  of  a  vector  v is  L2(v)

X

i

|vi|p! 1

p

Based  on  slides  by  Joseph  Bradley

Linear  Algebra  Concepts

Page 41: Linear Regression & Gradient Descent

• Vector  dot  product:

– Note:  dot  product  of  u with  itself   =  length(u)2 =  

• Matrix  product:

( ) ( ) 22112121 vuvuvvuuvu +=•=•

÷÷ø

öççè

æ++++

=

÷÷ø

öççè

æ=÷÷

ø

öççè

æ=

2222122121221121

2212121121121111

2221

1211

2221

1211 ,

babababababababa

AB

bbbb

Baaaa

A

kuk22

Based  on  slides  by  Joseph  Bradley

Linear  Algebra  Concepts

Page 42: Linear Regression & Gradient Descent

• Vector  products:– Dot  product:

– Outer  product:

( ) 22112

121 vuvuvv

uuvuvu T +=÷÷ø

öççè

æ==•

( ) ÷÷ø

öççè

æ=÷÷

ø

öççè

æ=

2212

211121

2

1

vuvuvuvu

vvuu

uvT

Based  on  slides  by  Joseph  Bradley

Linear  Algebra  Concepts

Page 43: Linear Regression & Gradient Descent

h(x) = ✓

|x

x

| =⇥1 x1 . . . xd

Vectorization• Benefits  of  vectorization– More  compact  equations– Faster  code  (using  optimized  matrix  libraries)

• Consider  our  model:

• Let

• Can  write  the  model  in  vectorized form  as43

h(x) =dX

j=0

✓jxj

✓ =

2

6664

✓0✓1...✓d

3

7775

Page 44: Linear Regression & Gradient Descent

Vectorization• Consider  our  model  for  n instances:

• Let

• Can  write  the  model  in  vectorized form  as44

h✓(x) = X✓

X =

2

66666664

1 x

(1)1 . . . x

(1)d

......

. . ....

1 x

(i)1 . . . x

(i)d

......

. . ....

1 x

(n)1 . . . x

(n)d

3

77777775

✓ =

2

6664

✓0✓1...✓d

3

7775

h

⇣x

(i)⌘=

dX

j=0

✓jx(i)j

R(d+1)⇥1 Rn⇥(d+1)

Page 45: Linear Regression & Gradient Descent

J(✓) =1

2n

nX

i=1

⇣✓

|x

(i) � y(i)⌘2

Vectorization• For  the  linear  regression  cost  function:

45

J(✓) =1

2n(X✓ � y)| (X✓ � y)

J(✓) =1

2n

nX

i=1

⇣h✓

⇣x

(i)⌘� y(i)

⌘2

Rn⇥(d+1)

R(d+1)⇥1

Rn⇥1R1⇥n

Let:

y =

2

6664

y(1)

y(2)

...y(n)

3

7775

Page 46: Linear Regression & Gradient Descent

Closed  Form  Solution:

Closed  Form  Solution• Instead  of  using  GD,  solve  for  optimal   analytically– Notice  that  the  solution  is  when

• Derivation:

Take  derivative  and  set  equal  to  0,  then  solve  for          :  

46

✓@

@✓J(✓) = 0

J (✓) =1

2n(X✓ � y)| (X✓ � y)

/ ✓|X|X✓ � y|X✓ � ✓|X|y + y|y/ ✓|X|X✓ � 2✓|X|y + y|y

1  x  1J (✓) =1

2n(X✓ � y)| (X✓ � y)

/ ✓|X|X✓ � y|X✓ � ✓|X|y + y|y/ ✓|X|X✓ � 2✓|X|y + y|y

J (✓) =1

2n(X✓ � y)| (X✓ � y)

/ ✓|X|X✓ � y|X✓ � ✓|X|y + y|y/ ✓|X|X✓ � 2✓|X|y + y|y

@

@✓(✓|X|X✓ � 2✓|X|y + y|y) = 0

(X|X)✓ �X|y = 0

(X|X)✓ = X|y

✓ = (X|X)�1X|y

✓@

@✓(✓|X|X✓ � 2✓|X|y + y|y) = 0

(X|X)✓ �X|y = 0

(X|X)✓ = X|y

✓ = (X|X)�1X|y

@

@✓(✓|X|X✓ � 2✓|X|y + y|y) = 0

(X|X)✓ �X|y = 0

(X|X)✓ = X|y

✓ = (X|X)�1X|y

@

@✓(✓|X|X✓ � 2✓|X|y + y|y) = 0

(X|X)✓ �X|y = 0

(X|X)✓ = X|y

✓ = (X|X)�1X|y

Page 47: Linear Regression & Gradient Descent

Closed  Form  Solution• Can  obtain            by  simply  plugging  X and   into

• If  X T  X is  not  invertible  (i.e.,  singular),  may  need  to:– Use  pseudo-­‐inverse  instead  of  the  inverse

• In  python,    numpy.linalg.pinv(a)

– Remove  redundant  (not  linearly  independent)  features– Remove  extra  features  to  ensure  that  d ≤  n

47

@

@✓(✓|X|X✓ � 2✓|X|y + y|y) = 0

(X|X)✓ �X|y = 0

(X|X)✓ = X|y

✓ = (X|X)�1X|y

y =

2

6664

y(1)

y(2)

...y(n)

3

7775X =

2

66666664

1 x

(1)1 . . . x

(1)d

......

. . ....

1 x

(i)1 . . . x

(i)d

......

. . ....

1 x

(n)1 . . . x

(n)d

3

77777775

✓ y

Page 48: Linear Regression & Gradient Descent

Gradient  Descent  vs Closed  Form

Gradient  Descent                      Closed  Form  Solution

48

• Requires  multiple  iterations• Need  to  choose  α• Works  well  when  n is  large• Can  support  incremental  

learning

• Non-­‐iterative• No  need  for  α• Slow  if  n is  large

– Computing  (X T  X)-­‐1 is  roughly  O(n3)

Page 49: Linear Regression & Gradient Descent

Improving  Learning:    Feature  Scaling

• Idea:  Ensure  that  feature  have  similar  scales

• Makes  gradient  descent  converge  much faster

49

0

5

10

15

20

0 5 10 15 20✓1

✓2

Before  Feature  Scaling

0

5

10

15

20

0 5 10 15 20✓1

✓2

After  Feature  Scaling

Page 50: Linear Regression & Gradient Descent

Feature  Standardization• Rescales  features  to  have  zero  mean  and  unit  variance

– Let  μj be  the  mean  of  feature  j:

– Replace  each  value  with:

• sj is  the  standard  deviation  of  feature  j• Could  also  use  the  range  of  feature  j (maxj – minj)  for  sj

• Must  apply  the  same  transformation  to  instances  for  both  training  and  prediction

• Outliers  can  cause  problems50

µj =1

n

nX

i=1

x

(i)j

x

(i)j

x

(i)j � µj

sj

for  j =  1...d(not  x0!)

Page 51: Linear Regression & Gradient Descent

Quality  of  Fit

Overfitting:• The  learned  hypothesis  may  fit  the  training  set  very  well  (                                     )

• ...but  fails  to  generalize  to  new  examples

51

Price

Size

Price

Size

Price

Size

Underfitting(high  bias)

Overfitting(high  variance)

Correct  fit

J(✓) ⇡ 0

Based  on  example  by  Andrew  Ng

Page 52: Linear Regression & Gradient Descent

Regularization• A  method  for  automatically  controlling  the  complexity  of  the  learned  hypothesis

• Idea:    penalize  for  large  values  of– Can  incorporate  into  the  cost  function– Works  well  when  we  have  a  lot  of  features,  each  that  contributes  a  bit  to  predicting  the  label  

• Can  also  address  overfitting by  eliminating  features  (either  manually  or  via  model  selection)

52

✓j

Page 53: Linear Regression & Gradient Descent

Regularization• Linear  regression  objective  function

– is  the  regularization  parameter  (                      )– No  regularization  on            !  

53

J(✓) =1

2n

nX

i=1

⇣h✓

⇣x

(i)⌘� y(i)

⌘2+ �

dX

j=1

✓2j

model  fit  to  data regularization

✓0

� � � 0

J(✓) =1

2n

nX

i=1

⇣h✓

⇣x

(i)⌘� y(i)

⌘2+

2

dX

j=1

✓2j

Page 54: Linear Regression & Gradient Descent

Understanding  Regularization

• Note  that                                                                  

– This  is  the  magnitude  of  the  feature  coefficient  vector!

• We  can  also  think  of  this  as:

• L2 regularization  pulls  coefficients  toward  0

54

dX

j=1

✓2j = k✓1:dk22

dX

j=1

(✓j � 0)2 = k✓1:d � ~0k22

J(✓) =1

2n

nX

i=1

⇣h✓

⇣x

(i)⌘� y(i)

⌘2+

2

dX

j=1

✓2j

Page 55: Linear Regression & Gradient Descent

Understanding  Regularization

• What  happens  if  we  set            to  be  huge  (e.g.,  1010)?

55

�Price

Size

Based  on  example  by  Andrew  Ng

J(✓) =1

2n

nX

i=1

⇣h✓

⇣x

(i)⌘� y(i)

⌘2+

2

dX

j=1

✓2j

Page 56: Linear Regression & Gradient Descent

Understanding  Regularization

• What  happens  if  we  set            to  be  huge  (e.g.,  1010)?

56

�Price

Size0 0 0 0

Based  on  example  by  Andrew  Ng

J(✓) =1

2n

nX

i=1

⇣h✓

⇣x

(i)⌘� y(i)

⌘2+

2

dX

j=1

✓2j

Page 57: Linear Regression & Gradient Descent

Regularized  Linear  Regression

57

• Cost  Function

• Fit  by  solving

• Gradient  update:  

min✓

J(✓)

✓j ✓j � ↵

1

n

nX

i=1

⇣h✓

⇣x

(i)⌘� y

(i)⌘x

(i)j

✓0 ✓0 � ↵1

n

nX

i=1

⇣h✓

⇣x

(i)⌘� y(i)

regularization

@

@✓jJ(✓)

@

@✓0J(✓)

✓j ✓j � ↵

1

n

nX

i=1

⇣h✓

⇣x

(i)⌘� y

(i)⌘x

(i)j � �✓j

J(✓) =1

2n

nX

i=1

⇣h✓

⇣x

(i)⌘� y(i)

⌘2+

2

dX

j=1

✓2j

Page 58: Linear Regression & Gradient Descent

Regularized  Linear  Regression

58

✓0 ✓0 � ↵1

n

nX

i=1

⇣h✓

⇣x

(i)⌘� y(i)

• We  can  rewrite  the  gradient  step  as:

J(✓) =1

2n

nX

i=1

⇣h✓

⇣x

(i)⌘� y(i)

⌘2+

2

dX

j=1

✓2j

✓j ✓j � ↵

1

n

nX

i=1

⇣h✓

⇣x

(i)⌘� y

(i)⌘x

(i)j � �✓j

✓j ✓j (1� ↵�)� ↵

1

n

nX

i=1

⇣h✓

⇣x

(i)⌘� y

(i)⌘x

(i)j

Page 59: Linear Regression & Gradient Descent

Regularized  Linear  Regression

59

✓ =

0

BBBBB@X|X + �

2

666664

0 0 0 . . . 00 1 0 . . . 00 0 1 . . . 0...

......

. . ....

0 0 0 . . . 1

3

777775

1

CCCCCA

�1

X|y

• To  incorporate  regularization  into  the  closed  form  solution:

Page 60: Linear Regression & Gradient Descent

Regularized  Linear  Regression

60

• To  incorporate  regularization  into  the  closed  form  solution:

• Can  derive  this  the  same  way,  by  solving

• Can  prove  that  for  λ >  0,  inverse  exists  in  the  equation  above

✓ =

0

BBBBB@X|X + �

2

666664

0 0 0 . . . 00 1 0 . . . 00 0 1 . . . 0...

......

. . ....

0 0 0 . . . 1

3

777775

1

CCCCCA

�1

X|y

@

@✓J(✓) = 0