linear regression & gradient descent · 2020. 1. 22. · linear regression & gradient descent robot...

34
Linear Regression & Gradient Descent Robot Image Credit: Viktoriya Sukhanova © 123RF.com These slides were assembled by Byron Boots, with grateful acknowledgement to Eric Eaton and the many others who made their course materials freely available online. Feel free to reuse or adapt these slides for your own academic purposes, provided that you include proper attribution.

Upload: others

Post on 28-Jan-2021

10 views

Category:

Documents


0 download

TRANSCRIPT

  • Linear Regression

    & Gradient Descent

    Robot Image Credit: Viktoriya Sukhanova © 123RF.com

    These slides were assembled by Byron Boots, with grateful acknowledgement to Eric Eaton and the many

    others who made their course materials freely available online. Feel free to reuse or adapt these slides for your own academic purposes, provided that you include proper attribution.

  • Regression

    Given:

    – Data where

    – Corresponding labels where

    2

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    1975 1980 1985 1990 1995 2000 2005 2010 2015

    Sept

    embe

    r Arc

    tic S

    ea Ic

    e Ex

    tent

    (1

    ,000

    ,000

    sq k

    m)

    Year

    Data from G. Witt. Journal of Statistics Education, Volume 21, Number 1 (2013)

    Linear Regression

    Quadratic Regression

    X =nx(1), . . . ,x(n)

    ox(i) 2 Rd

    y =ny(1), . . . , y(n)

    oy(i) 2 R

  • Linear Regression

    • Hypothesis:

    • Fit model by minimizing sum of squared errors

    3

    x

    y = ✓0 + ✓1x1 + ✓2x2 + . . .+ ✓dxd =dX

    j=0

    ✓jxj

    Assume x0 = 1

    y = ✓0 + ✓1x1 + ✓2x2 + . . .+ ✓dxd =dX

    j=0

    ✓jxj

    Figures are courtesy of Greg Shakhnarovich

  • Least Squares Linear Regression

    4

    • Cost Function

    • Fit by solving

    J(✓) =1

    2n

    nX

    i=1

    ⇣h✓

    ⇣x(i)

    ⌘� y(i)

    ⌘2

    min✓

    J(✓)

  • Intuition Behind Cost Function

    5

    For insight on J(), let’s assume so x 2 R ✓ = [✓0, ✓1]

    J(✓) =1

    2n

    nX

    i=1

    ⇣h✓

    ⇣x(i)

    ⌘� y(i)

    ⌘2

    Based on example

    by Andrew Ng

  • Intuition Behind Cost Function

    6

    0

    1

    2

    3

    0 1 2 3

    y

    x

    (for fixed , this is a function of x) (function of the parameter )

    0

    1

    2

    3

    -0.5 0 0.5 1 1.5 2 2.5

    For insight on J(), let’s assume so x 2 R ✓ = [✓0, ✓1]

    J(✓) =1

    2n

    nX

    i=1

    ⇣h✓

    ⇣x(i)

    ⌘� y(i)

    ⌘2

    Based on example

    by Andrew Ng

  • Intuition Behind Cost Function

    7

    0

    1

    2

    3

    0 1 2 3

    y

    x

    (for fixed , this is a function of x) (function of the parameter )

    0

    1

    2

    3

    -0.5 0 0.5 1 1.5 2 2.5

    For insight on J(), let’s assume so x 2 R ✓ = [✓0, ✓1]

    J(✓) =1

    2n

    nX

    i=1

    ⇣h✓

    ⇣x(i)

    ⌘� y(i)

    ⌘2

    J([0, 0.5]) =1

    2⇥ 3⇥(0.5� 1)2 + (1� 2)2 + (1.5� 3)2

    ⇤⇡ 0.58Based on example

    by Andrew Ng

  • Intuition Behind Cost Function

    8

    0

    1

    2

    3

    0 1 2 3

    y

    x

    (for fixed , this is a function of x) (function of the parameter )

    0

    1

    2

    3

    -0.5 0 0.5 1 1.5 2 2.5

    For insight on J(), let’s assume so x 2 R ✓ = [✓0, ✓1]

    J(✓) =1

    2n

    nX

    i=1

    ⇣h✓

    ⇣x(i)

    ⌘� y(i)

    ⌘2

    J([0, 0]) ⇡ 2.333

    Based on example

    by Andrew Ng

    J() is convex

  • Intuition Behind Cost Function

    9

    A function of a single variable J() is convex if it is

    twice differentiable and its second derivative is always nonnegative.

  • Intuition Behind Cost Function

    10

    Jensen’s Inequality

  • Intuition Behind Cost Function

    11Slide by Andrew Ng

  • Intuition Behind Cost Function

    12

    (for fixed , this is a function of x) (function of the parameters )

    Slide by Andrew Ng

  • Intuition Behind Cost Function

    13

    (for fixed , this is a function of x) (function of the parameters )

    Slide by Andrew Ng

  • Intuition Behind Cost Function

    14

    (for fixed , this is a function of x) (function of the parameters )

    Slide by Andrew Ng

  • Intuition Behind Cost Function

    15

    (for fixed , this is a function of x) (function of the parameters )

    Slide by Andrew Ng

  • Basic Search Procedure

    • Choose initial value for • Until we reach a minimum:– Choose a new value for to reduce

    16

    ✓ J(✓)

    q1q0

    J(q0,q1)

    Figure by Andrew Ng

  • Basic Search Procedure

    • Choose initial value for • Until we reach a minimum:– Choose a new value for to reduce

    17

    J(✓)

    q1q0

    J(q0,q1)

    Figure by Andrew Ng

  • Basic Search Procedure

    • Choose initial value for • Until we reach a minimum:– Choose a new value for to reduce

    18

    J(✓)

    q1q0

    J(q0,q1)

    Figure by Andrew Ng

    Since the least squares objective function is convex (concave), we

    don’t need to worry about local minima in linear regression

  • Gradient Descent

    • Initialize • Repeat until convergence

    19

    ✓j ✓j � ↵@

    @✓jJ(✓) simultaneous update

    for j = 0 ... d

    learning rate (small)

    e.g., α = 0.05

    J(✓)

    0

    1

    2

    3

    -0.5 0 0.5 1 1.5 2 2.5

  • Gradient Descent

    • Initialize • Repeat until convergence

    20

    ✓j ✓j � ↵@

    @✓jJ(✓) simultaneous update

    for j = 0 ... d

    For Linear Regression:@

    @✓jJ(✓) =

    @

    @✓j

    1

    2n

    nX

    i=1

    ⇣h✓⇣x(i)

    ⌘� y(i)

    ⌘2

    =@

    @✓j

    1

    2n

    nX

    i=1

    dX

    k=0

    ✓kx(i)k � y

    (i)

    !2

    =1

    n

    nX

    i=1

    dX

    k=0

    ✓kx(i)k � y

    (i)

    !⇥ @

    @✓j

    dX

    k=0

    ✓kx(i)k � y

    (i)

    !

    =1

    n

    nX

    i=1

    dX

    k=0

    ✓kx(i)k � y

    (i)

    !x(i)j

    =1

    n

    nX

    i=1

    ⇣h✓⇣x(i)

    ⌘� y(i)

    ⌘x(i)j

  • Gradient Descent

    • Initialize • Repeat until convergence

    21

    ✓j ✓j � ↵@

    @✓jJ(✓) simultaneous update

    for j = 0 ... d

    For Linear Regression:@

    @✓jJ(✓) =

    @

    @✓j

    1

    2n

    nX

    i=1

    ⇣h✓⇣x(i)

    ⌘� y(i)

    ⌘2

    =@

    @✓j

    1

    2n

    nX

    i=1

    dX

    k=0

    ✓kx(i)k � y

    (i)

    !2

    =1

    n

    nX

    i=1

    dX

    k=0

    ✓kx(i)k � y

    (i)

    !⇥ @

    @✓j

    dX

    k=0

    ✓kx(i)k � y

    (i)

    !

    =1

    n

    nX

    i=1

    dX

    k=0

    ✓kx(i)k � y

    (i)

    !x(i)j

    =1

    n

    nX

    i=1

    ⇣h✓⇣x(i)

    ⌘� y(i)

    ⌘x(i)j

  • Gradient Descent

    • Initialize • Repeat until convergence

    22

    ✓j ✓j � ↵@

    @✓jJ(✓) simultaneous update

    for j = 0 ... d

    For Linear Regression:@

    @✓jJ(✓) =

    @

    @✓j

    1

    2n

    nX

    i=1

    ⇣h✓⇣x(i)

    ⌘� y(i)

    ⌘2

    =@

    @✓j

    1

    2n

    nX

    i=1

    dX

    k=0

    ✓kx(i)k � y

    (i)

    !2

    =1

    n

    nX

    i=1

    dX

    k=0

    ✓kx(i)k � y

    (i)

    !⇥ @

    @✓j

    dX

    k=0

    ✓kx(i)k � y

    (i)

    !

    =1

    n

    nX

    i=1

    dX

    k=0

    ✓kx(i)k � y

    (i)

    !x(i)j

    =1

    n

    nX

    i=1

    ⇣h✓⇣x(i)

    ⌘� y(i)

    ⌘x(i)j

  • Gradient Descent

    • Initialize • Repeat until convergence

    23

    ✓j ✓j � ↵@

    @✓jJ(✓) simultaneous update

    for j = 0 ... d

    For Linear Regression:@

    @✓jJ(✓) =

    @

    @✓j

    1

    2n

    nX

    i=1

    ⇣h✓⇣x(i)

    ⌘� y(i)

    ⌘2

    =@

    @✓j

    1

    2n

    nX

    i=1

    dX

    k=0

    ✓kx(i)k � y

    (i)

    !2

    =1

    n

    nX

    i=1

    dX

    k=0

    ✓kx(i)k � y

    (i)

    !⇥ @

    @✓j

    dX

    k=0

    ✓kx(i)k � y

    (i)

    !

    =1

    n

    nX

    i=1

    dX

    k=0

    ✓kx(i)k � y

    (i)

    !x(i)j

    =1

    n

    nX

    i=1

    ⇣h✓⇣x(i)

    ⌘� y(i)

    ⌘x(i)j

  • Gradient Descent for Linear Regression

    • Initialize • Repeat until convergence

    24

    simultaneous

    update

    for j = 0 ... d

    ✓j ✓j � ↵1

    n

    nX

    i=1

    ⇣h✓

    ⇣x(i)

    ⌘� y(i)

    ⌘x(i)j

    kvk2 =sX

    i

    v2i =q

    v21 + v22 + . . .+ v

    2|v|L2 norm:

    k✓new � ✓oldk2 < ✏• Assume convergence when

  • Gradient Descent

    25

    (for fixed , this is a function of x) (function of the parameters )

    h(x) = -900 – 0.1 x

    Slide by Andrew Ng

  • Gradient Descent

    26

    (for fixed , this is a function of x) (function of the parameters )

    Slide by Andrew Ng

  • Gradient Descent

    27

    (for fixed , this is a function of x) (function of the parameters )

    Slide by Andrew Ng

  • Gradient Descent

    28

    (for fixed , this is a function of x) (function of the parameters )

    Slide by Andrew Ng

  • Gradient Descent

    29

    (for fixed , this is a function of x) (function of the parameters )

    Slide by Andrew Ng

  • Gradient Descent

    30

    (for fixed , this is a function of x) (function of the parameters )

    Slide by Andrew Ng

  • Gradient Descent

    31

    (for fixed , this is a function of x) (function of the parameters )

    Slide by Andrew Ng

  • Gradient Descent

    32

    (for fixed , this is a function of x) (function of the parameters )

    Slide by Andrew Ng

  • Gradient Descent

    33

    (for fixed , this is a function of x) (function of the parameters )

    Slide by Andrew Ng

  • Choosing α

    34

    α too small

    slow convergence

    α too large

    Increasing value for J(✓)

    • May overshoot the minimum• May fail to converge• May even diverge

    To see if gradient descent is working, print out each iteration

    • The value should decrease at each iteration• If it doesn’t, adjust α

    J(✓)