6b gradient descent - seas.upenn.educis520/lectures/gradient_descent.pdf · gradient descent lyle...

Gradient DescentLyle Ungar

University of PennsylvaniaIn part from slides written jointly with Zack Ives

Learning objectivesKnow standard, coordinate, stochastic gradient, and minibatch gradient descent

Adagrad: core idea

Gradient Descentu We almost always want to minimize some loss functionu Example: Sum of Squared Errors (SSE):

𝑆𝑆𝐸 q =1𝑛&!"#

$

𝑟! q %

𝑟!(q) = 𝑦(!) − ŷ 𝒙 ! ; q

Mean Squared Error

http://www.math.uah.edu/stat/expect/Variance.html

In one dimension, lookslike a parabola centeredaround the optimalvalue μ

(Generalizes to d dimensions)

𝑆𝑆𝐸 q =1𝑛)!"#

𝑟! q $

q

Getting Closer


Current value q

What if we use theslope of the tangentto decide where to“go next”?

the gradient𝛻𝑆𝑆𝐸 q

= lim!→#

ŷ(q + 𝑑 − ŷ q𝑑

𝑆𝑆𝐸 q =1𝑛)!"#

%

𝑟! q $

theory.stanford.edu/~tim/s15/l/l15.pdf

q: = q− h 𝛻𝑆𝑆𝐸 q

Getting Closer


We can compute the gradient numerically… But sometimes better to use analytics (calculus)!

𝑆𝑆𝐸 q =1𝑛)!"#

%

𝑟! q $

Current value q

𝛻𝑆𝑆𝐸 q =1𝑛3$%&

'𝑑𝑑q𝑟$ q (

Getting Closer



𝑆𝑆𝐸 q =1𝑛)!"#

%

𝑟! q $

Current value q

𝛻𝑆𝑆𝐸 q

=1𝑛'!"#

$

2 ⋅ 𝑟! q ⋅𝜕𝑟!(q)𝜕q

Getting Closer



𝑆𝑆𝐸 q =1𝑛)!"#

%

𝑟! q $

Current value q

𝛻𝑆𝑆𝐸 q

=1𝑛'!"#

$

2 ⋅ 𝑟! q ⋅𝜕𝑟!(q)𝜕q

𝜕𝑟$𝜕q)

= 𝑥)$

Getting Closer



𝑆𝑆𝐸 q =1𝑛)!"#

%

𝑟! q $

Current value q

𝜕𝑟$𝜕q)

= 𝑥)$

𝛻𝑆𝑆𝐸 q

=1𝑛'!"#

$

2 𝑟! q ⋅ 𝒙 !

Getting Closer



𝑆𝑆𝐸 q =1𝑛)!"#

%

𝑟! q $

Current value q

𝜕𝑟$𝜕q)

= 𝑥)$

𝛻𝑆𝑆𝐸 q

=2𝑛'!"#

$

𝑟! q ⋅ 𝒙 !

Step h

q ∶= q− h 𝛻𝑀𝑆𝐸(q)

Key questionsu How big a step h to take?

l Too small and it takes a long timel Too big and it will be unstable

u“Optimal:” scale h ~ 1/sqrt(iteration)u Adaptive (a simple version)

l E.g. each time, increase step size by 10%n If error ever increases, cut set size in half

q: = q− h 𝛻𝑆𝑆𝐸(q)

For ||w||1 or ||y-y||1 use coordinate descent

https://en.wikipedia.org/wiki/Coordinate_descent

Repeat:For j=1:pqj:= qj - hdErr/dqj

https://en.wikipedia.org/wiki/Coordinate_descent

Elastic net parameter search

Size of coefficients

Regularization penalty (inverse) Zou and Hastie

Stochastic Gradient Descentu If we have a very large data set, update the model

after observing each single observationl “online” or “streaming” learning

𝑆𝑆𝐸 q =1𝑛)!"#

%

𝑟! q $ 𝛻𝑆𝑆𝐸𝑖 q =𝑑𝑑q𝑟! q $

q ∶= q− h 𝛻𝑆𝑆𝐸𝑖 q

Mini-batchu Update the model every k observations

l Batch size k (e.g. 50)u More efficient than pure stochastic gradient

or full gradient descent𝑆𝑆𝐸 q =

1𝑛)!"#

%

𝑟! q $ 𝛻𝑆𝑆𝐸𝑘 q =1𝑘'!"#

#$%𝑑𝑑q 𝑟!

q &

q ∶= q− h 𝛻𝑆𝑆𝐸𝑘 q

Adagrad• Define a per-feature learning rate for feature j as:

• Gt,j is the sum of squares of gradients of feature j over time t• Frequently occurring features in the gradients get small learning rates; rare features get higher ones • Key idea: “learn slowly” from frequent features but “pay attention” to rare but informative feature

@

@✓jcost✓(xk, yk)

⌘t,j =⌘pGt,j

Gt,j =tX

k=1

g2k,j<latexit sha1_base64="q8emAzDFCHDHqZmCF6eU8SK6HTU=">AAACM3icbVDLSgMxFM34rPVVdekmKIILKTN1oZuC6KLiqoJVoVOHTJqpsZmHyR2hhPkBd/6JGz9EF4K4UMStvyBmWsHngcDhnHO5ucdPBFdg2w/W0PDI6Nh4YaI4OTU9M1uamz9UcSopa9BYxPLYJ4oJHrEGcBDsOJGMhL5gR353J/ePLphUPI4OoJewVkg6EQ84JWAkr7TnMiCehrWzDFexG0hCdS5l2lXnEnRt4GVZsfaVUmno6W7VyU4Adwwz8kml6JWW7bLdB/5LnE+yvFW73Hl3r27rXunObcc0DVkEVBClmo6dQEsTCZwKlhXdVLGE0C7psKahEQmZaun+zRleMUobB7E0LwLcV79PaBIq1Qt9kwwJnKrfXi7+5zVTCDZbmkdJCiyig0VBKjDEOC8Qt7lkFETPEEIlN3/F9JSY3sDUnJfg/D75LzmslJ31cmXftLGNBiigRbSEVpGDNtAW2kV11EAUXaN79ISerRvr0XqxXgfRIetzZgH9gPX2AcHsrrI=</latexit>

⌘t,j =⌘pGt,j

Gt,j =tX

k=1


Adagrad

In practice, add a small constant 𝜁 > 0 to prevent dividing by zero

⌘t,j =⌘pGt,j

Gt,j =tX

k=1


⌘t,j =⌘pGt,j

Gt,j =tX

k=1


✓j ✓j �⌘p

Gt,j + ⇣gt,j

<latexit sha1_base64="OBxDJ3usqHT9hv3J+8q8hjcf0Bs=">AAACOXicbZDLThsxFIY93AoDhRSWbCwiqkptoxmo1C5RWcAySASQMtHI45xJHDwX7DOtUmteqxvegh1SN12AEFteAE8yC26/ZOn3d86Rff4ol0Kj5107M7Nz8wvvFpfc5ZX3q2uND+snOisUhw7PZKbOIqZBihQ6KFDCWa6AJZGE0+h8v6qf/gKlRZYe4ziHXsIGqYgFZ2hR2Gi7AQ4BWWhGJf0YSIiRKZX9pjUe0a80iBXjJrDX0gT6QqE5CA1+GZUl/UyDPxWngylx3bDR9FreRPS18WvTJLXaYeMq6Ge8SCBFLpnWXd/LsWeYQsEllG5QaMgZP2cD6FqbsgR0z0w2L+m2JX0aZ8qeFOmEPp0wLNF6nES2M2E41C9rFXyr1i0w/tEzIs0LhJRPH4oLSTGjVYy0LxRwlGNrGFfC/pXyIbM5oQ27CsF/ufJrc7LT8ndbO0ffmns/6zgWySbZIp+IT76TPXJI2qRDOPlL/pEbcutcOv+dO+d+2jrj1DMb5Jmch0cMvq0H</latexit>

Recap: Gradient Descentu “Follow the slope” towards a minimum

l Analytical or numerical derivativel Need to pick step size

n larger = faster convergence but instabilityu Lots of variations

l Coordinate descentl Stochastic gradient descent or mini-batch

u Can get caught in local minimal Alternative, simulated annealing, uses randomness

6b gradient descent - seas.upenn.educis520/lectures/gradient_descent.pdf · gradient descent lyle...

Documents