6b gradient descent - seas.upenn.educis520/lectures/gradient_descent.pdf · gradient descent lyle...
TRANSCRIPT
Gradient DescentLyle Ungar
University of PennsylvaniaIn part from slides written jointly with Zack Ives
Learning objectivesKnow standard, coordinate, stochastic gradient, and minibatch gradient descent
Adagrad: core idea
Gradient Descentu We almost always want to minimize some loss functionu Example: Sum of Squared Errors (SSE):
𝑆𝑆𝐸 q =1𝑛&!"#
$
𝑟! q %
𝑟!(q) = 𝑦(!) − ŷ 𝒙 ! ; q
Mean Squared Error
http://www.math.uah.edu/stat/expect/Variance.html
In one dimension, lookslike a parabola centeredaround the optimalvalue μ
(Generalizes to d dimensions)
𝑆𝑆𝐸 q =1𝑛)!"#
𝑟! q $
q
Getting Closer
http://www.math.uah.edu/stat/expect/Variance.html
Current value q
What if we use theslope of the tangentto decide where to“go next”?
the gradient𝛻𝑆𝑆𝐸 q
= lim!→#
ŷ(q + 𝑑 − ŷ q𝑑
𝑆𝑆𝐸 q =1𝑛)!"#
%
𝑟! q $
theory.stanford.edu/~tim/s15/l/l15.pdf
q: = q− h 𝛻𝑆𝑆𝐸 q
Getting Closer
http://www.math.uah.edu/stat/expect/Variance.html
We can compute the gradient numerically… But sometimes better to use analytics (calculus)!
𝑆𝑆𝐸 q =1𝑛)!"#
%
𝑟! q $
Current value q
𝛻𝑆𝑆𝐸 q =1𝑛3$%&
'𝑑𝑑q𝑟$ q (
Getting Closer
http://www.math.uah.edu/stat/expect/Variance.html
We can compute the gradient numerically… But sometimes better to use analytics (calculus)!
𝑆𝑆𝐸 q =1𝑛)!"#
%
𝑟! q $
Current value q
𝛻𝑆𝑆𝐸 q
=1𝑛'!"#
$
2 ⋅ 𝑟! q ⋅𝜕𝑟!(q)𝜕q
Getting Closer
http://www.math.uah.edu/stat/expect/Variance.html
We can compute the gradient numerically… But sometimes better to use analytics (calculus)!
𝑆𝑆𝐸 q =1𝑛)!"#
%
𝑟! q $
Current value q
𝛻𝑆𝑆𝐸 q
=1𝑛'!"#
$
2 ⋅ 𝑟! q ⋅𝜕𝑟!(q)𝜕q
𝜕𝑟$𝜕q)
= 𝑥)$
Getting Closer
http://www.math.uah.edu/stat/expect/Variance.html
We can compute the gradient numerically… But sometimes better to use analytics (calculus)!
𝑆𝑆𝐸 q =1𝑛)!"#
%
𝑟! q $
Current value q
𝜕𝑟$𝜕q)
= 𝑥)$
𝛻𝑆𝑆𝐸 q
=1𝑛'!"#
$
2 𝑟! q ⋅ 𝒙 !
Getting Closer
http://www.math.uah.edu/stat/expect/Variance.html
We can compute the gradient numerically… But sometimes better to use analytics (calculus)!
𝑆𝑆𝐸 q =1𝑛)!"#
%
𝑟! q $
Current value q
𝜕𝑟$𝜕q)
= 𝑥)$
𝛻𝑆𝑆𝐸 q
=2𝑛'!"#
$
𝑟! q ⋅ 𝒙 !
Step h
q ∶= q− h 𝛻𝑀𝑆𝐸(q)
Key questionsu How big a step h to take?
l Too small and it takes a long timel Too big and it will be unstable
u“Optimal:” scale h ~ 1/sqrt(iteration)u Adaptive (a simple version)
l E.g. each time, increase step size by 10%n If error ever increases, cut set size in half
q: = q− h 𝛻𝑆𝑆𝐸(q)
For ||w||1 or ||y-y||1 use coordinate descent
https://en.wikipedia.org/wiki/Coordinate_descent
Repeat:For j=1:pqj:= qj - hdErr/dqj
Elastic net parameter search
Size of coefficients
Regularization penalty (inverse) Zou and Hastie
Stochastic Gradient Descentu If we have a very large data set, update the model
after observing each single observationl “online” or “streaming” learning
𝑆𝑆𝐸 q =1𝑛)!"#
%
𝑟! q $ 𝛻𝑆𝑆𝐸𝑖 q =𝑑𝑑q𝑟! q $
q ∶= q− h 𝛻𝑆𝑆𝐸𝑖 q
Mini-batchu Update the model every k observations
l Batch size k (e.g. 50)u More efficient than pure stochastic gradient
or full gradient descent𝑆𝑆𝐸 q =
1𝑛)!"#
%
𝑟! q $ 𝛻𝑆𝑆𝐸𝑘 q =1𝑘'!"#
#$%𝑑𝑑q 𝑟!
q &
q ∶= q− h 𝛻𝑆𝑆𝐸𝑘 q
Adagrad• Define a per-feature learning rate for feature j as:
• Gt,j is the sum of squares of gradients of feature j over time t• Frequently occurring features in the gradients get small learning rates; rare features get higher ones • Key idea: “learn slowly” from frequent features but “pay attention” to rare but informative feature
@
@✓jcost✓(xk, yk)
⌘t,j =⌘pGt,j
Gt,j =tX
k=1
g2k,j<latexit sha1_base64="q8emAzDFCHDHqZmCF6eU8SK6HTU=">AAACM3icbVDLSgMxFM34rPVVdekmKIILKTN1oZuC6KLiqoJVoVOHTJqpsZmHyR2hhPkBd/6JGz9EF4K4UMStvyBmWsHngcDhnHO5ucdPBFdg2w/W0PDI6Nh4YaI4OTU9M1uamz9UcSopa9BYxPLYJ4oJHrEGcBDsOJGMhL5gR353J/ePLphUPI4OoJewVkg6EQ84JWAkr7TnMiCehrWzDFexG0hCdS5l2lXnEnRt4GVZsfaVUmno6W7VyU4Adwwz8kml6JWW7bLdB/5LnE+yvFW73Hl3r27rXunObcc0DVkEVBClmo6dQEsTCZwKlhXdVLGE0C7psKahEQmZaun+zRleMUobB7E0LwLcV79PaBIq1Qt9kwwJnKrfXi7+5zVTCDZbmkdJCiyig0VBKjDEOC8Qt7lkFETPEEIlN3/F9JSY3sDUnJfg/D75LzmslJ31cmXftLGNBiigRbSEVpGDNtAW2kV11EAUXaN79ISerRvr0XqxXgfRIetzZgH9gPX2AcHsrrI=</latexit>
⌘t,j =⌘pGt,j
Gt,j =tX
k=1
g2k,j<latexit sha1_base64="q8emAzDFCHDHqZmCF6eU8SK6HTU=">AAACM3icbVDLSgMxFM34rPVVdekmKIILKTN1oZuC6KLiqoJVoVOHTJqpsZmHyR2hhPkBd/6JGz9EF4K4UMStvyBmWsHngcDhnHO5ucdPBFdg2w/W0PDI6Nh4YaI4OTU9M1uamz9UcSopa9BYxPLYJ4oJHrEGcBDsOJGMhL5gR353J/ePLphUPI4OoJewVkg6EQ84JWAkr7TnMiCehrWzDFexG0hCdS5l2lXnEnRt4GVZsfaVUmno6W7VyU4Adwwz8kml6JWW7bLdB/5LnE+yvFW73Hl3r27rXunObcc0DVkEVBClmo6dQEsTCZwKlhXdVLGE0C7psKahEQmZaun+zRleMUobB7E0LwLcV79PaBIq1Qt9kwwJnKrfXi7+5zVTCDZbmkdJCiyig0VBKjDEOC8Qt7lkFETPEEIlN3/F9JSY3sDUnJfg/D75LzmslJ31cmXftLGNBiigRbSEVpGDNtAW2kV11EAUXaN79ISerRvr0XqxXgfRIetzZgH9gPX2AcHsrrI=</latexit>
Adagrad
In practice, add a small constant 𝜁 > 0 to prevent dividing by zero
⌘t,j =⌘pGt,j
Gt,j =tX
k=1
g2k,j<latexit sha1_base64="q8emAzDFCHDHqZmCF6eU8SK6HTU=">AAACM3icbVDLSgMxFM34rPVVdekmKIILKTN1oZuC6KLiqoJVoVOHTJqpsZmHyR2hhPkBd/6JGz9EF4K4UMStvyBmWsHngcDhnHO5ucdPBFdg2w/W0PDI6Nh4YaI4OTU9M1uamz9UcSopa9BYxPLYJ4oJHrEGcBDsOJGMhL5gR353J/ePLphUPI4OoJewVkg6EQ84JWAkr7TnMiCehrWzDFexG0hCdS5l2lXnEnRt4GVZsfaVUmno6W7VyU4Adwwz8kml6JWW7bLdB/5LnE+yvFW73Hl3r27rXunObcc0DVkEVBClmo6dQEsTCZwKlhXdVLGE0C7psKahEQmZaun+zRleMUobB7E0LwLcV79PaBIq1Qt9kwwJnKrfXi7+5zVTCDZbmkdJCiyig0VBKjDEOC8Qt7lkFETPEEIlN3/F9JSY3sDUnJfg/D75LzmslJ31cmXftLGNBiigRbSEVpGDNtAW2kV11EAUXaN79ISerRvr0XqxXgfRIetzZgH9gPX2AcHsrrI=</latexit>
⌘t,j =⌘pGt,j
Gt,j =tX
k=1
g2k,j<latexit sha1_base64="q8emAzDFCHDHqZmCF6eU8SK6HTU=">AAACM3icbVDLSgMxFM34rPVVdekmKIILKTN1oZuC6KLiqoJVoVOHTJqpsZmHyR2hhPkBd/6JGz9EF4K4UMStvyBmWsHngcDhnHO5ucdPBFdg2w/W0PDI6Nh4YaI4OTU9M1uamz9UcSopa9BYxPLYJ4oJHrEGcBDsOJGMhL5gR353J/ePLphUPI4OoJewVkg6EQ84JWAkr7TnMiCehrWzDFexG0hCdS5l2lXnEnRt4GVZsfaVUmno6W7VyU4Adwwz8kml6JWW7bLdB/5LnE+yvFW73Hl3r27rXunObcc0DVkEVBClmo6dQEsTCZwKlhXdVLGE0C7psKahEQmZaun+zRleMUobB7E0LwLcV79PaBIq1Qt9kwwJnKrfXi7+5zVTCDZbmkdJCiyig0VBKjDEOC8Qt7lkFETPEEIlN3/F9JSY3sDUnJfg/D75LzmslJ31cmXftLGNBiigRbSEVpGDNtAW2kV11EAUXaN79ISerRvr0XqxXgfRIetzZgH9gPX2AcHsrrI=</latexit>
✓j ✓j �⌘p
Gt,j + ⇣gt,j
<latexit sha1_base64="OBxDJ3usqHT9hv3J+8q8hjcf0Bs=">AAACOXicbZDLThsxFIY93AoDhRSWbCwiqkptoxmo1C5RWcAySASQMtHI45xJHDwX7DOtUmteqxvegh1SN12AEFteAE8yC26/ZOn3d86Rff4ol0Kj5107M7Nz8wvvFpfc5ZX3q2uND+snOisUhw7PZKbOIqZBihQ6KFDCWa6AJZGE0+h8v6qf/gKlRZYe4ziHXsIGqYgFZ2hR2Gi7AQ4BWWhGJf0YSIiRKZX9pjUe0a80iBXjJrDX0gT6QqE5CA1+GZUl/UyDPxWngylx3bDR9FreRPS18WvTJLXaYeMq6Ge8SCBFLpnWXd/LsWeYQsEllG5QaMgZP2cD6FqbsgR0z0w2L+m2JX0aZ8qeFOmEPp0wLNF6nES2M2E41C9rFXyr1i0w/tEzIs0LhJRPH4oLSTGjVYy0LxRwlGNrGFfC/pXyIbM5oQ27CsF/ufJrc7LT8ndbO0ffmns/6zgWySbZIp+IT76TPXJI2qRDOPlL/pEbcutcOv+dO+d+2jrj1DMb5Jmch0cMvq0H</latexit>
Recap: Gradient Descentu “Follow the slope” towards a minimum
l Analytical or numerical derivativel Need to pick step size
n larger = faster convergence but instabilityu Lots of variations
l Coordinate descentl Stochastic gradient descent or mini-batch
u Can get caught in local minimal Alternative, simulated annealing, uses randomness