![Page 1: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization](https://reader036.vdocuments.mx/reader036/viewer/2022081619/60fe86ec50c9432ae66292a5/html5/thumbnails/1.jpg)
Robbins, Monro: A Stochastic ApproximationMethod
Robert Bassett
University of California Davis
Student Run Optimization SeminarOct 10, 2017
![Page 2: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization](https://reader036.vdocuments.mx/reader036/viewer/2022081619/60fe86ec50c9432ae66292a5/html5/thumbnails/2.jpg)
Motivation
You are a carrot farmer. The amount of carrots that you plantplays a part in how much carrots cost in the store, and hence howmuch profit you make on your harvest.
I Too many carrots planted → glut in the market and low profit.
I Too few carrots planted → unexploited demand for carrotsand low profit.
The connection between carrots planted and profit is complicated!We will model this as a stochastic optimization problem, in anattempt to deal with the random components of this model(weather, economic climate, transportation costs, etc.).
![Page 3: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization](https://reader036.vdocuments.mx/reader036/viewer/2022081619/60fe86ec50c9432ae66292a5/html5/thumbnails/3.jpg)
Motivation
You are a carrot farmer. The amount of carrots that you plantplays a part in how much carrots cost in the store, and hence howmuch profit you make on your harvest.
I Too many carrots planted → glut in the market and low profit.
I Too few carrots planted → unexploited demand for carrotsand low profit.
The connection between carrots planted and profit is complicated!We will model this as a stochastic optimization problem, in anattempt to deal with the random components of this model(weather, economic climate, transportation costs, etc.).
![Page 4: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization](https://reader036.vdocuments.mx/reader036/viewer/2022081619/60fe86ec50c9432ae66292a5/html5/thumbnails/4.jpg)
Motivation
You are a carrot farmer. The amount of carrots that you plantplays a part in how much carrots cost in the store, and hence howmuch profit you make on your harvest.
I Too many carrots planted → glut in the market and low profit.
I Too few carrots planted → unexploited demand for carrotsand low profit.
The connection between carrots planted and profit is complicated!We will model this as a stochastic optimization problem, in anattempt to deal with the random components of this model(weather, economic climate, transportation costs, etc.).
![Page 5: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization](https://reader036.vdocuments.mx/reader036/viewer/2022081619/60fe86ec50c9432ae66292a5/html5/thumbnails/5.jpg)
Goal: Find a carrot planting decision θ∗ which minimizes yourexpected loss.
I Model loss as a random function of your planting decision θ.
Loss ∼ q(x , θ) known
I x : random vector of external factors. Independent of θ.
x ∼ F (x) unknown
I Carrot Planter’s Problem (CPP)
minθ∈Θ
Q(θ) = E[q(x , θ)]
I q : Rm × Rn → RI Θ ⊆ Rn, set of possible decisions.
![Page 6: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization](https://reader036.vdocuments.mx/reader036/viewer/2022081619/60fe86ec50c9432ae66292a5/html5/thumbnails/6.jpg)
Goal: Find a carrot planting decision θ∗ which minimizes yourexpected loss.
I Model loss as a random function of your planting decision θ.
Loss ∼ q(x , θ) known
I x : random vector of external factors. Independent of θ.
x ∼ F (x) unknown
I Carrot Planter’s Problem (CPP)
minθ∈Θ
Q(θ) = E[q(x , θ)]
I q : Rm × Rn → RI Θ ⊆ Rn, set of possible decisions.
![Page 7: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization](https://reader036.vdocuments.mx/reader036/viewer/2022081619/60fe86ec50c9432ae66292a5/html5/thumbnails/7.jpg)
Goal: Find a carrot planting decision θ∗ which minimizes yourexpected loss.
I Model loss as a random function of your planting decision θ.
Loss ∼ q(x , θ) known
I x : random vector of external factors. Independent of θ.
x ∼ F (x) unknown
I Carrot Planter’s Problem (CPP)
minθ∈Θ
Q(θ) = E[q(x , θ)]
I q : Rm × Rn → RI Θ ⊆ Rn, set of possible decisions.
![Page 8: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization](https://reader036.vdocuments.mx/reader036/viewer/2022081619/60fe86ec50c9432ae66292a5/html5/thumbnails/8.jpg)
minθ∈Θ
Q(θ) = E[q(x , θ)] (CPP)
Assumptions
1. With probability 1, q(x , θ) strongly convex in θ, with uniformmodulus K . That is,
〈θ1 − θ2,∇q(x , θ1)−∇q(x , θ2)〉 ≥ K ||θ1 − θ2||2 x a.s.
2. We have access to a random variable y ∼ P(y , θ) with theproperty that
E[y |θ] = ∇Q(θ).
3. y has second moment uniformly bounded in θ.∫Y||y ||2 dP(y , θ) ≤ C 2 ∀θ ∈ Θ.
4. Q is differentiable, so (CPP) is equivalent to solving∇Q(θ) = 0.
![Page 9: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization](https://reader036.vdocuments.mx/reader036/viewer/2022081619/60fe86ec50c9432ae66292a5/html5/thumbnails/9.jpg)
minθ∈Θ
Q(θ) = E[q(x , θ)] (CPP)
Assumptions
1. With probability 1, q(x , θ) strongly convex in θ, with uniformmodulus K . That is,
〈θ1 − θ2,∇q(x , θ1)−∇q(x , θ2)〉 ≥ K ||θ1 − θ2||2 x a.s.
2. We have access to a random variable y ∼ P(y , θ) with theproperty that
E[y |θ] = ∇Q(θ).
3. y has second moment uniformly bounded in θ.∫Y||y ||2 dP(y , θ) ≤ C 2 ∀θ ∈ Θ.
4. Q is differentiable, so (CPP) is equivalent to solving∇Q(θ) = 0.
![Page 10: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization](https://reader036.vdocuments.mx/reader036/viewer/2022081619/60fe86ec50c9432ae66292a5/html5/thumbnails/10.jpg)
minθ∈Θ
Q(θ) = E[q(x , θ)] (CPP)
Assumptions
1. With probability 1, q(x , θ) strongly convex in θ, with uniformmodulus K . That is,
〈θ1 − θ2,∇q(x , θ1)−∇q(x , θ2)〉 ≥ K ||θ1 − θ2||2 x a.s.
2. We have access to a random variable y ∼ P(y , θ) with theproperty that
E[y |θ] = ∇Q(θ).
3. y has second moment uniformly bounded in θ.∫Y||y ||2 dP(y , θ) ≤ C 2 ∀θ ∈ Θ.
4. Q is differentiable, so (CPP) is equivalent to solving∇Q(θ) = 0.
![Page 11: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization](https://reader036.vdocuments.mx/reader036/viewer/2022081619/60fe86ec50c9432ae66292a5/html5/thumbnails/11.jpg)
minθ∈Θ
Q(θ) = E[q(x , θ)] (CPP)
Assumptions
1. With probability 1, q(x , θ) strongly convex in θ, with uniformmodulus K . That is,
〈θ1 − θ2,∇q(x , θ1)−∇q(x , θ2)〉 ≥ K ||θ1 − θ2||2 x a.s.
2. We have access to a random variable y ∼ P(y , θ) with theproperty that
E[y |θ] = ∇Q(θ).
3. y has second moment uniformly bounded in θ.∫Y||y ||2 dP(y , θ) ≤ C 2 ∀θ ∈ Θ.
4. Q is differentiable, so (CPP) is equivalent to solving∇Q(θ) = 0.
![Page 12: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization](https://reader036.vdocuments.mx/reader036/viewer/2022081619/60fe86ec50c9432ae66292a5/html5/thumbnails/12.jpg)
minθ∈Θ
Q(θ) = E[q(x , θ)] (CPP)
Assumptions
1. With probability 1, q(x , θ) strongly convex in θ, with uniformmodulus K . That is,
〈θ1 − θ2,∇q(x , θ1)−∇q(x , θ2)〉 ≥ K ||θ1 − θ2||2 x a.s.
2. We have access to a random variable y ∼ P(y , θ) with theproperty that
E[y |θ] = ∇Q(θ).
3. y has second moment uniformly bounded in θ.∫Y||y ||2 dP(y , θ) ≤ C 2 ∀θ ∈ Θ.
4. Q is differentiable, so (CPP) is equivalent to solving∇Q(θ) = 0.
![Page 13: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization](https://reader036.vdocuments.mx/reader036/viewer/2022081619/60fe86ec50c9432ae66292a5/html5/thumbnails/13.jpg)
minθ∈Θ
Q(θ) = E[q(x , θ)] (CPP)
Assumptions
1. With probability 1, q(x , θ) strongly convex in θ, with uniformmodulus K . That is,
〈θ1 − θ2,∇q(x , θ1)−∇q(x , θ2)〉 ≥ K ||θ1 − θ2||2 x a.s.
2. We have access to a random variable y ∼ P(y , θ) with theproperty that
E[y |θ] = ∇Q(θ).
3. y has second moment uniformly bounded in θ.∫Y||y ||2 dP(y , θ) ≤ C 2 ∀θ ∈ Θ.
4. Q is differentiable, so (CPP) is equivalent to solving∇Q(θ) = 0.
![Page 14: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization](https://reader036.vdocuments.mx/reader036/viewer/2022081619/60fe86ec50c9432ae66292a5/html5/thumbnails/14.jpg)
(2) says we must have access to an unbiased estimator of ourgradient.
For many common loss functions this is true, e.g. squared loss:
q(x , θ) =1
2E[||Aθ − x ||2]
⇒y ∼ ∇q(x , θ) = AT (Aθ − x) satisfies
E[y |θ] = E[∇q(x , θ)|θ] = E[AT (Aθ − x)|θ] = ∇Q(θ)
The last equality follows from the mild regularity assumptions thatallow us to interchange derivative with integral. Know them!
![Page 15: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization](https://reader036.vdocuments.mx/reader036/viewer/2022081619/60fe86ec50c9432ae66292a5/html5/thumbnails/15.jpg)
(2) says we must have access to an unbiased estimator of ourgradient.For many common loss functions this is true, e.g. squared loss:
q(x , θ) =1
2E[||Aθ − x ||2]
⇒y ∼ ∇q(x , θ) = AT (Aθ − x) satisfies
E[y |θ] = E[∇q(x , θ)|θ] = E[AT (Aθ − x)|θ] = ∇Q(θ)
The last equality follows from the mild regularity assumptions thatallow us to interchange derivative with integral. Know them!
![Page 16: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization](https://reader036.vdocuments.mx/reader036/viewer/2022081619/60fe86ec50c9432ae66292a5/html5/thumbnails/16.jpg)
I We would like to find a minimizer of Q(θ) without everattempting to calculate the expectation that defines it.
I Why? It involves random factors that are nasty!
I Instead, we will build a sequence θn which depends on randomsamples from y ∼ P(y , θ) for different θ.
I Hope to generate iterates from samples which allow us tominimize the expectation.
I We want θn to be a consistent estimator of θ∗. That is
∀ε ∃N s.t. n ≥ N ⇒ P(||θn − θ∗||2 > ε) < ε
i.e. θn → θ in probability.
![Page 17: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization](https://reader036.vdocuments.mx/reader036/viewer/2022081619/60fe86ec50c9432ae66292a5/html5/thumbnails/17.jpg)
I We would like to find a minimizer of Q(θ) without everattempting to calculate the expectation that defines it.
I Why? It involves random factors that are nasty!
I Instead, we will build a sequence θn which depends on randomsamples from y ∼ P(y , θ) for different θ.
I Hope to generate iterates from samples which allow us tominimize the expectation.
I We want θn to be a consistent estimator of θ∗. That is
∀ε ∃N s.t. n ≥ N ⇒ P(||θn − θ∗||2 > ε) < ε
i.e. θn → θ in probability.
![Page 18: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization](https://reader036.vdocuments.mx/reader036/viewer/2022081619/60fe86ec50c9432ae66292a5/html5/thumbnails/18.jpg)
Facts of Life Give references!
1. L2-convergence implies convergence in probability. That is,
E[||θn − θ||2]→ 0⇒ ||θn − θ||2 → 0 in probability.
2. A convex function defines a monotone operator. That is, iff : Rn → R is convex, then
〈∇f (x)−∇f (y), x − y〉 ≥ 0
3. f (ξ, x) convex in x for some ξ almost everywhere implies
F (x) = E[f (ξ, x)]
is convex.
![Page 19: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization](https://reader036.vdocuments.mx/reader036/viewer/2022081619/60fe86ec50c9432ae66292a5/html5/thumbnails/19.jpg)
Facts of Life Give references!
1. L2-convergence implies convergence in probability. That is,
E[||θn − θ||2]→ 0⇒ ||θn − θ||2 → 0 in probability.
2. A convex function defines a monotone operator. That is, iff : Rn → R is convex, then
〈∇f (x)−∇f (y), x − y〉 ≥ 0
3. f (ξ, x) convex in x for some ξ almost everywhere implies
F (x) = E[f (ξ, x)]
is convex.
![Page 20: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization](https://reader036.vdocuments.mx/reader036/viewer/2022081619/60fe86ec50c9432ae66292a5/html5/thumbnails/20.jpg)
Facts of Life Give references!
1. L2-convergence implies convergence in probability. That is,
E[||θn − θ||2]→ 0⇒ ||θn − θ||2 → 0 in probability.
2. A convex function defines a monotone operator. That is, iff : Rn → R is convex, then
〈∇f (x)−∇f (y), x − y〉 ≥ 0
3. f (ξ, x) convex in x for some ξ almost everywhere implies
F (x) = E[f (ξ, x)]
is convex.
![Page 21: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization](https://reader036.vdocuments.mx/reader036/viewer/2022081619/60fe86ec50c9432ae66292a5/html5/thumbnails/21.jpg)
TheoremLet an be a positive sequence which satisfies:
I∞∑n=0
a2n <∞
I∞∑n=0
ana0 + ...+ an−1
=∞
For some initial θ0, define the sequence
θn+1 = θn − anyn
Where yn ∼ P(y |θn). Then θn → θ∗ in probability.
![Page 22: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization](https://reader036.vdocuments.mx/reader036/viewer/2022081619/60fe86ec50c9432ae66292a5/html5/thumbnails/22.jpg)
Proof:
Because of F.O.L. 1, we will instead show L2 convergence.
![Page 23: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization](https://reader036.vdocuments.mx/reader036/viewer/2022081619/60fe86ec50c9432ae66292a5/html5/thumbnails/23.jpg)
Proof:Because of F.O.L. 1, we will instead show L2 convergence.
![Page 24: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization](https://reader036.vdocuments.mx/reader036/viewer/2022081619/60fe86ec50c9432ae66292a5/html5/thumbnails/24.jpg)
Proof:
Define bn = E[||θn − θ∗||2]. We want to show that bn → 0.Have that:
bn+1 = E[||θn+1 − θ∗||2]
= E[E[||θn+1 − θ∗||2 |θn]]
= E[∫
Y||(θn − θ∗)− any ||2 dP(y |θn)
]
= bn + a2nE[∫
Y||y ||2 dP(y |θn)
]− 2anE[〈θn − θ∗,∇Q(θn)〉]
![Page 25: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization](https://reader036.vdocuments.mx/reader036/viewer/2022081619/60fe86ec50c9432ae66292a5/html5/thumbnails/25.jpg)
Proof:Define bn = E[||θn − θ∗||2]. We want to show that bn → 0.
Have that:bn+1 = E[||θn+1 − θ∗||2]
= E[E[||θn+1 − θ∗||2 |θn]]
= E[∫
Y||(θn − θ∗)− any ||2 dP(y |θn)
]
= bn + a2nE[∫
Y||y ||2 dP(y |θn)
]− 2anE[〈θn − θ∗,∇Q(θn)〉]
![Page 26: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization](https://reader036.vdocuments.mx/reader036/viewer/2022081619/60fe86ec50c9432ae66292a5/html5/thumbnails/26.jpg)
Proof:Define bn = E[||θn − θ∗||2]. We want to show that bn → 0.Have that:
bn+1 = E[||θn+1 − θ∗||2]
= E[E[||θn+1 − θ∗||2 |θn]]
= E[∫
Y||(θn − θ∗)− any ||2 dP(y |θn)
]
= bn + a2nE[∫
Y||y ||2 dP(y |θn)
]− 2anE[〈θn − θ∗,∇Q(θn)〉]
![Page 27: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization](https://reader036.vdocuments.mx/reader036/viewer/2022081619/60fe86ec50c9432ae66292a5/html5/thumbnails/27.jpg)
Proof:Define bn = E[||θn − θ∗||2]. We want to show that bn → 0.Have that:
bn+1 = E[||θn+1 − θ∗||2]
= E[E[||θn+1 − θ∗||2 |θn]]
= E[∫
Y||(θn − θ∗)− any ||2 dP(y |θn)
]
= bn + a2nE[∫
Y||y ||2 dP(y |θn)
]− 2anE[〈θn − θ∗,∇Q(θn)〉]
![Page 28: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization](https://reader036.vdocuments.mx/reader036/viewer/2022081619/60fe86ec50c9432ae66292a5/html5/thumbnails/28.jpg)
Proof:Define bn = E[||θn − θ∗||2]. We want to show that bn → 0.Have that:
bn+1 = E[||θn+1 − θ∗||2]
= E[E[||θn+1 − θ∗||2 |θn]]
= E[∫
Y||(θn − θ∗)− any ||2 dP(y |θn)
]
= bn + a2nE[∫
Y||y ||2 dP(y |θn)
]− 2anE[〈θn − θ∗,∇Q(θn)〉]
![Page 29: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization](https://reader036.vdocuments.mx/reader036/viewer/2022081619/60fe86ec50c9432ae66292a5/html5/thumbnails/29.jpg)
Proof:Define bn = E[||θn − θ∗||2]. We want to show that bn → 0.Have that:
bn+1 = E[||θn+1 − θ∗||2]
= E[E[||θn+1 − θ∗||2 |θn]]
= E[∫
Y||(θn − θ∗)− any ||2 dP(y |θn)
]
= bn + a2nE[∫
Y||y ||2 dP(y |θn)
]− 2anE[〈θn − θ∗,∇Q(θn)〉]
![Page 30: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization](https://reader036.vdocuments.mx/reader036/viewer/2022081619/60fe86ec50c9432ae66292a5/html5/thumbnails/30.jpg)
= bn + a2n E[∫
Y||y ||2 dP(y |θn)
]︸ ︷︷ ︸−2anE[〈θn − θ∗,∇Q(θn)〉]
E[∫
Y||y ||2 dP(y |θn)
]< C 2
by Assumption 3
![Page 31: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization](https://reader036.vdocuments.mx/reader036/viewer/2022081619/60fe86ec50c9432ae66292a5/html5/thumbnails/31.jpg)
= bn + a2n E[∫
Y||y ||2 dP(y |θn)
]︸ ︷︷ ︸−2anE[〈θn − θ∗,∇Q(θn)〉]
E[∫
Y||y ||2 dP(y |θn)
]< C 2
by Assumption 3
![Page 32: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization](https://reader036.vdocuments.mx/reader036/viewer/2022081619/60fe86ec50c9432ae66292a5/html5/thumbnails/32.jpg)
≤ bn + a2nC
2 − 2an E[〈θn − θ∗,∇Q(θn)〉]︸ ︷︷ ︸
Because θ∗ ∈ argminθ Q(θ), ∇Q(θ∗) = 0.
E[〈θn − θ∗,∇Q(θn)〉] = E[〈θn − θ∗,∇Q(θn)−∇Q(θ∗)〉]
≥ 0
by F.O.L. 2 and 3
![Page 33: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization](https://reader036.vdocuments.mx/reader036/viewer/2022081619/60fe86ec50c9432ae66292a5/html5/thumbnails/33.jpg)
≤ bn + a2nC
2 − 2an E[〈θn − θ∗,∇Q(θn)〉]︸ ︷︷ ︸Because θ∗ ∈ argminθ Q(θ), ∇Q(θ∗) = 0.
E[〈θn − θ∗,∇Q(θn)〉] = E[〈θn − θ∗,∇Q(θn)−∇Q(θ∗)〉]
≥ 0
by F.O.L. 2 and 3
![Page 34: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization](https://reader036.vdocuments.mx/reader036/viewer/2022081619/60fe86ec50c9432ae66292a5/html5/thumbnails/34.jpg)
≤ bn + a2nC
2 − 2an E[〈θn − θ∗,∇Q(θn)〉]︸ ︷︷ ︸Because θ∗ ∈ argminθ Q(θ), ∇Q(θ∗) = 0.
E[〈θn − θ∗,∇Q(θn)〉] = E[〈θn − θ∗,∇Q(θn)−∇Q(θ∗)〉]
≥ 0
by F.O.L. 2 and 3
![Page 35: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization](https://reader036.vdocuments.mx/reader036/viewer/2022081619/60fe86ec50c9432ae66292a5/html5/thumbnails/35.jpg)
≤ bn + a2nC
2 − 2an E[〈θn − θ∗,∇Q(θn)〉]︸ ︷︷ ︸Because θ∗ ∈ argminθ Q(θ), ∇Q(θ∗) = 0.
E[〈θn − θ∗,∇Q(θn)〉] = E[〈θn − θ∗,∇Q(θn)−∇Q(θ∗)〉]
≥ 0
by F.O.L. 2 and 3
![Page 36: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization](https://reader036.vdocuments.mx/reader036/viewer/2022081619/60fe86ec50c9432ae66292a5/html5/thumbnails/36.jpg)
0 ≤ bn+1 ≤ bn + a2nC
2 − 2anE[〈θn − θ∗,∇Q(θn)〉]
⇒
0 ≤ bn+1 ≤ b0 + C 2n∑
i=0
a2n − 2
n∑i=0
aiE[〈θi − θ∗,∇Q(θi )〉]
⇒
0 ≤n∑
i=0
aiE[〈θi − θ∗,∇Q(θi )〉] ≤1
2
(b0 + C 2
n∑i=1
a2i
)
Taking the limit as n→∞, gives
0 ≤ limn→∞
bn <∞
![Page 37: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization](https://reader036.vdocuments.mx/reader036/viewer/2022081619/60fe86ec50c9432ae66292a5/html5/thumbnails/37.jpg)
0 ≤ bn+1 ≤ bn + a2nC
2 − 2anE[〈θn − θ∗,∇Q(θn)〉]
⇒
0 ≤ bn+1 ≤ b0 + C 2n∑
i=0
a2n − 2
n∑i=0
aiE[〈θi − θ∗,∇Q(θi )〉]
⇒
0 ≤n∑
i=0
aiE[〈θi − θ∗,∇Q(θi )〉] ≤1
2
(b0 + C 2
n∑i=1
a2i
)
Taking the limit as n→∞, gives
0 ≤ limn→∞
bn <∞
![Page 38: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization](https://reader036.vdocuments.mx/reader036/viewer/2022081619/60fe86ec50c9432ae66292a5/html5/thumbnails/38.jpg)
0 ≤ bn+1 ≤ bn + a2nC
2 − 2anE[〈θn − θ∗,∇Q(θn)〉]
⇒
0 ≤ bn+1 ≤ b0 + C 2n∑
i=0
a2n − 2
n∑i=0
aiE[〈θi − θ∗,∇Q(θi )〉]
⇒
0 ≤n∑
i=0
aiE[〈θi − θ∗,∇Q(θi )〉] ≤1
2
(b0 + C 2
n∑i=1
a2i
)
Taking the limit as n→∞, gives
0 ≤ limn→∞
bn <∞
![Page 39: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization](https://reader036.vdocuments.mx/reader036/viewer/2022081619/60fe86ec50c9432ae66292a5/html5/thumbnails/39.jpg)
0 ≤ bn+1 ≤ bn + a2nC
2 − 2anE[〈θn − θ∗,∇Q(θn)〉]
⇒
0 ≤ bn+1 ≤ b0 + C 2n∑
i=0
a2n − 2
n∑i=0
aiE[〈θi − θ∗,∇Q(θi )〉]
⇒
0 ≤n∑
i=0
aiE[〈θi − θ∗,∇Q(θi )〉] ≤1
2
(b0 + C 2
n∑i=1
a2i
)
Taking the limit as n→∞, gives
0 ≤ limn→∞
bn <∞
![Page 40: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization](https://reader036.vdocuments.mx/reader036/viewer/2022081619/60fe86ec50c9432ae66292a5/html5/thumbnails/40.jpg)
Great! This shows that limn→∞ bn exists. But is it equal to 0?
Consider the sequence
kn =K
a1 + ...+ an−1
We have that
E[〈θn − θ∗,∇Q(θn)〉] ≥ KE[||θn − θ∗||2]
By strong convexity.Rewriting, we then have
≥ Kbn ≥ knbn
for large enough n.We proved previously that summing the an times inner productabove gives a convergent sequence! This gives that, since kn andbn are both positive,
∞∑n
anknbn <∞.
![Page 41: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization](https://reader036.vdocuments.mx/reader036/viewer/2022081619/60fe86ec50c9432ae66292a5/html5/thumbnails/41.jpg)
Great! This shows that limn→∞ bn exists. But is it equal to 0?Consider the sequence
kn =K
a1 + ...+ an−1
We have that
E[〈θn − θ∗,∇Q(θn)〉] ≥ KE[||θn − θ∗||2]
By strong convexity.Rewriting, we then have
≥ Kbn ≥ knbn
for large enough n.We proved previously that summing the an times inner productabove gives a convergent sequence! This gives that, since kn andbn are both positive,
∞∑n
anknbn <∞.
![Page 42: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization](https://reader036.vdocuments.mx/reader036/viewer/2022081619/60fe86ec50c9432ae66292a5/html5/thumbnails/42.jpg)
Great! This shows that limn→∞ bn exists. But is it equal to 0?Consider the sequence
kn =K
a1 + ...+ an−1
We have that
E[〈θn − θ∗,∇Q(θn)〉] ≥ KE[||θn − θ∗||2]
By strong convexity.
Rewriting, we then have
≥ Kbn ≥ knbn
for large enough n.We proved previously that summing the an times inner productabove gives a convergent sequence! This gives that, since kn andbn are both positive,
∞∑n
anknbn <∞.
![Page 43: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization](https://reader036.vdocuments.mx/reader036/viewer/2022081619/60fe86ec50c9432ae66292a5/html5/thumbnails/43.jpg)
Great! This shows that limn→∞ bn exists. But is it equal to 0?Consider the sequence
kn =K
a1 + ...+ an−1
We have that
E[〈θn − θ∗,∇Q(θn)〉] ≥ KE[||θn − θ∗||2]
By strong convexity.Rewriting, we then have
≥ Kbn ≥ knbn
for large enough n.
We proved previously that summing the an times inner productabove gives a convergent sequence! This gives that, since kn andbn are both positive,
∞∑n
anknbn <∞.
![Page 44: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization](https://reader036.vdocuments.mx/reader036/viewer/2022081619/60fe86ec50c9432ae66292a5/html5/thumbnails/44.jpg)
Great! This shows that limn→∞ bn exists. But is it equal to 0?Consider the sequence
kn =K
a1 + ...+ an−1
We have that
E[〈θn − θ∗,∇Q(θn)〉] ≥ KE[||θn − θ∗||2]
By strong convexity.Rewriting, we then have
≥ Kbn ≥ knbn
for large enough n.We proved previously that summing the an times inner productabove gives a convergent sequence! This gives that, since kn andbn are both positive,
∞∑n
anknbn <∞.
![Page 45: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization](https://reader036.vdocuments.mx/reader036/viewer/2022081619/60fe86ec50c9432ae66292a5/html5/thumbnails/45.jpg)
By assumption on an, we have also that
ankn =Kan
a1 + ...+ an−1→∞
So∑
ankn →∞.
![Page 46: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization](https://reader036.vdocuments.mx/reader036/viewer/2022081619/60fe86ec50c9432ae66292a5/html5/thumbnails/46.jpg)
By assumption on an, we have also that
ankn =Kan
a1 + ...+ an−1→∞
So∑
ankn →∞.
![Page 47: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization](https://reader036.vdocuments.mx/reader036/viewer/2022081619/60fe86ec50c9432ae66292a5/html5/thumbnails/47.jpg)
We have established
1.∞∑i=1
aikibi <∞
2.∞∑i=1
aiki →∞
So...
bn → 0!
![Page 48: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization](https://reader036.vdocuments.mx/reader036/viewer/2022081619/60fe86ec50c9432ae66292a5/html5/thumbnails/48.jpg)
We have established
1.∞∑i=1
aikibi <∞
2.∞∑i=1
aiki →∞
So...
bn → 0!
![Page 49: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization](https://reader036.vdocuments.mx/reader036/viewer/2022081619/60fe86ec50c9432ae66292a5/html5/thumbnails/49.jpg)
We have established
1.∞∑i=1
aikibi <∞
2.∞∑i=1
aiki →∞
So...
bn → 0!
![Page 50: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization](https://reader036.vdocuments.mx/reader036/viewer/2022081619/60fe86ec50c9432ae66292a5/html5/thumbnails/50.jpg)
We have established
1.∞∑i=1
aikibi <∞
2.∞∑i=1
aiki →∞
So...
bn → 0!
![Page 51: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization](https://reader036.vdocuments.mx/reader036/viewer/2022081619/60fe86ec50c9432ae66292a5/html5/thumbnails/51.jpg)
We have established
1.∞∑i=1
aikibi <∞
2.∞∑i=1
aiki →∞
So...
bn → 0!
![Page 52: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization](https://reader036.vdocuments.mx/reader036/viewer/2022081619/60fe86ec50c9432ae66292a5/html5/thumbnails/52.jpg)
So are there any sequences that satisfy the conditions in thetheorem?
I∞∑n=0
a2n <∞
I∞∑n=0
ana1 + ...+ an−1
=∞
Take an = 1n . Square summable and
∞∑n=0
1n
11 + ...+ 1
n−1
≈∞∑n=0
1
n ln(n − 1)≥∞∑n=0
1
n ln n→∞
![Page 53: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization](https://reader036.vdocuments.mx/reader036/viewer/2022081619/60fe86ec50c9432ae66292a5/html5/thumbnails/53.jpg)
So are there any sequences that satisfy the conditions in thetheorem?
I∞∑n=0
a2n <∞
I∞∑n=0
ana1 + ...+ an−1
=∞
Take an = 1n . Square summable and
∞∑n=0
1n
11 + ...+ 1
n−1
≈∞∑n=0
1
n ln(n − 1)≥∞∑n=0
1
n ln n→∞
![Page 54: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization](https://reader036.vdocuments.mx/reader036/viewer/2022081619/60fe86ec50c9432ae66292a5/html5/thumbnails/54.jpg)
So are there any sequences that satisfy the conditions in thetheorem?
I∞∑n=0
a2n <∞
I∞∑n=0
ana1 + ...+ an−1
=∞
Take an = 1n . Square summable and
∞∑n=0
1n
11 + ...+ 1
n−1
≈∞∑n=0
1
n ln(n − 1)≥∞∑n=0
1
n ln n→∞
![Page 55: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization](https://reader036.vdocuments.mx/reader036/viewer/2022081619/60fe86ec50c9432ae66292a5/html5/thumbnails/55.jpg)
Extensions and loose ends
I Actually converges with probability 1 (Blum).I Convergence with rates
I E[Q(θn)− Q(θ∗)] ∈ O(n−1) (with strong convexity)I E[Q(θn)− Q(θ∗)] ∈ O(n−
12 ) (without strong convexity)
I θn−θ∗√n
is asymptotically normal. (Sacks)
I√n rate cannot be beat for general convex case. (Nemirovski
et al)