michael p. friedlander university of british columbia august … · 2020. 8. 28. · avron &...

36
Robust inversion and randomized sampling Michael P. Friedlander University of British Columbia International Symposium on Mathematical Programming August 19–24, 2012 Collaborators: Sasha Aravkin, Felix Herrmann, Tristan van Leeuwen, and Mark Schmidt

Upload: others

Post on 19-Sep-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Michael P. Friedlander University of British Columbia August … · 2020. 8. 28. · Avron & Toledo (’11) — other optimal choices for W Haber, Chung, Herrmann (’12) — connection

Robust inversion and randomized sampling

Michael P. FriedlanderUniversity of British Columbia

International Symposium on Mathematical ProgrammingAugust 19–24, 2012

Collaborators:Sasha Aravkin, Felix Herrmann, Tristan van Leeuwen, and Mark Schmidt

Page 2: Michael P. Friedlander University of British Columbia August … · 2020. 8. 28. · Avron & Toledo (’11) — other optimal choices for W Haber, Chung, Herrmann (’12) — connection

Model problem

minimizex

f (x) :=1m

m∑i=1

fi (x)

Examples:• least-squares fi (x) = (aT

i x − bi )2 f (x) = ‖Ax − b‖2

• log likelihood fi (x) = − log p(bi ; x)

• sample average fi (x) = f (x , ωi ) f (x) ≈ Eω[f (x , ω)]

Context:• m large• each fi (x) and ∇fi (x) expensive to evaluate

Computing costs:• count fi /∇fi evals, not f /∇f evals• minimize passes through full data set (f1, . . . , fm)

2 / 33

Page 3: Michael P. Friedlander University of British Columbia August … · 2020. 8. 28. · Avron & Toledo (’11) — other optimal choices for W Haber, Chung, Herrmann (’12) — connection

SEISMICINVERSION

Page 4: Michael P. Friedlander University of British Columbia August … · 2020. 8. 28. · Avron & Toledo (’11) — other optimal choices for W Haber, Chung, Herrmann (’12) — connection

Reflection seismology

3 / 33

Page 5: Michael P. Friedlander University of British Columbia August … · 2020. 8. 28. · Avron & Toledo (’11) — other optimal choices for W Haber, Chung, Herrmann (’12) — connection

Full waveform inversion: velocity models

Lateral distance (Km)

Dep

th (

Km

)

1 2 3 4 5 6 7 8 9

0.5

1

1.5

2

2.5

3 2000

2500

3000

3500

4000

4500

5000

5500

x [km]

z [k

m]

0 2 4 6

0

1

2

[km

/s]

2

3

4

4 / 33

Page 6: Michael P. Friedlander University of British Columbia August … · 2020. 8. 28. · Avron & Toledo (’11) — other optimal choices for W Haber, Chung, Herrmann (’12) — connection

Full waveform inversionEach of m experiments yields a vector of measurements:sources: q1, . . . , qm

measurements: d1, . . . , dm

Seismic Laboratory for Imaging and Modeling

Forward problem: given m, predict data.If m, the velocity on the grid, is known, then we can predict (frequency-domain) data via Helmholz PDE:

d = Pu

q

!2m + r2

u = q

!"#$%&'#())'*m =

q: known source

m: known velocity

: frequency

u: predicted wavefield (on entire grid)

P: restriction to surface (where data is observed)

d: Predicted data.

!

Thursday, July 19, 2012

1 source, 1 frequency:

minimizex ,u

‖d − Pu‖2 subj to Hω(x)u = q

All sources, all frequencies: (eg, 1k sources, ∼ 10 freqs)

minimizex

m∑i

∑ω∈Ω

‖di − PHω(x)−1qi‖2

Main cost is solution of Helmholtz equation for each (i , ω) pair:

Hω(x)u = qi

5 / 33

Page 7: Michael P. Friedlander University of British Columbia August … · 2020. 8. 28. · Avron & Toledo (’11) — other optimal choices for W Haber, Chung, Herrmann (’12) — connection

Dimensionality reduction and stochastic optimization

Use all your data:

f (x) =m∑i‖ri (x)‖2 with ri (x) = di − F (x)qi

Dimensionality reduction: form s weighted averages (s m)

dj :=m∑j

wij di and qj :=m∑j

wij qi , j = 1, . . . , s

Stochastic approximation of the misfit:

f (x) =s∑j‖r j (x)‖2 with r j (x) = dj − F (x)qj

Stochastic optimization interpretation holds if E[WW T ] = I:

E[f (x)] = f (x) and E[∇f (x)] = ∇f (x)

6 / 33

Page 8: Michael P. Friedlander University of British Columbia August … · 2020. 8. 28. · Avron & Toledo (’11) — other optimal choices for W Haber, Chung, Herrmann (’12) — connection

Stochastic trace estimationLeast-squares misfit is a matrix trace:

f (x) =m∑i‖ri‖2 = trace(RTR) with R = [ r1 | · · · | rm ]

Data mixing (dimensionality reduction) equivalent to

R W = R

gives sample-average function

f (x) =s∑j‖r j‖2 = trace(RTR) with R = [ r 1 | · · · | r s ], s m

Connected to stochastic trace estimation.• Hutchinson (’90) — minimize var[trace(RT R)] with W ∼ Rademacher• Avron & Toledo (’11) — other optimal choices for W• Haber, Chung, Herrmann (’12) — connection to inverse problems

7 / 33

Page 9: Michael P. Friedlander University of British Columbia August … · 2020. 8. 28. · Avron & Toledo (’11) — other optimal choices for W Haber, Chung, Herrmann (’12) — connection

Nonlinear least-squares with corrupted data

good data 4% bad data

50 100 150 200 250 300 350 400 450 500 550

20

40

60

80

100

12050 100 150 200 250 300 350 400 450 500 550

20

40

60

80

100

120

collaboration with Total

8 / 33

Page 10: Michael P. Friedlander University of British Columbia August … · 2020. 8. 28. · Avron & Toledo (’11) — other optimal choices for W Haber, Chung, Herrmann (’12) — connection

Robust misfit measures

di = F (x)qi + ε

0

Student’s-tLaplaceNormal

0

|x |

12 x2

log(k + x2)

density function penalty function

• Least-squares penalty assumes ε ∼ N (0, 1)

• Theorem: Log-concave densities—ie, those induced by convexpenalties—are all exponentially bounded:

prob(|x | > t + ∆t given |x | > t

)= O(e−∆t)

• Heavy-tailed densities: robust to wrong data & approximate models9 / 33

Page 11: Michael P. Friedlander University of British Columbia August … · 2020. 8. 28. · Avron & Toledo (’11) — other optimal choices for W Haber, Chung, Herrmann (’12) — connection

Robust fullwaveform inversion results

• Recover velocity on a 2D grid: 201× 301• 151 sources, 6 frequencies: 906 PDE solves per f /∇f -eval• 50% corrupted data

0 5 10 15 20 25 30

0

5

10

15

20

!3 0 30

0.1

10 / 33

Page 12: Michael P. Friedlander University of British Columbia August … · 2020. 8. 28. · Avron & Toledo (’11) — other optimal choices for W Haber, Chung, Herrmann (’12) — connection

minimize ρ(R(x)) with R(x) = D − F (x)Q

Normal Laplace Student’s-t

‖R(x)‖2F

∑ij ‖Rij (x)‖1

∑ij log

(k + Rij (x)2)

0 5 10 15 20 25 30

0

5

10

15

20

0 5 10 15 20 25 30

0

5

10

15

20

0 5 10 15 20 25 30

0

5

10

15

20

!3 0 30

0.1

!3 0 30

0.1

!3 0 30

0.1

11 / 33

Page 13: Michael P. Friedlander University of British Columbia August … · 2020. 8. 28. · Avron & Toledo (’11) — other optimal choices for W Haber, Chung, Herrmann (’12) — connection

Sampling strategies for dimensionality reduction

Generic inverse problem (assuming iid observations):

minx

f (x) :=1m

m∑iρ(ri)

with R(x) = [ r1 | · · · | rm ]

Moment condition E [WW T ] = I is not generally sufficient to guarantee

E[ f (x)] = f (x)

E[∇f (x)] = ∇f (x)with R(x) = R(x)W

Random subset selection without replacement, ie,

R(x) = [ ri(1) | ri(2) | · · · | ri(s) ]

does yield desired “expected objective” property

12 / 33

Page 14: Michael P. Friedlander University of British Columbia August … · 2020. 8. 28. · Avron & Toledo (’11) — other optimal choices for W Haber, Chung, Herrmann (’12) — connection

MODEL PROBLEM

minimizex

f (x) :=1m

m∑i=1

fi (x)

13 / 33

Page 15: Michael P. Friedlander University of British Columbia August … · 2020. 8. 28. · Avron & Toledo (’11) — other optimal choices for W Haber, Chung, Herrmann (’12) — connection

Complexity of steepest descent

Baseline Algorithm:

xk+1 ← xk − αk∇f (xk ), αk ≡ 1/L

Assume: convex f ; Lipschitz ∇f with param L

Sublinear rate:• f (xk )− f (x∗) = O(1/k) (constant stepsize)• f (xk )− f (x∗) = O(1/k2) (optimal rate with extrapolation)

[Nesterov ’83; Tseng ’10]

Linear rate: additionally assume that f is strongly convex w/ param µ

• f (xk )− f (x∗) = O([1− µ/L]k )

Note: if f is twice differentiable, µI ∇2f (x) LI

14 / 33

Page 16: Michael P. Friedlander University of British Columbia August … · 2020. 8. 28. · Avron & Toledo (’11) — other optimal choices for W Haber, Chung, Herrmann (’12) — connection

Incremental gradient methods

Algorithm: xk+1 ← xk − α∇fi (xk ), i ∈ 1, . . . ,m cyclic , randomized

Assume: f strongly convex (µ) and Lipschitz ∇f (L)

Constant stepsize: αk ≡ α

• ‖xk − x∗‖2 ≤ O([1− µ/L]k ) +O(m2α) k full cycles

• E‖xk − x∗‖2 ≤ O([1− µ/L]k ) +O(mα) k iterations

Decreasing stepsize:∑

k αk =∞,∑

k α2k <∞

• ‖xk − x∗‖2 = O(1/k) k full cycles

• E‖xk − x∗‖2 = O(1/k) k iterations

Many variations:Luo/Tseng ’94; Nedic/Bertsekas ’00/’10; Blatt et al ’08; Bottou ’10

15 / 33

Page 17: Michael P. Friedlander University of British Columbia August … · 2020. 8. 28. · Avron & Toledo (’11) — other optimal choices for W Haber, Chung, Herrmann (’12) — connection

EXAMPLES

Page 18: Michael P. Friedlander University of British Columbia August … · 2020. 8. 28. · Avron & Toledo (’11) — other optimal choices for W Haber, Chung, Herrmann (’12) — connection

Seismic inversionRecover image of geological structures via nonlinear least squares

minimizex

m∑i

∑ω∈Ω

‖di − PHω(x)−1qi‖2

Observations: Each of m “shots” is an experiment:

sources: q1, . . . , qm, measurements: d1, . . . , dm

x [km]

z [km

]

0 0.5 1 1.5 2 2.5 3

0

0.5

1

1.5

2

16 / 33

Page 19: Michael P. Friedlander University of British Columbia August … · 2020. 8. 28. · Avron & Toledo (’11) — other optimal choices for W Haber, Chung, Herrmann (’12) — connection

full gradient incremental gradient gradient sampling

x [km]

z [km

]

0 0.5 1 1.5 2 2.5 3

0

0.5

1

1.5

2

x [km]

z [km

]

0 0.5 1 1.5 2 2.5 3

0

0.5

1

1.5

2

x [km]

z [km

]

0 0.5 1 1.5 2 2.5 3

0

0.5

1

1.5

2

x [km]

z [km

]

0 0.5 1 1.5 2 2.5 3

0

0.5

1

1.5

2

x [km]

z [km

]

0 0.5 1 1.5 2 2.5 3

0

0.5

1

1.5

2

x [km]

z [km

]

0 0.5 1 1.5 2 2.5 3

0

0.5

1

1.5

2

x [km]

z [km

]

0 0.5 1 1.5 2 2.5 3

0

0.5

1

1.5

2

x [km]

z [km

]

0 0.5 1 1.5 2 2.5 3

0

0.5

1

1.5

2

x [km]

z [km

]

0 0.5 1 1.5 2 2.5 3

0

0.5

1

1.5

2

x [km]

z [km

]

0 0.5 1 1.5 2 2.5 3

0

0.5

1

1.5

2

x [km]

z [km

]

0 0.5 1 1.5 2 2.5 3

0

0.5

1

1.5

2

x [km]

z [km

]

0 0.5 1 1.5 2 2.5 3

0

0.5

1

1.5

2

x [km]

z [km

]

0 0.5 1 1.5 2 2.5 3

0

0.5

1

1.5

2

x [km]

z [km

]

0 0.5 1 1.5 2 2.5 3

0

0.5

1

1.5

2

x [km]

z [km

]

0 0.5 1 1.5 2 2.5 3

0

0.5

1

1.5

2

x [km]

z [km

]

0 0.5 1 1.5 2 2.5 3

0

0.5

1

1.5

2

x [km]

z [km

]

0 0.5 1 1.5 2 2.5 3

0

0.5

1

1.5

2

x [km]

z [km

]

0 0.5 1 1.5 2 2.5 3

0

0.5

1

1.5

2

x [km]

z [km

]

0 0.5 1 1.5 2 2.5 3

0

0.5

1

1.5

2

x [km]

z [km

]

0 0.5 1 1.5 2 2.5 3

0

0.5

1

1.5

2

x [km]

z [km

]

0 0.5 1 1.5 2 2.5 3

0

0.5

1

1.5

2

x [km]

z [km

]

0 0.5 1 1.5 2 2.5 3

0

0.5

1

1.5

2

x [km]

z [km

]

0 0.5 1 1.5 2 2.5 3

0

0.5

1

1.5

2

x [km]

z [km

]

0 0.5 1 1.5 2 2.5 3

0

0.5

1

1.5

2

x [km]

z [km

]

0 0.5 1 1.5 2 2.5 3

0

0.5

1

1.5

2

x [km]

z [km

]

0 0.5 1 1.5 2 2.5 3

0

0.5

1

1.5

2

x [km]

z [km

]

0 0.5 1 1.5 2 2.5 3

0

0.5

1

1.5

2

x [km]

z [km

]

0 0.5 1 1.5 2 2.5 3

0

0.5

1

1.5

2

x [km]

z [km

]

0 0.5 1 1.5 2 2.5 3

0

0.5

1

1.5

2

x [km]

z [km

]

0 0.5 1 1.5 2 2.5 3

0

0.5

1

1.5

2

x [km]

z [km

]

0 0.5 1 1.5 2 2.5 3

0

0.5

1

1.5

2

x [km]

z [km

]

0 0.5 1 1.5 2 2.5 3

0

0.5

1

1.5

2

x [km]

z [km

]

0 0.5 1 1.5 2 2.5 3

0

0.5

1

1.5

2

0.010.40.822.6471016223039 of 39 passes

17 / 33

Page 20: Michael P. Friedlander University of British Columbia August … · 2020. 8. 28. · Avron & Toledo (’11) — other optimal choices for W Haber, Chung, Herrmann (’12) — connection

Image denoising

• Statistical denoising via conditional random fields• Dataset of 50 synthetic 64× 64 images [Kumar/Hebert ’04]• Generalization of logistic model to capture dependencies among labels

maximizex

m∑i=1

log p(bi ; x)

• p is intractable and approximated

18 / 33

Page 21: Michael P. Friedlander University of British Columbia August … · 2020. 8. 28. · Avron & Toledo (’11) — other optimal choices for W Haber, Chung, Herrmann (’12) — connection

true image noisy sample

full gradient incremental gradient gradient sampling

0.250.500.7512345 of 5 passes 19 / 33

Page 22: Michael P. Friedlander University of British Columbia August … · 2020. 8. 28. · Avron & Toledo (’11) — other optimal choices for W Haber, Chung, Herrmann (’12) — connection

Sampling approach

Increasing batch: [Bertsekas/Tsitsiklis ’96, Shapiro/H-do-M ’00]

Sk ⊆ 1, . . . ,m, sk → m (slowly)

Sample-average gradient:

gk (x) :=1sk

∑i∈Sk

∇fi (x)

Algorithm:

xk+1 ← xk − αk d with Hk d = −gk (xk )

Goal: non-asymptotic analysis based on controlling gradient error

• gk (x) = ∇f (xk ) + ek where ‖ek‖2 ≤ εk or E[‖ek‖2] ≤ εk

• How to control sample size sk ?

20 / 33

Page 23: Michael P. Friedlander University of British Columbia August … · 2020. 8. 28. · Avron & Toledo (’11) — other optimal choices for W Haber, Chung, Herrmann (’12) — connection

Gradient with generic errors

Prototype algo:

xk+1 ← xk − αgk , gk = ∇f (xk ) + ek , α fixed

Assumptions:• Lipschitz gradient (L); strong convexity (µ)• ‖ek‖2 ≤ εk

Convergence rate: for all k = 1, 2, . . . [F. & Schmidt ’12]

‖xk − x∗‖2 ≤ O([1− µ/L]k ) +O(εk )

Examples:linear: εk = O(γk ) =⇒ f (xk )− f∗ = O(maxγ, 1− µ/Lk )

sublinear: εk = O(1/k2) =⇒ f (xk )− f∗ = O(1/k2)

persistent err: infkεk > 0 =⇒ f (xk )− f∗ = O(εk )

21 / 33

Page 24: Michael P. Friedlander University of British Columbia August … · 2020. 8. 28. · Avron & Toledo (’11) — other optimal choices for W Haber, Chung, Herrmann (’12) — connection

Growing the sample size

Prototype algo:

xk+1 ← xk − αgk , gk =1sk

∑i∈Sk

∇fi (xk ), Sk ⊆ 1, . . . ,m

Sampling strategies• Deterministic: pre-determined sample sequence• Randomized: uniform sampling

Convergence rates for all k = 1, 2, . . . [F. & Schmidt ’12]

deterministic: ‖xk − x∗‖2 = O([1− µ/L]k ) +O([m−sk

m]2)

sampling w/o replacement E‖xk − x∗‖2 = O([1− µ/L]k ) +O(

m−skm · 1

sk

)sampling w/ replacement E‖xk − x∗‖2 = O([1− µ/L]k ) +O

(1sk

)22 / 33

Page 25: Michael P. Friedlander University of British Columbia August … · 2020. 8. 28. · Avron & Toledo (’11) — other optimal choices for W Haber, Chung, Herrmann (’12) — connection

Illustration

Sample-size schedule Cumulative samples

0 20 40 60 80 100

Iteration (k)

0.0

0.5

1.0

Sam

ple

siz

e (s/m

)

0 20 40 60 80 100

Iteration (k)

100

103

106

Cum

ula

tive s

am

ple

s

replacement

no replacement

deterministic

23 / 33

Page 26: Michael P. Friedlander University of British Columbia August … · 2020. 8. 28. · Avron & Toledo (’11) — other optimal choices for W Haber, Chung, Herrmann (’12) — connection

In practice

Algorithm:

xk+1 ← xk − αkd , Bkd = −gk(xk ), gk(x) =1sk

∑i∈Sk

∇fi (x)

Hessian approximations Bk :• (limited-memory) quasi-Newton: [Schraudolph et al ’07]

sk := xk+1 − xk , yk := gk(xk+1)− gk(xk )

• sample-average Hessian: [Byrd et al ’11]

Bk :=1

hk

∑i∈Hk

∇2fi (xk ), hk sk

• Fisher information [for f (x) = −∑

i log pi (ωi ; x)]: [Osborne ’92]

Bk ≈ Eω[∇2f (x)] = Eω[∇f (x)∇f (x)T ]

24 / 33

Page 27: Michael P. Friedlander University of British Columbia August … · 2020. 8. 28. · Avron & Toledo (’11) — other optimal choices for W Haber, Chung, Herrmann (’12) — connection

APPLICATIONS

Page 28: Michael P. Friedlander University of British Columbia August … · 2020. 8. 28. · Avron & Toledo (’11) — other optimal choices for W Haber, Chung, Herrmann (’12) — connection

Binary logistic regression

maxx

m∑i

log p(bi | ai , x), p(bi | ai , x) =1

1 + exp(−bi aTi x)

, bi ∈ −1, 1

• Email spam classifier (Cormack and Lynam, 2005)• TREC 2005 dataset: 92,189 email msgs from Enron investigation

0 10 20 30 40 50 60 70 80 90 100

100

101

102

103

104

Passes through data

Ob

jec

tive

min

us

Op

tim

al

Stochastic1(1e+01)

Stochastic1(1e+00)

Stochastic1(1e−01)

Hybrid

Deterministic

25 / 33

Page 29: Michael P. Friedlander University of British Columbia August … · 2020. 8. 28. · Avron & Toledo (’11) — other optimal choices for W Haber, Chung, Herrmann (’12) — connection

Multinomial logistic regression

maxx

m∑i

log p(bi = j | ai , xj∈C), bi ∈ C

• Digit classification• MNIST dataset: 70,000 handwritten 28×28 images of digits

0 10 20 30 40 50 60 70 80 90 100

103

104

105

Passes through data

Ob

jec

tive

min

us

Op

tim

al

Stochastic1(1e−02)

Stochastic1(1e−03)

Stochastic1(1e−04)

Hybrid

Deterministic

26 / 33

Page 30: Michael P. Friedlander University of British Columbia August … · 2020. 8. 28. · Avron & Toledo (’11) — other optimal choices for W Haber, Chung, Herrmann (’12) — connection

Chain-structured conditional random fields

maxx

m∑i

log p(bki = jkk∈Ω | ak

i k∈Ω, xjj∈C), bki ∈ C, k ∈ Ω

• noun-phrase chunking task from natural-language processing• CoNLL-2000 Shared Task dataset: 211,727 words in 8,936 sentences

0 10 20 30 40 50 60 70 80 90 100

102

103

104

105

Passes through data

Ob

jec

tiv

e m

inu

s O

ptim

al

Stochastic1(1e−01)

Stochastic1(1e−02)

Stochastic1(1e−03)

Hybrid

Deterministic

27 / 33

Page 31: Michael P. Friedlander University of British Columbia August … · 2020. 8. 28. · Avron & Toledo (’11) — other optimal choices for W Haber, Chung, Herrmann (’12) — connection

General conditional random fields

minx

m∑i=1

p(bi ; x)

• Statistical denoising• Kumar/Hebert dataset of 50 synthetic 64× 64 images

0 10 20 30 40 50 60 70 80 90 100

10−8

10−6

10−4

10−2

100

102

104

Passes through data

Ob

jec

tiv

e m

inu

s O

ptim

al

Stochastic1(1e−03)

Stochastic1(1e−04)

Stochastic1(1e−05)

Hybrid

Deterministic

28 / 33

Page 32: Michael P. Friedlander University of British Columbia August … · 2020. 8. 28. · Avron & Toledo (’11) — other optimal choices for W Haber, Chung, Herrmann (’12) — connection

Seismic inversion

minx

m∑i

∑ω∈Ω

‖di − PHω(x)−1qi‖2

• Recover seismic image via nonlinear least squares• Marmousi 2D acoustic model; 101 sources/receivers; 8 frequencies

0 5 10 15 20 25

104

105

106

107

Passes through data

Ob

jec

tive

min

us

Op

tim

al

Stochastic1(1e−08)

Stochastic1(1e−09)

Stochastic1(1e−10)

Hybrid

Deterministic

29 / 33

Page 33: Michael P. Friedlander University of British Columbia August … · 2020. 8. 28. · Avron & Toledo (’11) — other optimal choices for W Haber, Chung, Herrmann (’12) — connection

No strong convexity, no linear convergence

Steepest descentxk+1 ← xk − α∇f (xk )

is sublinear:f (xk )− f (x∗) = O(1/k)

Approximate steepest descent

xk+1 ← xk − αdk with dk = ∇f (xk ) + ek

Summable errors, ie,∑∞

k ‖ek‖ <∞, gives sublinear iterate average:

f (xk )− f (x∗) = O(1/k), xk :=1k

k∑i=1

xi

30 / 33

Page 34: Michael P. Friedlander University of British Columbia August … · 2020. 8. 28. · Avron & Toledo (’11) — other optimal choices for W Haber, Chung, Herrmann (’12) — connection

References I• H. Avron and S. Toledo. Randomized algorithms for estimating the trace of an implicit

symmetric positive semi-definite matrix. J. ACM, 58:8:1–8:34, April 2011.

• D. Bertsekas. Incremental Gradient, Subgradient, and Proximal Methods for ConvexOptimization: A Survey. In S. Sra, S. Nowozin, and S. J. Wright, editors, Optimization formachine learning, chapter 4, pages 85–115. MIT Press, 2012.

• D. Bertsekas and J. Tsitsiklis. Neuro-dynamic programming. Athena Scientific, 1996.

• D. Blatt, A. Hero, and H. Gauchman. A convergent incremental gradient method with aconstant step size. SIAM J. Optim., 18(1):29–51, 2008.

• L. Bottou. Large-scale machine learning with stochastic gradient descent. In Y. Lechevallier andG. Saporta, editors, Proceedings of the 19th International Conference on ComputationalStatistics (COMPSTAT’2010), pages 177–187, Paris, France, August 2010. Springer.

• R. H. Byrd, G. Chin, J. Nocedal, and Y. Wu. Sample size selection in optimization methods formachine learning. Math. Program., 134:127–155, 2012.

• R. H. Byrd, G. M. Chin, W. Neveitt, and J. Nocedal. On the use of stochastic hessianinformation in optimization methods for machine learning. SIAM J. Optim., 21(3):977–995,2011.

• L. Grippo. A class of unconstrained minimization methods for neural network training. Optim.Methods Softw., 4(2):135–150, 1994.

• E. Haber, M. Chung, and F. J. Herrmann. An effective method for parameter estimation withPDE constraints with multiple right-hand sides. SIAM J. Optim., 22(3):739–757, 2012.

31 / 33

Page 35: Michael P. Friedlander University of British Columbia August … · 2020. 8. 28. · Avron & Toledo (’11) — other optimal choices for W Haber, Chung, Herrmann (’12) — connection

References II• M. Hutchinson. A stochastic estimator of the trace of the influence matrix for laplacian

smoothing splines. Comm. Statist. Simulation Comput., 19(2):433–450, 1990.

• A. Kleywegt, A. Shapiro, and T. Homem-de Mello. The sample average approximation methodfor stochastic discrete optimization. SIAM J. Optim., 12(2):479–502, 2002.

• Z. Luo and P. Tseng. Error bounds and convergence analysis of feasible descent methods: Ageneral approach. Ann. Oper. Res., 46(1):157–178, 1993.

• A. Nedic and D. Bertsekas. Convergence rate of incremental subgradient algorithms. StochasticOptimization: Algorithms and Applications, pages 263–304, 2000.

• Y. Nesterov. A method for unconstrained convex minimization problem with the rate ofconvergence O(1/k2). Soviet Math. Dokl., 269, 1983.

• M. Osborne. Fisher’s method of scoring. Intern. Stat. Rev., pages 99–117, 1992.

• N. N. Schraudolph, J. Yu, and S. Gunter. A Stochastic Quasi-Newton Method for OnlineConvex Optimization. In M. Meila and X. Shen, editors, Proc. 11th Intl. Conf. ArtificialIntelligence and Statistics (AIStats), volume 2 of Workshop Conf. Proc., pages 436–443, SanJuan, Puerto Rico, 2007. J. Machine Learning Res.

• P. Tseng. Approximation accuracy, gradient methods, and error bound for structured convexoptimization. Math. Program., 125:263–295, 2010.

• K. van den Doel and U. M. Ascher. Adaptive and stochastic algorithms for electrical impedancetomography and dc resistivity problems with piecewise constant solutions and manymeasurements. SIAM J. Comput., 34(1), 2012.

32 / 33

Page 36: Michael P. Friedlander University of British Columbia August … · 2020. 8. 28. · Avron & Toledo (’11) — other optimal choices for W Haber, Chung, Herrmann (’12) — connection

Thanks!

Read:• M. P. Friedlander and M. Schmidt. Hybrid deterministic-stochastic

methods for data fitting. SIAM J. Scientific Computing, 34(3), April 2012.

• A. Aravkin, M. P. Friedlander, F. Herrmann, and T. van Leeuwen.Robust inversion, dimensionality reduction, and randomized sampling.Mathematical Programming, 134(1):101–125, 2012.

Email:• [email protected]

Surf:• http://www.cs.ubc.ca/˜mpf

33 / 33