beyond stochastic gradient descent for large-scale machine ... · • large-scale machine learning:...

67
Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Sup´ erieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Aymeric Dieuleveut, Nicolas Flammarion, Eric Moulines - ERNSI Workshop, 2017

Upload: others

Post on 25-Sep-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Beyond stochastic gradient descent for large-scale machine ... · • Large-scale machine learning: large p, large n – p : dimension of each observation (input) – n : number of

Beyond stochastic gradient descentfor large-scale machine learning

Francis Bach

INRIA - Ecole Normale Superieure, Paris, France

ÉCOLE NORMALE

S U P É R I E U R E

Joint work with Aymeric Dieuleveut, Nicolas Flammarion,

Eric Moulines - ERNSI Workshop, 2017

Page 2: Beyond stochastic gradient descent for large-scale machine ... · • Large-scale machine learning: large p, large n – p : dimension of each observation (input) – n : number of

Context and motivations

• Supervised machine learning

– Goal: estimating a function f : X → Y– From random observations (xi, yi) ∈ X × Y, i = 1, . . . , n

Page 3: Beyond stochastic gradient descent for large-scale machine ... · • Large-scale machine learning: large p, large n – p : dimension of each observation (input) – n : number of

Context and motivations

• Supervised machine learning

– Goal: estimating a function f : X → Y– From random observations (xi, yi) ∈ X × Y, i = 1, . . . , n

• Specificities

– Data may come from anywhere

– From strong to weak prior knowledge

– Computational constraints

– Between theory, algorithms and applications

Page 4: Beyond stochastic gradient descent for large-scale machine ... · • Large-scale machine learning: large p, large n – p : dimension of each observation (input) – n : number of

Context and motivations

• Large-scale machine learning: large p, large n

– p : dimension of each observation (input)

– n : number of observations

• Examples: computer vision, bioinformatics, advertising

– Ideal running-time complexity: O(pn)

– Going back to simple methods

– Stochastic gradient methods (Robbins and Monro, 1951)

– Mixing statistics and optimization

– Using smoothness to go beyond stochastic gradient descent

Page 5: Beyond stochastic gradient descent for large-scale machine ... · • Large-scale machine learning: large p, large n – p : dimension of each observation (input) – n : number of

Search engines - Advertising

Page 6: Beyond stochastic gradient descent for large-scale machine ... · • Large-scale machine learning: large p, large n – p : dimension of each observation (input) – n : number of

Visual object recognition

Page 7: Beyond stochastic gradient descent for large-scale machine ... · • Large-scale machine learning: large p, large n – p : dimension of each observation (input) – n : number of

Bioinformatics

• Protein: Crucial elements of cell life

• Massive data: 2 millions for humans

• Complex data

Page 8: Beyond stochastic gradient descent for large-scale machine ... · • Large-scale machine learning: large p, large n – p : dimension of each observation (input) – n : number of

Context and motivations

• Large-scale machine learning: large p, large n

– p : dimension of each observation (input)

– n : number of observations

• Examples: computer vision, bioinformatics, advertising

• Ideal running-time complexity: O(pn)

– Going back to simple methods

– Stochastic gradient methods (Robbins and Monro, 1951)

– Mixing statistics and optimization

– Using smoothness to go beyond stochastic gradient descent

Page 9: Beyond stochastic gradient descent for large-scale machine ... · • Large-scale machine learning: large p, large n – p : dimension of each observation (input) – n : number of

Context and motivations

• Large-scale machine learning: large p, large n

– p : dimension of each observation (input)

– n : number of observations

• Examples: computer vision, bioinformatics, advertising

• Ideal running-time complexity: O(pn)

• Going back to simple methods

– Stochastic gradient methods (Robbins and Monro, 1951)

– Mixing statistics and optimization

– Using smoothness to go beyond stochastic gradient descent

Page 10: Beyond stochastic gradient descent for large-scale machine ... · • Large-scale machine learning: large p, large n – p : dimension of each observation (input) – n : number of

Supervised machine learning

• Data: n observations (xi, yi) ∈ X × Y, i = 1, . . . , n, i.i.d.

• Prediction as a linear function 〈θ,Φ(x)〉 of features Φ(x) ∈ Rp

– Explicit features adapted to inputs (can be learned as well)

– Using Hilbert spaces for non-linear / non-parametric estimation

Page 11: Beyond stochastic gradient descent for large-scale machine ... · • Large-scale machine learning: large p, large n – p : dimension of each observation (input) – n : number of

Supervised machine learning

• Data: n observations (xi, yi) ∈ X × Y, i = 1, . . . , n, i.i.d.

• Prediction as a linear function 〈θ,Φ(x)〉 of features Φ(x) ∈ Rp

• (regularized) empirical risk minimization: find θ solution of

minθ∈Rp

1

n

n∑

i=1

ℓ(

yi, 〈θ,Φ(xi)〉)

+ µΩ(θ)

convex data fitting term + regularizer

Page 12: Beyond stochastic gradient descent for large-scale machine ... · • Large-scale machine learning: large p, large n – p : dimension of each observation (input) – n : number of

Supervised machine learning

• Data: n observations (xi, yi) ∈ X × Y, i = 1, . . . , n, i.i.d.

• Prediction as a linear function 〈θ,Φ(x)〉 of features Φ(x) ∈ Rp

• (regularized) empirical risk minimization: find θ solution of

minθ∈Rp

1

2n

n∑

i=1

(

yi − 〈θ,Φ(xi)〉)2

+ µΩ(θ)

(least-squares regression)

Page 13: Beyond stochastic gradient descent for large-scale machine ... · • Large-scale machine learning: large p, large n – p : dimension of each observation (input) – n : number of

Supervised machine learning

• Data: n observations (xi, yi) ∈ X × Y, i = 1, . . . , n, i.i.d.

• Prediction as a linear function 〈θ,Φ(x)〉 of features Φ(x) ∈ Rp

• (regularized) empirical risk minimization: find θ solution of

minθ∈Rp

1

n

n∑

i=1

log(

1 + exp(−yi〈θ,Φ(xi)〉))

+ µΩ(θ)

(logistic regression)

Page 14: Beyond stochastic gradient descent for large-scale machine ... · • Large-scale machine learning: large p, large n – p : dimension of each observation (input) – n : number of

Supervised machine learning

• Data: n observations (xi, yi) ∈ X × Y, i = 1, . . . , n, i.i.d.

• Prediction as a linear function 〈θ,Φ(x)〉 of features Φ(x) ∈ Rp

• (regularized) empirical risk minimization: find θ solution of

minθ∈Rp

1

n

n∑

i=1

ℓ(

yi, 〈θ,Φ(xi)〉)

+ µΩ(θ)

convex data fitting term + regularizer

• Empirical risk: f(θ) = 1n

∑ni=1 ℓ(yi, 〈θ,Φ(xi)〉) training cost

• Expected risk: f(θ) = E(x,y)ℓ(y, 〈θ,Φ(x)〉) testing cost

• Two fundamental questions: (1) computing θ and (2) analyzing θ

Page 15: Beyond stochastic gradient descent for large-scale machine ... · • Large-scale machine learning: large p, large n – p : dimension of each observation (input) – n : number of

Smoothness and strong convexity

• A function g : Rp → R is L-smooth if and only if it is twice

differentiable and

∀θ ∈ Rp, eigenvalues

[

g′′(θ)]

6 L

smooth non−smooth

Page 16: Beyond stochastic gradient descent for large-scale machine ... · • Large-scale machine learning: large p, large n – p : dimension of each observation (input) – n : number of

Smoothness and strong convexity

• A function g : Rp → R is L-smooth if and only if it is twice

differentiable and

∀θ ∈ Rp, eigenvalues

[

g′′(θ)]

6 L

• Machine learning

– with g(θ) = 1n

∑ni=1 ℓ(yi, 〈θ,Φ(xi)〉)

– Hessian ≈ covariance matrix 1n

∑ni=1Φ(xi)⊗ Φ(xi)

– Bounded data: ‖Φ(x)‖ 6 R ⇒ L = O(R2)

Page 17: Beyond stochastic gradient descent for large-scale machine ... · • Large-scale machine learning: large p, large n – p : dimension of each observation (input) – n : number of

Smoothness and strong convexity

• A twice differentiable function g : Rp → R is µ-strongly convex if

and only if

∀θ ∈ Rp, eigenvalues

[

g′′(θ)]

> µ

convexstronglyconvex

Page 18: Beyond stochastic gradient descent for large-scale machine ... · • Large-scale machine learning: large p, large n – p : dimension of each observation (input) – n : number of

Smoothness and strong convexity

• A twice differentiable function g : Rp → R is µ-strongly convex if

and only if

∀θ ∈ Rp, eigenvalues

[

g′′(θ)]

> µ

(large µ/L) (small µ/L)

Page 19: Beyond stochastic gradient descent for large-scale machine ... · • Large-scale machine learning: large p, large n – p : dimension of each observation (input) – n : number of

Smoothness and strong convexity

• A twice differentiable function g : Rp → R is µ-strongly convex if

and only if

∀θ ∈ Rp, eigenvalues

[

g′′(θ)]

> µ

• Machine learning

– with g(θ) = 1n

∑ni=1 ℓ(yi, 〈θ,Φ(xi)〉)

– Hessian ≈ covariance matrix 1n

∑ni=1Φ(xi)⊗ Φ(xi)

– Data with invertible covariance matrix (low correlation/dimension)

Page 20: Beyond stochastic gradient descent for large-scale machine ... · • Large-scale machine learning: large p, large n – p : dimension of each observation (input) – n : number of

Smoothness and strong convexity

• A twice differentiable function g : Rp → R is µ-strongly convex if

and only if

∀θ ∈ Rp, eigenvalues

[

g′′(θ)]

> µ

• Machine learning

– with g(θ) = 1n

∑ni=1 ℓ(yi, 〈θ,Φ(xi)〉)

– Hessian ≈ covariance matrix 1n

∑ni=1Φ(xi)⊗ Φ(xi)

– Data with invertible covariance matrix (low correlation/dimension)

• Adding regularization by µ2‖θ‖2

– creates additional bias unless µ is small

Page 21: Beyond stochastic gradient descent for large-scale machine ... · • Large-scale machine learning: large p, large n – p : dimension of each observation (input) – n : number of

Iterative methods for minimizing smooth functions

• Assumption: g convex and smooth on Rp

• Gradient descent: θt = θt−1 − γt g′(θt−1)

– O(1/t) convergence rate for convex functions

– O(e−(µ/L)t) convergence rate for strongly convex functions

Page 22: Beyond stochastic gradient descent for large-scale machine ... · • Large-scale machine learning: large p, large n – p : dimension of each observation (input) – n : number of

Iterative methods for minimizing smooth functions

• Assumption: g convex and smooth on Rp

• Gradient descent: θt = θt−1 − γt g′(θt−1)

– O(1/t) convergence rate for convex functions

– O(e−(µ/L)t) convergence rate for strongly convex functions

• Newton method: θt = θt−1 − g′′(θt−1)−1g′(θt−1)

– O(

e−ρ2t)

convergence rate

Page 23: Beyond stochastic gradient descent for large-scale machine ... · • Large-scale machine learning: large p, large n – p : dimension of each observation (input) – n : number of

Iterative methods for minimizing smooth functions

• Assumption: g convex and smooth on Rp

• Gradient descent: θt = θt−1 − γt g′(θt−1)

– O(1/t) convergence rate for convex functions

– O(e−(µ/L)t) convergence rate for strongly convex functions

• Newton method: θt = θt−1 − g′′(θt−1)−1g′(θt−1)

– O(

e−ρ2t)

convergence rate

• Key insights from Bottou and Bousquet (2008)

1. In machine learning, no need to optimize below statistical error

2. In machine learning, cost functions are averages

⇒ Stochastic approximation

Page 24: Beyond stochastic gradient descent for large-scale machine ... · • Large-scale machine learning: large p, large n – p : dimension of each observation (input) – n : number of

Stochastic approximation

• Goal: Minimizing a function f defined on Rp

– given only unbiased estimates f ′n(θn) of its gradients f ′(θn) at

certain points θn ∈ Rp

Page 25: Beyond stochastic gradient descent for large-scale machine ... · • Large-scale machine learning: large p, large n – p : dimension of each observation (input) – n : number of

Stochastic approximation

• Goal: Minimizing a function f defined on Rp

– given only unbiased estimates f ′n(θn) of its gradients f ′(θn) at

certain points θn ∈ Rp

• Machine learning - statistics

– f(θ) = Efn(θ) = E ℓ(yn, 〈θ,Φ(xn)〉) = generalization error

– Loss for a single pair of observations: fn(θ) = ℓ(yn, 〈θ,Φ(xn)〉)– Expected gradient:

f ′(θ) = Ef ′n(θ) = E

ℓ′(yn, 〈θ,Φ(xn)〉)Φ(xn)

• Beyond convex optimization: see, e.g., Benveniste et al. (2012)

Page 26: Beyond stochastic gradient descent for large-scale machine ... · • Large-scale machine learning: large p, large n – p : dimension of each observation (input) – n : number of

Convex stochastic approximation

• Key assumption: smoothness and/or strong convexity

• Key algorithm: stochastic gradient descent (a.k.a. Robbins-Monro)

θn = θn−1 − γn f′n(θn−1)

– Polyak-Ruppert averaging: θn = 1n+1

∑nk=0 θk

– Which learning rate sequence γn? Classical setting: γn = Cn−α

Page 27: Beyond stochastic gradient descent for large-scale machine ... · • Large-scale machine learning: large p, large n – p : dimension of each observation (input) – n : number of

Convex stochastic approximation

• Key assumption: smoothness and/or strong convexity

• Key algorithm: stochastic gradient descent (a.k.a. Robbins-Monro)

θn = θn−1 − γn f′n(θn−1)

– Polyak-Ruppert averaging: θn = 1n+1

∑nk=0 θk

– Which learning rate sequence γn? Classical setting: γn = Cn−α

• Running-time = O(np)

– Single pass through the data

– One line of code among many

Page 28: Beyond stochastic gradient descent for large-scale machine ... · • Large-scale machine learning: large p, large n – p : dimension of each observation (input) – n : number of

Convex stochastic approximation

Existing analysis

• Known global minimax rates of convergence for non-smooth

problems (Nemirovski and Yudin, 1983; Agarwal et al., 2012)

– Strongly convex: O((µn)−1)

Attained by averaged stochastic gradient descent with γn ∝ (µn)−1

– Non-strongly convex: O(n−1/2)

Attained by averaged stochastic gradient descent with γn ∝ n−1/2

Page 29: Beyond stochastic gradient descent for large-scale machine ... · • Large-scale machine learning: large p, large n – p : dimension of each observation (input) – n : number of

Convex stochastic approximation

Existing analysis

• Known global minimax rates of convergence for non-smooth

problems (Nemirovski and Yudin, 1983; Agarwal et al., 2012)

– Strongly convex: O((µn)−1)

Attained by averaged stochastic gradient descent with γn ∝ (µn)−1

– Non-strongly convex: O(n−1/2)

Attained by averaged stochastic gradient descent with γn ∝ n−1/2

• Asymptotic analysis of averaging (Polyak and Juditsky, 1992;

Ruppert, 1988)

– All step sizes γn = Cn−α with α ∈ (1/2, 1) lead to O(n−1) for

smooth strongly convex problems

A single algorithm with global adaptive convergence rate for

smooth problems?

Page 30: Beyond stochastic gradient descent for large-scale machine ... · • Large-scale machine learning: large p, large n – p : dimension of each observation (input) – n : number of

Convex stochastic approximation

Existing analysis

• Known global minimax rates of convergence for non-smooth

problems (Nemirovski and Yudin, 1983; Agarwal et al., 2012)

– Strongly convex: O((µn)−1)

Attained by averaged stochastic gradient descent with γn ∝ (µn)−1

– Non-strongly convex: O(n−1/2)

Attained by averaged stochastic gradient descent with γn ∝ n−1/2

• Asymptotic analysis of averaging (Polyak and Juditsky, 1992;

Ruppert, 1988)

– All step sizes γn = Cn−α with α ∈ (1/2, 1) lead to O(n−1) for

smooth strongly convex problems

• A single algorithm for smooth problems with global convergence

rate O(1/n) in all situations?

Page 31: Beyond stochastic gradient descent for large-scale machine ... · • Large-scale machine learning: large p, large n – p : dimension of each observation (input) – n : number of

Least-mean-square algorithm

• Least-squares: f(θ) = 12E

[

(yn − 〈Φ(xn), θ〉)2]

with θ ∈ Rp

– SGD = least-mean-square algorithm (see, e.g., Macchi, 1995)

– usually studied without averaging and decreasing step-sizes

– with strong convexity assumption E[

Φ(xn)⊗Φ(xn)]

= H < µ · Id

Page 32: Beyond stochastic gradient descent for large-scale machine ... · • Large-scale machine learning: large p, large n – p : dimension of each observation (input) – n : number of

Least-mean-square algorithm

• Least-squares: f(θ) = 12E

[

(yn − 〈Φ(xn), θ〉)2]

with θ ∈ Rp

– SGD = least-mean-square algorithm (see, e.g., Macchi, 1995)

– usually studied without averaging and decreasing step-sizes

– with strong convexity assumption E[

Φ(xn)⊗Φ(xn)]

= H < µ · Id

• New analysis for averaging and constant step-size γ = 1/(4R2)

– Assume ‖Φ(xn)‖ 6 R and |yn − 〈Φ(xn), θ∗〉| 6 σ almost surely

– No assumption regarding lowest eigenvalues of H

– Main result: Ef(θn)− f(θ∗) 64σ2p

n+

4R2‖θ0 − θ∗‖2n

• Matches statistical lower bound (Tsybakov, 2003)

– Fewer assumptions than existing bounds for empirical risk min. (

Page 33: Beyond stochastic gradient descent for large-scale machine ... · • Large-scale machine learning: large p, large n – p : dimension of each observation (input) – n : number of

Least-squares - Proof technique

• LMS recursion:

θn − θ∗ =[

I − γΦ(xn)⊗ Φ(xn)]

(θn−1 − θ∗) + γ εnΦ(xn)

• Simplified LMS recursion: with H = E[

Φ(xn)⊗ Φ(xn)]

θn − θ∗ =[

I − γH]

(θn−1 − θ∗) + γ εnΦ(xn)

– Direct proof technique of Polyak and Juditsky (1992), e.g.,

θn − θ∗ =[

I − γH]n(θ0 − θ∗) + γ

n∑

k=1

[

I − γH]n−k

εkΦ(xk)

• Infinite expansion of Aguech, Moulines, and Priouret (2000) in powers

of γ

Page 34: Beyond stochastic gradient descent for large-scale machine ... · • Large-scale machine learning: large p, large n – p : dimension of each observation (input) – n : number of

Markov chain interpretation of constant step sizes

• LMS recursion for fn(θ) =12

(

yn − 〈Φ(xn), θ〉)2

θn = θn−1 − γ(

〈Φ(xn), θn−1〉 − yn)

Φ(xn)

• The sequence (θn)n is a homogeneous Markov chain

– convergence to a stationary distribution πγ

– with expectation θγdef=

θπγ(dθ)

- For least-squares, θγ = θ∗

θγ

θ0

θn

Page 35: Beyond stochastic gradient descent for large-scale machine ... · • Large-scale machine learning: large p, large n – p : dimension of each observation (input) – n : number of

Markov chain interpretation of constant step sizes

• LMS recursion for fn(θ) =12

(

yn − 〈Φ(xn), θ〉)2

θn = θn−1 − γ(

〈Φ(xn), θn−1〉 − yn)

Φ(xn)

• The sequence (θn)n is a homogeneous Markov chain

– convergence to a stationary distribution πγ

– with expectation θγdef=

θπγ(dθ)

• For least-squares, θγ = θ∗

θ∗

θ0

θn

Page 36: Beyond stochastic gradient descent for large-scale machine ... · • Large-scale machine learning: large p, large n – p : dimension of each observation (input) – n : number of

Markov chain interpretation of constant step sizes

• LMS recursion for fn(θ) =12

(

yn − 〈Φ(xn), θ〉)2

θn = θn−1 − γ(

〈Φ(xn), θn−1〉 − yn)

Φ(xn)

• The sequence (θn)n is a homogeneous Markov chain

– convergence to a stationary distribution πγ

– with expectation θγdef=

θπγ(dθ)

• For least-squares, θγ = θ∗

θ∗

θ0

θn

θn

Page 37: Beyond stochastic gradient descent for large-scale machine ... · • Large-scale machine learning: large p, large n – p : dimension of each observation (input) – n : number of

Markov chain interpretation of constant step sizes

• LMS recursion for fn(θ) =12

(

yn − 〈Φ(xn), θ〉)2

θn = θn−1 − γ(

〈Φ(xn), θn−1〉 − yn)

Φ(xn)

• The sequence (θn)n is a homogeneous Markov chain

– convergence to a stationary distribution πγ

– with expectation θγdef=

θπγ(dθ)

• For least-squares, θγ = θ∗

– θn does not converge to θ∗ but oscillates around it

– oscillations of order√γ

• Ergodic theorem:

– Averaged iterates converge to θγ = θ∗ at rate O(1/n)

Page 38: Beyond stochastic gradient descent for large-scale machine ... · • Large-scale machine learning: large p, large n – p : dimension of each observation (input) – n : number of

Simulations - synthetic examples

• Gaussian distributions - p = 20

0 2 4 6−5

−4

−3

−2

−1

0

log10

(n)

log 10

[f(θ)

−f(

θ *)]

synthetic square

1/2R2

1/8R2

1/32R2

1/2R2n1/2

Page 39: Beyond stochastic gradient descent for large-scale machine ... · • Large-scale machine learning: large p, large n – p : dimension of each observation (input) – n : number of

Simulations - benchmarks

• alpha (p = 500, n = 500 000), news (p = 1 300 000, n = 20 000)

0 2 4 6

−2

−1.5

−1

−0.5

0

0.5

1

log10

(n)

log 10

[f(θ)

−f(

θ *)]

alpha square C=1 test

1/R2

1/R2n1/2

SAG

0 2 4 6

−2

−1.5

−1

−0.5

0

0.5

1

log10

(n)

alpha square C=opt test

C/R2

C/R2n1/2

SAG

0 2 4

−0.8

−0.6

−0.4

−0.2

0

0.2

log10

(n)

log 10

[f(θ)

−f(

θ *)]

news square C=1 test

1/R2

1/R2n1/2

SAG

0 2 4

−0.8

−0.6

−0.4

−0.2

0

0.2

log10

(n)

news square C=opt test

C/R2

C/R2n1/2

SAG

Page 40: Beyond stochastic gradient descent for large-scale machine ... · • Large-scale machine learning: large p, large n – p : dimension of each observation (input) – n : number of

Isn’t least-squares regression a “regression”?

Page 41: Beyond stochastic gradient descent for large-scale machine ... · • Large-scale machine learning: large p, large n – p : dimension of each observation (input) – n : number of

Isn’t least-squares regression a “regression”?

• Least-squares regression

– Simpler to analyze and understand

– Explicit relationship to bias/variance trade-offs

• Many important loss functions are not quadratic

– Beyond least-squares with online Newton steps

– Complexity of O(p) per iteration with rate O(p/n)

– See Bach and Moulines (2013) for details

Page 42: Beyond stochastic gradient descent for large-scale machine ... · • Large-scale machine learning: large p, large n – p : dimension of each observation (input) – n : number of

Optimal bounds for least-squares?

• Least-squares: cannot beat σ2p/n (Tsybakov, 2003). Really?

– Adaptivity to simpler problems

Page 43: Beyond stochastic gradient descent for large-scale machine ... · • Large-scale machine learning: large p, large n – p : dimension of each observation (input) – n : number of

Finer assumptions (Dieuleveut and Bach, 2016)

• Covariance eigenvalues

– Pessimistic assumption: all eigenvalues λm less than a constant

– Actual decay as λm = o(m−α) with trH1/α =∑

m

λ1/αm small

– New result: replaceσ2p

nby

σ2(γn)1/α trH1/α

n

0 5 10−2

0

2

4

6

8

log(m)

log(

λm

)

Page 44: Beyond stochastic gradient descent for large-scale machine ... · • Large-scale machine learning: large p, large n – p : dimension of each observation (input) – n : number of

Finer assumptions (Dieuleveut and Bach, 2016)

• Covariance eigenvalues

– Pessimistic assumption: all eigenvalues λm less than a constant

– Actual decay as λm = o(m−α) with trH1/α =∑

m

λ1/αm small

– New result: replaceσ2p

nby

σ2(γn)1/α trH1/α

n

0 5 10−2

0

2

4

6

8

log(m)

log(

λm

)

0 50 1000

10

20

30

40

alpha

(γ n

)1/α T

r H1/

α / n

Page 45: Beyond stochastic gradient descent for large-scale machine ... · • Large-scale machine learning: large p, large n – p : dimension of each observation (input) – n : number of

Finer assumptions (Dieuleveut and Bach, 2016)

• Covariance eigenvalues

– Pessimistic assumption: all eigenvalues λm less than a constant

– Actual decay as λm = o(m−α) with trH1/α =∑

m

λ1/αm small

– New result: replaceσ2p

nby

σ2(γn)1/α trH1/α

n

• Optimal predictor

– Pessimistic assumption: ‖θ0 − θ∗‖2 finite

– Finer assumption: ‖H1/2−r(θ0 − θ∗)‖2 small

– Replace‖θ0 − θ∗‖2

γnby

4‖H1/2−r(θ0 − θ∗)‖2γ2rn2minr,1

– Leads to optimal rates for non-parametric regression

Page 46: Beyond stochastic gradient descent for large-scale machine ... · • Large-scale machine learning: large p, large n – p : dimension of each observation (input) – n : number of

Finer assumptions (Dieuleveut and Bach, 2016)

• Covariance eigenvalues

– Pessimistic assumption: all eigenvalues λm less than a constant

– Actual decay as λm = o(m−α) with trH1/α =∑

m

λ1/αm small

– New result: replaceσ2p

nby

σ2(γn)1/α trH1/α

n

• Optimal predictor

– Pessimistic assumption: ‖θ0 − θ∗‖2 finite

– Finer assumption: ‖H1/2−r(θ0 − θ∗)‖2 small

– Replace‖θ0 − θ∗‖2

γnby

4‖H1/2−r(θ0 − θ∗)‖2γ2rn2minr,1

• Leads to optimal rates for non-parametric regression

Page 47: Beyond stochastic gradient descent for large-scale machine ... · • Large-scale machine learning: large p, large n – p : dimension of each observation (input) – n : number of

Achieving optimal bias and variance terms

Bias Variance

Averaged gradient descent

(Bach and Moulines, 2013)R2‖θ0 − θ∗‖2

n

σ2p

n

Page 48: Beyond stochastic gradient descent for large-scale machine ... · • Large-scale machine learning: large p, large n – p : dimension of each observation (input) – n : number of

Achieving optimal bias and variance terms

Bias Variance

Averaged gradient descent

(Bach and Moulines, 2013)R2‖θ0 − θ∗‖2

n

σ2p

n

Accelerated gradient descent

(Nesterov, 1983)R2‖θ0 − θ∗‖2

n2σ2p

• Acceleration is notoriously non-robust to noise (d’Aspremont,

2008; Schmidt et al., 2011)

– For non-structured noise, see Lan (2012)

Page 49: Beyond stochastic gradient descent for large-scale machine ... · • Large-scale machine learning: large p, large n – p : dimension of each observation (input) – n : number of

Achieving optimal bias and variance terms

Bias Variance

Averaged gradient descent

(Bach and Moulines, 2013)R2‖θ0 − θ∗‖2

n

σ2p

n

Accelerated gradient descent

(Nesterov, 1983)R2‖θ0 − θ∗‖2

n2σ2p

“Between” averaging and acceleration

(Flammarion and Bach, 2015)R2‖θ0 − θ∗‖2

n1+α

σ2p

n1−α

Page 50: Beyond stochastic gradient descent for large-scale machine ... · • Large-scale machine learning: large p, large n – p : dimension of each observation (input) – n : number of

Achieving optimal bias and variance terms

Bias Variance

Averaged gradient descent

(Bach and Moulines, 2013)R2‖θ0 − θ∗‖2

n

σ2p

n

Accelerated gradient descent

(Nesterov, 1983)R2‖θ0 − θ∗‖2

n2σ2p

“Between” averaging and acceleration

(Flammarion and Bach, 2015)R2‖θ0 − θ∗‖2

n1+α

σ2p

n1−α

Averaging and acceleration

(Dieuleveut, Flammarion, and Bach, 2016)R2‖θ0 − θ∗‖2

n2

σ2p

n

Page 51: Beyond stochastic gradient descent for large-scale machine ... · • Large-scale machine learning: large p, large n – p : dimension of each observation (input) – n : number of

Beyond least-squares - Markov chain interpretation

• Recursion θn = θn−1 − γf ′n(θn−1) also defines a Markov chain

– Stationary distribution πγ such that∫

f ′(θ)πγ(dθ) = 0

– When f ′ is not linear, f ′(∫

θπγ(dθ)) 6=∫

f ′(θ)πγ(dθ) = 0

Page 52: Beyond stochastic gradient descent for large-scale machine ... · • Large-scale machine learning: large p, large n – p : dimension of each observation (input) – n : number of

Beyond least-squares - Markov chain interpretation

• Recursion θn = θn−1 − γf ′n(θn−1) also defines a Markov chain

– Stationary distribution πγ such that∫

f ′(θ)πγ(dθ) = 0

– When f ′ is not linear, f ′(∫

θπγ(dθ)) 6=∫

f ′(θ)πγ(dθ) = 0

• θn oscillates around the wrong value θγ 6= θ∗

θγ

θ0

θn

θn

θ∗

Page 53: Beyond stochastic gradient descent for large-scale machine ... · • Large-scale machine learning: large p, large n – p : dimension of each observation (input) – n : number of

Beyond least-squares - Markov chain interpretation

• Recursion θn = θn−1 − γf ′n(θn−1) also defines a Markov chain

– Stationary distribution πγ such that∫

f ′(θ)πγ(dθ) = 0

– When f ′ is not linear, f ′(∫

θπγ(dθ)) 6=∫

f ′(θ)πγ(dθ) = 0

• θn oscillates around the wrong value θγ 6= θ∗

– moreover, ‖θ∗ − θn‖ = Op(√γ)

• Ergodic theorem

– averaged iterates converge to θγ 6= θ∗ at rate O(1/n)

– moreover, ‖θ∗ − θγ‖ = O(γ) (Bach, 2014)

– See precise analysis by Dieuleveut, Durmus, and Bach (2017)

Page 54: Beyond stochastic gradient descent for large-scale machine ... · • Large-scale machine learning: large p, large n – p : dimension of each observation (input) – n : number of

Simulations - synthetic examples

• Gaussian distributions - p = 20

0 2 4 6−5

−4

−3

−2

−1

0

log10

(n)

log 10

[f(θ)

−f(

θ *)]

synthetic logistic − 1

1/2R2

1/8R2

1/32R2

1/2R2n1/2

Page 55: Beyond stochastic gradient descent for large-scale machine ... · • Large-scale machine learning: large p, large n – p : dimension of each observation (input) – n : number of

Restoring convergence through online Newton steps

• Known facts

1. Averaged SGD with γn ∝ n−1/2 leads to robust rate O(n−1/2)

for all convex functions

2. Averaged SGD with γn constant leads to robust rate O(n−1)

for all convex quadratic functions

3. Newton’s method squares the error at each iteration

for smooth functions

4. A single step of Newton’s method is equivalent to minimizing the

quadratic Taylor expansion

– Online Newton step

– Rate: O((n−1/2)2 + n−1) = O(n−1)

– Complexity: O(p) per iteration

Page 56: Beyond stochastic gradient descent for large-scale machine ... · • Large-scale machine learning: large p, large n – p : dimension of each observation (input) – n : number of

Restoring convergence through online Newton steps

• Known facts

1. Averaged SGD with γn ∝ n−1/2 leads to robust rate O(n−1/2)

for all convex functions

2. Averaged SGD with γn constant leads to robust rate O(n−1)

for all convex quadratic functions ⇒ O(n−1)

3. Newton’s method squares the error at each iteration

for smooth functions ⇒ O((n−1/2)2)

4. A single step of Newton’s method is equivalent to minimizing the

quadratic Taylor expansion

• Online Newton step

– Rate: O((n−1/2)2 + n−1) = O(n−1)

– Complexity: O(p) per iteration

Page 57: Beyond stochastic gradient descent for large-scale machine ... · • Large-scale machine learning: large p, large n – p : dimension of each observation (input) – n : number of

Restoring convergence through online Newton steps

• The Newton step for f = Efn(θ)def= E

[

ℓ(yn, 〈θ,Φ(xn)〉)]

at θ is

equivalent to minimizing the quadratic approximation

g(θ) = f(θ) + 〈f ′(θ), θ − θ〉+ 12〈θ − θ, f ′′(θ)(θ − θ)〉

= f(θ) + 〈Ef ′n(θ), θ − θ〉+ 1

2〈θ − θ,Ef ′′n(θ)(θ − θ)〉

= E

[

f(θ) + 〈f ′n(θ), θ − θ〉+ 1

2〈θ − θ, f ′′n(θ)(θ − θ)〉

]

Page 58: Beyond stochastic gradient descent for large-scale machine ... · • Large-scale machine learning: large p, large n – p : dimension of each observation (input) – n : number of

Restoring convergence through online Newton steps

• The Newton step for f = Efn(θ)def= E

[

ℓ(yn, 〈θ,Φ(xn)〉)]

at θ is

equivalent to minimizing the quadratic approximation

g(θ) = f(θ) + 〈f ′(θ), θ − θ〉+ 12〈θ − θ, f ′′(θ)(θ − θ)〉

= f(θ) + 〈Ef ′n(θ), θ − θ〉+ 1

2〈θ − θ,Ef ′′n(θ)(θ − θ)〉

= E

[

f(θ) + 〈f ′n(θ), θ − θ〉+ 1

2〈θ − θ, f ′′n(θ)(θ − θ)〉

]

• Complexity of least-mean-square recursion for g is O(p)

θn = θn−1 − γ[

f ′n(θ) + f ′′

n(θ)(θn−1 − θ)]

– f ′′n(θ) = ℓ′′(yn, 〈θ,Φ(xn)〉)Φ(xn)⊗ Φ(xn) has rank one

– New online Newton step without computing/inverting Hessians

Page 59: Beyond stochastic gradient descent for large-scale machine ... · • Large-scale machine learning: large p, large n – p : dimension of each observation (input) – n : number of

Choice of support point for online Newton step

• Two-stage procedure

(1) Run n/2 iterations of averaged SGD to obtain θ

(2) Run n/2 iterations of averaged constant step-size LMS

– Reminiscent of one-step estimators (see, e.g., Van der Vaart, 2000)

– Provable convergence rate of O(p/n) for logistic regression

– Additional assumptions but no strong convexity

Page 60: Beyond stochastic gradient descent for large-scale machine ... · • Large-scale machine learning: large p, large n – p : dimension of each observation (input) – n : number of

Choice of support point for online Newton step

• Two-stage procedure

(1) Run n/2 iterations of averaged SGD to obtain θ

(2) Run n/2 iterations of averaged constant step-size LMS

– Reminiscent of one-step estimators (see, e.g., Van der Vaart, 2000)

– Provable convergence rate of O(p/n) for logistic regression

– Additional assumptions but no strong convexity

• Update at each iteration using the current averaged iterate

– Recursion: θn = θn−1 − γ[

f ′n(θn−1) + f ′′

n(θn−1)(θn−1 − θn−1)]

– No provable convergence rate (yet) but best practical behavior

– Note (dis)similarity with regular SGD: θn = θn−1 − γf ′n(θn−1)

Page 61: Beyond stochastic gradient descent for large-scale machine ... · • Large-scale machine learning: large p, large n – p : dimension of each observation (input) – n : number of

Simulations - synthetic examples

• Gaussian distributions - p = 20

0 2 4 6−5

−4

−3

−2

−1

0

log10

(n)

log 10

[f(θ)

−f(

θ *)]

synthetic logistic − 1

1/2R2

1/8R2

1/32R2

1/2R2n1/2

0 2 4 6−5

−4

−3

−2

−1

0

log10

(n)lo

g 10[f(

θ)−

f(θ *)]

synthetic logistic − 2

every 2p

every iter.2−step2−step−dbl.

Page 62: Beyond stochastic gradient descent for large-scale machine ... · • Large-scale machine learning: large p, large n – p : dimension of each observation (input) – n : number of

Simulations - benchmarks

• alpha (p = 500, n = 500 000), news (p = 1 300 000, n = 20 000)

0 2 4 6−2.5

−2

−1.5

−1

−0.5

0

0.5

log10

(n)

log 10

[f(θ)

−f(

θ *)]

alpha logistic C=1 test

1/R2

1/R2n1/2

SAGAdagradNewton

0 2 4 6−2.5

−2

−1.5

−1

−0.5

0

0.5

log10

(n)

alpha logistic C=opt test

C/R2

C/R2n1/2

SAGAdagradNewton

0 2 4−1

−0.8

−0.6

−0.4

−0.2

0

0.2

log10

(n)

log 10

[f(θ)

−f(

θ *)]

news logistic C=1 test

1/R2

1/R2n1/2

SAGAdagradNewton

0 2 4−1

−0.8

−0.6

−0.4

−0.2

0

0.2

log10

(n)

news logistic C=opt test

C/R2

C/R2n1/2

SAGAdagradNewton

Page 63: Beyond stochastic gradient descent for large-scale machine ... · • Large-scale machine learning: large p, large n – p : dimension of each observation (input) – n : number of

Conclusions

• Constant-step-size averaged stochastic gradient descent

– Reaches convergence rate O(1/n) in all regimes

– Improves on the O(1/√n) lower-bound of non-smooth problems

– Efficient online Newton step for non-quadratic problems

– Robustness to step-size selection and adaptivity

Page 64: Beyond stochastic gradient descent for large-scale machine ... · • Large-scale machine learning: large p, large n – p : dimension of each observation (input) – n : number of

Conclusions

• Constant-step-size averaged stochastic gradient descent

– Reaches convergence rate O(1/n) in all regimes

– Improves on the O(1/√n) lower-bound of non-smooth problems

– Efficient online Newton step for non-quadratic problems

– Robustness to step-size selection and adaptivity

• Extensions and future work

– Going beyond a single pass (Le Roux, Schmidt, and Bach, 2012;

Defazio, Bach, and Lacoste-Julien, 2014)

– Non-differentiable regularization (Flammarion and Bach, 2017)

– Kernels and nonparametric estimation (Dieuleveut and Bach, 2016)

– Parallelization

– Non-convex problems

Page 65: Beyond stochastic gradient descent for large-scale machine ... · • Large-scale machine learning: large p, large n – p : dimension of each observation (input) – n : number of

References

A. Agarwal, P. L. Bartlett, P. Ravikumar, and M. J. Wainwright. Information-theoretic lower bounds

on the oracle complexity of stochastic convex optimization. Information Theory, IEEE Transactions

on, 58(5):3235–3249, 2012.

R. Aguech, E. Moulines, and P. Priouret. On a perturbation approach for the analysis of stochastic

tracking algorithms. SIAM J. Control and Optimization, 39(3):872–899, 2000.

F. Bach. Adaptivity of averaged stochastic gradient descent to local strong convexity for logistic

regression. Journal of Machine Learning Research, 15(1):595–627, 2014.

F. Bach and E. Moulines. Non-strongly-convex smooth stochastic approximation with convergence

rate O(1/n). In Adv. NIPS, 2013.

Albert Benveniste, Michel Metivier, and Pierre Priouret. Adaptive algorithms and stochastic

approximations. Springer Publishing Company, Incorporated, 2012.

L. Bottou and O. Bousquet. The tradeoffs of large scale learning. In Adv. NIPS, 2008.

A. d’Aspremont. Smooth optimization with approximate gradient. SIAM J. Optim., 19(3):1171–1183,

2008.

A. Defazio, F. Bach, and S. Lacoste-Julien. Saga: A fast incremental gradient method with support for

non-strongly convex composite objectives. In Advances in Neural Information Processing Systems

(NIPS), 2014.

A. Dieuleveut and F. Bach. Non-parametric Stochastic Approximation with Large Step sizes. Annals

of Statistics, 44(4):1363–1399, 2016.

Page 66: Beyond stochastic gradient descent for large-scale machine ... · • Large-scale machine learning: large p, large n – p : dimension of each observation (input) – n : number of

A. Dieuleveut, N. Flammarion, and F. Bach. Harder, better, faster, stronger convergence rates for

least-squares regression. Technical Report 1602.05419, arXiv, 2016.

Aymeric Dieuleveut, Alain Durmus, and Francis Bach. Bridging the gap between constant step size

stochastic gradient descent and markov chains. Technical report, HAL - to appear, 2017.

N. Flammarion and F. Bach. From averaging to acceleration, there is only a step-size. arXiv preprint

arXiv:1504.01577, 2015.

N. Flammarion and F. Bach. Stochastic composite least-squares regression with convergence rate

O(1/n). In Proc. COLT, 2017.

G. Lan. An optimal method for stochastic composite optimization. Math. Program., 133(1-2, Ser. A):

365–397, 2012.

N. Le Roux, M. Schmidt, and F. Bach. A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets. In Advances in Neural Information

Processing Systems (NIPS), 2012.

O. Macchi. Adaptive processing: The least mean squares approach with applications in transmission.

Wiley West Sussex, 1995.

A. S. Nemirovski and D. B. Yudin. Problem complexity and method efficiency in optimization. Wiley

& Sons, 1983.

Y. Nesterov. A method for solving a convex programming problem with rate of convergence O(1/k2).

Soviet Math. Doklady, 269(3):543–547, 1983.

B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM Journal

on Control and Optimization, 30(4):838–855, 1992.

Page 67: Beyond stochastic gradient descent for large-scale machine ... · • Large-scale machine learning: large p, large n – p : dimension of each observation (input) – n : number of

H. Robbins and S. Monro. A stochastic approximation method. Ann. Math. Statistics, 22:400–407,

1951.

D. Ruppert. Efficient estimations from a slowly convergent Robbins-Monro process. Technical Report

781, Cornell University Operations Research and Industrial Engineering, 1988.

M. Schmidt, N. Le Roux, and F. Bach. Convergence rates for inexact proximal-gradient method. In

Adv. NIPS, 2011.

A. B. Tsybakov. Optimal rates of aggregation. In Proc. COLT, 2003.

A. W. Van der Vaart. Asymptotic statistics, volume 3. Cambridge Univ. press, 2000.