large-scale machine learning and convex fbach/gradsto_ machine learning and convex optimization...

Download Large-scale machine learning and convex fbach/gradsto_ machine learning and convex optimization Francis

Post on 23-Jun-2018

213 views

Category:

Documents

0 download

Embed Size (px)

TRANSCRIPT

  • Large-scale machine learningand convex optimization

    Francis Bach

    INRIA - Ecole Normale Superieure, Paris, France

    Allerton Conference - September 2015

    Slides available at www.di.ens.fr/~fbach/gradsto_allerton.pdf

  • Context

    Machine learning for big data

    Large-scale machine learning: large d, large n, large k d : dimension of each observation (input)

    n : number of observations

    k : number of tasks (dimension of outputs)

    Examples: computer vision, bioinformatics, text processing

    Ideal running-time complexity: O(dn+ kn)

    Going back to simple methods

    Stochastic gradient methods (Robbins and Monro, 1951)

    Mixing statistics and optimization

    Using smoothness to go beyond stochastic gradient descent

  • Visual object recognition

  • Personal photos

  • Learning for bioinformatics - Proteins

    Crucial components of cell life

    Predicting multiple functions andinteractions

    Massive data: up to 1 millions forhumans!

    Complex data Amino-acid sequence

    Link with DNA

    Tri-dimensional molecule

  • Search engines - advertising

  • Context

    Machine learning for big data

    Large-scale machine learning: large d, large n, large k d : dimension of each observation (input)

    n : number of observations

    k : number of tasks (dimension of outputs)

    Examples: computer vision, bioinformatics, text processing

    Ideal running-time complexity: O(dn+ kn)

    Going back to simple methods

    Stochastic gradient methods (Robbins and Monro, 1951)

    Mixing statistics and optimization

    Using smoothness to go beyond stochastic gradient descent

  • Context

    Machine learning for big data

    Large-scale machine learning: large d, large n, large k d : dimension of each observation (input)

    n : number of observations

    k : number of tasks (dimension of outputs)

    Examples: computer vision, bioinformatics, text processing

    Ideal running-time complexity: O(dn+ kn)

    Going back to simple methods Stochastic gradient methods (Robbins and Monro, 1951)

    Mixing statistics and optimization

    Using smoothness to go beyond stochastic gradient descent

  • Outline

    1. Large-scale machine learning and optimization

    Traditional statistical analysis Classical methods for convex optimization

    2. Non-smooth stochastic approximation

    Stochastic (sub)gradient and averaging Non-asymptotic results and lower bounds Strongly convex vs. non-strongly convex

    3. Smooth stochastic approximation algorithms

    Asymptotic and non-asymptotic results Beyond decaying step-sizes

    4. Finite data sets

  • Supervised machine learning

    Data: n observations (xi, yi) X Y, i = 1, . . . , n, i.i.d.

    Prediction as a linear function (x) of features (x) Rd

    (regularized) empirical risk minimization: find solution of

    minRd

    1

    n

    n

    i=1

    (

    yi, (xi)

    )

    + ()

    convex data fitting term + regularizer

  • Usual losses

    Regression: y R, prediction y = (x) quadratic loss 12(y y)2 = 12(y (x))2

  • Usual losses

    Regression: y R, prediction y = (x) quadratic loss 12(y y)2 = 12(y (x))2

    Classification : y {1, 1}, prediction y = sign((x)) loss of the form (y (x)) True 0-1 loss: (y (x)) = 1y (x)

  • Main motivating examples

    Support vector machine (hinge loss)

    (Y, (X)) = max{1 Y (X), 0}

    Logistic regression

    (Y, (X)) = log(1 + exp(Y (X)))

    Least-squares regression

    (Y, (X)) =1

    2(Y (X))2

  • Usual regularizers

    Main goal: avoid overfitting

    (squared) Euclidean norm: 22 =d

    j=1 |j|2

    Numerically well-behaved

    Representer theorem and kernel methods : =n

    i=1i(xi)

    See, e.g., Scholkopf and Smola (2001); Shawe-Taylor and

    Cristianini (2004)

    Sparsity-inducing norms Main example: 1-norm 1 =

    dj=1 |j|

    Perform model selection as well as regularization

    Non-smooth optimization and structured sparsity

    See, e.g., Bach, Jenatton, Mairal, and Obozinski (2012b,a)

  • Supervised machine learning

    Data: n observations (xi, yi) X Y, i = 1, . . . , n, i.i.d.

    Prediction as a linear function (x) of features (x) Rd

    (regularized) empirical risk minimization: find solution of

    minRd

    1

    n

    n

    i=1

    (

    yi, (xi)

    )

    + ()

    convex data fitting term + regularizer

  • Supervised machine learning

    Data: n observations (xi, yi) X Y, i = 1, . . . , n, i.i.d.

    Prediction as a linear function (x) of features (x) Rd

    (regularized) empirical risk minimization: find solution of

    minRd

    1

    n

    n

    i=1

    (

    yi, (xi)

    )

    + ()

    convex data fitting term + regularizer

    Empirical risk: f() = 1nn

    i=1 (yi, (xi)) training cost

    Expected risk: f() = E(x,y)(y, (x)) testing cost

    Two fundamental questions: (1) computing and (2) analyzing May be tackled simultaneously

  • Supervised machine learning

    Data: n observations (xi, yi) X Y, i = 1, . . . , n, i.i.d.

    Prediction as a linear function (x) of features (x) Rd

    (regularized) empirical risk minimization: find solution of

    minRd

    1

    n

    n

    i=1

    (

    yi, (xi)

    )

    + ()

    convex data fitting term + regularizer

    Empirical risk: f() = 1nn

    i=1 (yi, (xi)) training cost

    Expected risk: f() = E(x,y)(y, (x)) testing cost

    Two fundamental questions: (1) computing and (2) analyzing May be tackled simultaneously

  • Supervised machine learning

    Data: n observations (xi, yi) X Y, i = 1, . . . , n, i.i.d.

    Prediction as a linear function (x) of features (x) Rd

    (regularized) empirical risk minimization: find solution of

    minRd

    1

    n

    n

    i=1

    (

    yi, (xi)

    )

    such that () 6 D

    convex data fitting term + constraint

    Empirical risk: f() = 1nn

    i=1 (yi, (xi)) training cost

    Expected risk: f() = E(x,y)(y, (x)) testing cost

    Two fundamental questions: (1) computing and (2) analyzing May be tackled simultaneously

  • General assumptions

    Data: n observations (xi, yi) X Y, i = 1, . . . , n, i.i.d.

    Bounded features (x) Rd: (x)2 6 R

    Empirical risk: f() = 1nn

    i=1 (yi, (xi)) training cost

    Expected risk: f() = E(x,y)(y, (x)) testing cost

    Loss for a single observation: fi() = (yi, (xi)) i, f() = Efi()

    Properties of fi, f, f Convex on Rd

    Additional regularity assumptions: Lipschitz-continuity,

    smoothness and strong convexity

  • Lipschitz continuity

    Bounded gradients of f (Lipschitz-continuity): the function f ifconvex, differentiable and has (sub)gradients uniformly bounded by

    B on the ball of center 0 and radius D:

    Rd, 2 6 D f ()2 6 B

    Machine learning

    with f() = 1nn

    i=1 (yi, (xi))

    G-Lipschitz loss and R-bounded data: B = GR

  • Smoothness and strong convexity

    A function f : Rd R is L-smooth if and only if it is differentiableand its gradient is L-Lipschitz-continuous

    1, 2 Rd, f (1) f (2)2 6 L1 22

    If f is twice differentiable: Rd, f () 4 L Id

    smooth nonsmooth

  • Smoothness and strong convexity

    A function f : Rd R is L-smooth if and only if it is differentiableand its gradient is L-Lipschitz-continuous

    1, 2 Rd, f (1) f (2)2 6 L1 22

    If f is twice differentiable: Rd, f () 4 L Id

    Machine learning

    with f() = 1nn

    i=1 (yi, (xi))

    Hessian covariance matrix 1nn

    i=1(xi)(xi)

    -smooth loss and R-bounded data: L = R2

  • Smoothness and strong convexity

    A function f : Rd R is -strongly convex if and only if

    1, 2 Rd, f(1) > f(2) + f (2)(1 2) + 21 222

    If f is twice differentiable: Rd, f () < Id

    convexstronglyconvex

  • Smoothness and strong convexity

    A function f : Rd R is -strongly convex if and only if

    1, 2 Rd, f(1) > f(2) + f (2)(1 2) + 21 222

    If f is twice differentiable: Rd, f () < Id

    Machine learning

    with f() = 1nn

    i=1 (yi, (xi))

    Hessian covariance matrix 1nn

    i=1(xi)(xi)

    Data with invertible covariance matrix (low correlation/dimension)

  • Smoothness and strong convexity

    A function f : Rd R is -strongly convex if and only if

    1, 2 Rd, f(1) > f(2) + f (2)(1 2) + 21 223

    If f is twice differentiable: Rd, f () < Id

    Machine learning

    with f() = 1nn

    i=1 (yi, (xi))

    Hessian covariance matrix 1nn

    i=1(xi)(xi)

    Data with invertible covariance matrix (low correlation/dimension)

    Adding regularization by 22

    creates additional bias unless is small

  • Summary of smoothness/convexity assumptions

    Bounded gradients of f (Lipschitz-continuity): the function f ifconvex, differentiable and has (sub)gradients uniformly bounded by

    B on the ball of center 0 and radius D:

    Rd, 2 6 D f ()2 6 B

    Smoothness of f : the function f is convex, differentiable withL-Lipschitz-continuous gradient f :

    1, 2 Rd, f (1) f (2)2 6 L1 22

    Strong convexity of f : The function f is strongly convex withrespect to the norm , with convexity constant > 0:

    1, 2 Rd, f(1) > f(2) + f (2)(1 2) + 21 222

  • Analysis of empirical risk minimization

    Approximation and estimation errors: = { Rd,() 6 D}

    f() minRd

    f(

View more >