focs13 workshop

98
ITERATIVE METHODS AND REGULARIZATION IN THE DESIGN OF FAST ALGORITHMS Lorenzo Orecchia, MIT Math An unified framework for optimization and online learning beyond Multiplicative Weight Updates

Upload: lorenzo-orecchia

Post on 08-Nov-2015

215 views

Category:

Documents


0 download

DESCRIPTION

presentation

TRANSCRIPT

  • ITERATIVE METHODS AND REGULARIZATION

    IN THE DESIGN OF FAST ALGORITHMS

    Lorenzo Orecchia, MIT Math

    An unified framework for optimization and online learning

    beyond Multiplicative Weight Updates

  • Talk Outline: A Tale of Two Halves

    PART 1: REGULARIZATION AND ITERATIVE TECHNIQUES FOR ONLINE LEARNING

    Online Linear Optimization Online Linear Optimization over Simplex and Multiplicative Weight Updates (MWUs) A Regularization Framework to generalize MWUs: Follow the Regularized Leader

    MESSAGE: REGULARIZATION IS A POWERFUL ALGORITHMIC TECHNIQUE

  • Talk Outline: A Tale of Two Halves

    PART 1: REGULARIZATION AND ITERATIVE TECHNIQUES FOR ONLINE LEARNING

    Online Linear Optimization Online Linear Optimization over Simplex and Multiplicative Weight Updates (MWUs) A Regularization Framework to generalize MWUs: Follow the Regularized Leader

    MESSAGE: REGULARIZATION IS A POWERFUL ALGORITHMIC TECHNIQUE

    Optimization:

    Regularized Updates

    Online Learning:

    Multiplicative Weight

    Updates (MWUs)

  • Talk Outline: A Tale of Two Halves

    PART 1: REGULARIZATION AND ITERATIVE TECHNIQUES FOR ONLINE LEARNING

    Online Linear Optimization Online Linear Optimization over Simplex and Multiplicative Weight Updates (MWUs) A Regularization Framework to generalize MWUs: Follow the Regularized Leader

    MESSAGE: REGULARIZATION IS A POWERFUL ALGORITHMIC TECHNIQUE

    PART 2: NON-SMOOTH OPTIMIZATION AND FAST ALGORITHMS FOR MAXFLOW

    Non-smooth vs Smooth Convex Optimization Non-smooth Convex Optimization reduces to Online Linear Optimization Application: Understanding Undirected Maxflow algorithms based on MWUs

    MESSAGE: FASTEST ALGORITHMS REQUIRE PRIMAL-DUAL APPROACH

  • Fast Algorithms for solving specific LPs and SDPs: Maximum Flow problems [PST], [GK], [F], [CKMST] Covering-packing problems [PST] Oblivious routing [R], [M]

    Fast Approximation Algorithms based on LP and SDP relaxations: Maxcut [AK] Graph Partitioning Problems [AK], [S], [OSV]

    Proof Technique Hardcore Lemma [BHK] QIP = PSPACE [W] Derandomization [Y]

    and more

    TOC Applications of MWUs

  • Machine Learning meets Optimization meets TCS

    These techniques have been rediscovered multiple times in different fields:

    Machine Learning, Convex Optimization, TCS

    Three surveys emphasizing the different viewpoints and literatures:

    1) ML: Prediction, Learning and Games by Gabor and Lugosi

    2) Optimization: Lectures in Modern Convex Optimization

    by Ben Tal and Nemirowski

    3) TCS: The Multiplicative Weights Update Method: a Meta

    Algorithm and Applications by Arora, Hazan and Kale

  • REGULARIZATION 101

  • What is Regularization?

    Regularization is a fundamental technique in optimization

    OPTIMIZATION

    PROBLEM

    WELL-BEHAVED

    OPTIMIZATION

    PROBLEM

    Stable optimum

    Unique optimal solution

    Smoothness conditions

  • What is Regularization?

    Regularization is a fundamental technique in optimization

    OPTIMIZATION

    PROBLEM

    WELL-BEHAVED

    OPTIMIZATION

    PROBLEM

    Benefits of Regularization in Learning and Statistics: Prevents overfitting

    Increases stability

    Decreases sensitivity to random noise

    Regularizer F Parameter > 0

  • Example: Regularization Helps Stability

    f(c) = argminx2S cTx

    Consider a convex set and a linear optimization problem:

    The optimal solution f(c) may be very unstable under perturbation of c :

    S Rn

    kc0 ck and

    S

    cc0

    f(c0) f(c)

    kf(c0) f(c)k >>

  • Example: Regularization Helps Stability

    f(c) = argminx2S cTx

    Consider a convex set and a regularized linear optimization problem

    where F is -strongly convex.

    Then:

    S Rn

    kc0 ck implies kf(c0) f(c)kk

    f(c0)f(c)

    +F(x)

    cTx+F(x)

    c0Tx+F(x)

  • Example: Regularization Helps Stability

    f(c) = argminx2S cTx

    Consider a convex set and a regularized linear optimization problem

    where F is -strongly convex.

    Then:

    S Rn

    kc0 ck implies kf(c0) f(c)kk

    f(c0)f(c)

    +F(x)

    cTx+F(x)

    c0Tx+F(x)kslopek

  • ONLINE LINEAR OPTIMIZATION

    AND

    MULTIPLICATIVE WEIGHT UPDATES

  • SETUP: Convex set X Rn, generic norm, repeated game over T rounds. At round t,

    Online Linear Minimization

    ALGORITHM ADVERSARY

    x(t) 2XCurrent solution

  • SETUP: Convex set X Rn, generic norm, repeated game over T rounds. At round t,

    Online Linear Minimization

    ALGORITHM ADVERSARY

    x(t) 2XCurrent solution

    `(t) 2 Rn;kr`(t)k Current linear objective

    Loss vector

  • SETUP: Convex set X Rn, generic norm, repeated game over T rounds. At round t,

    Online Linear Minimization

    ALGORITHM ADVERSARY

    x(t) 2XCurrent solution

    `(t) 2 Rn;kr`(t)k Current linear objective

    Loss vector

    `(t)Tx(t)

    Algorithms loss

  • SETUP: Convex set X Rn, generic norm, repeated game over T rounds. At round t,

    Online Linear Minimization

    ALGORITHM ADVERSARY

    x(t) 2X x(t) 2X`(t) 2 Rn;kr`(t)k

    x(t+1) 2XUpdated solution

  • SETUP: Convex set X Rn, generic norm, repeated game over T rounds. At round t,

    Online Linear Minimization

    ALGORITHM ADVERSARY

    x(t) 2X x(t) 2X`(t) 2 Rn;kr`(t)k

    x(t+1) 2X `(t+1) 2 Rn;kr`(t)k Updated solution New Loss Vector

  • L^SETUP: Convex set X Rn, generic norm, repeated game over T rounds. At round t,

    Online Linear Minimization

    ALGORITHM ADVERSARY

    x(t) 2X x(t) 2X`(t) 2 Rn;kr`(t)k

    x(t+1) 2X `(t+1) 2 Rn;kr`(t)k GOAL: update x(t) to minimize regret

    Average Algorithms Loss A Posteriori Optimum

    1

    TTXt=1

    `(t)TxT min

    x2X1

    TTXt=1

    `(t)

    i

    T

    x

    L

  • p(t)ALGORITHM ADVERSARY

    distribution over experts

    Simplex Case: Learning with Experts SETUP: Simplex X Rn under 1 norm. At round t,

  • p(t)ALGORITHM ADVERSARY

    distribution over dimensions

    i.e. experts

    Simplex Case: Learning with Experts SETUP: Simplex X Rn under 1 norm. At round t,

    k`(t)k1 Experts losses

  • p(t)ALGORITHM ADVERSARY

    distribution over experts

    Simplex Case: Learning with Experts SETUP: Simplex X Rn under 1 norm. At round t,

    k`(t)k1 Experts losses

    Eip(t)h`(t)

    i

    i= p(t)

    T`(t)

    Algorithms loss

  • p(t)ALGORITHM ADVERSARY

    distribution over experts

    Simplex Case: Learning with Experts SETUP: Simplex X Rn under 1 norm. At round t,

    k`(t)k1 Experts losses

    p(t+1)

    Update distribution

  • Simplex Case: Multiplicative Weight Updates

    p(t)

    ALGORITHM ADVERSARY

    `(t)

    w(t+1)

    i = (1 )`(t)

    i w(t)

    i ; w1 =~1Weights:

  • Simplex Case: Multiplicative Weight Updates

    p(t)

    ALGORITHM ADVERSARY

    `(t)

    w(t+1)

    i = (1 )`(t)

    i w(t)

    i ; w1 =~1Weights:

    p(t+1)

    i =w(t)

    iPn

    j=1w(t)

    j

    Distribution:

  • p(t)

    ALGORITHM ADVERSARY

    `(t)

    w(t+1)

    i = (1 )`(t)

    i w(t)

    i ; w1 =~1Weights:

    p(t+1)

    i =w(t)

    iPn

    j=1w(t)

    j

    Distribution:

    MULTIPLICATIVE WEIGHT UPDATE

    Simplex Case: Multiplicative Weight Updates

  • p(t)

    ALGORITHM ADVERSARY

    `(t)

    w(t+1)

    i = (1 )`(t)

    i w(t)

    i ; w1 =~1Weights:

    p(t+1)

    i =w(t)

    iPn

    j=1w(t)

    j

    Distribution:

    Simplex Case: Multiplicative Weight Updates

    2 (0; 1)0 1

    CONSERVATIVE AGGRESSIVE

  • MWUs: Unraveling the Update

    p(t)

    ALGORITHM ADVERSARY

    `(t)

    WEIGHT

    CUMULATIVE LOSS

    (1 )P

    t`(t)

    i

    p(t+1)

    i / w(t+1)i = (1 )`(t)

    i w(t)iUpdate:

    w(t+1)

    i

    Pt `(t)

    i

  • For and

    MWUs: Regret Bound

    p(t)

    ALGORITHM ADVERSARY

    `(t)

    L^L? lognT

    +

    k`(t)k1 < 12

    p(t+1)

    i / w(t+1)i = (1 )`(t)

    i w(t)iUpdate:

  • For and

    MWUs: Regret Bound

    p(t)

    ALGORITHM ADVERSARY

    `(t)

    L^L? lognT

    +

    < 12

    p(t+1)

    i / w(t+1)i = (1 )`(t)

    i w(t)iUpdate:

    Algorithms

    Regret

    Start-up Penalty Penalty for

    being greedy

    k`(t)k1

  • ONLINE LINEAR OPTIMIZATION BEYOND MWUs

    A REGULARIZATION FRAMEWORK

  • MWUs: Proof Sketch of Regret Bound

    (t+1) = log1Pni=1w

    (t+1)

    i

    p(t+1)

    i / w(t+1)i = (1 )P

    t

    s=1`(s)

    iUpdate:

    Proof is potential function argument

  • (t+1) = log1Pni=1w

    (t+1)

    i

    p(t+1)

    i / w(t+1)i = (1 )P

    t

    s=1`(s)

    iUpdate:

    Proof is potential function argument

    Potential function bounds loss of best expert

    (t+1) log1minni=1w

    (t+1)

    i =minni=1

    Pts=1 `

    (s)

    i

    MWUs: Proof Sketch of Regret Bound

  • (t+1) = log1Pni=1w

    (t+1)

    i

    p(t+1)

    i / w(t+1)i = (1 )P

    t

    s=1`(s)

    iUpdate:

    Proof is potential function argument

    Potential function bounds loss of best expert

    Potential function is related to algorithms performance

    (t+1) log1minni=1w

    (t+1)

    i =minni=1

    Pts=1 `

    (s)

    i

    (t+1) (t) `(t)

    Tp(t)

    MWUs: Proof Sketch of Regret Bound

  • (t+1) = log1Pni=1w

    (t+1)

    i

    p(t+1)

    i / w(t+1)i = (1 )P

    t

    s=1`(s)

    iUpdate:

    Proof is potential function argument

    Potential function bounds loss of best expert

    Potential function is related to algorithms performance

    (t+1) log1minni=1w

    (t+1)

    i =minni=1

    Pts=1 `

    (s)

    i

    (t+1) (t) `(t)

    Tp(t)

    DOES THIS PROOF TECHNIQUE GENERALIZE TO BEYOND SIMPLEX CASE?

    MWUs: Proof Sketch of Regret Bound

  • MWUs AND APPLICATIONS

    Designing a Regularized Update GOAL: Design an update and its potential function analysis

    QUESTION: Choice of potential function?

    DESIDERATA: 1) lower bounds best experts loss

    2) tracks algorithms performance

  • MWUs AND APPLICATIONS

    QUESTION: Choice of potential function?

    DESIDERATA: 1) lower bounds best experts loss

    2) tracks algorithms performance

    Attempt 1 FOLLOW THE LEADER: Cumulative loss

    L(t) =Pts=1 `

    (s)

    x(t+1) = argminx2X

    xTL(t) (t+1) = minx2X

    xTL(t)

    Pick best current solution Potential is current best loss

    Designing a Regularized Update

  • MWUs AND APPLICATIONS

    QUESTION: Choice of potential function?

    DESIDERATA: 1) lower bounds best experts loss

    2) tracks algorithms performance

    Attempt 1 FOLLOW THE LEADER: Cumulative loss

    L(t) =Pts=1 `

    (s)

    x(t+1) = argminx2X

    xTL(t) (t+1) = minx2X

    xTL(t)

    Pick best current solution Potential is current best loss

    Designing a Regularized Update

  • MWUs AND APPLICATIONS

    QUESTION: Choice of potential function?

    DESIDERATA: 1) lower bounds best experts loss

    2) tracks algorithms performance

    Attempt 1 FOLLOW THE LEADER: Cumulative loss

    L(t) =Pts=1 `

    (s)

    x(t+1) = argminx2X

    xTL(t) (t+1) = minx2X

    xTL(t)

    Pick best current solution Potential is current best loss

    Designing a Regularized Update

    Fails if best expert changes moves drastically

  • MWUs AND APPLICATIONS

    QUESTION: Choice of potential function?

    DESIDERATA: 1) lower bounds best experts loss

    2) tracks algorithms performance

    Attempt 1 FOLLOW THE LEADER: Cumulative loss

    L(t) =Pts=1 `

    (s)

    x(t+1) = argminx2X

    xTL(t)

    (t+1) = minx2X

    xTL(t)

    Designing a Regularized Update

    How to make update

    more stable?

  • MWUs AND APPLICATIONS

    QUESTION: Choice of potential function?

    DESIDERATA: 1) lower bounds best experts loss

    2) tracks algorithms performance

    Attempt 2 FOLLOW THE REGULARIZED LEADER:

    x(t+1) = argminx2X

    xTL(t) + F(x)

    (t+1) = minx2X

    xTL(t) + F(x)

    Properties of Regularizer F(x):

    1. Convex, differentiable

    2. -strong convex w.r.t. norm

    Parameter 0, TBD

    Regularized Update: Definition

  • MWUs AND APPLICATIONS

    QUESTION: Choice of potential function?

    DESIDERATA: 1) lower bounds best experts loss

    2) tracks algorithms performance

    Attempt 2 FOLLOW THE REGULARIZED LEADER:

    x(t+1) = argminx2X

    xTL(t) + F(x)

    (t+1) = minx2X

    xTL(t) + F(x)

    Properties of Regularizer F(x):

    1. Convex, differentiable

    2. -strong convex w.r.t. norm

    Parameter 0, TBD

    Regularized Update: Definition

    These properties are actually sufficient to get a regret bound

  • MWUs AND APPLICATIONS

    QUESTION: Choice of potential function?

    DESIDERATA: 1) lower bounds best experts loss

    2) tracks algorithms performance

    Attempt 2 FOLLOW THE REGULARIZED LEADER:

    x(t+1) = argminx2X

    xTL(t) + F(x)

    (t+1) = minx2X

    xTL(t) + F(x)

    Properties of Regularizer F(x):

    1. Convex, differentiable

    2. -strong convex w.r.t. norm

    Parameter 0, TBD

    Regularized Update: Analysis

    (t+1) minx2X

    L(t)Tx+ max

    x2XF(x)

  • MWUs AND APPLICATIONS

    QUESTION: Choice of potential function?

    DESIDERATA: 1) lower bounds best experts loss

    2) tracks algorithms performance

    Attempt 2 FOLLOW THE REGULARIZED LEADER:

    x(t+1) = argminx2X

    xTL(t) + F(x)

    (t+1) = minx2X

    xTL(t) + F(x)

    Properties of Regularizer F(x):

    1. Convex, differentiable

    2. -strong convex w.r.t. norm

    Parameter 0, TBD

    Regularized Update: Analysis

    (t+1) minx2X

    L(t)Tx+ max

    x2XF(x) Regularization

    error

  • MWUs AND APPLICATIONS

    QUESTION: Choice of potential function?

    DESIDERATA: 1) lower bounds best experts loss

    2) tracks algorithms performance

    Attempt 2 FOLLOW THE REGULARIZED LEADER:

    x(t+1) = argminx2X

    xTL(t) + F(x)

    (t+1) = minx2X

    xTL(t) + F(x)

    Properties of Regularizer F(x):

    1. Convex, differentiable

    2. -strong convex w.r.t. norm

    Parameter 0, TBD

    Regularized Update: Analysis

    ?

    f(t+1)(x)

  • Tracking the Algorithm: Proof by Picture

    f(t+1)(x) = xTL(t) + F(x)

    f(t)(x)

    x

    f(t+1)(x)

    x(t) x(t+1)

    Define:

    (t+1)

    (t)

  • Define:

    (t+1)

    (t)

    Notice:

    f(t+1)(x) f(t)(x) = `(t)Tx Latest loss vector

    Tracking the Algorithm: Proof by Picture

    f(t+1)(x) = xTL(t) + F(x)

    f(t)(x)

    x

    f(t+1)(x)

    x(t) x(t+1)

  • Define:

    (t+1)

    (t)

    Notice:

    f(t+1)(x) f(t)(x) = `(t)Tx Latest loss vector

    `(t)Tx(t)

    Tracking the Algorithm: Proof by Picture

    f(t+1)(x) = L(t)Tx+ F(x)

    f(t)(x)

    x

    f(t+1)(x)

    x(t) x(t+1)

  • Compare:

    (t+1)

    (t)

    and (t+1) (t)

    Tracking the Algorithm: Proof by Picture

    `(t)Tx(t)

    f(t)(x)

    x

    f(t+1)(x)

    x(t) x(t+1)

    `(t)Tx(t)

    f(t)(x)

    f(t+1)(x)

  • p

    Want:

    (t+1)

    (t)

    Tracking the Algorithm: Proof by Picture

    f(t+1)(x(t)) f(t+1)(x(t+1))

    `(t)Tx(t)

    xx(t) x(t+1)

    f(t)(x)

    f(t+1)(x)

    (t+1) (t) = f(t+1)(x(t+1)) f(t+1)(x(t)) + `(t)Tx(t)

  • Regularization in Action

    (t+1)

    (t)

    f (t) is ( )-strongly-convex REGULARIZATION

    f(t+1)(x) = L(t)Tx+ F(x)

    `(t)Tx(t)

    xx(t) x(t+1)

    f(t)(x)

    f(t+1)(x)

  • `(t)

    Regularization in Action

    (t+1)

    (t)

    f (t) is ( )-strongly-convex REGULARIZATION

    kf(t+1) f(t)k = k`(t)k jjx(t+1) x(t)jj jj`(t)jj

    STABILITY

    `(t)Tx(t)

    xx(t) x(t+1)

    f(t)(x)

    f(t+1)(x)

    f(t+1)(x) = L(t)Tx+ F(x)

  • `(t)

    Regularization in Action

    (t+1)

    (t)

    f (t) is ( )-strongly-convex REGULARIZATION

    kf(t+1) f(t)k = k`(t)k jjx(t+1) x(t)jj jj`(t)jj

    STABILITY

    `(t)Tx(t)

    xx(t) x(t+1)

    f(t)(x)

    f(t+1)(x)

    f(t+1)(x) = L(t)Tx+ F(x)

    Quadratic

    lower bound

    to f(t+1)

  • MWUs AND APPLICATIONS

    Analysis: Progress in One Iteration

    rf(t+1)(x(t)) = `(t) jjx(t) x(t)jj jj`(t)jj

    f (t+1) is ( )-strongly-convex

    (t+1) (t) = f(t+1)(x(t+1)) f(t+1)(x(t)) + `(t)Tx(t)

    f(t+1)(x(t+1)) f(t+1)(x(t)) `(t)T (x(t+1) x(t)) + jj`(t)jj22

  • MWUs AND APPLICATIONS

    Analysis: Progress in One Iteration

    rf(t+1)(x(t)) = `(t)

    f(t+1)(x(t+1)) f(t+1)(x(t)) `(t)T (x(t+1) x(t)) + jj`(t)jj22

    f (t+1) is ( )-strongly-convex

    (t+1) (t) = f(t+1)(x(t+1)) f(t+1)(x(t)) + `(t)Tx(t)

    k`(t)kkx(t+1) x(t)k+ jj`(t)jj2

    k`(t)k22

    jjx(t) x(t)jj jj`(t)jj

  • MWUs AND APPLICATIONS

    Completing the Analysis

    (t+1) (t) `(t)Tx(t) k`(t)k2

    Regret at iteration t

    Progress in one iteration:

  • MWUs AND APPLICATIONS

    Completing the Analysis

    (t+1) (t) `(t)Tx(t) k`(t)k2

    Progress in one iteration:

    Telescopic sum:

    (T+1) TXt=1

    `(t)Tp(t) +(1) T jj`

    (t)jj2

  • MWUs AND APPLICATIONS

    Completing the Analysis

    (t+1) (t) `(t)Tx(t) k`(t)k2

    Progress in one iteration:

    Telescopic sum:

    (T+1) TXt=1

    `(t)Tp(t) +(1) T jj`

    (t)jj2

    Final regret bound:

    1

    T

    TXt=1

    `(t)Tx(t) min

    x2X

    TXt=1

    `(t)Tx

    !

    T (maxx2X

    F (x)minx2X

    F (x)) +2

    2

  • MWUs AND APPLICATIONS

    Completing the Analysis

    Regret bound: with regularizer F and

    jj`(t)jj

    Start-up Penalty Penalty for

    being greedy

    SAME TYPE OF BOUND AS FOR MWUs

    1

    T

    TXt=1

    `(t)Tx(t) min

    x2X

    TXt=1

    `(t)Tx

    !

    T (maxx2X

    F (x)minx2X

    F (x)) +2

    2

  • MWUs AND APPLICATIONS

    Reinterpreting MWUs

    (t+1) = minp0;Ppi=1

    pTL(t) + nXi=1

    pi logpiPotential function:

    Regularizer: is negative entropy F (p) =

    nXi=1

    pi log pi

  • MWUs AND APPLICATIONS

    Reinterpreting MWUs

    (t+1) = minp0;Ppi=1

    pTL(t) + nXi=1

    pi logpiPotential function:

    Regularizer: is negative entropy

    F (p ) is 1-strongly-convex w.r.t.

    Update:

    F (p) =

    nXi=1

    pi log pi

    k k1

    p(t+1) = arg minp0;Ppi=1

    pTL(t) + nXi=1

    pi logpi

    p(t+1)

    i =e

    1L(t)

    iPni=1 e

    1L(t)

    i

    =(1 )L(t)iPni=1(1 )L

    (t)

    i

    :

    SOFT-MAX

  • MWUs AND APPLICATIONS

    Reinterpreting MWUs

    (t+1) = minp0;Ppi=1

    pTL(t) + nXi=1

    pi logpiPotential function:

    Regularizer: is negative entropy

    F (p ) is 1-strongly-convex w.r.t.

    Update:

    F (p) =

    nXi=1

    pi log pi

    k k1

    p(t+1) = arg minp0;Ppi=1

    pTL(t) + nXi=1

    pi logpi

    p(t+1)

    i =e

    1L(t)

    iPni=1 e

    1L(t)

    i

    =(1 )L(t)iPni=1(1 )L

    (t)

    i

    :

  • MWUs AND APPLICATIONS

    Beyond MWUs: which regularizer?

    Regret bound: optimizing over

    Best choice of regularizer and norm minimizes

    maxt jj`(t)jj2 (maxx2X F (x)minx2X F (x))

    1

    T

    TXt=1

    `(t)Tx(t) min

    x2X

    TXt=1

    `(t)Tx

    !p(2 (maxx2X F (x)minx2X F (x))p

    T

  • MWUs AND APPLICATIONS

    Beyond MWUs: which regularizer?

    Regret bound: optimizing over

    Best choice of regularizer and norm minimizes

    maxt jj`(t)jj2 (maxx2X F (x)minx2X F (x))

    1

    T

    TXt=1

    `(t)Tx(t) min

    x2X

    TXt=1

    `(t)Tx

    !p(2 (maxx2X F (x)minx2X F (x))p

    T

    Negative entropy with -norm is approximately optimal for simplex

    QUESTION: are other regularizers ever useful?

    `1

  • QUESTION 1:

    Are other regularizers, besides entropy, ever useful?

    YES! Applications:

    Graph Partitioning and Random Walks

    Spectral algorithms for balanced separator running in time

    Uses random-walk framework and SDP MWUs

    Different walks correspond to different regularizers for eigenvector problem

    [Mahoney, Orecchia, Vishnoi 2011], [Orecchia, Sachdeva, Vishnoi 2012]

    Different Regularizers in Algorithm Design

    F(X) = Tr(X1=2)

    F(X) = Tr(Xp)

    F(X) = Tr(X logX)SDP MWU

    p-norm, 1 p 1 NEW REGULARIZER

    Heat Kernel Random Walk

    Lazy Random Walk

    Personalized PageRank

    ~O(m)

  • QUESTION 1:

    Are other regularizers, besides entropy, ever useful?

    YES! Applications:

    Graph Partitioning and Random Walks

    Sparsification

    -spectral-sparsifiers with edges

    Uses Matrix concentration bound equivalent to SDP MWUs

    [Spielman, Srivastava 2008]

    -spectral-sparsifiers with edges

    Can be interpreted as different regularizer:

    [Batson, Spielman, Srivastava 2009]

    Different Regularizers in Algorithm Design

    O(n logn2

    )

    O( n2)

    F(X) = Tr(X1=2)

  • QUESTION 1:

    Are other regularizers, besides entropy, ever useful?

    YES! Applications:

    Graph Partitioning and Random Walks

    Sparsification

    Many more in Online Learning

    Bandit Online Learning [AHR],

    Different Regularizers in Algorithm Design

  • NON-SMOOTH CONVEX OPTIMIZATION

    REDUCES TO

    ONLINE LINEAR OPTIMIZATION

  • Convex Optimization Setup

    8x 2 X;krf(x)k

    8x; y 2 X;krf(y)rf(x)k Lky xk

    f convex, differentiable

    X Rn closed, convex set

    minx2X

    f(x)

    NON-SMOOTH SMOOTH

    -Lipschitz continuous -Lipschitz continuous gradient

  • Convex Optimization Setup

    8x 2 X;krf(x)k

    8x; y 2 X;krf(y)rf(x)k Lky xk

    f convex, differentiable

    X Rn closed, convex set

    minx2X

    f(x)

    NON-SMOOTH SMOOTH

    -Lipschitz continuous -Lipschitz continuous gradient

    Gradient step is guaranteed to decrease

    function value

    f(x(t+1)) f(x(t)) krf(x(t))k2

    2L

  • Convex Optimization Setup

    8x 2 X;krf(x)k

    8x; y 2 X;krf(y)rf(x)k Lky xk

    f convex, differentiable

    X Rn closed, convex set

    minx2X

    f(x)

    NON-SMOOTH SMOOTH

    -Lipschitz continuous -Lipschitz continuous gradient

    Gradient step is guaranteed to decrease

    function value

    f(x(t+1)) f(x(t)) krf(x(t))k2

    2L

    x(t)x(t+1)

    NO GRADIENT STEP GUARANTEE

  • Convex Optimization Setup

    8x 2 X;krf(x)k

    8x; y 2 X;krf(y)rf(x)k Lky xk

    f convex, differentiable

    X Rn closed, convex set

    minx2X

    f(x)

    NON-SMOOTH SMOOTH

    -Lipschitz continuous -Lipschitz continuous gradient

    Gradient step is guaranteed to decrease

    function value

    f(x(t+1)) f(x(t)) krf(x(t))k2

    2L

    x(t)x(t+1)

    NO GRADIENT STEP GUARANTEE

    ONLY DUAL GUARANTEE

  • Non-Smooth Setup: Dual Approach

    8x 2X; krf(x)k

    f convex, differentiable

    X Rn closed, convex set

    minx2X

    f(x)

    -Lipschitz continuous

    x(t)x(t+1) x(t+2)

    APPROACH: Each iterate solution provides a lower bound and an upper bound

    f(x(t)) f(x)

    f(x) f(x(t)) +rf(x(t)T (x x(t))

  • Non-Smooth Setup: Dual Approach

    8x 2X; krf(x)k

    f convex, differentiable

    X Rn closed, convex set

    minx2X

    f(x)

    -Lipschitz continuous

    x(t)x(t+1) x(t+2)

    APPROACH: Each iterate solution provides a lower bound and an upper bound

    f(x(t)) f(x)

    f(x) f(x(t)) +rf(x(t)T (x x(t))

    CAN WEAKEN DIFFERENTIABILITY ASSUMPTION: SUBGRADIENTS SUFFICE

  • Non-Smooth Setup: Dual Approach

    x(t)x(t+1) x(t+2)

    APPROACH: Each iterate solution provides a lower bound and an upper bound

    f(x(t)) f(x)

    f(x) f(x(t)) +rf(x(t)T (x x(t))

    Take convex combination of both upper bounds and lower bounds with weights t

    UPPER BOUND:

    LOWER BOUND:

    1PT

    t=1t

    PTt=1 tf(x

    (t)) f(x)

    UPPER

  • Non-Smooth Setup: Dual Approach

    x(t)x(t+1) x(t+2)

    APPROACH: Each iterate solution provides a lower bound and an upper bound

    f(x(t)) f(x)

    f(x) f(x(t)) +rf(x(t))T (x x(t))

    Take convex combination of both upper bounds and lower bounds with weights t

    UPPER:

    LOWER :

    1PT

    t=1t

    PTt=1 tf(x

    (t)) f(x)

    UPPER

    f(x) 1PTt=1

    t

    hPTt=1 t(f(x

    (t)) +rf(x(t))T (x x(t)))i

    LOWER

  • Non-Smooth Setup: Dual Approach

    x(t)x(t+1) x(t+2)

    APPROACH: Each iterate solution provides a lower bound and an upper bound

    f(x(t)) f(x)

    f(x) f(x(t)) +rf(x(t))T (x x(t))

    Take convex combination of both upper bounds and lower bounds with weights t

    UPPER:

    LOWER :

    1PT

    t=1t

    PTt=1 tf(x

    (t)) f(x)

    UPPER

    f(x) 1PTt=1

    t

    hPTt=1 t(f(x

    (t)) +rf(x(t))T (x x(t)))i

    LOWER HOW TO UPDATE ITERATES?

    HOW TO CHOSE WEIGHTS?

  • Reduction to Online Linear Minimization

    Fix weights t to be uniform for simplicity:

    UPPER:

    LOWER :

    DUALITY GAP:

    1PT

    t=1t

    PTt=1 tf(x

    (t)) f(x)

    f(x) 1PTt=1

    t

    hPTt=1 t(f(x

    (t)) +rf(x(t))T (x x(t)))i

    PTt=1

    tPT

    t=1tf(x(t))

    f(x) PTt=1rf(x(t))T (x x(t))

    LINEAR FUNCTION

  • Reduction to Online Linear Minimization

    Fix weights t to be uniform for simplicity:

    DUALITY GAP: PTt=1

    tPT

    t=1tf(x(t))

    f(x) PTt=1rf(x(t))T (x x(t))

    ALGORITHM ADVERSARY

    x(t) 2X rf(x(t))

    ONLINE SETUP

  • Reduction to Online Linear Minimization

    Fix weights t to be uniform for simplicity:

    DUALITY GAP: PTt=1

    tPT

    t=1tf(x(t))

    f(x) PTt=1rf(x(t))T (x x(t))

    ALGORITHM ADVERSARY

    x(t) 2X `(t) =rf(x(t))

    ONLINE SETUP

    Recall that by assumption: Loss vector is gradient k`(t)k = krf(x(t))k

  • Reduction to Online Linear Minimization

    Fix weights t to be uniform for simplicity:

    DUALITY GAP: hPTt=1

    1Tf(x(t))

    i f(x) 1

    TPTt=1rf(x(t))T (x x(t))

    ALGORITHM ADVERSARY

    x(t) 2X `(t) =rf(x(t))

    ONLINE SETUP

    Recall that by assumption: Loss vector is gradient k`(t)k = krf(x(t))k

    1

    TTXt=1

    rf(x(t))T (x x(t)) = REGRET

  • Final Bound

    ALGORITHM ADVERSARY

    x(t) 2X `(t) =rf(x(t))

    ONLINE SETUP

    Recall that by assumption: Loss vector is gradient k`(t)k = krf(x(t))k

    TXt=1

    rf(x(t))T (x x(t)) = REGRET

    MD p2 (maxx2X F (x)minx2X F (x))

    pT

    RESULTING ALGORITHM: MIRROR DESCENT

    Error bound with -strongly-convex regularizer F

  • Final Bound

    ALGORITHM ADVERSARY

    x(t) 2X `(t) =rf(x(t))

    ONLINE SETUP

    Recall that by assumption: Loss vector is gradient k`(t)k = krf(x(t))k

    TXt=1

    rf(x(t))T (x x(t)) = REGRET

    MD p2 (maxx2X F (x)minx2X F (x))

    pT

    RESULTING ALGORITHM: MIRROR DESCENT

    Error bound with -strongly-convex regularizer F

    ASYMPTOTICALLY OPTIMAL BY INFORMATION COMPLEXITY LOWER BOUND

  • Non-Smooth Optimization over Simplex

    MD p2 lognpT

    RESULTING ALGORITHM:

    MIRROR DESCENT OVER SIMPLEX = MWU

    Regularizer F is negative entropy, with krf(x(t))k1

  • APPLICATIONS IN ALGORITHM DESIGN

  • Warm-up Example: Linear Programming

    A 2 Rmn;?9x 2 X : Ax b 0

    LP Feasibility problem

    Easy constraints

    Maintain feasible Hard constraints

    Require fixing

  • Warm-up Example: Linear Programming

    A 2 Rmn;?9x 2 X : Ax b 0

    Convert into non-smooth optimization problem over simplex:

    Non-differentiable objective:

    LP Feasibility problem

    minp2m

    maxx2X

    pT (bAx)

    f(p) = maxx2X

    pT (bAx)

  • Warm-up Example: Linear Programming

    A 2 Rmn;?9x 2 X : Ax b 0

    Convert into non-smooth optimization problem over simplex:

    Non-differentiable objective:

    LP Feasibility problem

    minp2m

    maxx2X

    pT (bAx)

    f(p) = maxx2X

    pT (bAx)Best response to dual

    solution p

  • Warm-up Example: Linear Programming

    A 2 Rmn;?9x 2 X : bAx 0

    Convert into non-smooth optimization problem over simplex:

    Non-differentiable objective

    Admits subgradients, for all p:

    LP Feasibility problem

    minp2m

    maxx2X

    pT (bAx)

    f(p) = maxx2X

    pT (bAx)

    xp : pT (bAxp) 0;

    (bAxp) 2 @f(p)Subgradient is slack

    in constraints

  • Warm-up Example: Linear Programming

    A 2 Rmn;?9x 2 X : bAx 0

    Convert into non-smooth optimization problem over simplex:

    Non-differentiable objective

    Admits subgradients, for all p:

    If we can pick xp such that , then

    LP Feasibility problem

    minp2m

    maxx2X

    pT (bAx)

    f(p) = maxx2X

    pT (bAx)

    xp : pT (bAxp) 0;

    (bAxp) 2 @f(p)

    kbAxpk1

    MD p2 lognpT

    T 2 2 logn

    2

  • Minaximum flow feasibility for value F over undirected graph G with incidence matrix B:

    Turn into non-smooth minimization problem over simplex:

    MWU and s-t Maxflow

    8e 2 E;F jfejce

    1

    BT f = es et

    f(p) = minBT f=eset

    Xe2E

    pe F jfejce

    1

    Will enforce this

    Best response fp is shortest s-t path with lengths pe / ce .

    For any p, if fp has length > 1, there is no subgradient, i.e. problem is infeasible.

    Otherwise, the following is a subgradient

    Unfortunately, width can be large

    @f(p)e =F j(fp)ej

    ce 1

    k@f(p)ek1 F

    cmin

    [PST 91] T = O

    F logn

    2cmin

  • PROBLEM: Optimal for this specific formulation

    SOLUTION: Regularize primal

    Width Reduction: make function nicer

    x(t)x(t+1) x(t+2)

    k@f(p)ek1 F

    cmin

    f(p) = minBTf=eset

    F Xe2E

    fe

    ce

    pe +

    m

    1

    NEED PRIMAL ARGUMENT

  • PROBLEM: Optimal for this specific formulation

    SOLUTION: Regularize primal

    REGULARIZATION ERROR:

    NEW WIDTH:

    ITERATION BOUND:

    Width Reduction: make primal nicer

    k@f(p)ek1 F

    cmin

    f(p) = minBTf=eset

    F Xe2E

    fe

    ce

    pe +

    m

    1

    F

    k@f(p)ek1 m

    [GK 98] T = O

    m logn

    2

  • Electrical Flow Approach [CKMST]

    8e 2 E;F f2e

    c2e 1

    BT f = es et Will enforce this

    Different formulation yields basis for CKMST algorithm:

    Non-smooth optimization problem:

    f(p) = minBT f=eset

    Xe2E

    pe F f2e

    c2e 1

  • Electrical Flow Approach [CKMST]

    8e 2 E;F f2e

    c2e 1

    BT f = es et Will enforce this

    Different formulation yields basis for CKMST algorithm:

    Non-smooth optimization problem:

    Original width:

    f(p) = minBT f=eset

    Xe2E

    pe F f2e

    c2e 1

    Best response is electrical flow fp

    k@f(p)ek1 m

  • Electrical Flow Approach [CKMST]

    8e 2 E;F f2e

    c2e 1

    BT f = es et Will enforce this

    Different formulation yields basis for CKMST algorithm:

    Non-smooth optimization problem:

    Regularize primal:

    f(p) = minBT f=eset

    Xe2E

    pe F f2e

    c2e 1

    f(p) = minBT f=eset

    F Xe2E

    f2ec2e

    pe +

    m

    1

    k@f(p)ek1 rm

  • Conclusion: Take-away messages

    Regularization is a powerful tool for the design of fast algorithms.

    Most iterative algorithms can be understood as regularized updates: MWUs, Width Reduction, Interior Point, Gradient descent, ..

    Perform well in practice. Regularization also helps eliminate noise.

    ULTIMATE GOAL: Development of a library of iterative methods for fast graph algorithms.

    Regularization plays a fundamental role in this effort

  • THE END THANK YOU