exponentiated gradient versus gradient descent for linear

Exponentiated Gradient versus Gradient Descent for Linear Predictors

Jyrki Kivinen and Manfred Warmuth

Presented By: Maitreyi N

Linear Predictors

)),(inf(),( SuLossOSALoss LUuL ∈=

),(inf))1(1(),( SuLossoSALoss LUuL ∈+=

A good linear predictor will satisfy the bounds:

The bounds can be improved to:

where

0)1( →⇒∞→ ol

Gradient Descent

ttttt xyyww )ˆ(21 −−=+ η

This algorithm uses the update rule:

This is the gradient of the Squared Euclidean Distance:

2

221),( swswd −=

Exponentiated Gradient

∑=

+ = N

jjtjt

ititit

wr

wrw

1,,

,,,1

∑=

=N

i i

iire s

wwswd1

ln),(

This algorithm uses the update rule:

This is the gradient of the Relative Entropy:

Algorithm GDL(s, η)

Parameters:L: a loss function from R × R to [0, ∞),s: a start vector in RN, andη: a learning rate in [0, ∞).

Initialization: Before the first trial, set w1=s.

Prediction: Upon receiving the t th instance xt, give the prediction ŷt=wt • xt .

Update: Upon receiving the t th outcome yt, update the weights according to the rule

wt+1=wt - η L'yt(ŷt) xt .

Algorithm EGL(s, η)

Parameters:L: a loss function from R × R to [0, ∞),s: a start vector with ΣN

i=1 si = 1, andη: a learning rate in [0, ∞).

Initialization: Before the first trial, set w1=s.

Prediction: Upon receiving the t th instance xt, give the prediction ŷt=wt • xt .

Update: Upon receiving the t th outcome yt, update the weights according to the rule

EG± : EG with negative weights

EG is analogous to the Weighted Majority Algorithm:Uses multiplicative update rulesIs based on minimizing relative entropyUnfortunately, it can represent only positive concepts

EG± can represent any concept in the entire sample space.

It has proven relative boundsAbsolute bounds are not proven.Works by splitting the weight vector into positive and negative weights, with separate update rules.

EG± Algorithm:

Update:

EG±

EG: Update rule EG±: Update rule

∑ =

+

−−

=

=

N

j jtjt

ititit

xyyit

rw

rww

er ittt

1 ,,

,,,1

)ˆ(2,

,η

∑ =−−++

++++

−−+

+=

=

N

j jtjt

itit

xyy

jtjt

it

ittt

it

rwrw

rww

er

1 ,,

,,1

)ˆ(2

)(,,

,

,

,

η

Variable Learning Rates

GDVWeight update rule becomes:

EGV±

Weight update rule becomes:

tttt

tt xyyx

ww )ˆ(2 2

2

1 −−=+η

⎟⎟

⎠

⎞

⎜⎜

⎝

⎛−−=

∞

+ittt

tit Uxyy

xr ,2, )ˆ(2exp η

+− =

itit r

r,

,1

Approximated EG Algorithms

Use the approximation

))(1( oo vvaee avav −−≈ −−

So the update rule becomes

)).ˆ)(ˆ(1( ,,,1 tittyitit yxyLwwt

−′−=+ η

The approximation leads to oscillation of the weight vector for certain weight distributions

Worst Case Loss Bounds

Gradient Descent

EG

22

2211),()21()),,(( Xsuc

SuLosscSsGDLoss −⎟⎠⎞

⎜⎝⎛ +++≤η

).,(121),(

21)),,(( 2 sudR

cSuLosscSsEGLoss re⎟

⎠⎞

⎜⎝⎛ ++⎟

⎠⎞

⎜⎝⎛ +≤η

)2(2,0 2 cR

cR+

=> η

Worst Case Loss Bounds

EG±

).,/(42),(2

1)),,,(( 22 sUudXUc

SuLosscSsUEGLoss re ′′⎟⎠⎞

⎜⎝⎛ ++⎟

⎠⎞

⎜⎝⎛ +≤± η

2231

,2

XU

andUXR

Where

=

=

η

Other Algorithms

Gradient projection algorithm (GP)Has similar bounds to GDUses the constraint: weights must sum to 1

Exponentiated Gradient algorithm with Unnormalized weights (EGU)

When all outcomes, inputs and comparison vectors are positive, it has the bounds:

( ) ).,(12),(21)),,,(( suXYdc

SuLosscSYsEGULoss reu⎟⎠⎞

⎜⎝⎛ +++≤η

Experiments

Have a fixed target concept u∈RN

u is equivalent to the weightage of each inputUse ℓ instances of input xt

Drawn from a probability measure in RN

Random noise is added to the inputsRun each algorithm on the (same) inputsPlot cumulative losses for each algorithm

Results

GD vs. EG

Random errors confuse GD much moreWhen the number of relevant variables is constant:

Loss(GD) grows linearly in NLoss(EG) grows logarithmically in N

GD does better when:All variables are relevant, andInput is consistent (few or no errors)

Conclusion

Worst case loss bounds exist only for square loss.

We need loss bounds for relative entropy lossGD has provably optimal bounds

Lower bounds for EG, EG± are still required. EG, EG± perform better in error prone learning environments

exponentiated gradient versus gradient descent for linear

Documents