exponentiated gradient versus gradient descent for linear

23
Exponentiated Gradient versus Gradient Descent for Linear Predictors Jyrki Kivinen and Manfred Warmuth Presented By: Maitreyi N

Upload: others

Post on 21-Nov-2021

16 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Exponentiated Gradient versus Gradient Descent for Linear

Exponentiated Gradient versus Gradient Descent for Linear Predictors

Jyrki Kivinen and Manfred Warmuth

Presented By: Maitreyi N

Page 2: Exponentiated Gradient versus Gradient Descent for Linear

Linear Predictors

)),(inf(),( SuLossOSALoss LUuL ∈=

),(inf))1(1(),( SuLossoSALoss LUuL ∈+=

A good linear predictor will satisfy the bounds:

The bounds can be improved to:

where

0)1( →⇒∞→ ol

Page 3: Exponentiated Gradient versus Gradient Descent for Linear

Gradient Descent

ttttt xyyww )ˆ(21 −−=+ η

This algorithm uses the update rule:

This is the gradient of the Squared Euclidean Distance:

2

221),( swswd −=

Page 4: Exponentiated Gradient versus Gradient Descent for Linear

Exponentiated Gradient

∑=

+ = N

jjtjt

ititit

wr

wrw

1,,

,,,1

∑=

=N

i i

iire s

wwswd1

ln),(

This algorithm uses the update rule:

This is the gradient of the Relative Entropy:

Page 5: Exponentiated Gradient versus Gradient Descent for Linear

Algorithm GDL(s, η)

Parameters:L: a loss function from R × R to [0, ∞),s: a start vector in RN, andη: a learning rate in [0, ∞).

Initialization: Before the first trial, set w1=s.

Prediction: Upon receiving the t th instance xt, give the prediction ŷt=wt • xt .

Update: Upon receiving the t th outcome yt, update the weights according to the rule

wt+1=wt - η L'yt(ŷt) xt .

Page 6: Exponentiated Gradient versus Gradient Descent for Linear

Algorithm EGL(s, η)

Parameters:L: a loss function from R × R to [0, ∞),s: a start vector with ΣN

i=1 si = 1, andη: a learning rate in [0, ∞).

Initialization: Before the first trial, set w1=s.

Prediction: Upon receiving the t th instance xt, give the prediction ŷt=wt • xt .

Update: Upon receiving the t th outcome yt, update the weights according to the rule

Page 7: Exponentiated Gradient versus Gradient Descent for Linear

EG± : EG with negative weights

EG is analogous to the Weighted Majority Algorithm:Uses multiplicative update rulesIs based on minimizing relative entropyUnfortunately, it can represent only positive concepts

EG± can represent any concept in the entire sample space.

It has proven relative boundsAbsolute bounds are not proven.Works by splitting the weight vector into positive and negative weights, with separate update rules.

Page 8: Exponentiated Gradient versus Gradient Descent for Linear

EG± Algorithm:

Page 9: Exponentiated Gradient versus Gradient Descent for Linear

Update:

Page 10: Exponentiated Gradient versus Gradient Descent for Linear

EG±

EG: Update rule EG±: Update rule

∑ =

+

−−

=

=

N

j jtjt

ititit

xyyit

rw

rww

er ittt

1 ,,

,,,1

)ˆ(2,

∑ =−−++

++++

−−+

+=

=

N

j jtjt

itit

xyy

jtjt

it

ittt

it

rwrw

rww

er

1 ,,

,,1

)ˆ(2

)(,,

,

,

,

η

Page 11: Exponentiated Gradient versus Gradient Descent for Linear

Variable Learning Rates

GDVWeight update rule becomes:

EGV±

Weight update rule becomes:

tttt

tt xyyx

ww )ˆ(2 2

2

1 −−=+η

⎟⎟

⎜⎜

⎛−−=

+ittt

tit Uxyy

xr ,2, )ˆ(2exp η

+− =

itit r

r,

,1

Page 12: Exponentiated Gradient versus Gradient Descent for Linear

Approximated EG Algorithms

Use the approximation

))(1( oo vvaee avav −−≈ −−

So the update rule becomes

)).ˆ)(ˆ(1( ,,,1 tittyitit yxyLwwt

−′−=+ η

The approximation leads to oscillation of the weight vector for certain weight distributions

Page 13: Exponentiated Gradient versus Gradient Descent for Linear

Worst Case Loss Bounds

Gradient Descent

EG

22

2211),()21()),,(( Xsuc

SuLosscSsGDLoss −⎟⎠⎞

⎜⎝⎛ +++≤η

).,(121),(

21)),,(( 2 sudR

cSuLosscSsEGLoss re⎟

⎠⎞

⎜⎝⎛ ++⎟

⎠⎞

⎜⎝⎛ +≤η

)2(2,0 2 cR

cR+

=> η

Page 14: Exponentiated Gradient versus Gradient Descent for Linear

Worst Case Loss Bounds

EG±

).,/(42),(2

1)),,,(( 22 sUudXUc

SuLosscSsUEGLoss re ′′⎟⎠⎞

⎜⎝⎛ ++⎟

⎠⎞

⎜⎝⎛ +≤± η

2231

,2

XU

andUXR

Where

=

=

η

Page 15: Exponentiated Gradient versus Gradient Descent for Linear

Other Algorithms

Gradient projection algorithm (GP)Has similar bounds to GDUses the constraint: weights must sum to 1

Exponentiated Gradient algorithm with Unnormalized weights (EGU)

When all outcomes, inputs and comparison vectors are positive, it has the bounds:

( ) ).,(12),(21)),,,(( suXYdc

SuLosscSYsEGULoss reu⎟⎠⎞

⎜⎝⎛ +++≤η

Page 16: Exponentiated Gradient versus Gradient Descent for Linear

Experiments

Have a fixed target concept u∈RN

u is equivalent to the weightage of each inputUse ℓ instances of input xt

Drawn from a probability measure in RN

Random noise is added to the inputsRun each algorithm on the (same) inputsPlot cumulative losses for each algorithm

Page 17: Exponentiated Gradient versus Gradient Descent for Linear

Results

Page 18: Exponentiated Gradient versus Gradient Descent for Linear

Results

Page 19: Exponentiated Gradient versus Gradient Descent for Linear

Results

Page 20: Exponentiated Gradient versus Gradient Descent for Linear

Results

Page 21: Exponentiated Gradient versus Gradient Descent for Linear

Results

Page 22: Exponentiated Gradient versus Gradient Descent for Linear

GD vs. EG

Random errors confuse GD much moreWhen the number of relevant variables is constant:

Loss(GD) grows linearly in NLoss(EG) grows logarithmically in N

GD does better when:All variables are relevant, andInput is consistent (few or no errors)

Page 23: Exponentiated Gradient versus Gradient Descent for Linear

Conclusion

Worst case loss bounds exist only for square loss.

We need loss bounds for relative entropy lossGD has provably optimal bounds

Lower bounds for EG, EG± are still required. EG, EG± perform better in error prone learning environments