stochastic gradient descent training for l1-regularizaed log-linear models with cumulative penalty

Stochastic Gradient Descent Training for L1-regularizaed Log-linear Models with

Cumulative Penalty

Yoshimasa Tsuruoka, Jun’ichi Tsujii, and Sophia Ananiadou

University of Manchester

1

Log-linear models in NLP

• Maximum entropy models– Text classification (Nigam et al., 1999)– History-based approaches (Ratnaparkhi, 1998)

• Conditional random fields– Part-of-speech tagging (Lafferty et al., 2001),

chunking (Sha and Pereira, 2003), etc.• Structured prediction

– Parsing (Clark and Curan, 2004), Semantic Role Labeling (Toutanova et al, 2005), etc.

2

Log-linear models

Feature functionWeight

y

yxxi

ii fwZ ,exp

• Log-linear (a.k.a. maximum entropy) model

• Training– Maximize the conditional likelihood of the training data

Partition function:

iii fwZ

p yxx

wxy ,exp1

;|

wwxyw RpLN

j

jj 1

;|log

3

Regularization

• To avoid overfitting to the training data– Penalize the weights of the features

• L1 regularization

– Most of the weights become zero– Produces sparse (compact) models– Saves memory and storage

i

i

N

j

jj wCpL1

;|log wxyw

4

Training log-linear models

• Numerical optimization methods– Gradient descent (steepest descent or hill-climbing)– Quasi-Newton methods (e.g. BFGS, OWL-QN)– Stochastic Gradient Descent (SGD)– etc.

• Training can take several hours (or even days), depending on the complexity of the model, the size of training data, etc.

5

Gradient Descent (Hill Climbing)

1w

2w

objective

6

Stochastic Gradient Descent (SGD)

1w

2w

objective

Compute an approximate gradient using onetraining sample

7

Stochastic Gradient Descent (SGD)

• Weight update procedure– very simple (similar to the Perceptron algorithm)

Not differentiable

8

i

jj

i

ki

ki w

N

Cp

www wxy ;|log1

: learning rate

Using subgradients

• Weight update procedure

i

jj

i

ki

ki w

N

Cp

www wxy ;|log1

0if1

0if0

0if1

i

i

i

ii w

w

w

ww

9

Using subgradients

• Problems– L1 penalty needs to be applied to all features

(including the ones that are not used in the current sample).

– Few weights become zero as a result of training.

i

jj

i

ki

ki w

N

Cp

www wxy ;|log1

10

Clipping-at-zero approach

• Carpenter (2008)• Special case of the FOLOS algorithm (Duchi and

Singer, 2008) and the truncated gradient method (Langford et al., 2009)

• Enables lazy update

w

11

Clipping-at-zero approach

12

i

jj

i

ki

ki w

N

Cp

www wxy ;|log1

N

Cww

w

N

Cww

w

pw

ww

ki

ki

ki

ki

ki

ki

jj

i

ki

ki

2

11

2

1

2

11

2

1

2

1

,0min

then0ifelse

,0max

then0if

;|log wxy

• Text chunking

• Named entity recognition

• Part-of-speech tagging

13

Number of non-zero features

Quasi-Newton 18,109

SGD (Naive) 455,651

SGD (Clipping-at-zero) 87,792


Quasi-Newton 30,710

SGD (Naive) 1,032,962



Quasi-Newton 50,870

SGD (Naive) 2,142,130


Why it does not produce sparse models

• In SGD, weights are not updated smoothly

Fails to becomezero!

L1 penalty is wasted away

14

Cumulative L1 penalty

• The absolute value of the total L1 penalty which should have been applied to each weight

• The total L1 penalty which has actually been applied to each weight

15

k

ttk N

Cu

1

k

t

ti

tik wwq

1

2

11

Applying L1 with cumulative penalty

12

11

2

1

12

11

2

1

2

1

,0min

then0ifelse

,0max

then0if

;|log

kik

ki

ki

ki

kik

ki

ki

ki

jj

i

ki

ki

quww

w

quww

w

pw

ww wxy

• Penalize each weight according to the difference between and ku

1kiq

Implementation

10 lines of code!

17

Experiments

• Model: Conditional Random Fields (CRFs)• Baseline: OWL-QN (Andrew and Gao, 2007)• Tasks

– Text chunking (shallow parsing)• CoNLL 2000 shared task data• Recognize base syntactic phrases (e.g. NP, VP, PP)

– Named entity recognition• NLPBA 2004 shared task data• Recognize names of genes, proteins, etc.

– Part-of-speech (POS) tagging• WSJ corpus (sections 0-18 for training)

18

CoNLL 2000 chunking task: objective

19

CoNLL 2000 chunking: non-zero features

20

CoNLL 2000 chunking

Passes Obj. # Features Time (sec) F-score

OWL-QN 160 -1.583 18,109 598 93.62

SGD (Naive) 30 -1.671 455,651 1,117 93.64

SGD (Clipping + Lazy Update) 30 -1.671 87,792 144 93.65

SGD (Cumulative) 30 -1.653 28,189 149 93.68

SGD (Cumulative + ED) 30 -1.622 23,584 148 93.66

21

• Performance of the produced model

• Training is 4 times faster than OWL-QN• The model is 4 times smaller than the clipping-at-zero approach• The objective is also slightly better

Passes Obj. # Features Time (sec) F-score

OWL-QN 160 -2.448 30,710 2,253 71.76

SGD (Naive) 30 -2.537 1,032,962 4,528 71.20

SGD (Clipping + Lazy Update) 30 -2.538 279,886 585 71.20

SGD (Cumulative) 30 -2.479 31,986 631 71.40

SGD (Cumulative + ED) 30 -2.443 25,965 631 71.63

NLPBA 2004 named entity recognition

22

Passes Obj. # Features Time (sec)

Accuracy

OWL-QN 124 -1.941 50,870 5,623 97.16

SGD (Naive) 30 -2.013 2,142,130 18,471 97.18

SGD (Clipping + Lazy Update) 30 -2.013 323,199 1,680 97.18

SGD (Cumulative) 30 -1.987 62,043 1,777 97.19

SGD (Cumulative + ED) 30 -1.954 51,857 1,774 97.17

Part-of-speech tagging on WSJ

Discussions

• Convergence– Demonstrated empirically– Penalties applied are not i.i.d.

• Learning rate– The need for tuning can be annoying– Rule of thumb:

• Exponential decay (passes = 30, alpha = 0.85)

23

Conclusions

• Stochastic gradient descent training for L1-regularized log-linear models– Force each weight to receive the total L1 penalty

that would have been applied if the true (noiseless) gradient were available

• 3 to 4 times faster than OWL-QN• Extremely easy to implement

24

stochastic gradient descent training for l1-regularizaed log-linear models with cumulative penalty

Documents

sgd clipping

training datapenalize

result of training

produced model training

approximate gradient

chunking task

sgd naive1

sgd naive2