dynamic feature selection for dependency parsing › ~jason › papers ›...

Dynamic Feature Selection for Dependency Parsing

He He, Hal Daumé III and Jason Eisner

EMNLP 2013, Seattle

2

Structured Prediction in NLP

Fruit flies like a banana .

果蝇喜欢香蕉。

N ⟶ N ⟶ V ⟶ Det ⟶ N

↓ ↓ ↓ ↓ ↓

Fruit flies like a banana

summarization, name entity resolution and many more ...

Machine Translation

ParsingPart-of-Speech Tagging

$ Fruit flies like a banana

3


Fruit flies like a banana .

果蝇喜欢香蕉。

N ⟶ N ⟶ V ⟶ Det ⟶ N

↓ ↓ ↓ ↓ ↓


summarization, name entity resolution and many more ...

Machine Translation

ParsingPart-of-Speech Tagging

$ Fruit flies like a banana

Exponentially increasing search space

Millions of features for scoring

4


⋮ ⋮ ⋮


⋮ ⋮

N

V

D

N

V

D

N

V

D

N

V

D

N

V

D

abanana

like

flies

Fruit

$

5


⋮ ⋮ ⋮


⋮ ⋮

N

V

D

N

V

D

N

V

D

N

V

D

N

V

D

abanana

like

flies

Fruit

$

tokenleft tokenright tokenin-between tokenstemformbigramtagcoarse taglengthdirection...

Feature templates per edge

6


⋮ ⋮ ⋮


⋮ ⋮

N

V

D

N

V

D

N

V

D

N

V

D

N

V

D

abanana

like

flies

Fruit

$

tokenleft tokenright tokenin-between tokenstemformbigramtagcoarse taglengthdirection...

(head_token + mod_token)X

(head_tag +mod_tag)


7


⋮ ⋮ ⋮


⋮ ⋮tokenleft tokenright tokenin-between tokenstemformbigramtagcoarse taglengthdirection...

(head_token + mod_token)X

(head_tag +mod_tag)

N

V

D

N

V

D

N

V

D

N

V

D

N

V

D

abanana

like

flies

Fruit

$

HUGE


8


⋮ ⋮ ⋮


⋮ ⋮

N

V

D

N

V

D

N

V

D

N

V

D

N

V

D

abanana

like

flies

Fruit

$

Do you need all features everywhere ?

token

tagcoarse taglengthdirection...


9


⋮ ⋮ ⋮


⋮ ⋮

abanana

like

flies

Fruit

$

N

V

D

N

V

D

N

V

D

N

V

D

N

V

D

token




10


⋮ ⋮ ⋮


⋮ ⋮

abanana

like

flies

Fruit

$

N

V

D

N

V

D

N

V

D

N

V

D

N

V

D

token




11


⋮ ⋮ ⋮


⋮ ⋮

abanana

like

flies

Fruit

$

N

V

D

N

V

D

N

V

D

N

V

D

N

V

D

token



Dynamic Decisions

12

Case Study: Dependency Parsing

BulgarianChinese

EnglishGerman

JapanesePortuguese

Swedish

0

1

2

3

4

5

6

OursBaseline

Sp

ee

du

p

2x to 6x speedup with little loss in accuracy

13

Graph-based Dependency Parsing

.This time , the firms were ready$

Scoring:

14



Scoring:

firms were

⋮

length: 1direction: right

modifier_token: werehead_token: firms

head_tag: noun

And hundreds more!

15



Decoding: find the highest-scoring tree

the

$

This

time , firms

were

ready .

16

totaltime

MST Dependency Parsing (1st-order projective)

17

? ? ? ?

Find highest-scoring tree O(n3)


18

Average Sentence0

10

20

30

40

50

60

find edge scores

Find highest-scoring tree O(n3)

Find edge scores

~268 feature templates~76M features


19

Add features only when necessary!

This the firms ready

score(This → ready) =

score(the → firms) =

20



score(This → ready) = -0.23

score(the → firms) = 0.63

-0.23

+0.63

21





-0.23

+0.1

+0.63 +0.7

22





-0.23

+0.1

+0.63 +0.7WINNER

23





-0.23

+0.1

-1.2

+0.63 +0.7WINNER

24





-0.23

+0.1

-1.2

- 0.55

+0.63 +0.7WINNER

LOSER

25





-0.23

+0.1

-1.2

- 0.55

+0.63 +0.7

This is a structured problem! Should not look at scores independently.

WINNER

LOSER

26

Dynamic Dependency Parsing

1.Find the highest-scoring tree after adding some features fast non-projective decoding

27



2.Only edges in the current best tree can win

28




are chosen by a classifier ≤ n decisions

are killed because they fight with the winners

29






3. Add features to undetermined edges by group

30






3. Add features to undetermined edges by group

Max # of iterations = # of feature groups

31


+ first feature group 51 gray edges with unknown fate...5 features per gray edge

Current 1-best treeWinner edge

Loser edge

Undetermined edge

(permanently in 1-best tree)

32


51 gray edges with unknown fate...5 features per gray edge


Loser edge

Undetermined edge


Non-projective decoding to find new 1-best tree

33



Classifier picks winners among the blue edgesCurrent 1-best tree

Winner edge

Loser edge

Undetermined edge


34




Loser edge

Undetermined edge


Remove losers in conflict with the winners

35




Loser edge

Undetermined edge



36


+ next feature group 44 gray edges with unknown fate...27 features per gray edge


Loser edge

Undetermined edge


37




Loser edge

Undetermined edge



38




Loser edge

Undetermined edge


Classifier picks winners among the blue edges

39




Loser edge

Undetermined edge



40




Loser edge

Undetermined edge



41




Loser edge

Undetermined edge


42




Loser edge

Undetermined edge



43




Loser edge

Undetermined edge



44




Loser edge

Undetermined edge



45




Loser edge

Undetermined edge



46




Loser edge

Undetermined edge


47




Loser edge

Undetermined edge



48




Loser edge

Undetermined edge



49




Loser edge

Undetermined edge



50




Loser edge

Undetermined edge



51


+ last feature group 3 gray edges with unknown fate...268 features per gray edge


Loser edge

Undetermined edge


52


0 gray edge with unknown fate...268 features per gray edge


Loser edge

Undetermined edge


Projective decoding to find final 1-best tree

53

What Happens During the Average Parse?

54

Most edges winor lose early


55Some edges win late





Later features are helpful



Linear increase in runtime


Later features are helpful


58

Summary: How Early Decisions Are Made

● Winners– Will definitely appear in the 1-best tree

● Losers – Have the same child as a winning edge– Form cycle with winning edges– Cross a winning edge (optional)– Share root ($) with a winning edge (optional)

● Undetermined– Add the next feature group to the remaining

gray edges

59

Feature Template Ranking

● Forward selection

60



A 0.60B 0.49C 0.55

61



A1 A

A 0.60B 0.49C 0.55

62



A&B 0.80A&C 0.85

A1 A

A 0.60B 0.49C 0.55

63



A&B 0.80A&C 0.85

A C1 A2 C

A 0.60B 0.49C 0.55

64



A&B 0.80A&C 0.85

A CA&C&B 0.9

1 A

3 B2 C

A 0.60B 0.49C 0.55

65



A&B 0.80A&C 0.85

A CA&C&B 0.9

1 A

3 B2 C

A 0.60B 0.49C 0.55

● Groupinghead cPOS+ mod cPOS + in-between punct # 0.49

in-between cPOS 0.59

head POS + mod POS + in-between conj # 0.71

head POS + mod POS + in-between POS + dist 0.72

head token + mod cPOS + dist 0.80

⋮

⋮

⋮

66



A&B 0.80A&C 0.85

A CA&C&B 0.9

1 A

3 B2 C

A 0.60B 0.49C 0.55

● Groupinghead cPOS+ mod cPOS + in-between punct # 0.49

in-between cPOS 0.59

head POS + mod POS + in-between conj # 0.71

head POS + mod POS + in-between POS + dist 0.72

head token + mod cPOS + dist 0.80

⋮

⋮

⋮

+ ~0.1

+ ~0.1

67

Partition Feature List Into Groups

68

How to pick the winners?

69


● Learn a classifier

70


● Learn a classifier● Features

– Currently added parsing features – Meta-features -- confidence of a prediction

71


● Learn a classifier● Features

– Currently added parsing features – Meta-features -- confidence of a prediction

● Training examples– Input: each blue edge in current 1-best tree– Output: is the edge in the gold tree? If so,

we want it to win!

72

Classifier Features

● Currently added parsing features ● Meta-features

– : …, 0.5, 0.8, 0.85

(scores are normalized by the sigmoid function)

– Margins to the highest-scoring competing edge

– Index of the next feature group

the firms

.This time the firms were$

0.720.65 0.30

0.23

0.12

73

Classifier Features


– : …, 0.5, 0.8, 0.85




the firms


0.720.65 0.30

0.23

0.12

74

Classifier Features


– : …, 0.5, 0.8, 0.85




the firms


0.720.65 0.30

0.23

0.12

Dynamic Features

75

How To Train With Dynamic Features

● Training examples are not fixed in advance!● Winners/losers from stages < k affect:

– Set of edges to classify at stage k

– The dynamic features of those edges at stage k

● Bad decisions can cause future errors

76

How To Train With Dynamic Features

● Training examples are not fixed in advance!!● Winners/losers from stages < k affect:

– Set of edges to classify at stage k

– The dynamic features of those edges at stage k

● Bad decisions can cause future errors

Reinforcement / Imitation Learning

● Dataset Aggregation (DAgger) (Ross et al., 2011)

– Iterates between training and running a model

– Learns to recover from past mistakes

77

Upper Bound of Our Performance

● “Labels”– Gold edges always win

– 96.47% UAS with 2.9% first-order features

.This time , the firms were ready$ .This time , the firms were ready$

78

How To Train Our Parser

1.Train parsers (non-projective, projective) using all features

2.Rank and group feature templates

3.Iteratively train a classifier to decide winners/losers

79

Experiment

● Data– Penn Treebank: English

– CoNLL-X: Bulgarian, Chinese, German, Japanese, Portuguese, Swedish

● Parser– MSTParser (McDonald et al., 2006)

● Dynamically-trained Classifier– LibLinear (Fan et al., 2008)

80

Dynamic Feature Selection Beats Static Forward Selection

81

Dynamic Feature Selection Beats Static Forward Selection

Add features as needed

Always add the next feature group to all edges

82

Experiment: 1st-order2x to 6x speedup

BulgarianChinese

EnglishGerman

JapanesePortuguese

Swedish

0

1

2

3

4

5

6

DynFSBaseline

Sp

ee

du

p

83

Experiment: 1st-order~0.2% loss in accuracy

BulgarianChinese

EnglishGerman

JapanesePortuguese

Swedish

99.3%99.4%99.5%99.6%99.7%99.8%99.9%

100.0%100.1%100.2%100.3%

DynFSBaseline

Re

lativ

e a

ccu

racy

relative accuracy=accuracy of the pruning parser accuracy of the full parser

84

Second-order Dependency Parsing

were ready .

● Features depend on the siblings as well

● First-order: ● O(n2) substructure to score

● Second-order: ● O(n3) substructure to score

~380 feature templates ~96M features

● Decoding: still O(n3)

the

$

This

time , firms

were

ready .

85

Experiment: 2nd-order2x to 8x speedup

BulgarianChinese

EnglishGerman

JapanesePortuguese

Swedish

0123456789

DynFSBaseline

Sp

ee

du

p

86

Experiment: 2nd-order~0.3% loss in accuracy

BulgarianChinese

EnglishGerman

JapanesePortuguese

Swedish

99.3%99.4%99.5%99.6%99.7%99.8%99.9%

100.0%100.1%100.2%100.3%

DynFSBaseline

Re

lativ

e a

ccu

racy

87

Ours vs Vine Pruning (Rush and Petrov, 2012)

● Vine pruning: a very fast parser that speeds up using orthogonal techniques– Start with short edges (fully scored)

– Add long edges in if needed

● Ours– Start with all edges (partially scored)

– Quickly remove unneeded edges

● Could be combined for further speedup!

88

VS Vine Pruning: 1st-ordercomparable performance

BulgarianChinese

EnglishGerman

JapanesePortuguese

Swedish

0

1

2

3

4

5

6

DynFSVinePBaseline

Sp

ee

du

p

89

VS Vine Pruning: 1st-order

BulgarianChinese

EnglishGerman

JapanesePortuguese

Swedish

99.3%99.4%99.5%99.6%99.7%99.8%99.9%

100.0%100.1%100.2%100.3%

DynFSVinePBaseline

Re

lativ

e a

ccu

racy

90

VS Vine Pruning: 2nd-order

BulgarianChinese

EnglishGerman

JapanesePortuguese

Swedish

0

2

4

6

8

10

12

14

16

DynFSVinePBaseline

Sp

ee

du

p

91

VS Vine Pruning: 2nd-order

BulgarianChinese

EnglishGerman

JapanesePortuguese

Swedish

99.3%99.4%99.5%99.6%99.7%99.8%99.9%

100.0%100.1%100.2%100.3%

DynFSVinePBaseline

Re

lativ

e a

ccu

racy

92

Conclusion

● Feature computation is expensive in structured prediction

● Commitment should be made dynamically

● Early commitment to edges reduce both searching and scoring time

● Can be used in other feature-rich models for structured prediction

93

Backup Slides

94

Static dictionary pruning (Rush and Petrov, 2012)

VB CD:→ 18VB CD:← 3NN VBG:→ 22NN VBG:← 11...


95

Reinforcement Learning 101

● Markov Decision Process (MDP)– State: all the information helping us to make

decisions

– Action: things we choose to do

– Reward: criteria for evaluating actions

– Policy: the “brain” that makes the decision

● Goal– Maximize the expected future reward

96

Policy Learning

π ( + context) = add / lock

● Markov Decision Process (MDP)

– reward = accuracy + λ∙speed

● Reinforcement learning– Delayed reward

– Long time to converge

● Imitation learning– Mimic the oracle

– Reduced to supervised classification problem

the firms

97

Imitation Learning

● Oracle– (near) optimal performance

– generate target action in any given state

π ( + context) = lockthe firms

time , theπ ( + context) = add

...

Binary classifier

98

Dataset Aggregation (DAgger)

● Collect data from the oracle only– Different distribution at training and test time

● Iterative policy training

● Correct the learner's mistake● Obtain a policy performs well under its own policy

distribution

99

Experiment (1st-order)

BulgarianChinese

EnglishGerman

JapanesePortuguese

Swedish

0.00%5.00%

10.00%15.00%20.00%25.00%30.00%35.00%40.00%45.00%

DynFS

Fe

atu

r e c

ost

cost= # feature templates usedtotal # feature templates on the statically pruned graph

100

Experiment (2nd-order)

BulgarianChinese

EnglishGerman

JapanesePortuguese

Swedish

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

DynFS

Fe

atu

r e c

ost

101

Second-order Parsing

.

102

Second-order Parsing

.

dynamic feature selection for dependency parsing › ~jason › papers ›...

Documents