global inference in learning for natural language processing vasin punyakanok department of computer...

48
Global Inference in Learning Global Inference in Learning for for Natural Language Processing Natural Language Processing Vasin Punyakanok Department of Computer Science University of Illinois at Urbana-Champaign Joint work with Dan Roth, Wen-tau Yih, and Dav Zimak

Upload: kristopher-hensley

Post on 26-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Global Inference in Learning for Natural Language Processing Vasin Punyakanok Department of Computer Science University of Illinois at Urbana-Champaign

Global Inference in Learning forGlobal Inference in Learning forNatural Language ProcessingNatural Language Processing

Vasin Punyakanok

Department of Computer Science

University of Illinois at Urbana-Champaign

Joint work with Dan Roth, Wen-tau Yih, and Dav Zimak

Page 2: Global Inference in Learning for Natural Language Processing Vasin Punyakanok Department of Computer Science University of Illinois at Urbana-Champaign

Page 2

Story Comprehension

1. Who is Christopher Robin?2. What did Mr. Robin do when Chris was three years old?3. When was Winnie the Pooh written?4. Why did Chris write two books of his own?

(ENGLAND, June, 1989) - Christopher Robin is alive and well. He lives in England. He is the same person that you read about in the book, Winnie the Pooh. As a boy, Chris lived in a pretty home called Cotchfield Farm. When Chris was three years old, his father wrote a poem about him. The poem was printed in a magazine for others to read. Mr. Robin then wrote a book. He made up a fairy tale land where Chris lived. His friends were animals. There was a bear called Winnie the Pooh. There was also an owl and a young pig, called a piglet. All the animals were stuffed toys that Chris owned. Mr. Robin made them come to life with his words. The places in the story were all near Cotchfield Farm. Winnie the Pooh was written in 1925. Children still love to read about Christopher Robin and his animal friends. Most people don't know he is a real person who is grown now. He has written two books of his own. They tell what it is like to be famous.

Page 3: Global Inference in Learning for Natural Language Processing Vasin Punyakanok Department of Computer Science University of Illinois at Urbana-Champaign

Page 3

Stand Alone Ambiguity Resolution

Context Sensitive Spelling CorrectionIIllinois’ bored of education board

Word Sense Disambiguation...Nissan Car and truck plant is ……divide life into plant and animal kingdom

Part of Speech Tagging(This DT) (can N) (will MD) (rust V) DT,MD,V,N

Coreference ResolutionThe dog bit the kid. He was taken to a hospital.The dog bit the kid. He was taken to a veterinarian.

Page 4: Global Inference in Learning for Natural Language Processing Vasin Punyakanok Department of Computer Science University of Illinois at Urbana-Champaign

Page 4

Textual Entailment

Eyeing the huge market potential, currently led by Google, Yahoo took over search company Overture Services Inc. last year.

Yahoo acquired Overture.

Question Answering Who acquired Overture?

Page 5: Global Inference in Learning for Natural Language Processing Vasin Punyakanok Department of Computer Science University of Illinois at Urbana-Champaign

Page 5

Inference and Learning

Global decisions in which several local decisions play a role but there are mutual dependencies on their outcome.

Learned classifiers for different sub-problems

Incorporate classifiers’ information, along with constraints, in making coherent decisions – decisions that respect the local classifiers as well as domain & context specific constraints.

Global inference for the best assignment to all variables of interest.

Page 6: Global Inference in Learning for Natural Language Processing Vasin Punyakanok Department of Computer Science University of Illinois at Urbana-Champaign

Page 6

Text Chunking

VP ADVP VP

NP ADJP

The guy standing there is so tallx =

y =

Page 7: Global Inference in Learning for Natural Language Processing Vasin Punyakanok Department of Computer Science University of Illinois at Urbana-Champaign

Page 7

Full Parsing

VP ADVP

NP ADJP

VP

NP

S

The guy standing there is so tallx =

y =

Page 8: Global Inference in Learning for Natural Language Processing Vasin Punyakanok Department of Computer Science University of Illinois at Urbana-Champaign

Page 8

Outline

Semantic Role Labeling Problem Global Inference with Integer Linear Programming

Some Issues with Learning and Inference Global vs Local Training Utility of Constraints in the Inference

Conclusion

Page 9: Global Inference in Learning for Natural Language Processing Vasin Punyakanok Department of Computer Science University of Illinois at Urbana-Champaign

Page 9

Semantic Role Labeling

I left my pearls to my daughter in my will .

[I]A0 left [my pearls]A1 [to my daughter]A2 [in my will]AM-LOC .

A0 Leaver A1 Things left A2 Benefactor AM-LOC Location

Page 10: Global Inference in Learning for Natural Language Processing Vasin Punyakanok Department of Computer Science University of Illinois at Urbana-Champaign

Page 10

Semantic Role Labeling

PropBank [Palmer et. al. 05] provides a large human-annotated corpus of semantic verb-argument relations. It adds a layer of generic semantic labels to Penn Tree Bank II. (Almost) all the labels are on the constituents of the parse trees.

Core arguments: A0-A5 and AA different semantics for each verb specified in the PropBank Frame files

13 types of adjuncts labeled as AM-arg where arg specifies the adjunct type

Page 11: Global Inference in Learning for Natural Language Processing Vasin Punyakanok Department of Computer Science University of Illinois at Urbana-Champaign

Page 11

Semantic Role Labeling

Page 12: Global Inference in Learning for Natural Language Processing Vasin Punyakanok Department of Computer Science University of Illinois at Urbana-Champaign

Page 12

The Approach

Pruning Use heuristics to reduce the number of

candidates (modified from [Xue&Palmer’04])

Argument Identification Use a binary classifier to identify

arguments Argument Classification

Use a multiclass classifier to classify arguments

Inference Infer the final output satisfying linguistic

and structure constraints

I left my nice pearls to herI left my nice pearls to herI left my nice pearls to herI left my nice pearls to herI left my nice pearls to her

Page 13: Global Inference in Learning for Natural Language Processing Vasin Punyakanok Department of Computer Science University of Illinois at Urbana-Champaign

Page 13

Learning

Both argument identifier and argument classifier are trained phrase-based classifiers.

Features (some examples) voice, phrase type, head word, path, chunk, chunk pattern, etc.

[some make use of a full syntactic parse]

Learning Algorithm – SNoW Sparse network of linear functions

weights learned by regularized Winnow multiplicative update rule with averaged weight vectors

Probability conversion is done via softmax pi = exp{acti}/j exp{actj}

Page 14: Global Inference in Learning for Natural Language Processing Vasin Punyakanok Department of Computer Science University of Illinois at Urbana-Champaign

Page 14

Inference The output of the argument classifier often violates some

constraints, especially when the sentence is long.

Finding the best legitimate output is formalized as an optimization problem and solved via Integer Linear Programming.

Input: The probability estimation (by the argument classifier) Structural and linguistic constraints

Allows incorporating expressive (non-sequential) constraints on the variables (the arguments types).

Page 15: Global Inference in Learning for Natural Language Processing Vasin Punyakanok Department of Computer Science University of Illinois at Urbana-Champaign

Page 15

Integer Linear Programming Inference

For each argument ai and type t (including null)

Set up a Boolean variable: ai,t indicating if ai is classified as t

Goal is to maximize

i score(ai = t ) ai,t

Subject to the (linear) constraints Any Boolean constraints can be encoded this way

If score(ai = t ) = P(ai = t ), the objective is find the assignment that maximizes the expected number of arguments that are correct and satisfies the constraints

Page 16: Global Inference in Learning for Natural Language Processing Vasin Punyakanok Department of Computer Science University of Illinois at Urbana-Champaign

Page 16

Linear Constraints

No overlapping or embedding arguments

ai,aj overlap or embed: ai,null + aj,null 1

Page 17: Global Inference in Learning for Natural Language Processing Vasin Punyakanok Department of Computer Science University of Illinois at Urbana-Champaign

Page 17

Constraints

Constraints No overlapping or embedding arguments No duplicate argument classes for A0-A5 Exactly one V argument per predicate If there is a C-V, there must be V-A1-C-V pattern If there is an R-arg, there must be arg somewhere If there is a C-arg, there must be arg somewhere before Each predicate can take only core arguments that appear in its

frame file. More specifically, we check for only the minimum and

maximum ids

Page 18: Global Inference in Learning for Natural Language Processing Vasin Punyakanok Department of Computer Science University of Illinois at Urbana-Champaign

Page 18

SRL Results (CoNLL-2005)

Training: section 02-21 Development: section 24 Test WSJ: section 23 Test Brown: from Brown corpus (very small)

Precision Recall F1

Development Collins 73.89 70.11 71.95

Charniak 75.40 74.13 74.76

Test WSJ Collins 77.09 72.00 74.46

Charniak 78.10 76.15 77.11

Test Brown Collins 68.03 63.34 65.60

Charniak 67.15 63.57 65.31

Page 19: Global Inference in Learning for Natural Language Processing Vasin Punyakanok Department of Computer Science University of Illinois at Urbana-Champaign

Page 19

Inference with Multiple SRL systems

Goal is to maximize

i score(ai = t ) ai,t

Subject to the (linear) constraints Any Boolean constraints can be encoded this way

score(ai = t ) = k fk(ai = t )

If system k has no opinion on ai, use a prior instead

Page 20: Global Inference in Learning for Natural Language Processing Vasin Punyakanok Department of Computer Science University of Illinois at Urbana-Champaign

Page 20

Results with Multiple Systems (CoNLL-2005)F1

67.75

79.44

77.35

60 70 80 90

Devel

WSJ

Brown

Col

Char

Char-2

Char-3

Char-4

Char-5

Combined

Page 21: Global Inference in Learning for Natural Language Processing Vasin Punyakanok Department of Computer Science University of Illinois at Urbana-Champaign

Page 21

Outline

Semantic Role Labeling Problem Global Inference with Integer Linear Programming

Some Issues with Learning and Inference Global vs Local Training Utility of Constraints in the Inference

Conclusion

Page 22: Global Inference in Learning for Natural Language Processing Vasin Punyakanok Department of Computer Science University of Illinois at Urbana-Champaign

Page 22

Training w/o ConstraintsTesting: Inference with ConstraintsIBT: Inference-based Training

Learning and Inference

x1

x6

x2

x5

x4

x3

x7

y1

y2

y5

y4

y3

f1(x)

f2(x)

f3(x)f4(x)

f5(x)

X

Y

Learning the components together!

Which one is better? When and Why?

Page 23: Global Inference in Learning for Natural Language Processing Vasin Punyakanok Department of Computer Science University of Illinois at Urbana-Champaign

Page 23

Comparisons of Learning Approaches

Coupling (IBT) Optimize the true global objective function (This should be better in

the limit)

Decoupling (L+I) More efficient Reusability of classifiers Modularity in training

No global examples required

Page 24: Global Inference in Learning for Natural Language Processing Vasin Punyakanok Department of Computer Science University of Illinois at Urbana-Champaign

Page 24

Claims

When the local classification problems are “easy”, L+I outperforms IBT.

Only when the local problems become difficult to solve in isolation, IBT outperforms L+I, but needs a large enough number of training examples.

Will show experimental results and theoretical intuition to support our claims.

Page 25: Global Inference in Learning for Natural Language Processing Vasin Punyakanok Department of Computer Science University of Illinois at Urbana-Champaign

Page 25

-1 1 111Y’ Local Predictions

Perceptron-based Global Learning

x1

x6

x2

x5

x4

x3

x7

f1(x)

f2(x)

f3(x)f4(x)

f5(x)

X

Y

-1 1 1-1-1YTrue Global Labeling

-1 1 11-1Y’ Apply Constraints:

Page 26: Global Inference in Learning for Natural Language Processing Vasin Punyakanok Department of Computer Science University of Illinois at Urbana-Champaign

Page 26

Simulation

There are 5 local binary linear classifiers Global classifier is also linear

h(x) = argmaxy2C(Y) i fi(x,yi)

Constraints are randomly generated The hypothesis is linearly separable at the global level given that the

constraints are known The separability level at the local level is varied

Page 27: Global Inference in Learning for Natural Language Processing Vasin Punyakanok Department of Computer Science University of Illinois at Urbana-Champaign

Page 27

opt=0.2opt=0.1opt=0

Bound Prediction

Local ≤ opt + ( ( d log m + log 1/ ) / m )1/2

Global ≤ 0 + ( ( cd log m + c2d + log 1/ ) / m )1/2

Bounds Simulated Data

L+I vs. IBT: the more identifiable individual problems are the better overall performance is with L+I

Page 28: Global Inference in Learning for Natural Language Processing Vasin Punyakanok Department of Computer Science University of Illinois at Urbana-Champaign

Page 28

Relative Merits: SRL

Difficulty of the learning problem

(# features)

L+I is better.When the problem is artificially made harder, the tradeoff is clearer.

easyhard

Page 29: Global Inference in Learning for Natural Language Processing Vasin Punyakanok Department of Computer Science University of Illinois at Urbana-Champaign

Page 29

Summary

When the local classification problems are “easy”, L+I outperforms IBT.

Only when the local problems become difficult to solve in isolation, IBT outperforms L+I, but needs a large enough number of training examples.

Why does inference help at all?

Page 30: Global Inference in Learning for Natural Language Processing Vasin Punyakanok Department of Computer Science University of Illinois at Urbana-Champaign

Page 30

About Constraints

We always assume that global coherency is good Constraints does help in real world applications

Performance is usually measured at the local prediction

Depending on the performance metric, constraints can hurt

Page 31: Global Inference in Learning for Natural Language Processing Vasin Punyakanok Department of Computer Science University of Illinois at Urbana-Champaign

Page 31

Results: Contribution of Expressive Constraints [Roth & Yih 05] Basic: Learning with statistical constraints only; Additional constraints added at evaluation time (efficiency)

73.91

73.78

73.71

73.64

69.74

69.14

71.94

71.72

71.71

71.78

67.10

66.46

+ disallow

+ verb pos

+ argument

+ cand

+ no dup

basic (Viterbi)

F1

+0.13+0.22

+0.07+0.01

+0.07-0.07

+3.90+4.68

+0.60+0.64

diffdiff

CRF-DCRF-ML

Page 32: Global Inference in Learning for Natural Language Processing Vasin Punyakanok Department of Computer Science University of Illinois at Urbana-Champaign

Page 32

Assumptions

y = h y1, …, yl i

Non-interactive classifiers: fi(x,yi)Each classifier does not use as inputs the outputs of other classifiers

Inference is linear summation

hun(x) = argmaxy2Y i fi(x,yi)

hcon(x) = argmaxy2C(Y) i fi(x,yi)

C(Y) always contains correct outputs No assumption on the structure of constraints

Page 33: Global Inference in Learning for Natural Language Processing Vasin Punyakanok Department of Computer Science University of Illinois at Urbana-Champaign

Page 33

Performance Metrics

Zero-one loss Mistakes are calculated in terms of global mistakes y is wrong if any of yi is wrong

Hamming loss Mistakes are calculated in terms of local mistakes

Page 34: Global Inference in Learning for Natural Language Processing Vasin Punyakanok Department of Computer Science University of Illinois at Urbana-Champaign

Page 34

Zero-One Loss

Constraints cannot hurt Constraints never fix correct global outputs

This is not true for Hamming Loss

Page 35: Global Inference in Learning for Natural Language Processing Vasin Punyakanok Department of Computer Science University of Illinois at Urbana-Champaign

Page 35

4-bit binary outputs

Boolean Cube

0000

1000 0100 0010 0001

1100 1010 1001 0110 0101 0011

1110 1101 1011 0111

1111

Ham

min

g Lo

ss

Score

1 mistake

2 mistakes

3 mistakes

4 mistakes

0 mistake

Page 36: Global Inference in Learning for Natural Language Processing Vasin Punyakanok Department of Computer Science University of Illinois at Urbana-Champaign

Page 36

Hamming Loss

0000

1000 0100 0010 0001

1100 1010 10010110 0101 0011

1110 1101 1011 0111

1111

Ham

min

g Lo

ss

Score

hun

0011

Page 37: Global Inference in Learning for Natural Language Processing Vasin Punyakanok Department of Computer Science University of Illinois at Urbana-Champaign

Page 37

Best Classifiers

0000

1000 0100 0010 0001

1100 1010 10010110 0101 0011

1110 1101 1011 0111

1111

Ham

min

g Lo

ss

Score

1

3

2

1 + 2

4

Page 38: Global Inference in Learning for Natural Language Processing Vasin Punyakanok Department of Computer Science University of Illinois at Urbana-Champaign

Page 38

When Constraints Cannot Hurt

i : distance between the correct label and the 2nd best label

i : distance between the predicted label and the correct label

Fcorrect = { i | fi is correct}

Fwrong = { i | fi is wrong}

Constraints cannot hurt if

8 i 2 Fcorrect : i > i 2 Fwrong i

Page 39: Global Inference in Learning for Natural Language Processing Vasin Punyakanok Department of Computer Science University of Illinois at Urbana-Champaign

Page 39

An Empirical Investigation

SRL System CoNLL-2005 WSJ test set

Without Constraints With Constraints

Local Accuracy 82.38% 84.08%

Page 40: Global Inference in Learning for Natural Language Processing Vasin Punyakanok Department of Computer Science University of Illinois at Urbana-Champaign

Page 40

An Empirical Investigation

Number Percentages

Total Predictions 19822 100.00

Incorrect Predictions 3492 17.62

Correct Predictions 16330 82.38

Violate the condition 1833 9.25

Page 41: Global Inference in Learning for Natural Language Processing Vasin Punyakanok Department of Computer Science University of Illinois at Urbana-Champaign

Page 41

Good Classifiers

0000

1000 0100 0010 0001

1100 1010 10010110 0101 0011

1110 1101 1011 0111

1111

Ham

min

g Lo

ss

Score

Page 42: Global Inference in Learning for Natural Language Processing Vasin Punyakanok Department of Computer Science University of Illinois at Urbana-Champaign

Page 42

Bad Classifiers

0000

1000 0100 0010 0001

1100 1010 10010110 0101 0011

1110 1101 1011 0111

1111

Ham

min

g Lo

ss

Score

Page 43: Global Inference in Learning for Natural Language Processing Vasin Punyakanok Department of Computer Science University of Illinois at Urbana-Champaign

Page 43

Average Distance vs Gain in Hamming Loss

Good High Loss ! Low Score

(Low Gain)

Page 44: Global Inference in Learning for Natural Language Processing Vasin Punyakanok Department of Computer Science University of Illinois at Urbana-Champaign

Page 44

Good Classifiers

0000

1000 0100 0010 0001

1100 1010 10010110 0101 0011

1110 1101 1011 0111

1111

Ham

min

g Lo

ss

Score

Page 45: Global Inference in Learning for Natural Language Processing Vasin Punyakanok Department of Computer Science University of Illinois at Urbana-Champaign

Page 45

Bad Classifiers

0000

1000 0100 0010 0001

1100 1010 10010110 0101 0011

1110 1101 1011 0111

1111

Ham

min

g Lo

ss

Score

Page 46: Global Inference in Learning for Natural Language Processing Vasin Punyakanok Department of Computer Science University of Illinois at Urbana-Champaign

Page 46

Average Gain in Hamming Loss vs Distance

Good High Score ! Low Loss

(High Gain)

Page 47: Global Inference in Learning for Natural Language Processing Vasin Punyakanok Department of Computer Science University of Illinois at Urbana-Champaign

Page 47

Utility of Constraints

Constraints improve the performance because the classifiers are good

Good Classifiers: When the classifier is correct, it allows large margin between the

correct label and the 2nd best label When the classifier is wrong, the correct label is not far away from

the predicted one

Page 48: Global Inference in Learning for Natural Language Processing Vasin Punyakanok Department of Computer Science University of Illinois at Urbana-Champaign

Page 48

Conclusions

Show how global inference can be used Semantic Role Labeling

Tradeoffs between Coupling vs. Decoupling learning and inference Investigation of utility of constraints

The analyses are very preliminary Average-case analysis for the tradeoffs between Coupling vs.

Decoupling learning and inference Better understanding for using constraints

More interactive classifiers Different performance metrics, e.g. F1 Relation with margin