global inference in learning for natural language processing vasin punyakanok department of computer...

Global Inference in Learning forGlobal Inference in Learning forNatural Language ProcessingNatural Language Processing

Vasin Punyakanok

Department of Computer Science

University of Illinois at Urbana-Champaign

Joint work with Dan Roth, Wen-tau Yih, and Dav Zimak

Story Comprehension

1. Who is Christopher Robin?2. What did Mr. Robin do when Chris was three years old?3. When was Winnie the Pooh written?4. Why did Chris write two books of his own?

(ENGLAND, June, 1989) - Christopher Robin is alive and well. He lives in England. He is the same person that you read about in the book, Winnie the Pooh. As a boy, Chris lived in a pretty home called Cotchfield Farm. When Chris was three years old, his father wrote a poem about him. The poem was printed in a magazine for others to read. Mr. Robin then wrote a book. He made up a fairy tale land where Chris lived. His friends were animals. There was a bear called Winnie the Pooh. There was also an owl and a young pig, called a piglet. All the animals were stuffed toys that Chris owned. Mr. Robin made them come to life with his words. The places in the story were all near Cotchfield Farm. Winnie the Pooh was written in 1925. Children still love to read about Christopher Robin and his animal friends. Most people don't know he is a real person who is grown now. He has written two books of his own. They tell what it is like to be famous.

Stand Alone Ambiguity Resolution

Context Sensitive Spelling CorrectionIIllinois’ bored of education board

Word Sense Disambiguation...Nissan Car and truck plant is ……divide life into plant and animal kingdom

Part of Speech Tagging(This DT) (can N) (will MD) (rust V) DT,MD,V,N

Coreference ResolutionThe dog bit the kid. He was taken to a hospital.The dog bit the kid. He was taken to a veterinarian.

Textual Entailment

Eyeing the huge market potential, currently led by Google, Yahoo took over search company Overture Services Inc. last year.

Yahoo acquired Overture.

Question Answering Who acquired Overture?

Inference and Learning

Global decisions in which several local decisions play a role but there are mutual dependencies on their outcome.

Learned classifiers for different sub-problems

Incorporate classifiers’ information, along with constraints, in making coherent decisions – decisions that respect the local classifiers as well as domain & context specific constraints.

Global inference for the best assignment to all variables of interest.

Text Chunking

VP ADVP VP

NP ADJP

The guy standing there is so tallx =

y =

Full Parsing

VP ADVP

NP ADJP

VP

NP

S

The guy standing there is so tallx =

y =

Outline

Semantic Role Labeling Problem Global Inference with Integer Linear Programming

Some Issues with Learning and Inference Global vs Local Training Utility of Constraints in the Inference

Conclusion

Semantic Role Labeling

I left my pearls to my daughter in my will .

[I]A0 left [my pearls]A1 [to my daughter]A2 [in my will]AM-LOC .

A0 Leaver A1 Things left A2 Benefactor AM-LOC Location


PropBank [Palmer et. al. 05] provides a large human-annotated corpus of semantic verb-argument relations. It adds a layer of generic semantic labels to Penn Tree Bank II. (Almost) all the labels are on the constituents of the parse trees.

Core arguments: A0-A5 and AA different semantics for each verb specified in the PropBank Frame files

13 types of adjuncts labeled as AM-arg where arg specifies the adjunct type

The Approach

Pruning Use heuristics to reduce the number of

candidates (modified from [Xue&Palmer’04])

Argument Identification Use a binary classifier to identify

arguments Argument Classification

Use a multiclass classifier to classify arguments

Inference Infer the final output satisfying linguistic

and structure constraints

I left my nice pearls to herI left my nice pearls to herI left my nice pearls to herI left my nice pearls to herI left my nice pearls to her

Learning

Both argument identifier and argument classifier are trained phrase-based classifiers.

Features (some examples) voice, phrase type, head word, path, chunk, chunk pattern, etc.

[some make use of a full syntactic parse]

Learning Algorithm – SNoW Sparse network of linear functions

weights learned by regularized Winnow multiplicative update rule with averaged weight vectors

Probability conversion is done via softmax pi = exp{acti}/j exp{actj}

Inference The output of the argument classifier often violates some

constraints, especially when the sentence is long.

Finding the best legitimate output is formalized as an optimization problem and solved via Integer Linear Programming.

Input: The probability estimation (by the argument classifier) Structural and linguistic constraints

Allows incorporating expressive (non-sequential) constraints on the variables (the arguments types).

Integer Linear Programming Inference

For each argument ai and type t (including null)

Set up a Boolean variable: ai,t indicating if ai is classified as t

Goal is to maximize

i score(ai = t ) ai,t

Subject to the (linear) constraints Any Boolean constraints can be encoded this way

If score(ai = t ) = P(ai = t ), the objective is find the assignment that maximizes the expected number of arguments that are correct and satisfies the constraints

Linear Constraints

No overlapping or embedding arguments

ai,aj overlap or embed: ai,null + aj,null 1

Constraints

Constraints No overlapping or embedding arguments No duplicate argument classes for A0-A5 Exactly one V argument per predicate If there is a C-V, there must be V-A1-C-V pattern If there is an R-arg, there must be arg somewhere If there is a C-arg, there must be arg somewhere before Each predicate can take only core arguments that appear in its

frame file. More specifically, we check for only the minimum and

maximum ids

SRL Results (CoNLL-2005)

Training: section 02-21 Development: section 24 Test WSJ: section 23 Test Brown: from Brown corpus (very small)

Precision Recall F1

Development Collins 73.89 70.11 71.95

Charniak 75.40 74.13 74.76

Test WSJ Collins 77.09 72.00 74.46

Charniak 78.10 76.15 77.11

Test Brown Collins 68.03 63.34 65.60

Charniak 67.15 63.57 65.31

Inference with Multiple SRL systems

Goal is to maximize

i score(ai = t ) ai,t

Subject to the (linear) constraints Any Boolean constraints can be encoded this way

score(ai = t ) = k fk(ai = t )

If system k has no opinion on ai, use a prior instead

Results with Multiple Systems (CoNLL-2005)F1

67.75

79.44

77.35

60 70 80 90

Devel

WSJ

Brown

Col

Char

Char-2

Char-3

Char-4

Char-5

Combined

Outline

Semantic Role Labeling Problem Global Inference with Integer Linear Programming

Some Issues with Learning and Inference Global vs Local Training Utility of Constraints in the Inference

Conclusion

Training w/o ConstraintsTesting: Inference with ConstraintsIBT: Inference-based Training

Learning and Inference

x1

x6

x2

x5

x4

x3

x7

y1

y2

y5

y4

y3

f1(x)

f2(x)

f3(x)f4(x)

f5(x)

X

Y

Learning the components together!

Which one is better? When and Why?

Comparisons of Learning Approaches

Coupling (IBT) Optimize the true global objective function (This should be better in

the limit)

Decoupling (L+I) More efficient Reusability of classifiers Modularity in training

No global examples required

Claims

When the local classification problems are “easy”, L+I outperforms IBT.

Only when the local problems become difficult to solve in isolation, IBT outperforms L+I, but needs a large enough number of training examples.

Will show experimental results and theoretical intuition to support our claims.

-1 1 111Y’ Local Predictions

Perceptron-based Global Learning

x1

x6

x2

x5

x4

x3

x7

f1(x)

f2(x)

f3(x)f4(x)

f5(x)

X

Y

-1 1 1-1-1YTrue Global Labeling

-1 1 11-1Y’ Apply Constraints:

Simulation

There are 5 local binary linear classifiers Global classifier is also linear

h(x) = argmaxy2C(Y) i fi(x,yi)

Constraints are randomly generated The hypothesis is linearly separable at the global level given that the

constraints are known The separability level at the local level is varied

opt=0.2opt=0.1opt=0

Bound Prediction

Local ≤ opt + ( ( d log m + log 1/ ) / m )1/2

Global ≤ 0 + ( ( cd log m + c2d + log 1/ ) / m )1/2

Bounds Simulated Data

L+I vs. IBT: the more identifiable individual problems are the better overall performance is with L+I

Relative Merits: SRL

Difficulty of the learning problem

(# features)

L+I is better.When the problem is artificially made harder, the tradeoff is clearer.

easyhard

Summary

When the local classification problems are “easy”, L+I outperforms IBT.

Only when the local problems become difficult to solve in isolation, IBT outperforms L+I, but needs a large enough number of training examples.

Why does inference help at all?

About Constraints

We always assume that global coherency is good Constraints does help in real world applications

Performance is usually measured at the local prediction

Depending on the performance metric, constraints can hurt

Results: Contribution of Expressive Constraints [Roth & Yih 05] Basic: Learning with statistical constraints only; Additional constraints added at evaluation time (efficiency)

73.91

73.78

73.71

73.64

69.74

69.14

71.94

71.72

71.71

71.78

67.10

66.46

+ disallow

+ verb pos

+ argument

+ cand

+ no dup

basic (Viterbi)

F1

+0.13+0.22

+0.07+0.01

+0.07-0.07

+3.90+4.68

+0.60+0.64

diffdiff

CRF-DCRF-ML

Assumptions

y = h y1, …, yl i

Non-interactive classifiers: fi(x,yi)Each classifier does not use as inputs the outputs of other classifiers

Inference is linear summation

hun(x) = argmaxy2Y i fi(x,yi)

hcon(x) = argmaxy2C(Y) i fi(x,yi)

C(Y) always contains correct outputs No assumption on the structure of constraints

Performance Metrics

Zero-one loss Mistakes are calculated in terms of global mistakes y is wrong if any of yi is wrong

Hamming loss Mistakes are calculated in terms of local mistakes

Zero-One Loss

Constraints cannot hurt Constraints never fix correct global outputs

This is not true for Hamming Loss

4-bit binary outputs

Boolean Cube

0000

1000 0100 0010 0001

1100 1010 1001 0110 0101 0011

1110 1101 1011 0111

1111

Ham

min

g Lo

ss

Score

1 mistake

2 mistakes

3 mistakes

4 mistakes

0 mistake

Hamming Loss

0000

1000 0100 0010 0001

1100 1010 10010110 0101 0011

1110 1101 1011 0111

1111

Ham

min

g Lo

ss

Score

hun

0011

Best Classifiers

0000

1000 0100 0010 0001

1100 1010 10010110 0101 0011

1110 1101 1011 0111

1111

Ham

min

g Lo

ss

Score

1

3

2

1 + 2

4

When Constraints Cannot Hurt

i : distance between the correct label and the 2nd best label

i : distance between the predicted label and the correct label

Fcorrect = { i | fi is correct}

Fwrong = { i | fi is wrong}

Constraints cannot hurt if

8 i 2 Fcorrect : i > i 2 Fwrong i

An Empirical Investigation

SRL System CoNLL-2005 WSJ test set

Without Constraints With Constraints

Local Accuracy 82.38% 84.08%

An Empirical Investigation

Number Percentages

Total Predictions 19822 100.00

Incorrect Predictions 3492 17.62

Correct Predictions 16330 82.38

Violate the condition 1833 9.25

Good Classifiers

0000

1000 0100 0010 0001

1100 1010 10010110 0101 0011

1110 1101 1011 0111

1111

Ham

min

g Lo

ss

Score

Bad Classifiers

0000

1000 0100 0010 0001

1100 1010 10010110 0101 0011

1110 1101 1011 0111

1111

Ham

min

g Lo

ss

Score

Average Distance vs Gain in Hamming Loss

Good High Loss ! Low Score

(Low Gain)

Good Classifiers

0000

1000 0100 0010 0001

1100 1010 10010110 0101 0011

1110 1101 1011 0111

1111

Ham

min

g Lo

ss

Score

Bad Classifiers

0000

1000 0100 0010 0001

1100 1010 10010110 0101 0011

1110 1101 1011 0111

1111

Ham

min

g Lo

ss

Score

Average Gain in Hamming Loss vs Distance

Good High Score ! Low Loss

(High Gain)

Utility of Constraints

Constraints improve the performance because the classifiers are good

Good Classifiers: When the classifier is correct, it allows large margin between the

correct label and the 2nd best label When the classifier is wrong, the correct label is not far away from

the predicted one

Conclusions

Show how global inference can be used Semantic Role Labeling

Tradeoffs between Coupling vs. Decoupling learning and inference Investigation of utility of constraints

The analyses are very preliminary Average-case analysis for the tradeoffs between Coupling vs.

Decoupling learning and inference Better understanding for using constraints

More interactive classifiers Different performance metrics, e.g. F1 Relation with margin

global inference in learning for natural language processing vasin punyakanok department of computer...

Documents