global inference in learning for natural language processing vasin punyakanok department of computer...
TRANSCRIPT
Global Inference in Learning forGlobal Inference in Learning forNatural Language ProcessingNatural Language Processing
Vasin Punyakanok
Department of Computer Science
University of Illinois at Urbana-Champaign
Joint work with Dan Roth, Wen-tau Yih, and Dav Zimak
Page 2
Story Comprehension
1. Who is Christopher Robin?2. What did Mr. Robin do when Chris was three years old?3. When was Winnie the Pooh written?4. Why did Chris write two books of his own?
(ENGLAND, June, 1989) - Christopher Robin is alive and well. He lives in England. He is the same person that you read about in the book, Winnie the Pooh. As a boy, Chris lived in a pretty home called Cotchfield Farm. When Chris was three years old, his father wrote a poem about him. The poem was printed in a magazine for others to read. Mr. Robin then wrote a book. He made up a fairy tale land where Chris lived. His friends were animals. There was a bear called Winnie the Pooh. There was also an owl and a young pig, called a piglet. All the animals were stuffed toys that Chris owned. Mr. Robin made them come to life with his words. The places in the story were all near Cotchfield Farm. Winnie the Pooh was written in 1925. Children still love to read about Christopher Robin and his animal friends. Most people don't know he is a real person who is grown now. He has written two books of his own. They tell what it is like to be famous.
Page 3
Stand Alone Ambiguity Resolution
Context Sensitive Spelling CorrectionIIllinois’ bored of education board
Word Sense Disambiguation...Nissan Car and truck plant is ……divide life into plant and animal kingdom
Part of Speech Tagging(This DT) (can N) (will MD) (rust V) DT,MD,V,N
Coreference ResolutionThe dog bit the kid. He was taken to a hospital.The dog bit the kid. He was taken to a veterinarian.
Page 4
Textual Entailment
Eyeing the huge market potential, currently led by Google, Yahoo took over search company Overture Services Inc. last year.
Yahoo acquired Overture.
Question Answering Who acquired Overture?
Page 5
Inference and Learning
Global decisions in which several local decisions play a role but there are mutual dependencies on their outcome.
Learned classifiers for different sub-problems
Incorporate classifiers’ information, along with constraints, in making coherent decisions – decisions that respect the local classifiers as well as domain & context specific constraints.
Global inference for the best assignment to all variables of interest.
Page 6
Text Chunking
VP ADVP VP
NP ADJP
The guy standing there is so tallx =
y =
Page 7
Full Parsing
VP ADVP
NP ADJP
VP
NP
S
The guy standing there is so tallx =
y =
Page 8
Outline
Semantic Role Labeling Problem Global Inference with Integer Linear Programming
Some Issues with Learning and Inference Global vs Local Training Utility of Constraints in the Inference
Conclusion
Page 9
Semantic Role Labeling
I left my pearls to my daughter in my will .
[I]A0 left [my pearls]A1 [to my daughter]A2 [in my will]AM-LOC .
A0 Leaver A1 Things left A2 Benefactor AM-LOC Location
Page 10
Semantic Role Labeling
PropBank [Palmer et. al. 05] provides a large human-annotated corpus of semantic verb-argument relations. It adds a layer of generic semantic labels to Penn Tree Bank II. (Almost) all the labels are on the constituents of the parse trees.
Core arguments: A0-A5 and AA different semantics for each verb specified in the PropBank Frame files
13 types of adjuncts labeled as AM-arg where arg specifies the adjunct type
Page 11
Semantic Role Labeling
Page 12
The Approach
Pruning Use heuristics to reduce the number of
candidates (modified from [Xue&Palmer’04])
Argument Identification Use a binary classifier to identify
arguments Argument Classification
Use a multiclass classifier to classify arguments
Inference Infer the final output satisfying linguistic
and structure constraints
I left my nice pearls to herI left my nice pearls to herI left my nice pearls to herI left my nice pearls to herI left my nice pearls to her
Page 13
Learning
Both argument identifier and argument classifier are trained phrase-based classifiers.
Features (some examples) voice, phrase type, head word, path, chunk, chunk pattern, etc.
[some make use of a full syntactic parse]
Learning Algorithm – SNoW Sparse network of linear functions
weights learned by regularized Winnow multiplicative update rule with averaged weight vectors
Probability conversion is done via softmax pi = exp{acti}/j exp{actj}
Page 14
Inference The output of the argument classifier often violates some
constraints, especially when the sentence is long.
Finding the best legitimate output is formalized as an optimization problem and solved via Integer Linear Programming.
Input: The probability estimation (by the argument classifier) Structural and linguistic constraints
Allows incorporating expressive (non-sequential) constraints on the variables (the arguments types).
Page 15
Integer Linear Programming Inference
For each argument ai and type t (including null)
Set up a Boolean variable: ai,t indicating if ai is classified as t
Goal is to maximize
i score(ai = t ) ai,t
Subject to the (linear) constraints Any Boolean constraints can be encoded this way
If score(ai = t ) = P(ai = t ), the objective is find the assignment that maximizes the expected number of arguments that are correct and satisfies the constraints
Page 16
Linear Constraints
No overlapping or embedding arguments
ai,aj overlap or embed: ai,null + aj,null 1
Page 17
Constraints
Constraints No overlapping or embedding arguments No duplicate argument classes for A0-A5 Exactly one V argument per predicate If there is a C-V, there must be V-A1-C-V pattern If there is an R-arg, there must be arg somewhere If there is a C-arg, there must be arg somewhere before Each predicate can take only core arguments that appear in its
frame file. More specifically, we check for only the minimum and
maximum ids
Page 18
SRL Results (CoNLL-2005)
Training: section 02-21 Development: section 24 Test WSJ: section 23 Test Brown: from Brown corpus (very small)
Precision Recall F1
Development Collins 73.89 70.11 71.95
Charniak 75.40 74.13 74.76
Test WSJ Collins 77.09 72.00 74.46
Charniak 78.10 76.15 77.11
Test Brown Collins 68.03 63.34 65.60
Charniak 67.15 63.57 65.31
Page 19
Inference with Multiple SRL systems
Goal is to maximize
i score(ai = t ) ai,t
Subject to the (linear) constraints Any Boolean constraints can be encoded this way
score(ai = t ) = k fk(ai = t )
If system k has no opinion on ai, use a prior instead
Page 20
Results with Multiple Systems (CoNLL-2005)F1
67.75
79.44
77.35
60 70 80 90
Devel
WSJ
Brown
Col
Char
Char-2
Char-3
Char-4
Char-5
Combined
Page 21
Outline
Semantic Role Labeling Problem Global Inference with Integer Linear Programming
Some Issues with Learning and Inference Global vs Local Training Utility of Constraints in the Inference
Conclusion
Page 22
Training w/o ConstraintsTesting: Inference with ConstraintsIBT: Inference-based Training
Learning and Inference
x1
x6
x2
x5
x4
x3
x7
y1
y2
y5
y4
y3
f1(x)
f2(x)
f3(x)f4(x)
f5(x)
X
Y
Learning the components together!
Which one is better? When and Why?
Page 23
Comparisons of Learning Approaches
Coupling (IBT) Optimize the true global objective function (This should be better in
the limit)
Decoupling (L+I) More efficient Reusability of classifiers Modularity in training
No global examples required
Page 24
Claims
When the local classification problems are “easy”, L+I outperforms IBT.
Only when the local problems become difficult to solve in isolation, IBT outperforms L+I, but needs a large enough number of training examples.
Will show experimental results and theoretical intuition to support our claims.
Page 25
-1 1 111Y’ Local Predictions
Perceptron-based Global Learning
x1
x6
x2
x5
x4
x3
x7
f1(x)
f2(x)
f3(x)f4(x)
f5(x)
X
Y
-1 1 1-1-1YTrue Global Labeling
-1 1 11-1Y’ Apply Constraints:
Page 26
Simulation
There are 5 local binary linear classifiers Global classifier is also linear
h(x) = argmaxy2C(Y) i fi(x,yi)
Constraints are randomly generated The hypothesis is linearly separable at the global level given that the
constraints are known The separability level at the local level is varied
Page 27
opt=0.2opt=0.1opt=0
Bound Prediction
Local ≤ opt + ( ( d log m + log 1/ ) / m )1/2
Global ≤ 0 + ( ( cd log m + c2d + log 1/ ) / m )1/2
Bounds Simulated Data
L+I vs. IBT: the more identifiable individual problems are the better overall performance is with L+I
Page 28
Relative Merits: SRL
Difficulty of the learning problem
(# features)
L+I is better.When the problem is artificially made harder, the tradeoff is clearer.
easyhard
Page 29
Summary
When the local classification problems are “easy”, L+I outperforms IBT.
Only when the local problems become difficult to solve in isolation, IBT outperforms L+I, but needs a large enough number of training examples.
Why does inference help at all?
Page 30
About Constraints
We always assume that global coherency is good Constraints does help in real world applications
Performance is usually measured at the local prediction
Depending on the performance metric, constraints can hurt
Page 31
Results: Contribution of Expressive Constraints [Roth & Yih 05] Basic: Learning with statistical constraints only; Additional constraints added at evaluation time (efficiency)
73.91
73.78
73.71
73.64
69.74
69.14
71.94
71.72
71.71
71.78
67.10
66.46
+ disallow
+ verb pos
+ argument
+ cand
+ no dup
basic (Viterbi)
F1
+0.13+0.22
+0.07+0.01
+0.07-0.07
+3.90+4.68
+0.60+0.64
diffdiff
CRF-DCRF-ML
Page 32
Assumptions
y = h y1, …, yl i
Non-interactive classifiers: fi(x,yi)Each classifier does not use as inputs the outputs of other classifiers
Inference is linear summation
hun(x) = argmaxy2Y i fi(x,yi)
hcon(x) = argmaxy2C(Y) i fi(x,yi)
C(Y) always contains correct outputs No assumption on the structure of constraints
Page 33
Performance Metrics
Zero-one loss Mistakes are calculated in terms of global mistakes y is wrong if any of yi is wrong
Hamming loss Mistakes are calculated in terms of local mistakes
Page 34
Zero-One Loss
Constraints cannot hurt Constraints never fix correct global outputs
This is not true for Hamming Loss
Page 35
4-bit binary outputs
Boolean Cube
0000
1000 0100 0010 0001
1100 1010 1001 0110 0101 0011
1110 1101 1011 0111
1111
Ham
min
g Lo
ss
Score
1 mistake
2 mistakes
3 mistakes
4 mistakes
0 mistake
Page 36
Hamming Loss
0000
1000 0100 0010 0001
1100 1010 10010110 0101 0011
1110 1101 1011 0111
1111
Ham
min
g Lo
ss
Score
hun
0011
Page 37
Best Classifiers
0000
1000 0100 0010 0001
1100 1010 10010110 0101 0011
1110 1101 1011 0111
1111
Ham
min
g Lo
ss
Score
1
3
2
1 + 2
4
Page 38
When Constraints Cannot Hurt
i : distance between the correct label and the 2nd best label
i : distance between the predicted label and the correct label
Fcorrect = { i | fi is correct}
Fwrong = { i | fi is wrong}
Constraints cannot hurt if
8 i 2 Fcorrect : i > i 2 Fwrong i
Page 39
An Empirical Investigation
SRL System CoNLL-2005 WSJ test set
Without Constraints With Constraints
Local Accuracy 82.38% 84.08%
Page 40
An Empirical Investigation
Number Percentages
Total Predictions 19822 100.00
Incorrect Predictions 3492 17.62
Correct Predictions 16330 82.38
Violate the condition 1833 9.25
Page 41
Good Classifiers
0000
1000 0100 0010 0001
1100 1010 10010110 0101 0011
1110 1101 1011 0111
1111
Ham
min
g Lo
ss
Score
Page 42
Bad Classifiers
0000
1000 0100 0010 0001
1100 1010 10010110 0101 0011
1110 1101 1011 0111
1111
Ham
min
g Lo
ss
Score
Page 43
Average Distance vs Gain in Hamming Loss
Good High Loss ! Low Score
(Low Gain)
Page 44
Good Classifiers
0000
1000 0100 0010 0001
1100 1010 10010110 0101 0011
1110 1101 1011 0111
1111
Ham
min
g Lo
ss
Score
Page 45
Bad Classifiers
0000
1000 0100 0010 0001
1100 1010 10010110 0101 0011
1110 1101 1011 0111
1111
Ham
min
g Lo
ss
Score
Page 46
Average Gain in Hamming Loss vs Distance
Good High Score ! Low Loss
(High Gain)
Page 47
Utility of Constraints
Constraints improve the performance because the classifiers are good
Good Classifiers: When the classifier is correct, it allows large margin between the
correct label and the 2nd best label When the classifier is wrong, the correct label is not far away from
the predicted one
Page 48
Conclusions
Show how global inference can be used Semantic Role Labeling
Tradeoffs between Coupling vs. Decoupling learning and inference Investigation of utility of constraints
The analyses are very preliminary Average-case analysis for the tradeoffs between Coupling vs.
Decoupling learning and inference Better understanding for using constraints
More interactive classifiers Different performance metrics, e.g. F1 Relation with margin