structured learning introductionspeech.ee.ntu.edu.tw/~tlkagk/courses/mlds_2015/structured...
TRANSCRIPT
Structured LearningIntroduction
Hung-yi Lee
Structured Learning
• We need a more powerful function f
• Input and output are both objects with structures
• Object: sequence, list, tree …
X is the space of one kind of object
Y is the space of another kind of object
YXf :
Example Application• Translation
• X: Mandarin sentence (sequence) → Y: English sentence (sequence)
• Summarization
• X: long document → Y: summary (short paragraph)
• Retrieval
• X: keyword → Y: search result (a list of webpage)
• Speech recognition
• X: Speech signal (sequence) → Y: text (sequence)
• Syntactic Paring
• X: sentence → Y: parsing tree (tree structure)
• Just name a few ……
Unified Framework
• Find a function F
• F(x,y): evaluate how compatible the objects x and y is
Step 1: Training
• Given an object x
Step 2: Inference (Testing)
R:F YX
yxFyYy
,maxarg
yxFxfYy
,maxarg
YXf :Original Target:
Unified Framework - Translation
• Find a function F
• F(x,y): evaluate how compatible the objects x and y is
Step 1: Training
• Given an object x
Step 2: Inference (Testing)
R:F YX
yxFyYy
,maxarg
x: English sentence
y: Chinese sentence
F(x,y): how suitable it is if x is translated into y
x=“How are you”, y=“你好嗎”
x=“How are you”, y=“嗨”
x=“How are you”, y=“哈哈哈”
x=“How are you”, y=“天氣好嗎”
F(x,y)
10
4
0
-5
Unified Framework - Translation
• Find a function F
• F(x,y): evaluate how compatible the objects x and y is
Step 1: Training
• Given an object x
Step 2: Inference (Testing)
R:F YX
yxFyYy
,maxarg
Enumerate all possible Chinese sentences y
In input x= “I am sorry”
x=“I am sorry”, y=“對不起”
x=“I am sorry”, y=“天氣很好”
x=“I am sorry”, y=“人帥真好”
x=“I am sorry”, y=“哈哈哈”
F(x,y)
10
1
0
-1
e.g. 對不起、天氣真好、哈哈哈、人帥真好……..
(translation result)
(?)
Unified Framework - Summarization• Task description
• Given a long document
• Select a set of sentences from the document, and cascade the sentences to form a short paragraph
X Y
long document={s1, s2, s3, ……si…}
summary={s1, s3, s5}
si: the ith sentence
Unified Framework - Summarization
Step 1: Training Step 2: Inference
F(x,y)
d1 d2
d’ {s2, s4, s6}
d’{s3, s6, s9}
d’ {s1, s3, s5}
x yF(x,y)
d1
x y
d2
Unified Framework - Retrieval• Task description
• User input a keyword Q
• System returns a list of web pages
(keyword)
d10011“Obama”
X Y
d98776
A list of web pages (Search Result)
……
Unified Framework - Retrieval
Step 1: Training Step 2: Inference
F(x,y) F(x,y)
x=“Obama”, y=
d666d444……
x=“Haruhi”, y=
d203d330……
x=“Haruhi”, y=
d103d304……
x=“Haruhi”, y=
d103d305……
x=“Obama”, y=
d103d300……x=“Bush”, y=
d103d300……
x=“Bush”, y=
d133d220……
• Estimate the probability P(x,y)
Step 1: Estimation
• Given an object x
Step 2: Inference
• Find a function F
• F(x,y): evaluate how compatible the objects x and y is
Step 1: Training
• Given an object x
Step 2: Inference
yxFyYy
,maxarg
R:F YX
Unified Framework
0,1:P YX
xP
xy
Yy
,Pmaxarg
xyYy
,Pmaxarg
Statistics
?yx,Pyx,F
xyyYy
|Pmaxarg
• Estimate the probability P(x,y)
Step 1: Estimation
• Given an object x
Step 2: Inference
Unified Framework
0,1:P YX
xP
xy
Yy
,Pmaxarg
xyYy
,Pmaxarg
Statistics
?yx,Pyx,F
Probability cannot explain everything
0-1 constraint is not necessary
xyyYy
|Pmaxarg
Strength for probability
Drawback for probability
Meaningful
Unified Framework
• Solve any tasks by two steps
• Easier than putting an elephant into a refrigerator
Really? No, we have to answer three problems.
Problem 1
• Evaluation: What does F(x,y) look like?
• How F(x,y) compute the “compatibility” of objects x and y
F(x=“How are you”, y=“你好嗎”)
F(x= , y= )
(a short paragraph)
(keyword)
(Search Result)
Translation:
Summarization:
Retrieval:
(a long document)
F(x= “Obama” , y= )
Problem 2
• Inference: How to solve the “arg max” problem
yxFyYy
,maxarg
The space Y can be extremely large!
Translation:
Summarization:
Y=All possible Mandarin sentences ….
Y=All combination of sentence set in a document …
Retrieval: Y=All possible webpage ranking ….
Problem 3
• Training: Given training data, how to find F(x,y)
,ˆ,,,ˆ,,ˆ, 2211 rr yxyxyxTraining data:
Principle
We should find F(x,y) such that ……
11ˆ,F yx
…… …… yx ,F 1
1yy for all
22ˆ,F yx
yx ,F 2
2yy for all
rr yx ˆ,F
yxr ,F
ryy ˆfor all
3 Problems
• What does F(x,y) look like?
Problem 1: Evaluation
• How to solve the “arg max” problem
Problem 2: Inference
• Given training data, how to find F(x,y)
Problem 3: Training
yxFyYy
,maxarg
Structured Linear ModelReduce 3 Problems to 2
Structured Linear Model:Problem 1• Evaluation: What does F(x,y) look like?
x y yx,1
yx,2
yx,3
yxw
yxw
yxwyx
,
,
,,F
33
22
11
yx
yx
yx
w
w
w
yx,
,
,
,F3
2
1
3
2
1
yxwyx ,,F Learn from data
w yx,
Characteristics
• Evaluation: What does F(x,y) look like?
• Example: Translation
Structured Linear Model:Problem 1
x y yx,1
yx,2
yx,3
ChineseSentence
English Sentence
How many times “你””好”occur in y consecutively
How many times “好””你”occur in y consecutively
If “I” in x, and “我” in “y”
otherwise
1
0
If “I” in x, and “你” in “y”
otherwise
1
0
otherwise
If “I am” in x, and “我是” in “y”1
0
otherwise
If “I” is the 1st word of x, and “我” is the 1st word of “y”
1
0
yx,4
yx,5
yx,6
Structured Linear Model:Problem 1• Evaluation: What does F(x,y) look like?
• Example: Summarization
yx,1
yx,2
yx,3
(Short paragraph)
(a long document)
x yLength of y
Whether the sentence containing the word “important” is in y
Whether the sentence containing the word “definition” is in y
yx,4
How succinct is y? yx,5
How representative of y?
Structured Linear Model:Problem 1• Evaluation: What does F(x,y) look like?
• Example: Retrieval
x y
(Input keyword) (Search
Result)
yx,1
yx,2
yx,3
How much different information does y cover? (Diversity)
The degree of relevance with respect to x for the top 1 webpages in y.
Is the top 1 webpage more relevant than the top 2 webpage?
Structured Linear Model:Problem 2• Inference: How to solve the “arg max” problem
yxyYy
,Fmaxarg
yxwyx ,,F yxyYy
,wmaxarg
Assume we have solved this question.
Structured Linear Model:Problem 3• Training: Given training data, how to learn F(x,y)
• F(x,y) = w·φ(x,y), so what we have to learn is w
,ˆ,,,ˆ,,ˆ, 2211 rr yxyxyxTraining data:
We should find w such that
yxwyxw rrr ,ˆ,
}ˆ{ ryYy (All incorrect label for r-th example)
r (All training examples)
Structured Linear Model:Problem 3
11ˆ, yx
yx ,1
x1=“How are you”, 𝑦1=“你好嗎?”
x1=“How are you”, y=“我很好?”
x1=“How are you”, y=“人帥真好”
2211ˆ,,ˆ, yxyx
x1=“How are you”
x2=“I am sorry”
Structured Linear Model:Problem 3
x2=“I am sorry”, 𝑦2=“抱歉”
x2=“I am sorry”, y=“顆顆”
x2=“I am sorry”, y=“太強了”
22ˆ, yx
yx ,2
11ˆ, yx
yx ,1
2211ˆ,,ˆ, yxyx
x1=“How are you”
x2=“I am sorry”
Structured Linear Model:Problem 3
w
yxw
yxw
,
ˆ,
1
11
22ˆ, yx
yx ,2
11ˆ, yx
yx ,1
2211ˆ,,ˆ, yxyx
yxw
yxw
,
ˆ,
2
22
Solution of Problem 3Difficult?
Not as difficult as expected
Algorithm
• Input: training data set
• Output: weight vector w
• Algorithm: Initialize w = 0
• do
• For each pair of training example
• Find the label 𝑦𝑟 maximizing 𝑤 ∙ 𝜑 𝑥𝑟 , 𝑦
• If , update w
• until w is not updated
,ˆ,,,ˆ,,ˆ, 2211 rr yxyxyx
rr yx ˆ,
yxwy rYy
r ,maxarg~
rrrr yxyxww ~,ˆ,
rr yy ˆ~
(question 2)
We are done!
Will it terminate?
Algorithm - Example
w
yxw
yxw
,
ˆ,
1
11
22ˆ, yx
yx ,2
11ˆ, yx
yx ,1
2211ˆ,,ˆ, yxyx
yxw
yxw
,
ˆ,
2
22
Algorithm - Example
11ˆ, yx
yx ,1
22ˆ, yx
yx ,2Initialize w = 0
pick 11ˆ, yx
yxwyYy
,maxarg~11
Because w=0 at this time, 𝜑 𝑥1, 𝑦 always 0
Random pick one point as 𝑦𝑟
1~y 1111
~,ˆ, yxyxww
If , update w11ˆ~ yy
w
Algorithm - Examplepick 22
ˆ, yx
yxwyYy
,maxarg~22
2121~,ˆ, yxyxww
If , update w22ˆ~ yy
2~y
ww
11ˆ, yx
yx ,1
22ˆ, yx
yx ,2
Algorithm - Example
w
pick again 11ˆ, yx
do not update w11ˆ~ yy
yxwyYy
,maxarg~11
1~y
pick again 22ˆ, yx
do not update w22ˆ~ yy
yxwyYy
,maxarg~22
2~y
1y 2y
11ˆ, yx
yx ,1
22ˆ, yx
yx ,2
So we are done
yxw
yxw
,
ˆ,
1
11
yxw
yxw
,
ˆ,
2
22
Assumption: Separable
• There exists a weight vector 𝑤
yxwyxw rrr ,ˆˆ,ˆ
}ˆ{ ryYy (All incorrect label for an example)
r (All training examples)
yxwyxw rrr ,ˆˆ,ˆ (The target exists)
1ˆ w
Assumption: Separable
*w
11ˆ, yx
yx ,1
22ˆ, yx
yx ,2
yxwyxw rrr ,ˆˆ,ˆ
Proof of Termination
nnnn
kk yxyxww ~,ˆ,1
1210 0 kk wwwww
nnnn
kk yxyxwwww ~,ˆ,ˆˆ 1
nnnn
k yxwyxwww ~,ˆˆ,ˆˆ 1 1ˆ kww
w is updated once it sees a mistake
Proof that: The angle ρk between and wk is smaller as k increases
w
(the relation of wk and wk-1)
Analysis kcos (larger and larger?)
k
kk
w
w
w
w
ˆ
ˆcos
(Separable)
Proof of Termination
1ˆˆ kk wwww
Proof that: The angle ρk between and wk is smaller as k increases
w
Analysis kcos (larger and larger?)
k
k
kw
w
w
w
ˆ
ˆcos
01 ˆˆ wwww 12 ˆˆ wwww kww k ˆ......
1ˆ ww 2ˆ 2 ww ...... (so what)
nnnn
kk yxyxww ~,ˆ,1
1210 0 kk wwwww
w is updated once it sees a mistake
(the relation of wk and wk-1)
=0 ≥δ
Proof of Termination
2
12 ~,ˆ, nnnn
kk yxyxww
nnnn
k
nnnn
k yxyxwyxyxw ~,ˆ,2~,ˆ, 1221
21 R kw2
2
Rkwk
0 ?
nnnn
kk yxyxww ~,ˆ,1
0 (mistake)
Assume the distance between any two feature vector is smaller than R
22
02
1 R ww2R
22
12
2 R ww2R2
k
k
kw
w
w
w
ˆ
ˆcos
Proof of Termination
22
Rkwk kww k ˆ
2kR
k
Rk
k
kcos
Rk
1cos k
1R
k
2
Rk
k
k
kw
w
w
w
ˆ
ˆcos
Proof of Termination
2
Rk
Margin: Is it easy to separable red points from the blue ones
Normalization
rr yx ˆ,
yxr ,
All feature times 2
R
Larger margin, less update
The largest distances between features
Structured Linear Model:Reduce 3 Problems to 2
Problem 1: Evaluation
• How to define F(x,y)
Problem 2: Inference
• How to find the y with the largest F(x,y)
Problem 3: Training
• How to learn F(x,y)
Problem A: Feature
• How to define φ(x,y)
Problem B: Inference
• How to find the y with the largest w·φ(x,y)
F(x,y)=w·φ(x,y)
Reference
• Collins, Michael. "Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms." Proceedings of the ACL-02 conference on Empirical methods in natural language processing-Volume 10. Association for Computational Linguistics, 2002.• http://www.cs.columbia.edu/~mcollins/papers/tagperc.pdf
Complement 1
Ring a bell?
DSP
Complement 2
Gradient descent - Review
• 1. There is a target function C(θ)
• We want to find the parameter set θ which minimizes the target function
• 2. Gradient descent
• At iteration i
• Compute the gradient:
• Update parameters: iC
ii1i C
Target Function
• What do we want to minimize? nn yy ˆfor all
nn yx ,w
nn yx ˆ,w
badperfect
N
n
wwC1
nCWhat is the minimum possible value of C(w)?
nnnnYy
n yxwyxwwCnn
ˆ,,max
Computing Gradient
nnnnYy
n yxwyxwwCnn
ˆ,,max
nyny ny
nnYy
yxwnn
ˆ,max
w
nn
nn
yxw
yxw
ˆ,
,
nn
nn
yxw
yxw
ˆ,
,
0ˆ,
ˆ,
nn
nn
yxw
yxw
Computing Gradient ?wCn
nyny ny
w
nn
nn
yxw
yxw
ˆ,
,
nn
nn
yxw
yxw
ˆ,
,
0ˆ,
ˆ,
nn
nn
yxw
yxw
nnnnn yxyxwC ˆ,~,
nnnn
n
yxyx
w
ˆ,,
C
0C wn
nnnn
n
yxyx
w
ˆ,,
C
nnYy
n yxwynn
,maxarg~
Update Parameters
nnnnn yxyxwC ˆ,~,
nnYy
n yxwynn
,maxarg~
Appendix
Feeling abstract ……
• Here are practical application• Translation
• http://www.cs.cmu.edu/~nasmith/papers/gimpel+smith.naacl12.pdf
• Summarization• Hung-yi Lee, Yu-yu Chou, Yow-Bang Wang, Lin-shan Lee,
"Supervised Spoken Document Summarization Jointly Considering Utterance Importance and Redundancy by Structured Support Vector Machine", InterSpeech, 2012
• Retrieval• Y. Yue, T. Finley, F. Radlinski, and T. Joachims. A Support Vector
Method for Optimizing Average Precision, SIGIR, 2007• Chao-hong Meng, Hung-yi Lee, Lin-shan Lee, "Improved Lattice-
based Spoken Document Retrieval by Directly Learning from the evaluation Measures", ICASSP 2009
Discussion
• Linear model is too simple?
• Design very complex feature φ(x,y)
• Deep learning?
• Perceptron also used in binary classification
• What is the relation of binary classification and structured learning?
• How about other binary classification methods? For example, SVM
• Can you use the unified framework to address the non-structured problems?
Discussion
• Why not use neural network?• Yes, you can
• Example: CTC
• Recurrent neural network
• where you reduce structured prediction to learning a compositionality operator and embedding things in a latent space
• Reverse question• Why not deep learning by structured learning
• Yes, you can do it.
• http://www.cs.toronto.edu/~cuty/HNNSO.pdf
What is Structured learning
• This article give a definition of structured learning• http://www.cs.utah.edu/~piyush/teaching/structured_p
rediction.pdf
• The dependent between the output is a characteristics
Examples
• https://pystruct.github.io/auto_examples/plot_potts_model.html• Plotts model (similar to Ising)• Ising: To use graph cut, the weight has limitation
• https://pystruct.github.io/auto_examples/image_segmentation.html#image-segmentation-py• Image segmentation by superpixel
• https://pystruct.github.io/auto_examples/plot_directional_grid.html • Denoise (use qpbo to inference)
• https://pystruct.github.io/auto_examples/plot_grid_crf.html#plot-grid-crf-py• Grid CRF
More Examples
• Summarization
• Multi class with restricted output• Semantic Role Labeling via Integer Linear Programming Inference• http://www.yaroslavvb.com/papers/punyakanok-learning.pdf
• 3D point• http://pr.cs.cornell.edu/sceneunderstanding/data/data.php
• CFG
• Image Segmentation
• Alignment: Also use in protein (DNA)
• Sequence
• http://www.cs.cornell.edu/~cnyu/papers/recomb07_revised.pdf
• http://www.cs.cornell.edu/people/tj/publications/joachims_03b.pdf
Semantic Role Labeling
• Based on the dataset PropBank [Palmer et. al. 05]
• Input Text with pre-processing
• Five possible decisions for each candidate Create a binary variable for each decision, only one of which is true for each candidate.
Reference
• A Tutorial on Energy-Based Learning
• http://yann.lecun.com/exdb/publis/pdf/lecun-06.pdf
• A website
• http://www.cs.nyu.edu/~yann/research/ebm/
• A reference
• http://www.slideshare.net/zukun/machine-learning-of-structured-outputs
Reference
• http://www.phontron.com/slides/nlp-programming-en-11-struct.pdf
• Brief and simple
• http://www.cs.utah.edu/~piyush/teaching/structured_prediction.pdf
Reference
• http://www.cs.cornell.edu/People/tj/publications/joachims_06b.pdf
• The earlier version use parsing as examples
• Another example is co-reference
More Reference
• http://peekaboo-vision.blogspot.tw/2012/06/structured-svm-and-structured.html