structured learning introductionspeech.ee.ntu.edu.tw/~tlkagk/courses/mlds_2015/structured...

Structured LearningIntroduction

Hung-yi Lee

Structured Learning

• We need a more powerful function f

• Input and output are both objects with structures

• Object: sequence, list, tree …

X is the space of one kind of object

Y is the space of another kind of object

YXf :

Example Application• Translation

• X: Mandarin sentence (sequence) → Y: English sentence (sequence)

• Summarization

• X: long document → Y: summary (short paragraph)

• Retrieval

• X: keyword → Y: search result (a list of webpage)

• Speech recognition

• X: Speech signal (sequence) → Y: text (sequence)

• Syntactic Paring

• X: sentence → Y: parsing tree (tree structure)

• Just name a few ……

Unified Framework

• Find a function F

• F(x,y): evaluate how compatible the objects x and y is

Step 1: Training

• Given an object x

Step 2: Inference (Testing)

R:F YX

yxFyYy

,maxarg

yxFxfYy

,maxarg

YXf :Original Target:

Unified Framework - Translation



Step 1: Training



R:F YX

yxFyYy

,maxarg

x: English sentence

y: Chinese sentence

F(x,y): how suitable it is if x is translated into y

x=“How are you”, y=“你好嗎”

x=“How are you”, y=“嗨”

x=“How are you”, y=“哈哈哈”

x=“How are you”, y=“天氣好嗎”

F(x,y)

10

4

0

-5

Unified Framework - Translation



Step 1: Training



R:F YX

yxFyYy

,maxarg

Enumerate all possible Chinese sentences y

In input x= “I am sorry”

x=“I am sorry”, y=“對不起”

x=“I am sorry”, y=“天氣很好”

x=“I am sorry”, y=“人帥真好”

x=“I am sorry”, y=“哈哈哈”

F(x,y)

10

1

0

-1

e.g. 對不起、天氣真好、哈哈哈、人帥真好……..

(translation result)

(?)

Unified Framework - Summarization• Task description

• Given a long document

• Select a set of sentences from the document, and cascade the sentences to form a short paragraph

X Y

long document={s1, s2, s3, ……si…}

summary={s1, s3, s5}

si: the ith sentence

Unified Framework - Summarization

Step 1: Training Step 2: Inference

F(x,y)

d1 d2

d’ {s2, s4, s6}

d’{s3, s6, s9}

d’ {s1, s3, s5}

x yF(x,y)

d1

x y

d2

Unified Framework - Retrieval• Task description

• User input a keyword Q

• System returns a list of web pages

(keyword)

d10011“Obama”

X Y

d98776

A list of web pages (Search Result)

……

Unified Framework - Retrieval

Step 1: Training Step 2: Inference

F(x,y) F(x,y)

x=“Obama”, y=

d666d444……

x=“Haruhi”, y=

d203d330……

x=“Haruhi”, y=

d103d304……

x=“Haruhi”, y=

d103d305……

x=“Obama”, y=

d103d300……x=“Bush”, y=

d103d300……

x=“Bush”, y=

d133d220……

• Estimate the probability P(x,y)

Step 1: Estimation


Step 2: Inference



Step 1: Training


Step 2: Inference

yxFyYy

,maxarg

R:F YX

Unified Framework

0,1:P YX

xP

xy

Yy

,Pmaxarg

xyYy

,Pmaxarg

Statistics

?yx,Pyx,F

xyyYy

|Pmaxarg

• Estimate the probability P(x,y)

Step 1: Estimation


Step 2: Inference

Unified Framework

0,1:P YX

xP

xy

Yy

,Pmaxarg

xyYy

,Pmaxarg

Statistics

?yx,Pyx,F

Probability cannot explain everything

0-1 constraint is not necessary

xyyYy

|Pmaxarg

Strength for probability

Drawback for probability

Meaningful

Unified Framework

• Solve any tasks by two steps

• Easier than putting an elephant into a refrigerator

Really? No, we have to answer three problems.

Problem 1

• Evaluation: What does F(x,y) look like?

• How F(x,y) compute the “compatibility” of objects x and y

F(x=“How are you”, y=“你好嗎”)

F(x= , y= )

(a short paragraph)

(keyword)

(Search Result)

Translation:

Summarization:

Retrieval:

(a long document)

F(x= “Obama” , y= )

Problem 2

• Inference: How to solve the “arg max” problem

yxFyYy

,maxarg

The space Y can be extremely large!

Translation:

Summarization:

Y=All possible Mandarin sentences ….

Y=All combination of sentence set in a document …

Retrieval: Y=All possible webpage ranking ….

Problem 3

• Training: Given training data, how to find F(x,y)

,ˆ,,,ˆ,,ˆ, 2211 rr yxyxyxTraining data:

Principle

We should find F(x,y) such that ……

11ˆ,F yx

…… …… yx ,F 1

1yy for all

22ˆ,F yx

yx ,F 2

2yy for all

rr yx ˆ,F

yxr ,F

ryy ˆfor all

3 Problems

• What does F(x,y) look like?

Problem 1: Evaluation

• How to solve the “arg max” problem

Problem 2: Inference

• Given training data, how to find F(x,y)

Problem 3: Training

yxFyYy

,maxarg

Structured Linear ModelReduce 3 Problems to 2

Structured Linear Model:Problem 1• Evaluation: What does F(x,y) look like?

x y yx,1

yx,2

yx,3

yxw

yxw

yxwyx

,

,

,,F

33

22

11

yx

yx

yx

w

w

w

yx,

,

,

,F3

2

1

3

2

1

yxwyx ,,F Learn from data

w yx,

Characteristics

• Evaluation: What does F(x,y) look like?

• Example: Translation

Structured Linear Model:Problem 1

x y yx,1

yx,2

yx,3

ChineseSentence

English Sentence

How many times “你””好”occur in y consecutively

How many times “好””你”occur in y consecutively

If “I” in x, and “我” in “y”

otherwise

1

0

If “I” in x, and “你” in “y”

otherwise

1

0

otherwise

If “I am” in x, and “我是” in “y”1

0

otherwise

If “I” is the 1st word of x, and “我” is the 1st word of “y”

1

0

yx,4

yx,5

yx,6


• Example: Summarization

yx,1

yx,2

yx,3

(Short paragraph)

(a long document)

x yLength of y

Whether the sentence containing the word “important” is in y

Whether the sentence containing the word “definition” is in y

yx,4

How succinct is y? yx,5

How representative of y?


• Example: Retrieval

x y

(Input keyword) (Search

Result)

yx,1

yx,2

yx,3

How much different information does y cover? (Diversity)

The degree of relevance with respect to x for the top 1 webpages in y.

Is the top 1 webpage more relevant than the top 2 webpage?

Structured Linear Model:Problem 2• Inference: How to solve the “arg max” problem

yxyYy

,Fmaxarg

yxwyx ,,F yxyYy

,wmaxarg

Assume we have solved this question.

Structured Linear Model:Problem 3• Training: Given training data, how to learn F(x,y)

• F(x,y) = w·φ(x,y), so what we have to learn is w

,ˆ,,,ˆ,,ˆ, 2211 rr yxyxyxTraining data:

We should find w such that

yxwyxw rrr ,ˆ,

}ˆ{ ryYy (All incorrect label for r-th example)

r (All training examples)


11ˆ, yx

yx ,1

x1=“How are you”, 𝑦1=“你好嗎?”

x1=“How are you”, y=“我很好?”

x1=“How are you”, y=“人帥真好”

2211ˆ,,ˆ, yxyx

x1=“How are you”

x2=“I am sorry”


x2=“I am sorry”, 𝑦2=“抱歉”

x2=“I am sorry”, y=“顆顆”

x2=“I am sorry”, y=“太強了”

22ˆ, yx

yx ,2

11ˆ, yx

yx ,1

2211ˆ,,ˆ, yxyx

x1=“How are you”

x2=“I am sorry”


w

yxw

yxw

,

ˆ,

1

11

22ˆ, yx

yx ,2

11ˆ, yx

yx ,1

2211ˆ,,ˆ, yxyx

yxw

yxw

,

ˆ,

2

22

Solution of Problem 3Difficult?

Not as difficult as expected

Algorithm

• Input: training data set

• Output: weight vector w

• Algorithm: Initialize w = 0

• do

• For each pair of training example

• Find the label 𝑦𝑟 maximizing 𝑤 ∙ 𝜑 𝑥𝑟 , 𝑦

• If , update w

• until w is not updated

,ˆ,,,ˆ,,ˆ, 2211 rr yxyxyx

rr yx ˆ,

yxwy rYy

r ,maxarg~

rrrr yxyxww ~,ˆ,

rr yy ˆ~

(question 2)

We are done!

Will it terminate?

Algorithm - Example

w

yxw

yxw

,

ˆ,

1

11

22ˆ, yx

yx ,2

11ˆ, yx

yx ,1

2211ˆ,,ˆ, yxyx

yxw

yxw

,

ˆ,

2

22

Algorithm - Example

11ˆ, yx

yx ,1

22ˆ, yx

yx ,2Initialize w = 0

pick 11ˆ, yx

yxwyYy

,maxarg~11

Because w=0 at this time, 𝜑 𝑥1, 𝑦 always 0

Random pick one point as 𝑦𝑟

1~y 1111

~,ˆ, yxyxww

If , update w11ˆ~ yy

w

Algorithm - Examplepick 22

ˆ, yx

yxwyYy

,maxarg~22

2121~,ˆ, yxyxww

If , update w22ˆ~ yy

2~y

ww

11ˆ, yx

yx ,1

22ˆ, yx

yx ,2

Algorithm - Example

w

pick again 11ˆ, yx

do not update w11ˆ~ yy

yxwyYy

,maxarg~11

1~y

pick again 22ˆ, yx

do not update w22ˆ~ yy

yxwyYy

,maxarg~22

2~y

1y 2y

11ˆ, yx

yx ,1

22ˆ, yx

yx ,2

So we are done

yxw

yxw

,

ˆ,

1

11

yxw

yxw

,

ˆ,

2

22

Assumption: Separable

• There exists a weight vector 𝑤

yxwyxw rrr ,ˆˆ,ˆ

}ˆ{ ryYy (All incorrect label for an example)

r (All training examples)

yxwyxw rrr ,ˆˆ,ˆ (The target exists)

1ˆ w

Assumption: Separable

*w

11ˆ, yx

yx ,1

22ˆ, yx

yx ,2

yxwyxw rrr ,ˆˆ,ˆ

Proof of Termination

nnnn

kk yxyxww ~,ˆ,1

1210 0 kk wwwww

nnnn

kk yxyxwwww ~,ˆ,ˆˆ 1

nnnn

k yxwyxwww ~,ˆˆ,ˆˆ 1 1ˆ kww

w is updated once it sees a mistake

Proof that: The angle ρk between and wk is smaller as k increases

w

(the relation of wk and wk-1)

Analysis kcos (larger and larger?)

k

kk

w

w

w

w

ˆ

ˆcos

(Separable)


1ˆˆ kk wwww

Proof that: The angle ρk between and wk is smaller as k increases

w

Analysis kcos (larger and larger?)

k

k

kw

w

w

w

ˆ

ˆcos

01 ˆˆ wwww 12 ˆˆ wwww kww k ˆ......

1ˆ ww 2ˆ 2 ww ...... (so what)

nnnn

kk yxyxww ~,ˆ,1

1210 0 kk wwwww

w is updated once it sees a mistake

(the relation of wk and wk-1)

=0 ≥δ


2

12 ~,ˆ, nnnn

kk yxyxww

nnnn

k

nnnn

k yxyxwyxyxw ~,ˆ,2~,ˆ, 1221

21 R kw2

2

Rkwk

0 ?

nnnn

kk yxyxww ~,ˆ,1

0 (mistake)

Assume the distance between any two feature vector is smaller than R

22

02

1 R ww2R

22

12

2 R ww2R2

k

k

kw

w

w

w

ˆ

ˆcos


22

Rkwk kww k ˆ

2kR

k

Rk

k

kcos

Rk

1cos k

1R

k

2

Rk

k

k

kw

w

w

w

ˆ

ˆcos


2

Rk

Margin: Is it easy to separable red points from the blue ones

Normalization

rr yx ˆ,

yxr ,

All feature times 2

R

Larger margin, less update

The largest distances between features

Structured Linear Model:Reduce 3 Problems to 2

Problem 1: Evaluation

• How to define F(x,y)

Problem 2: Inference

• How to find the y with the largest F(x,y)

Problem 3: Training

• How to learn F(x,y)

Problem A: Feature

• How to define φ(x,y)

Problem B: Inference

• How to find the y with the largest w·φ(x,y)

F(x,y)=w·φ(x,y)

Reference

• Collins, Michael. "Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms." Proceedings of the ACL-02 conference on Empirical methods in natural language processing-Volume 10. Association for Computational Linguistics, 2002.• http://www.cs.columbia.edu/~mcollins/papers/tagperc.pdf

Complement 1

Ring a bell?

DSP

Complement 2

Gradient descent - Review

• 1. There is a target function C(θ)

• We want to find the parameter set θ which minimizes the target function

• 2. Gradient descent

• At iteration i

• Compute the gradient:

• Update parameters: iC

ii1i C

Target Function

• What do we want to minimize? nn yy ˆfor all

nn yx ,w

nn yx ˆ,w

badperfect

N

n

wwC1

nCWhat is the minimum possible value of C(w)?

nnnnYy

n yxwyxwwCnn

ˆ,,max

Computing Gradient

nnnnYy

n yxwyxwwCnn

ˆ,,max

nyny ny

nnYy

yxwnn

ˆ,max

w

nn

nn

yxw

yxw

ˆ,

,

nn

nn

yxw

yxw

ˆ,

,

0ˆ,

ˆ,

nn

nn

yxw

yxw

Computing Gradient ?wCn

nyny ny

w

nn

nn

yxw

yxw

ˆ,

,

nn

nn

yxw

yxw

ˆ,

,

0ˆ,

ˆ,

nn

nn

yxw

yxw

nnnnn yxyxwC ˆ,~,

nnnn

n

yxyx

w

ˆ,,

C

0C wn

nnnn

n

yxyx

w

ˆ,,

C

nnYy

n yxwynn

,maxarg~

Update Parameters

nnnnn yxyxwC ˆ,~,

nnYy

n yxwynn

,maxarg~

Appendix

Feeling abstract ……

• Here are practical application• Translation

• http://www.cs.cmu.edu/~nasmith/papers/gimpel+smith.naacl12.pdf

• Summarization• Hung-yi Lee, Yu-yu Chou, Yow-Bang Wang, Lin-shan Lee,

"Supervised Spoken Document Summarization Jointly Considering Utterance Importance and Redundancy by Structured Support Vector Machine", InterSpeech, 2012

• Retrieval• Y. Yue, T. Finley, F. Radlinski, and T. Joachims. A Support Vector

Method for Optimizing Average Precision, SIGIR, 2007• Chao-hong Meng, Hung-yi Lee, Lin-shan Lee, "Improved Lattice-

based Spoken Document Retrieval by Directly Learning from the evaluation Measures", ICASSP 2009

http://www.cs.cmu.edu/~nasmith/papers/gimpel+smith.naacl12.pdf

Discussion

• Linear model is too simple?

• Design very complex feature φ(x,y)

• Deep learning?

• Perceptron also used in binary classification

• What is the relation of binary classification and structured learning?

• How about other binary classification methods? For example, SVM

• Can you use the unified framework to address the non-structured problems?

Discussion

• Why not use neural network?• Yes, you can

• Example: CTC

• Recurrent neural network

• where you reduce structured prediction to learning a compositionality operator and embedding things in a latent space

• Reverse question• Why not deep learning by structured learning

• Yes, you can do it.

• http://www.cs.toronto.edu/~cuty/HNNSO.pdf

What is Structured learning

• This article give a definition of structured learning• http://www.cs.utah.edu/~piyush/teaching/structured_p

rediction.pdf

• The dependent between the output is a characteristics

http://www.cs.utah.edu/~piyush/teaching/structured_prediction.pdf

Examples

• https://pystruct.github.io/auto_examples/plot_potts_model.html• Plotts model (similar to Ising)• Ising: To use graph cut, the weight has limitation

• https://pystruct.github.io/auto_examples/image_segmentation.html#image-segmentation-py• Image segmentation by superpixel

• https://pystruct.github.io/auto_examples/plot_directional_grid.html • Denoise (use qpbo to inference)

• https://pystruct.github.io/auto_examples/plot_grid_crf.html#plot-grid-crf-py• Grid CRF

https://pystruct.github.io/auto_examples/plot_potts_model.html

https://pystruct.github.io/auto_examples/image_segmentation.html#image-segmentation-py

https://pystruct.github.io/auto_examples/plot_grid_crf.html#plot-grid-crf-py

More Examples

• Summarization

• Multi class with restricted output• Semantic Role Labeling via Integer Linear Programming Inference• http://www.yaroslavvb.com/papers/punyakanok-learning.pdf

• 3D point• http://pr.cs.cornell.edu/sceneunderstanding/data/data.php

• CFG

• Image Segmentation

• Alignment: Also use in protein (DNA)

• Sequence

• http://www.cs.cornell.edu/~cnyu/papers/recomb07_revised.pdf

• http://www.cs.cornell.edu/people/tj/publications/joachims_03b.pdf

http://www.yaroslavvb.com/papers/punyakanok-learning.pdf

http://pr.cs.cornell.edu/sceneunderstanding/data/data.php

http://www.cs.cornell.edu/~cnyu/papers/recomb07_revised.pdf

Semantic Role Labeling

• Based on the dataset PropBank [Palmer et. al. 05]

• Input Text with pre-processing

• Five possible decisions for each candidate Create a binary variable for each decision, only one of which is true for each candidate.

Reference

• A Tutorial on Energy-Based Learning

• http://yann.lecun.com/exdb/publis/pdf/lecun-06.pdf

• A website

• http://www.cs.nyu.edu/~yann/research/ebm/

• A reference

• http://www.slideshare.net/zukun/machine-learning-of-structured-outputs

http://yann.lecun.com/exdb/publis/pdf/lecun-06.pdf

http://www.cs.nyu.edu/~yann/research/ebm/

Reference

• http://www.phontron.com/slides/nlp-programming-en-11-struct.pdf

• Brief and simple

• http://www.cs.utah.edu/~piyush/teaching/structured_prediction.pdf

http://www.phontron.com/slides/nlp-programming-en-11-struct.pdf

Reference

• http://www.cs.cornell.edu/People/tj/publications/joachims_06b.pdf

• The earlier version use parsing as examples

• Another example is co-reference

http://www.cs.cornell.edu/People/tj/publications/joachims_06b.pdf

More Reference

• http://peekaboo-vision.blogspot.tw/2012/06/structured-svm-and-structured.html

structured learning introductionspeech.ee.ntu.edu.tw/~tlkagk/courses/mlds_2015/structured...

Documents