from logistic regression to linear chain crf

From Logistic Regressionto Linear-Chain CRF

Yow-Bang (Darren) Wang12/20/2012

● Introduction● Logistic Regression● Log-Linear Model● Linear-Chain CRF

○ Example: Part of Speech (POS) Tagging● CRF Training and Testing

○ Example: Part of Speech (POS) Tagging● Example: Speech Disfluency Detection

Outline

Introduction

IntroductionWe can approach the theory of CRF from

1. Maximum Entropy2. Probabilistic Graphical Model3. Logistic Regression <– today's talk

Linear Regression● Input x: real-valued features (RV)● Output y: Gaussian distribution (RV)

● Model parameter● ML (conditional likelihood) estimation of Ө:

, where {X, Y} are the training data.

Linear Regression● Input x: real-valued features (RV)● Output y: Gaussian distribution (RV)

● Represented with a graphical model:

1

x1

xN

y

a0

a1

aN

…...

Logistic Regression

Logistic Regression● Input x: real-valued features (RV)● Output y: Bernoulli distribution (RV)

● Model parameter

Q: Why this form?A: Both sides have range of value {-∞, ∞}

No analytical solution for ML→ gradient descent

Logistic Regression● Input x: real-valued features (RV)● Output y: Bernoulli distribution (RV)


1

x1

xN

a0

a1

aN

…...

pSigmoid

Logistic RegressionAdvantages of Logistic Regression:1. Correlated features x don't lead to problems (contrast to

Naive Bayes)2. Well-calibrated probability (contrast to SVM)

3. Not sensitive to unbalanced training data

number of ”Y=1"

Multinomial Logistic Regression● Input x: real-valued features (RV), N-dimension● Output y: Bernoulli distribution (RV), M-class


1

x1

xN

…

p1

pM

…

SoftmaxNeural network with 2 layers!!!

pm: Probability of m-th class

Log-Linear Model

Log-Linear ModelAn interpretation: Log-Linear Model is a Structured Logistic Regression● Structured: allow non-numerical input and output by

defining proper feature function● Special case: Logistic regression

General form:

● Fj(x,y): j-th feature function

Log-Linear ModelNote:1. “Feature” vs. “Feature function”

○ Feature: only correspond to input○ Feature function: correspond to both input and output

2. Must sum over all possible label y' for denominator-> normalization into [0, 1].

General form:

● Fj(x,y): j-th feature function

Linear-Chain CRF

hidden

observed

From probabilistic graphical model perspective:

● CRF is a Markov Random Field with some disjoint RVs observed and some hidden.

x

z

y

q

r

p

Conditional Random Field (CRF)

From probabilistic graphical model perspective:

● Linear-Chain CRF: a specific structure of CRF

Linear-Chain CRF

hidden

observed

We often refer to "linear-chain CRF" as simply "CRF"

Linear-Chain CRFFrom Log-Linear Model point of view: Linear-Chain CRF is a Log-Linear Model, of which1. The length L of output y can be varying.2. The form of feature function is the sum of ”low-level

feature functions”:

hidden

observed

y:

x:

……

Linear-Chain CRFFrom Log-Linear Model point of view: Linear-Chain CRF is a Log-Linear Model, of which1. The length L of output y can be varying.2. The form of feature function is the sum of ”low-level

feature functions”:

“We can have a fixed set of feature-functions Fj for log-linear training, even though the training examples are not fixed-length.” [1]

Input (observed) x: word sequence

Output (hidden) y: POS tag sequence

● For example:

x = "He sat on the mat."

y = "pronoun verb preposition article noun"

pron. v.

He sat on the mat.

prep. art. n.

Example: Part of Speech (POS) Tagging

Example: Part of Speech (POS) TaggingInput (observed) x: word sequence

Output (hidden) y: POS tag sequence

● With CRF we hope

CRF:

, where

Example: Part of Speech (POS) TaggingAn example of low-level feature function fj(x,yi,yi-1,i):● "The i-th word in x is capitalized, and POS tag yi =

proper noun." [TRUE(1) or FALSE(0)]

If wj positively large: given x and other condition fixed, y is more probable if fj(x,yi,yi-1,i) is activated.

CRF:

, where

Note a feature function may not use all the given information

CRF Training and Testing

TrainingStochastic Gradient Ascent● Partial derivative of conditional log-likelihood:

● Update weight by

TrainingNote: if j-th feature function is not activated by this training example

→ we don't need to update it!

→ usually only a few weights need to be updated in each iteration

TestingFor 1-best derivation:

N V Adj ...

N

V

Adj

...

For 1-best derivation:1. Pre-compute g(yi-1,yi) as a table for each i2. Perform dynamic programming to find the best sequence y:


●●

……………

●●

…

For 1-best derivation:1. Pre-compute g(yi-1,yi) as a table for each i2. Perform dynamic programming to find the best sequence y:

● Complexity: O(M2LD)


Build a table For each element in sequence # of feature fuNctions

TestingFor probability estimation:● must also compute all possible y (e.g. all possible POS

sequences) for denominator......

Can be calculated by matrix multiplication!!!

Example: Speech Disfluency Detection

Example: Speech Disfluency DetectionOne of the application of CRF in speech recognition: Boundary/Disfluency Detection [5]● Repetition : “It is is Tuesday.”● Hesitation : “It is uh… Tuesday.”● Correction: “It is Monday, I mean, Tuesday.”● etc.

Possible clues: prosody● Pitch● Duration● Energy● Pause● etc.

“It is uh…Tuesday.”

● Pitch reset?● Long duration?● Low energy?● Pause existence?

One of the application of CRF in speech recognition: Boundary/Disfluency Detection [5]

● CRF Input x: prosodic features● CRF Output y:

Speech Recognition

Rescoring

Example: Speech Disfluency Detection

�

Reference[1] Charles Elkan, “Log-linear Models and Conditional Random Fields”

○ Tutorial at CIKM08 (ACM International Conference on Information and Knowledge Management)

○ Video: http://videolectures.net/cikm08_elkan_llmacrf/○ Lecture notes: http://cseweb.ucsd.edu/~elkan/250B/cikmtutorial.pdf

[2] Hanna M. Wallach, “Conditional Random Fields: An Introduction”[3] Jeremy Morris, “Conditional Random Fields: An Overview”

○ Presented at OSU Clippers 2008, January 11, 2008

http://videolectures.net/cikm08_elkan_llmacrf/

http://cseweb.ucsd.edu/~elkan/250B/cikmtutorial.pdf

Reference[4] C. Sutton, K. Rohanimanesh, A. McCallum, “Conditional random fields: Probabilistic models for segmenting and labeling sequence data”, 2001.

[5] Liu, Y. and Shriberg, E. and Stolcke, A. and Hillard, D. and Ostendorf, M. and Harper, M., “Enriching speech recognition with automatic detection of sentence boundaries and disfluencies”, in IEEE Transactions on Audio, Speech, and Language Processing, 2006.

from logistic regression to linear chain crf

Technology