sequence modelling with features: linear-chain conditional ...jcheung/teaching/fall-2016/...logistic...
TRANSCRIPT
Sequence Modelling with Features:
Linear-Chain Conditional Random
Fields
COMP-599
Oct 3, 2016
OutlineSequence modelling: review and miscellaneous notes
Hidden Markov models: shortcomings
Generative vs. discriminative models
Linear-chain CRFs
⢠Inference and learning algorithms with linear-chain CRFs
2
Hidden Markov ModelGraph specifies how join probability decomposes
ð(ð¶,ðž) = ð ð1
ð¡=1
ðâ1
ð(ðð¡+1|ðð¡)
ð¡=1
ð
ð(ðð¡|ðð¡)
3
ð1
ð1
ð2
ð2
ð3
ð3
ð4
ð4
ð5
ð5
Initial state probability
State transition probabilities
Emission probabilities
EM and Baum-WelchEM finds a local optimum in ð ð¶ ð .
You can show that after each step of EM:
ð ð¶ ðð+1 > ð(ð¶|ðð)
However, this is not necessarily a global optimum.
4
Dealing with Local OptimaRandom restarts
⢠Train multiple models with different random initializations
⢠Model selection on development set to pick the best one
Biased initialization
⢠Bias your initialization using some external source of knowledge (e.g., external corpus counts or clustering procedure, expert knowledge about domain)
⢠Further training will hopefully improve results
5
CaveatsBaum-Welch with no labelled data generally gives poor results, at least for linguistic structure (~40% accuracy, according to Johnson, (2007))
Semi-supervised learning: combine small amounts of labelled data with larger corpus of unlabelled data
6
In PracticePer-token (i.e., per word) accuracy results on WSJ corpus:
Most frequent tag baseline ~90â94%
HMMs (Brants, 2000) 96.5%
Stanford tagger (Manning, 2011) 97.32%
7
Other Sequence Modelling TasksChunking (a.k.a., shallow parsing)
⢠Find syntactic chunks in the sentence; not hierarchical
[NPThe chicken] [Vcrossed] [NPthe road] [Pacross] [NPthe lake].
Named-Entity Recognition (NER)
⢠Identify elements in text that correspond to some high level categories (e.g., PERSON, ORGANIZATION, LOCATION)
[ORGMcGill University] is located in [LOCMontreal, Canada].
⢠Problem: need to detect spans of multiple words
8
First AttemptSimply label words by their category
ORG ORG ORG - ORG - - - LOC
McGill, UQAM, UdeM, and Concordia are located in Montreal.
What is the problem with this scheme?
9
IOB TaggingLabel whether a word is inside, outside, or at the beginning of a span as well.
For n categories, 2n+1 total tags.
BORG IORG O O O BLOC ILOC
McGill University is located in Montreal, Canada
BORG BORG BORG O BORG O O O BLOC
McGill, UQAM, UdeM, and Concordia are located in Montreal.
10
11
Sequence Modelling with Linguistic
Features
Shortcomings of Standard HMMsHow do we add more features to HMMs?
Might be useful for POS tagging:
⢠Word position within sentence (1st, 2nd, lastâŠ)
⢠Capitalization
⢠Word prefix and suffixes (-tion, -ed, -ly, -er, re-, de-)
⢠Features that depend on more than the current word or the previous words.
12
Possible to Do with HMMsAdd more emissions at each timestep
Clunky
Is there a better way to do this?
13
ð1
ð12ð11 ð13 ð14
Word identity Capitalization Prefix feature Word position
Discriminative ModelsHMMs are generative models
⢠Build a model of the joint probability distribution ð(ð¶,ðž),
⢠Letâs rename the variables
⢠Generative models specify ð(ð, ð; ðgen)
If we are only interested in POS tagging, we can instead train a discriminative model
⢠Model specifies ð(ð|ð; ðdisc)
⢠Now a task-specific model for sequence labelling; cannot use it for generating new samples of word and POS sequences
14
Generative or Discriminative?Naive Bayes
ð ðŠ ð¥ = ð ðŠ ð ð ð¥ð ðŠ / ð( ð¥)
Logistic regression
ð(ðŠ| ð¥) =1
ððð1ð¥1 + ð2ð¥2 + ⊠+ððð¥ð + ð
15
Discriminative Sequence ModelThe parallel to an HMM in the discriminative case: linear-chain conditional random fields (linear-chain CRFs) (Lafferty et al., 2001)
ð ð ð =1
ð ðexp
ð¡
ð
ðððð(ðŠð¡ , ðŠð¡â1, ð¥ð¡)
Z(X) is a normalization constant:
ð ð =
ð
exp
ð¡
ð
ðððð(ðŠð¡ , ðŠð¡â1, ð¥ð¡)
16
sum over all features
sum over all time-steps
sum over all possible sequences of hidden states
IntuitionStandard HMM: product of probabilities; these probabilities are defined over the identity of the states and words
⢠Transition from state DT to NN: ð(ðŠð¡+1 = ðð|ðŠð¡ = ð·ð)
⢠Emit word the from state DT: ð(ð¥ð¡ = ð¡âð|ðŠð¡ = ð·ð)
Linear-chain CRF: replace the products by numbers that are NOT probabilities, but linear combinations of weights and feature values.
17
Features in CRFsStandard HMM probabilities as CRF features:
⢠Transition from state DT to state NNðð·ðâðð(ðŠð¡ , ðŠð¡â1, ð¥ð¡) = ð(ðŠð¡â1 = ð·ð) ð(ðŠð¡ = ðð)
⢠Emit the from state DTðð·ðâð¡âð(ðŠð¡ , ðŠð¡â1, ð¥ð¡) = ð(ðŠð¡ = ð·ð) ð(ð¥ð¡ = ð¡âð)
⢠Initial state is DTðð·ð ðŠ1, ð¥1 = ð ðŠ1 = ð·ð
Indicator function:
Let ð ðððððð¡ððð = 1 if ðððððð¡ððð is true0 otherwise
18
Features in CRFsAdditional features that may be useful
⢠Word is capitalizedðððð(ðŠð¡ , ðŠð¡â1, ð¥ð¡) = ð(ðŠð¡ = ? )ð(ð¥ð¡ is capitalized)
⢠Word ends in âedðâðð(ðŠð¡ , ðŠð¡â1, ð¥ð¡) = ð(ðŠð¡ = ? )ð(ð¥ð¡ ends with ðð)
⢠Exercise: propose more features
19
Inference with LC-CRFsDynamic programming still works â modify the forward and the Viterbi algorithms to work with the weight-feature products.
20
HMM LC-CRF
Forward algorithm ð ð ð ð(ð)
Viterbi algorithm argmaxð
ð(ð, ð|ð) argmaxð
ð(ð|ð, ð)
Forward Algorithm for HMMsCreate trellis ðŒð(ð¡) for ð = 1âŠð, ð¡ = 1âŠð
ðŒð 1 = ðððð(ð1) for j = 1 ⊠N
for t = 2 ⊠T:
for j = 1 ⊠N:
ðŒð ð¡ =
ð=1
ð
ðŒð ð¡ â 1 ððððð(ðð¡)
ð ð¶ ð =
ð=1
ð
ðŒð(ð)
Runtime: O(ð2ð)
21
Forward Algorithm for LC-CRFsCreate trellis ðŒð(ð¡) for ð = 1âŠð, ð¡ = 1âŠð
ðŒð 1 = exp ð ðððððð¡ðð
ðððð¡(ðŠ1 = ð, ð¥1) for j = 1 ⊠N
for t = 2 ⊠T:
for j = 1 ⊠N:
ðŒð ð¡ =
ð=1
ð
ðŒð ð¡ â 1 exp
ð
ðððð(ðŠð¡ = ð, ðŠð¡â1, ð¥ð¡)
ð(ð) =
ð=1
ð
ðŒð(ð)
Runtime: O(ð2ð)
Having ð(ð) allows us to compute ð(ð|ð)
22
Transition and emission probabilities replacedby exponent of weighted sums of features.
Viterbi Algorithm for HMMsCreate trellis ð¿ð(ð¡) for ð = 1âŠð, ð¡ = 1âŠð
ð¿ð 1 = ðððð(ð1) for j = 1 ⊠N
for t = 2 ⊠T:
for j = 1 ⊠N:ð¿ð ð¡ = max
ðð¿ð ð¡ â 1 ððððð(ðð¡)
Take maxð
ð¿ð ð
Runtime: O(ð2ð)
23
Viterbi Algorithm for LC-CRFsCreate trellis ð¿ð(ð¡) for ð = 1âŠð, ð¡ = 1âŠð
ð¿ð 1 = exp ð ðððððð¡ðð
ðððð¡(ðŠ1 = ð, ð¥1)for j = 1 ⊠N
for t = 2 ⊠T:
for j = 1 ⊠N:
ð¿ð ð¡ = maxð
ð¿ð ð¡ â 1 exp
ð
ðððð(ðŠð¡ = ð, ðŠð¡â1, ð¥ð¡)
Take maxð
ð¿ð ð
Runtime: O(ð2ð)
Remember that we need backpointers.
24
Training LC-CRFsUnlike for HMMs, no analytic MLE solution
Use iterative method to improve data likelihood
Gradient descent
A version of Newtonâs method to find where the gradient is 0
25
ConvexityFortunately, ð ð is a concave function (equivalently, its negation is a convex function). That means that we will find the global maximum of ð ð with gradient descent.
26
Gradient AscentWalk in the direction of the gradient to maximize ð(ð)
⢠a.k.a., gradient descent on a loss function
ðnew = ðold + ðŸð»ð(ð)
ðŸis a learning rate that specifies how large a step to take.
There are more sophisticated ways to do this update:
⢠Conjugate gradient
⢠L-BFGS (approximates using second derivative)
27
Gradient Descent SummaryDescent vs ascent
Convention: think about the problem as a minimization problem
Minimize the negative log likelihood
Initialize ð = ð1, ð2, ⊠, ðð randomly
Do for a while:
Compute ð»ð(ð), which will require dynamic programming (i.e., forward algorithm)
ð â ð â ðŸð»ð(ð)
28
Gradient of Log-LikelihoodFind the gradient of the log likelihood of the training corpus:
ð ð = log
ð
ð(ð ð |ð ð )
29
Interpretation of GradientOverall gradient is the difference between:
ð
ð¡
ðð(ðŠð¡ð, ðŠð¡â1
ð, ð¥ð¡
ð)
the empirical distribution of feature ðð in the training corpus
and:
ð
ð¡
ðŠ,ðŠâ²
ðð ðŠ, ðŠâ², ð¥ð¡ð ð(ðŠ, ðŠâ²|ð ð )
the expected distribution of ðð as predicted by the current model
30
Interpretation of GradientWhen the corpus likelihood is maximized, the gradient is zero, so the difference is zero.
Intuitively, this means that finding parameter estimate by gradient descent is equivalent to telling our model to predict the features in such a way that they are found in the same distribution as in the gold standard.
31
RegularizationTo avoid overfitting, we can encourage the weights to be close to zero.
Add term to corpus log likelihood:
ðâ ð = log
ð
ð(ð ð |ð ð ) â exp
ð
ðð2
2ð2
ð controls the amount of regularization
Results in extra term in gradient:
âððð2
32
Stochastic Gradient DescentIn the standard version of the algorithm, the gradient is computed over the entire training corpus.
⢠Weight update only once per iteration through training corpus.
Alternative: calculate gradient over a small mini-batch of the training corpus and update weights then
⢠Many weight updates per iteration through training corpus
⢠Usually results in much faster convergence to final solution, without loss in performance
33
Stochastic Gradient DescentInitialize ð = ð1, ð2, ⊠, ðð randomly
Do for a while:
Randomize order of samples in training corpus
For each mini-batch in the training corpus:
Compute ð»âð(ð) over this mini-batchð â ð â ðŸð»âð(ð)
34