foundations of statistical nlp chapter 9. markov models 한 기 덕한 기 덕
TRANSCRIPT
Foundations of Statistical NLP
Chapter 9. Markov Models
한 기 덕
2
Contents
Introduction Markov Models Hidden Markov Models
– Why use HMMs– General form of an HMM– The Three Fundamental Questions for HMMs
Fundamental Questions For HMMs Implementation, Properties, and Variants
3
Introduction
Markov Model– Markov processes/chains/models were first developed by Andrei
A. Markov
– First use linguistic purpose : modeling the letter sequences in Russian literature(1913)
– Current use general statistical tool
VMM (Visible Markov Model)– Words in sentences is depend on their syntax.
HMM (Hidden Markov Model)– operate high level abstraction by postulating additional “hidden”
structures.
4
Markov Models
Markov assumption– Future elements of the sequence independent of past ele
ments, given the present element.
Limited Horizon– Xt = sequence of random variables
– Sk = state space
Time invariant (stationary)
5
Markov Models(Cont’)
Notation– stochastic transition
– probability of different initial state
Application : Linear sequence of events– modeling valid phone sequences in speech recognition
– sequences of speech acts in dialog systems
6
Markov Chain
circle : state and state name arrows connecting states : possible transition arc label : probability of each transition
7
Visible Markov Model
We know what states the machine is passing through.
mth order Markov model– n 3, n-gram violate Limited Horizen condition– reformulate any n-gram model as a visible Markov model by simpl
y encoding (n-1)-gram
8
Hidden Markov Model
We don’t know the state sequence that the model passes through, only some probabilistic function of it
Example 1 : The crazy soft drink machine– two state : cola preferring(CP), iced tea preferring(IP)
– VMM : machine always put out a cola in CP
– HMM : emission probability
– Output probability given From state
9
Crazy soft drink machine
Problem– What is the probability of seeing the output sequence {lem, ice-t} i
f the machine always start off in the cola preferring state?
10
Crazy soft drink machine(Cont’)
11
Why use HMMs?
underlying events probabilistically generate surface events– the words in a text parts of speech
Linear interpolation of n-gram
Hidden state– the choice of whether to use the unigram, bigram, or trigram proba
bilities.
Two Keys– This is conversion works by adding epsilon transitions.
– Separate parameters iab don’t adjust them separately.
12
13
NotationA
B
AAA
BB
SSS
KKK
S
K
S
K
14
General form of an HMM
Arc-emission HMM– the symbol emits at time t depends on both the state at time t and
at time(t+1). State-emission HMM : ex) crazy drink machine
– the symbol emits at time t depends just on the state at time t
Figure 9.4 A program for a Markov process.
15
The Three Fundamental Questions for HMMS
16
Finding the probability of an observation
17
The forward procedure
Cheap algorithm required only 2N2T multiplication
18
The backward procedure
The total probability of seeing the rest of the observation sequence.
Use of a combination of forward and backward probabilities is vital for solving the third problem of parameter reestimation.
Backward variables
Combining forward & backward
19
Finding the best state sequence
State sequence that explains the observations is more than one way.
Find Xt that maximizes P(Xt|O, )
This may yield a quite unlikely state sequence.
Viterbi algorithm is more efficient.
20
Viterbi algorithem
The most likely complete path
This is sufficient to maximize for a fixed O
Definition
21
Variable calculations for O = (lem, ice_t, cola)
22
Parameter estimation
Given a certain observation sequence Find the values of the model parameter
= (A, B, ) Using Maximum Likelihood Estimation
Locally maximize by an iterative hill-climbing algorithm usually effective for HMM
23
Parameter estimation (Cont’)
24
Parameter estimation (Cont’)
25
Implementation, Properties, Variants
Implementation– Obvious issue : keeping on multiplying very small
numbers Use Log function
Variants– It is not impossible to estimate many number
parameter.
Multiple input observations Initialization of parameter values
– Try to approach near global maximum