© padhraic smyth, uc irvine a review of hidden markov models for context-based classification...
TRANSCRIPT
© Padhraic Smyth, UC Irvine
A Review of Hidden Markov Models for Context-Based Classification
ICML’01 Workshop onTemporal and Spatial Learning
Williams CollegeJune 28th 2001
Padhraic SmythInformation and Computer Science
University of California, Irvine
www.datalab.uci.edu
© Padhraic Smyth, UC Irvine
Outline
• Context in classification
• Brief review of hidden Markov models
• Hidden Markov models for classification
• Simulation results: how useful is context?• (with Dasha Chudova, UCI)
© Padhraic Smyth, UC Irvine
Historical Note
• “Classification in Context” was well-studied in pattern recognition in the 60’s and 70’s– e.g, recursive Markov-based algorithms were
proposed, before hidden Markov algorithms and models were fully understood
• Applications in– OCR for word-level recognition– remote-sensing pixel classification
© Padhraic Smyth, UC Irvine
Papers of Note
Raviv, J., “Decision-making in Markov chains applied to the problem of pattern recognition”, IEEE Info Theory, 3(4), 1967
Hanson, Riseman, and Fisher, “Context in word recognition,”Pattern Recognition, 1976
Toussaint, G., “The use of context in pattern recognition,” Pattern Recognition, 10, 1978
Mohn, Hjort, and Storvik, “A simulation study of some contextual classification methods for remotely sensed data,”IEEE Trans Geo. Rem. Sens., 25(6), 1987.
© Padhraic Smyth, UC Irvine
Context-Based Classification Problems
• Medical Diagnosis– classification of a patient’s state over time
• Fraud Detection– detection of stolen credit card
• Electronic Nose– detection of landmines
• Remote Sensing– classification of pixels into ground cover
© Padhraic Smyth, UC Irvine
Modeling Context
• Common Theme = Context– class labels (and features) are “persistent” in
time/space
© Padhraic Smyth, UC Irvine
Modeling Context
• Common Theme = Context– class labels (and features) are “persistent” in
time/space
X1
C1
X2
C2
X3
C3
XT
CT
- - - - - - - -
Features(observed)
Class (hidden)
Time
© Padhraic Smyth, UC Irvine
Feature Windows
• Predict Ct using a window, e.g., f(Xt, Xt-1, Xt-2)
– e.g., NETtalk application
X1
C1
X2
C2
X3
C3
XT
CT
- - - - - - - -
Features(observed)
Class (hidden)
Time
© Padhraic Smyth, UC Irvine
Alternative: Probabilistic Modeling
• E.g., assume p(Ct | history) = p(Ct | Ct-1)
– first order Markov assumption on the classes
X1
C1
X2
C2
X3
C3
XT
CT
- - - - - - - -
Features(observed)
Class (hidden)
Time
© Padhraic Smyth, UC Irvine
Brief review of hidden Markov models (HMMs)
© Padhraic Smyth, UC Irvine
Graphical Models
• Basic Idea: p(U) <=> an annotated graph
– Let U be a set of random variables of interest
– 1-1 mapping from U to nodes in a graph
– graph encodes “independence structure” of model
– numerical specifications of p(U) are stored locally at the nodes
© Padhraic Smyth, UC Irvine
Acyclic Directed Graphical Models(aka belief/Bayesian networks)
A B
C
In general,
p(X1, X2,....XN) = p(Xi | parents(Xi ) )
p(A,B,C) = p(C|A,B)p(A)p(B)
© Padhraic Smyth, UC Irvine
Undirected Graphical Models (UGs)
• Undirected edges reflect correlational dependencies– e.g., particles in physical systems, pixels in an image
• Also known as Markov random fields, Boltzmann machines, etc
p(X1, X2,....XN) =potential(clique i)
© Padhraic Smyth, UC Irvine
Examples of 3-way Graphical Models
A CB Markov chainp(A,B,C) = p(C|B) p(B|A) p(A)
© Padhraic Smyth, UC Irvine
Examples of 3-way Graphical Models
A CB
A B
C
Markov chainp(A,B,C) = p(C|B) p(B|A) p(A)
Independent Causes:p(A,B,C) = p(C|A,B)p(A)p(B)
© Padhraic Smyth, UC Irvine
Hidden Markov Graphical Model• Assumption 1:
– p(Ct | history) = p(Ct | Ct-1)
– first order Markov assumption on the classes
• Assumption 2:
– p(Xt | history, Ct ) = p(Xt | Ct )
– Xt only depends on current class Ct
© Padhraic Smyth, UC Irvine
Hidden Markov Graphical Model
X1
C1
X2
C2
X3
C3
XT
CT
- - - - - - - -
Features(observed)
Class (hidden)
Time
Notes: - all temporal dependence is modeled through the class variable C - this is the simplest possible model
- Avoids modeling p(X|other X’s)
© Padhraic Smyth, UC Irvine
Generalizations of HMMs
R1
C1
R2
C2
R3
C3
RT
CT
- - - - - - - -
SpatialRainfall(observed)
State (hidden)
Hidden state model relating atmospheric measurementsto local rainfall
“Weather state” couples multiple variables in time and space
(Hughes and Guttorp, 1996)
Graphical models = language for spatio-temporal modeling
A1 A2 A3 ATAtmospheric(observed)
© Padhraic Smyth, UC Irvine
Exact Probability Propagation (PP) Algorithms
• Basic PP Algorithm – Pearl, 1988; Lauritzen and Spiegelhalter, 1988– Assume the graph has no loops– Declare 1 node (any node) to be a root – Schedule two phases of message-passing
• nodes pass messages up to the root• messages are distributed back to the leaves
– (if loops, convert loopy graph to an equivalent tree)
© Padhraic Smyth, UC Irvine
Properties of the PP Algorithm
• Exact– p(node|all data) is recoverable at each node
• i.e., we get exact posterior from local message-passing
– modification: MPE = most likely instantiation of all nodes jointly
• Efficient– Complexity: exponential in size of largest clique– Brute force: exponential in all variables
© Padhraic Smyth, UC Irvine
Hidden Markov Graphical Model
X1
C1
X2
C2
X3
C3
XT
CT
- - - - - - - -
Features(observed)
Class (hidden)
Time
© Padhraic Smyth, UC Irvine
PP Algorithm for a HMM
X1
C1
X2
C2
X3
C3
XT
CT
- - - - - - - -
Features(observed)
Class (hidden)
Let CT be the root
© Padhraic Smyth, UC Irvine
PP Algorithm for a HMM
X1
C1
X2
C2
X3
C3
XT
CT
- - - - - - - -
Features(observed)
Class (hidden)
Let CT be the root
Absorb evidence from X’s (which are fixed)
© Padhraic Smyth, UC Irvine
PP Algorithm for a HMM
X1
C1
X2
C2
X3
C3
XT
CT
- - - - - - - -
Features(observed)
Class (hidden)
Let CT be the root
Absorb evidence from X’s (which are fixed)
Forward pass: pass evidence forward from C1
© Padhraic Smyth, UC Irvine
PP Algorithm for a HMM
X1
C1
X2
C2
X3
C3
XT
CT
- - - - - - - -
Features(observed)
Class (hidden)
Let CT be the root
Absorb evidence from X’s (which are fixed)
Forward pass: pass evidence forward from C1
Backward pass: pass evidence backward from CT
(This is the celebrated “forward-backward” algorithm for HMMs)
© Padhraic Smyth, UC Irvine
Comments on F-B Algorithm
• Complexity = O(T m2)
• Has been reinvented several times– e.g., BCJR algorithm for error-correcting codes
• Real-time recursive version– run algorithm forward to current time t– can propagate backwards to “revise” history
© Padhraic Smyth, UC Irvine
HMMs and Classification
© Padhraic Smyth, UC Irvine
Forward-Backward Algorithm
• Classification
– Algorithm produces p(Ct|all other data) at each node
– to minimize 0-1 loss• choose most likely class at each t
• Most likely class sequence?– Not the same as the sequence of most likely classes– can be found instead with Viterbi/dynamic
programming• replace sums in F-B with “max”
© Padhraic Smyth, UC Irvine
Supervised HMM learning
• Use your favorite classifier to learn p(C|X)– i.e., ignore temporal aspect of problem (temporarily)
• Now, estimate p(Ct | Ct-1) from labeled training
data
• We have a fully operational HMM– no need to use EM for learning if class labels are
provided (i.e., do “supervised HMM learning”)
© Padhraic Smyth, UC Irvine
Fault Diagnosis Application (Smyth, Pattern Recognition, 1994)
Features
FaultClasses
X1
C1
X2
C2
X3
C3
XT
CT
- - - - - - - -
Fault Detection in 34m Antenna Systems:
Classes: {normal, short-circuit, tacho problem, ..}
Features: AR coefficients measured every 2 seconds
Classes are persistent over time
© Padhraic Smyth, UC Irvine
Approach and Results
• Classifiers– Gaussian model and neural network– trained on labeled “instantaneous window” data
• Markov component– transition probabilities estimated from MTBF data
• Results– discriminative neural net much better than Gaussian– Markov component reduced the error rate (all false
alarms) of 2% to 0%.
© Padhraic Smyth, UC Irvine
Classification with and withoutthe Markov context
We will compare what happens when
(a) we just make decisions based on p(Ct | Xt )(“ignore context”)
(b) we use the full Markov context(i.e., use forward-backward to“integrate” temporal information)
X1
C1
X2
C2
X3
C3
XT
CT
- - - - - - - -
© Padhraic Smyth, UC Irvine
-5 0 5 100
0.1
0.2
0.3
0.4
0.5
Component 1 Component 2p(
x)
-5 0 5 100
0.1
0.2
0.3
0.4
0.5
Mixture Model
x
p(x)
© Padhraic Smyth, UC Irvine
0 10 20 30 40 50 60 70 80 90 100
-2024
Gaussian vs HMM Classification
Obs
erva
tions
© Padhraic Smyth, UC Irvine
0 10 20 30 40 50 60 70 80 90 100
-2024
Gaussian vs HMM Classification
Obs
erva
tions
ObservationsTrue states
© Padhraic Smyth, UC Irvine
0 10 20 30 40 50 60 70 80 90 100
-2024
Gaussian vs HMM Classification
Obs
erva
tions
ObservationsTrue states
0 10 20 30 40 50 60 70 80 90 1000
0.5
1
Pos
terio
r
© Padhraic Smyth, UC Irvine
0 10 20 30 40 50 60 70 80 90 100
-2024
Gaussian vs HMM Classification
Obs
erva
tions
ObservationsTrue states
0 10 20 30 40 50 60 70 80 90 1000
0.5
1
Pos
terio
r
HMMGauss
© Padhraic Smyth, UC Irvine
0 10 20 30 40 50 60 70 80 90 100
-2024
Gaussian vs HMM Classification
Obs
erva
tions
ObservationsTrue states
0 10 20 30 40 50 60 70 80 90 1000
0.5
1
Pos
terio
r
HMMGauss
0 10 20 30 40 50 60 70 80 90 1001
1.5
2
HM
M D
ecod
ing
© Padhraic Smyth, UC Irvine
0 10 20 30 40 50 60 70 80 90 100
-2024
Gaussian vs HMM Classification
Obs
erva
tions
ObservationsTrue states
0 10 20 30 40 50 60 70 80 90 1000
0.5
1
Pos
terio
r
HMMGauss
0 10 20 30 40 50 60 70 80 90 1001
1.5
2
HM
M D
ecod
ing
0 10 20 30 40 50 60 70 80 90 1001
1.5
2
Gau
ss D
ecod
ing
© Padhraic Smyth, UC Irvine
Simulation Experiments
© Padhraic Smyth, UC Irvine
Systematic Simulations
Simulation Setup1. Two Gaussian classes, at mean 0 and mean 1 => vary “separation” = sigma of the Gaussians
2. Markov dependence A = [p 1-p ; 1-p p]
Vary p (self-transition) = “strength of context”
Look at Bayes error with and without context
X1
C1
X2
C2
X3
C3
XT
CT
- - - - - - - -
© Padhraic Smyth, UC Irvine
-4 -3 -2 -1 0 1 2 3 4 5 6 70
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
Class 1 Class 2p(x)
Bayes error = 0.08
Separation = 3sigma
© Padhraic Smyth, UC Irvine
-4 -3 -2 -1 0 1 2 3 4 50
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
Class 1 Class 2p(x)
Bayes error = 0.31
Separation = 1sigma
© Padhraic Smyth, UC Irvine
0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 10
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5Bayes Error vs. Markov Probability
Self-transition probability
Ba
yes
Err
or
Ra
te
Separation = 0.1Separation = 1Separation = 2Separation = 4
© Padhraic Smyth, UC Irvine
0 0.5 1 1.5 2 2.5 3 3.5 40
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5Bayes Error vs. Gaussian Separation
Separation
Ba
yes
Err
or
Ra
teSelf-transition = 0.5Self-transition = 0.9Self-transition = 0.94Self-transition = 0.99
© Padhraic Smyth, UC Irvine
0 0.5 1 1.5 2 2.5 3 3.5 40
10
20
30
40
50
60
70
80
90
100% Reduction in Bayes Error vs. Gaussian Separation
Separation
Pe
rce
nt D
ecr
ea
se in
Ba
yes
Err
or
Self-transition = 0.5Self-transition = 0.9Self-transition = 0.94Self-transition = 0.99
© Padhraic Smyth, UC Irvine
In summary….
• Context reduces error– greater Markov dependence => greater reduction
• Reduction is dramatic for p>0.9– e.g., even with minimal Gaussian separation, Bayes
error can be reduced to zero!!
© Padhraic Smyth, UC Irvine
Approximate Methods
• Forward-Only:– necessary in many applications
• “Two nearest-neighbors”– only use information from C(t-1) and C(t+1)
• How suboptimal are these methods?
© Padhraic Smyth, UC Irvine
0 1 2 3 4 5 6 70
0.05
0.1
0.15
0.2
0.25
0.3
0.35Bayes Error vs. Markov Probability
Log-odds of self-transition probability
Ba
yes
Err
or
Ra
teFwBw, Separation = 1Fw, Separation = 1NN2, Separation = 1
© Padhraic Smyth, UC Irvine
0 1 2 3 4 5 6 70
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45Bayes Error vs. Markov Probability
Log-odds of self-transition probability
Ba
yes
Err
or
Ra
te
FwBw, Separation = 0.25Fw, Separation = 0.25NN2, Separation = 0.25
© Padhraic Smyth, UC Irvine
0 0.5 1 1.5 2 2.5 3 3.5 40
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5Bayes Error vs. Gaussian Separation
Separation
Ba
yes
Err
or
Ra
teFwBw, self-transition = 0.99Fw, self-transition = 0.99NN2, self-transition = 0.99Bayes error
© Padhraic Smyth, UC Irvine
0 0.5 1 1.5 2 2.5 3 3.5 40
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5Bayes Error vs. Gaussian Separation
Separation
Ba
yes
Err
or
Ra
teFwBw, self-transition = 0.9Fw, self-transition = 0.9NN2, self-transition = 0.9Bayes error
© Padhraic Smyth, UC Irvine
In summary (for approximations)….
• Forward only:– “tracks” forward-backward reductions– generally gets much more than 50% of gap between
F-B and context-free Bayes error
• 2-neighbors– typically worse than forward only – much worse for small separation– much worse for very high transition probs
• does not converge to zero Bayes error
© Padhraic Smyth, UC Irvine
Extensions to “Simple” HMMs
Semi Markov modelsduration in each state need not be geometric
Segmental Markov Modelsoutputs within each state have a non-constant mean, regression function
Dynamic Belief NetworksAllow arbitrary dependencies among classes and features
Stochastic Grammars, Spatial Landmark models, etc
[See Afternoon Talks at this workshop for other approaches]
© Padhraic Smyth, UC Irvine
Conclusions
• Context is increasingly important in many classification applications
• Graphical models – HMMs are a simple and practical approach– graphical models provide a general-purpose
language for context
• Theory/Simulation– Effect of context on error rate can be dramatic
© Padhraic Smyth, UC Irvine
0 0.5 1 1.5 2 2.5 3 3.5 40
0.05
0.1
0.15
0.2
0.25
0.3
0.35Absolute Reduction in Bayes Error vs. Gaussian Separation
Separation
Pe
rce
nt D
ecr
ea
se in
Ba
yes
Err
or
Self-transition = 0.5Self-transition = 0.9Self-transition = 0.94Self-transition = 0.99
© Padhraic Smyth, UC Irvine
0 1 2 3 4 5 6 70
0.01
0.02
0.03
0.04
0.05
0.06
0.07Bayes Error vs. Markov Probability
Log-odds of self-transition probability
Ba
yes
Err
or
Ra
teFwBw, Separation = 3Fw, Separation = 3NN2, Separation = 3
© Padhraic Smyth, UC Irvine
0 0.5 1 1.5 2 2.5 3 3.5 40
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5Bayes Error vs. Gaussian Separation
Separation
Ba
yes
Err
or
Ra
teFwBw, self-transition = 0.7Fw, self-transition = 0.7NN2, self-transition = 0.7Bayes error
© Padhraic Smyth, UC Irvine
0 0.5 1 1.5 2 2.5 3 3.5 40
0.05
0.1
0.15
0.2
0.25
0.3
0.35Absolute Reduction in Bayes Error vs. Gaussian Separation
Separation
De
cre
ase
in B
aye
s E
rro
r R
ate
FwBw, self-transition = 0.99Fw, self-transition = 0.99NN2, self-transition = 0.99
© Padhraic Smyth, UC Irvine
0 0.5 1 1.5 2 2.5 3 3.5 40
10
20
30
40
50
60
70
80
90
100Percent Decrease in Bayes Error vs. Gaussian Separation
Separation
Pe
rce
nt D
ecr
ea
se in
Ba
yes
Err
or
FwBw, self-transition = 0.99Fw, self-transition = 0.99NN2, self-transition = 0.99
© Padhraic Smyth, UC Irvine
Sketch of the PP algorithm in action
© Padhraic Smyth, UC Irvine
Sketch of the PP algorithm in action
© Padhraic Smyth, UC Irvine
Sketch of the PP algorithm in action
1
© Padhraic Smyth, UC Irvine
Sketch of the PP algorithm in action
1 2
© Padhraic Smyth, UC Irvine
Sketch of the PP algorithm in action
1 2
3
© Padhraic Smyth, UC Irvine
Sketch of the PP algorithm in action
1 2
3 4