discriminative training and machine learning approaches machine learning lab, dept. of csie, ncku...
DESCRIPTION
Our Concerns Feature extraction and HMM modeling should be jointly performed. Common objective function should be considered. To alleviate model confusion and improve recognition performance, we should estimate HMM using discriminative criterion built from statistics theory. Model parameters should be calculated rapidly without applying descent algorithm. 3TRANSCRIPT
Discriminative Trainingand
Machine Learning Approaches
Machine Learning Lab, Dept. of CSIE, NCKU
Chih-Pin Liao
Discriminative Training
2
Our ConcernsFeature extraction and HMM modeling should be jointly performed.
Common objective function should be considered.
To alleviate model confusion and improve recognition performance, we should estimate HMM using discriminative criterion built from statistics theory.
Model parameters should be calculated rapidly without applying descent algorithm.
3
MCE is a popular discriminative training algorithm developed for speech recognition and extended to other PR applications.
Rather than maximizing likelihood of observed data, MCE aims to directly minimize classification errors.
Gradient descent algorithm was used to estimate HMM parameters.
Minimum Classification Error (MCE)4
Procedure of training discriminative models using observations X
Discriminant function
Anti-discriminant function
Misclassification measure
)(log),g( jj XPX
1
)(log1
1)(log),(
jc
cjj XPC
XPXG
),(),g(),( jjj XGXXd
}{ j
MCE Training Procedure5
Loss function is calculated by mapping
into a range between zero to one through a sigmoid function.
Minimize the expected loss or classification error to find discriminative model.
C
jjjXX XXlEXlE
1
)(1 ),( minarg)],([argmin ˆ
)),(exp(11),(
jj Xd
Xl
),( jXd
Expected Loss6
Hypothesis Test
7
New training criterion was derived from hypothesis test theory.
We are testing null hypothesis against alternative hypothesis.
Optimal solution is obtained by a likelihood ratio test according to Neyman-Pearson Lemma
Higher likelihood ratio imply stronger confidence towards accepting null hypothesis.
)()(
LR1
0
HXPHXP
Likelihood Ratio Test8
Null and alternative hypotheses : Observations X are from target HMM state
j : Observation X are not from target HMM
state j
We develop discriminative HMM parameters for target state against non-target states.
Problem turns out to verify the goodness of data alignment to the corresponding HMM states.
0H
1H
Hypotheses in HMM Training9
Maximum Confidence Hidden Markov Model
10
MCHMM is estimated by maximizing the log likelihood ratio or the confidence measure
where parameter set consists of HMM parameters and transformation matrix
)|(log)|(log maxarg
)|LLR( maxarg MC
XPXP
X
},,,{ Wjkjkjk
Maximum Confidence HMM11
Expectation-maximization (EM) algorithm is applied to tackle missing data problem for maximum confidence estimation
E-step
T
t
C
jctjtt
SS
jc
PC
PXjsP
SXXSPXSXEQ
1 1
)(1
1)(log),(
),(LLR),(],),(LLR[)(
xx
Hybrid Parameter Estimation12
)},,{},,({)}{}({
)()(21
log212log
2loglog
11
)()(21
log212log
2loglog
),(
)(
1
1
1 1 1 1
1
WWQQ
WW
dW
C
WW
dWkj
Q
jkjkjkjkgjkjk
cktckT
ckt
ckck
T
t
C
j
K
k iktikT
ikt
ikik
t
jc
xx
xx
Expectation Function13
T
t
T
t
K
kt
K
kt
T
tt
T
tt
jk
jc
jc
kcC
kj
kcC
kj
1 1 11
11
),(1
1),(
),(1
1),(
T
tt
T
tt
T
ttt
T
ttt
jk
jc
jc
kcC
kj
kcC
kj
W
11
11
),(1
1),(
),(1
1),(
xx
MC Estimates of HMM Parameters14
'
),(1
1),(
))((),(1
1
))((),(
11
1
1
W
kcC
kj
kcC
kj
W T
tt
T
tt
T
t
Tcktcktt
Tjktjkt
T
tt
Tjk
jc
jc
xx
xx
MC Estimates of HMM Parameters15
)(
)()()1( )(
i
igii
W
WWQWW
C
j
K
k
ijk
Tiijki
ig WWWT
WWWQ
1 1
1)()()()(
)(
)()(
MC Estimate of Transformation Matrix16
Training featuresfrom face images
Uniformsegmentation
Transform HMMparameters with W
Convergence?
Viterbidecoding
MCM-basedHMM parameters
Extract featureswith estimated Wfrom observation
yes
no
Initialize W
Estimate transformation matrix Wwith GPD algorithm
W convergence?
WWWQWW tt
)|()()1(
no yes
Estimate initialHMM parameters
17
MC Classification Rule
Let Y denote an input test image data. We apply the same criterion to identify the most likely category corresponding to Y
)LLR( maxargMC cc
Yc
18
Summary A new maximum confidence HMM framework was
proposed. Hypothesis test principle was used for building
training criterion. Discriminative feature extraction and HMM
modeling were performed under the same criterion. “Maximum Confidence Hidden Markov Modeling for Face
Recognition”Chien, Jen-Tzung; Liao, Chih-Pin;Pattern Analysis and Machine Intelligence, IEEE Transactions on
Volume 30, Issue 4, April 2008 Page(s):606 – 616
19
Machine Learning Approaches
20
Introduction
Conditional Random Fields (CRF) relax the normal conditional independence
assumption of the likelihood model enforce the homogeneity of labeling variables
conditioned on the observation Due to the weak assumptions of CRF model
and its discriminative nature allows arbitrary relationship among data may require less resources to train its
parameters
21
Better performance of CRF models than the Hidden Markov Model (HMM) and Maximum Entropy Markov models (MEMMs) language and text processing problem Object recognition problems Image and video segmentation tracking problem in video sequences
22
Generative & Discriminative Model
23
Two Classes of Models24
Generative model (HMM) - model the distribution of states
Direct model (MEMM and CRF)- model the posterior probability directly
)|()|( SXXS PP
)|(maxargˆ XSSS
p 1ts
1tx tx 1tx
ts 1ts
1ts
1tx tx 1tx
ts 1ts 1ts
1tx tx 1tx
ts 1ts
MEMM CRF
Comparisons of Two Kinds of Model
25 Generative model – HMM Use Bayesian rule approximation Assume that observations are independent Multiple overlapping features are not modeled Model is estimated through recursive Viterbi
algorithm)|()|()()( 11 sxPssPss t
Sstt
Direct model - MEMM and CRF Direct modeling of posterior probability Dependencies of observations are flexibly
modeled Model is estimated through recursive
Viterbi algorithm),|()()( 11
t
Sstt xssPss
26
Hidden Markov Model &Maximum Entropy Markov Model
27
HMM for Human Motion Recognition
HMM is defined by Transition probability Observation probability
1ts
1tx tx 1tx
ts 1ts
28
)|( 1tt ssp
)|( tt sxp
Maximum Entropy Markov Model29
MEMM is defined by is used to replace
transition and observation probability in HMM model
1ts
1tx tx 1tx
ts 1ts
),|( 1 ttt xssp
Maximum Entropy Criterion 30
Definition of feature functions
where
Constrained optimization problem
where empirical
expectation
model expectation
01)(,0
),(,
ssandcbifscf tt
ttsb
},{ 1 ttt sxc
ii ffi EEf ~:
VsCc
if scfcpcspEi
,),()(~)|(
N
jjji
VsCcif scf
NscfscpE
i1,
),(1),(),(~~
Solution of MEMM
Lagrange multipliers are used for constrained optimization
where are the model parameters
Solution is obtained by
Ss jjj
iii
iii scf
scfscf
cZcsp
)),(exp(
)),(exp()),(exp(1)|(
31
i
ffi iiEEcspHcsp ))~(())|(()),|((
}{ i
)|(log)|()(~))|((,
cspcspcpcspHVsCc
GIS Algorithm Optimize the Maxmimum Mutual
Information Criterion (MMI) Step1: Calculate the empirical expectation
Step2: Start from an initial value
Step3: Calculate the model expectation
Step4: Update model parameters
Repeat step 3 and 4 until convergence)
~log( )(
)()(current
f
fcurrenti
newi
i
i
E
E
32
1)0( i
N
jjjif scf
NE
i1
),(1~
VsCc
if scfcspN
Ei
,
),()|(1
Conditional Random Field
33
Conditional Random Field
34 Definition
Let be a graph such that .When conditioned on , and obeyed the
Markovproperty Then, is a conditional random field
1ts
1tx tx 1tx
ts 1ts
),( SXG Vvv )(SS
X vS
)~,,|(),,|( vwpvwp wvwv SXSSXS
),( SX
CRF Model Parameters
The undirected graphical structure can be used to factorize into a normalized product of potential functions
Consider the graph as a linear-chain structure
Model parameter set
Feature function set
35
jVvvjj
iEeeii vgefp
,,),,(),,(exp),|( XSXSXS
,...},,...;,{ 2121
,...},,...;,{ 2121 ggff
CRF Parameter Estimation
36 We can rewrite and maximize the posterior probability
where
and Log posterior probability is given by
)),(exp()(
1),|( k
kk FZ
p XSX
XS
,...},,...;,{,....},{ 212121
,...},,...;,{,...},{ 212121 ggffFF
k j
kkjjk F
ZL ),(
)(1log)( )()(
)( XSX
Parameter Updating by GIS Algorithm
37 Differentiating the log posterior probability with respect to parameter
Setting this derivative to zero yields the constraint in maximum entropy model
This estimation has no closed-form solution. We can use GIS algorithm.
)],([)],([)( )(),|(),(~ )(
kj
kpjp
j
FEFELk XSXS
XSXS
CRF MEMMDifference Objective Function Max. posterior
probability with Gibbs distribution
Max. entropy under constrain
Complexity of calculating normalization term
Full
DP
N-Best
Top One
Inference in model
Similarity Feature function State & observationState & state
Parameter Weight of feature function
Distribution Gibbs distribution
)1(O
)|(| NsO
)(kO
)|(| NsO
)|( XSp
)|(| 2 NsO
),|( 1 ttt xssp
38
Summary and Future works 39
We construct complex CRF with cycle for better modeling of contextual dependency. Graphical model algorithm is applied.
In the future, the variational inference algorithm will be developed for improving calculation of conditional probability.
The posterior probability can be calculated directly by a approximating approach.
“Graphical modeling of conditional random fields for human motion recognition” Liao, Chih-Pin; Chien, Jen-Tzung;ICASSP 2008. IEEE International Conference on March 31 2008-April 4 2008 Page(s):1969 - 1972
Thanks for your attention and
Discussion
40