discriminative training and machine learning approaches machine learning lab, dept. of csie, ncku...

Discriminative Trainingand

Machine Learning Approaches

Machine Learning Lab, Dept. of CSIE, NCKU

Chih-Pin Liao

Discriminative Training

2

Our ConcernsFeature extraction and HMM modeling should be jointly performed.

Common objective function should be considered.

To alleviate model confusion and improve recognition performance, we should estimate HMM using discriminative criterion built from statistics theory.

Model parameters should be calculated rapidly without applying descent algorithm.

3

MCE is a popular discriminative training algorithm developed for speech recognition and extended to other PR applications.

Rather than maximizing likelihood of observed data, MCE aims to directly minimize classification errors.

Gradient descent algorithm was used to estimate HMM parameters.

Minimum Classification Error (MCE)4

Procedure of training discriminative models using observations X

Discriminant function

Anti-discriminant function

Misclassification measure

)(log),g( jj XPX

1

)(log1

1)(log),(

jc

cjj XPC

XPXG

),(),g(),( jjj XGXXd

}{ j

MCE Training Procedure5

Loss function is calculated by mapping

into a range between zero to one through a sigmoid function.

Minimize the expected loss or classification error to find discriminative model.

C

jjjXX XXlEXlE

1

)(1 ),( minarg)],([argmin ˆ

)),(exp(11),(

jj Xd

Xl

),( jXd

Expected Loss6

Hypothesis Test

7

New training criterion was derived from hypothesis test theory.

We are testing null hypothesis against alternative hypothesis.

Optimal solution is obtained by a likelihood ratio test according to Neyman-Pearson Lemma

Higher likelihood ratio imply stronger confidence towards accepting null hypothesis.

)()(

LR1

0

HXPHXP

Likelihood Ratio Test8

Null and alternative hypotheses : Observations X are from target HMM state

j : Observation X are not from target HMM

state j

We develop discriminative HMM parameters for target state against non-target states.

Problem turns out to verify the goodness of data alignment to the corresponding HMM states.

0H

1H

Hypotheses in HMM Training9

Maximum Confidence Hidden Markov Model

10

MCHMM is estimated by maximizing the log likelihood ratio or the confidence measure

where parameter set consists of HMM parameters and transformation matrix

)|(log)|(log maxarg

)|LLR( maxarg MC

XPXP

X

},,,{ Wjkjkjk

Maximum Confidence HMM11

Expectation-maximization (EM) algorithm is applied to tackle missing data problem for maximum confidence estimation

E-step

T

t

C

jctjtt

SS

jc

PC

PXjsP

SXXSPXSXEQ

1 1

)(1

1)(log),(

),(LLR),(],),(LLR[)(

xx

Hybrid Parameter Estimation12

)},,{},,({)}{}({

)()(21

log212log

2loglog

11

)()(21

log212log

2loglog

),(

)(

1

1

1 1 1 1

1

WWQQ

WW

dW

C

WW

dWkj

Q

jkjkjkjkgjkjk

cktckT

ckt

ckck

T

t

C

j

K

k iktikT

ikt

ikik

t

jc

xx

xx

Expectation Function13

T

t

T

t

K

kt

K

kt

T

tt

T

tt

jk

jc

jc

kcC

kj

kcC

kj

1 1 11

11

),(1

1),(

),(1

1),(

T

tt

T

tt

T

ttt

T

ttt

jk

jc

jc

kcC

kj

kcC

kj

W

11

11

),(1

1),(

),(1

1),(

xx

MC Estimates of HMM Parameters14

'

),(1

1),(

))((),(1

1

))((),(

11

1

1

W

kcC

kj

kcC

kj

W T

tt

T

tt

T

t

Tcktcktt

Tjktjkt

T

tt

Tjk

jc

jc

xx

xx

MC Estimates of HMM Parameters15

)(

)()()1( )(

i

igii

W

WWQWW

C

j

K

k

ijk

Tiijki

ig WWWT

WWWQ

1 1

1)()()()(

)(

)()(

MC Estimate of Transformation Matrix16

Training featuresfrom face images

Uniformsegmentation

Transform HMMparameters with W

Convergence?

Viterbidecoding

MCM-basedHMM parameters

Extract featureswith estimated Wfrom observation

yes

no

Initialize W

Estimate transformation matrix Wwith GPD algorithm

W convergence?

WWWQWW tt

)|()()1(

no yes

Estimate initialHMM parameters

17

MC Classification Rule

Let Y denote an input test image data. We apply the same criterion to identify the most likely category corresponding to Y

)LLR( maxargMC cc

Yc

18

Summary A new maximum confidence HMM framework was

proposed. Hypothesis test principle was used for building

training criterion. Discriminative feature extraction and HMM

modeling were performed under the same criterion. “Maximum Confidence Hidden Markov Modeling for Face

Recognition”Chien, Jen-Tzung; Liao, Chih-Pin;Pattern Analysis and Machine Intelligence, IEEE Transactions on

Volume 30, Issue 4, April 2008 Page(s):606 – 616

19

http://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=34

http://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=34

http://ieeexplore.ieee.org/xpl/tocresult.jsp?isnumber=4480721

Machine Learning Approaches

20

Introduction

Conditional Random Fields (CRF) relax the normal conditional independence

assumption of the likelihood model enforce the homogeneity of labeling variables

conditioned on the observation Due to the weak assumptions of CRF model

and its discriminative nature allows arbitrary relationship among data may require less resources to train its

parameters

21

Better performance of CRF models than the Hidden Markov Model (HMM) and Maximum Entropy Markov models (MEMMs) language and text processing problem Object recognition problems Image and video segmentation tracking problem in video sequences

22

Generative & Discriminative Model

23

Two Classes of Models24

Generative model (HMM) - model the distribution of states

Direct model (MEMM and CRF)- model the posterior probability directly

)|()|( SXXS PP

)|(maxargˆ XSSS

p 1ts

1tx tx 1tx

ts 1ts

1ts

1tx tx 1tx

ts 1ts 1ts

1tx tx 1tx

ts 1ts

MEMM CRF

Comparisons of Two Kinds of Model

25 Generative model – HMM Use Bayesian rule approximation Assume that observations are independent Multiple overlapping features are not modeled Model is estimated through recursive Viterbi

algorithm)|()|()()( 11 sxPssPss t

Sstt

Direct model - MEMM and CRF Direct modeling of posterior probability Dependencies of observations are flexibly

modeled Model is estimated through recursive

Viterbi algorithm),|()()( 11

t

Sstt xssPss

26

Hidden Markov Model &Maximum Entropy Markov Model

27

HMM for Human Motion Recognition

HMM is defined by Transition probability Observation probability

1ts

1tx tx 1tx

ts 1ts

28

)|( 1tt ssp

)|( tt sxp

Maximum Entropy Markov Model29

MEMM is defined by is used to replace

transition and observation probability in HMM model

1ts

1tx tx 1tx

ts 1ts

),|( 1 ttt xssp

Maximum Entropy Criterion 30

Definition of feature functions

where

Constrained optimization problem

where empirical

expectation

model expectation

01)(,0

),(,

ssandcbifscf tt

ttsb

},{ 1 ttt sxc

ii ffi EEf ~:

VsCc

if scfcpcspEi

,),()(~)|(

N

jjji

VsCcif scf

NscfscpE

i1,

),(1),(),(~~

Solution of MEMM

Lagrange multipliers are used for constrained optimization

where are the model parameters

Solution is obtained by

Ss jjj

iii

iii scf

scfscf

cZcsp

)),(exp(

)),(exp()),(exp(1)|(

31

i

ffi iiEEcspHcsp ))~(())|(()),|((

}{ i

)|(log)|()(~))|((,

cspcspcpcspHVsCc

GIS Algorithm Optimize the Maxmimum Mutual

Information Criterion (MMI) Step1: Calculate the empirical expectation

Step2: Start from an initial value

Step3: Calculate the model expectation

Step4: Update model parameters

Repeat step 3 and 4 until convergence)

~log( )(

)()(current

f

fcurrenti

newi

i

i

E

E

32

1)0( i

N

jjjif scf

NE

i1

),(1~

VsCc

if scfcspN

Ei

,

),()|(1

Conditional Random Field

33

Conditional Random Field

34 Definition

Let be a graph such that .When conditioned on , and obeyed the

Markovproperty Then, is a conditional random field

1ts

1tx tx 1tx

ts 1ts

),( SXG Vvv )(SS

X vS

)~,,|(),,|( vwpvwp wvwv SXSSXS

),( SX

CRF Model Parameters

The undirected graphical structure can be used to factorize into a normalized product of potential functions

Consider the graph as a linear-chain structure

Model parameter set

Feature function set

35

jVvvjj

iEeeii vgefp

,,),,(),,(exp),|( XSXSXS

,...},,...;,{ 2121

,...},,...;,{ 2121 ggff

CRF Parameter Estimation

36 We can rewrite and maximize the posterior probability

where

and Log posterior probability is given by

)),(exp()(

1),|( k

kk FZ

p XSX

XS

,...},,...;,{,....},{ 212121

,...},,...;,{,...},{ 212121 ggffFF

k j

kkjjk F

ZL ),(

)(1log)( )()(

)( XSX

Parameter Updating by GIS Algorithm

37 Differentiating the log posterior probability with respect to parameter

Setting this derivative to zero yields the constraint in maximum entropy model

This estimation has no closed-form solution. We can use GIS algorithm.

)],([)],([)( )(),|(),(~ )(

kj

kpjp

j

FEFELk XSXS

XSXS

CRF MEMMDifference Objective Function Max. posterior

probability with Gibbs distribution

Max. entropy under constrain

Complexity of calculating normalization term

Full

DP

N-Best

Top One

Inference in model

Similarity Feature function State & observationState & state

Parameter Weight of feature function

Distribution Gibbs distribution

)1(O

)|(| NsO

)(kO

)|(| NsO

)|( XSp

)|(| 2 NsO

),|( 1 ttt xssp

38

Summary and Future works 39

We construct complex CRF with cycle for better modeling of contextual dependency. Graphical model algorithm is applied.

In the future, the variational inference algorithm will be developed for improving calculation of conditional probability.

The posterior probability can be calculated directly by a approximating approach.

“Graphical modeling of conditional random fields for human motion recognition” Liao, Chih-Pin; Chien, Jen-Tzung;ICASSP 2008. IEEE International Conference on March 31 2008-April 4 2008 Page(s):1969 - 1972

http://ieeexplore.ieee.org/xpl/RecentCon.jsp?punumber=4505270

Thanks for your attention and

Discussion

40

discriminative training and machine learning approaches machine learning lab, dept. of csie, ncku...

Documents