discriminative learning of generative models for sequence ...€¦ · 1.1 graphical representation:...

62
Discriminative Learning of Generative Models for Sequence Classification and Motion Tracking Minyoung Kim January 2007

Upload: others

Post on 05-Jul-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Discriminative Learning of Generative Models for Sequence ...€¦ · 1.1 Graphical Representation: Naive Bayes and Logistic Regression . 4 1.2 Test errors vs. sample sizes (m) for

Discriminative Learning of Generative Models for

Sequence Classification and Motion Tracking

Minyoung Kim

January 2007

Page 2: Discriminative Learning of Generative Models for Sequence ...€¦ · 1.1 Graphical Representation: Naive Bayes and Logistic Regression . 4 1.2 Test errors vs. sample sizes (m) for

Contents

1 Introduction 11.1 Probabilistic Model-Based Approach . . . . . . . . . . . . . . . . 21.2 Generative vs. Discriminative Models . . . . . . . . . . . . . . . 3

2 Discriminative Learning of Generative Models 122.1 Conditional Likelihood Maximization . . . . . . . . . . . . . . . . 13

2.1.1 CML Optimization . . . . . . . . . . . . . . . . . . . . . . 142.1.2 Example: Classification with Mixtures of Gaussians . . . 162.1.3 Evaluation on Real Data . . . . . . . . . . . . . . . . . . . 19

2.2 Margin Maximization . . . . . . . . . . . . . . . . . . . . . . . . 22

3 Discriminative Learning of Dynamical Systems 253.1 Linear Dynamical Systems . . . . . . . . . . . . . . . . . . . . . . 263.2 Discriminative Dynamic Models . . . . . . . . . . . . . . . . . . . 28

3.2.1 Conditional Random Fields . . . . . . . . . . . . . . . . . 283.2.2 Maximum Entropy Markov Models . . . . . . . . . . . . . 30

3.3 Discriminative Learning of LDS . . . . . . . . . . . . . . . . . . . 313.3.1 Conditional Likelihood Maximization (CML) . . . . . . . 313.3.2 Slicewise Conditional Likelihood Maximization . . . . . . 323.3.3 Extension to Nonlinear Dynamical Systems . . . . . . . . 34

3.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.5.1 Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . 363.5.2 Human Motion Data . . . . . . . . . . . . . . . . . . . . . 36

4 Recursive Method for Discriminative Learning 404.1 Discriminative Mixture Learning . . . . . . . . . . . . . . . . . . 414.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.3.1 Synthetic Experiment . . . . . . . . . . . . . . . . . . . . 454.3.2 Experiments on Real Data . . . . . . . . . . . . . . . . . . 46

5 Future Work and Conclusion 52

i

Page 3: Discriminative Learning of Generative Models for Sequence ...€¦ · 1.1 Graphical Representation: Naive Bayes and Logistic Regression . 4 1.2 Test errors vs. sample sizes (m) for

List of Figures

1.1 Graphical Representation: Naive Bayes and Logistic Regression . 41.2 Test errors vs. sample sizes (m) for Naive Bayes (solid lines) and

Logistic Regression (dashed lines) on UCI datasets. Excerptedfrom [32]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3 Graphical Representation of HMM and CRF for Sequence Tagging 61.4 Test error scatter plots on synthetic data comparing HMMs and

CRFs in sequence tagging. The open squares represent datasetsgenerated from α < 1/2, and the solid circles for α > 1/2. Ex-cerpted from [24]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.1 Asymptotic behavior of the ML/CML Learning: Depending onthe intial model, the ML and the CML reach sometimes good orbad models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.2 The generative models for static classification (TAN) and se-quence classification (HMM). . . . . . . . . . . . . . . . . . . . . 19

2.3 Static Classification on UCI Data . . . . . . . . . . . . . . . . . . 202.4 Digit Prototypes from the generative learning and the Max-Margin

discriminative learning. Excerpted from [45]. . . . . . . . . . . . 24

3.1 Graphical Models: HMM (or LDS), CRF, and MEMM. . . . . . 273.2 Visualization of estimated sequences for synthetic data. It shows

the estimated states (for dim-1) at t = 136 ∼ 148. The groundtruth is depicted by solid (cyan) line, ML by dotted (blue), CMLby dotted-dashed (red), and SCML by dashed (black). . . . . . . 37

3.3 Skeleton snapshots for walking (a−f), picking-up a ball (g−l),and running (m−s): The ground-truth is depicted by solid (cyan)lines, ML by dotted (blue), SCML by dashed (black), and latentvariable nonlinear model (LVN) by dotted-dashed (red). . . . . . 39

ii

Page 4: Discriminative Learning of Generative Models for Sequence ...€¦ · 1.1 Graphical Representation: Naive Bayes and Logistic Regression . 4 1.2 Test errors vs. sample sizes (m) for

4.1 Data is generated by the distributions in the top panel (+ class inblue/dashed and − class in red/solid). The middle panel showsweights for the second component, both discriminative wDis(c,a)and generative wGen(c,a). The bottom panel displays the indi-vidual mixture components of the learned models. Generativelylearned component fGen

2 (c,a) are contrasted to the discrimina-tively learned one, fDis

2 (c,a). . . . . . . . . . . . . . . . . . . . . 434.2 Example sequences generated by true model. . . . . . . . . . . . 454.3 Test error scatter plots comparing 7 models from Table 4.2. Each

point corresponds to one of the 5 classification problems. Forinstance, congregation of points below the main diagonal in theBxCML vs. ML case suggests that BxCML outperforms MLin most of the experimental evaluations. The (red) rectanglesindicate the plots comparing BxCML with others. . . . . . . . . 51

iii

Page 5: Discriminative Learning of Generative Models for Sequence ...€¦ · 1.1 Graphical Representation: Naive Bayes and Logistic Regression . 4 1.2 Test errors vs. sample sizes (m) for

List of Tables

2.1 Sequence Classification Test Accuracies (%): For the datasetsevaluated with random-fold validation (Gun/Point and GT Gait),the averages and the standard deviations are included. The otherdatasets contain average leave-1-out test errors. Note that GTGait and USF Set2 are the multi-class datasets. See Sec. ??for details. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.2 Sequence Tagging Test Accuracies (%): leave-1-out test errors. . 212.3 MNIST Digit Classification Test Error (%). Excerpted from [45]. 24

3.1 Test errors and log-perplexities for synthetic data. . . . . . . . . 363.2 Average test errors. The error types are abbreviated as 3 letters:

The first indicates smoothed (S) or filtered (F), followed by 2letters meaning that the error is measured in either the jointangle space (JA) or the 3D articulation point space (3P) (e.g.,SJA = smoothed error in the joint angle space). The unit scalefor the 3D point space is deemed as the height of the humanmodel ∼ 25. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.1 Average test errors (%), log-likelihoods (LL), and conditional log-likelihoods (CLL) on the test data are shown. BBN does nothave LL or CLL since it is a non-generative classifier. . . . . . . . 46

4.2 Test errors (%): For the datasets evaluated with random-foldvalidation (Gun/Point and GT Gait), the averages and thestandard deviations are included. The other datasets containaverage leave-1-out test errors. “–” indicates redundant since amulti-class method is to be applied for binary class data. (Notethat GT Gait and USF Set2 are the multi-class datasets.) Theboldfaced numbers indicate the lowest, within the margin of sig-nificance, test errors for a given dataset. . . . . . . . . . . . . . . 50

iv

Page 6: Discriminative Learning of Generative Models for Sequence ...€¦ · 1.1 Graphical Representation: Naive Bayes and Logistic Regression . 4 1.2 Test errors vs. sample sizes (m) for

Abstract

I consider the issue of learning generative probabilistic models (e.g., BayesianNetworks) for the problems of classification and regression. As the generativemodels now serve as target-predicting functions, the learning problem can betreated differently from the traditional density estimation. Unlike the likelihoodmaximizing generative learning that fits a model to overall data, the discrimina-tive learning is an alternative estimation method that optimizes the objectivesthat are much closely related with the prediction task (e.g., the conditionallikelihood of target variables given input attributes). The contribution of thiswork is three-fold. First, for the family of general generative models, I provide aunifying parametric gradient-based optimization method for the discriminativelearning.

In the second part, not restricted to the classification problem with discretetargets, the method is applied to the continuous multivariate state domain,resulting in dynamical systems learned discriminatively. This is very appeal-ing approach toward the structured state prediction problems such as motiontracking, in that the discriminative models in discrete domains (e.g., ConditionalRandom Fields or Maximum Entropy Markov Models) can be problematic to beextended to handle continuous targets properly. For the CMU motion capturedata, I evaluate the generalization performance of the proposed methods on the3D human pose tracking problem from the monocular videos.

Despite the improved prediction performance of the discriminative learning,the parametric gradient-based optimization may have certain drawbacks suchas the computational overhead and the sensitivity to the choice of the initialmodel. In the third part, I address these issues by introducing a novel recursivemethod for discriminative learning. The proposed method estimates a mixtureof generative models, where the component to be added at each stage is selectedin a greedy fashion, by the criterion maximizing the conditional likelihood ofthe new mixture. The approach is highly efficient as it reduces to the gener-ative learning of the base generative models on weighted data. Moreover it isless sensitive to the initial model choice by enhancing the mixture model re-cursively. The improved classification performance of the proposed method isdemonstrated in an extensive set of evaluations on time-series sequence data,including human motion classification problems.

Page 7: Discriminative Learning of Generative Models for Sequence ...€¦ · 1.1 Graphical Representation: Naive Bayes and Logistic Regression . 4 1.2 Test errors vs. sample sizes (m) for

Chapter 1

Introduction

One of the fundamental problems in machine learning is to predict the unknownor unseen nature y of the observation x. Depending on the structures of x andy, the problem has its own name with related applications in the field of patternrecognition, computer vision, natural language processing, and bioinformatics.In this proposal, I am particularly interested in the problems summarized asfollows:

1. Static Classification: For a continuous or discrete vector x, classify xto y ∈ {1, ...,K}. e.g., face detection/recognition and medical diagnosis.

2. Sequence Classification: For a sequence of continuous or discrete vec-tors x = x1, ...,xT , classify x to y ∈ {1, ...,K}. e.g., DNA barcoding,human identification from motion, and motion classification.

3. Sequence Tagging: For a sequence of continuous or discrete vectorsx = x1, ...,xT , estimate the discrete label sequence y = y1, ...,yT , whereyt ∈ {1, ...,K}. e.g., part-of-speech tagging, named entity recognition,and protein secondary structure prediction.

4. Sequence Tracking: For a sequence of continuous vectors x = x1, ...,xT ,estimate the continuous state sequence y = y1, ...,yT , where yt ∈ Rk. e.g.,dynamic pose estimation and motion tracking.

Throughout the paper, the setting is assumed supervised, meaning that one isgiven complete pairs of examples, also called train data, D = {(xi,yi)}ni=1. Allthese problems can be formulated in a unifying probabilistic framework: Then pairs of target/observation data D = {(xi,yi)}ni=1 is drawn identically andindependently from some unknown distribution P (x,y), where x ∈ X is calledinput or attribute features and y ∈ Y is called output or targets. The goal is topredict an output y ∈ Y for a new observation x ∈ X . Equivalently, one needsto learn a predictor function y = g(x) from the data D.

Why is the probabilistic formulation of the problem reasonable? Due to thenature of real world problem, it is nearly impossible to specify all factors that

1

Page 8: Discriminative Learning of Generative Models for Sequence ...€¦ · 1.1 Graphical Representation: Naive Bayes and Logistic Regression . 4 1.2 Test errors vs. sample sizes (m) for

decide the output y deterministically. Moreover x could have some measure-mental noises. To see this, a good illustrating example can be found in [6]: Weknow that a person is dehydrated. The cause of the dehydration is what weneed to predict, that is, y = causes, Y = a set of possible causes = { low waterintake in hot weather, severe diarrhea, ... }. Suppose we measure only x = wa-ter content of human body. There are certainly other factors we need to knowto decide the exact cause of dehydration (e.g., the body/room temperatures orhis activities a few hours ago, etc.). In this way, for the confined knowledge orfeatures x, we will rather have a distribution on (X ,Y), instead of deterministicrelationship between x and y. Note also that, even though one knows the truegenerating process P (x,y), any deterministic classifier y = g(x) is imperfect,namely, the optimal Bayes error, ming:X→Y P (g(x) 6= y) is non-zero.

1.1 Probabilistic Model-Based Approach

In the field of statistical learning, several approaches have been studied to tacklethe problem. There are roughly two schools of approaches: probabilistic model-based approach and non-probabilistic function estimation. The latter tries tofind a deterministic classifier or a regression function y = g(x) in the hypothe-sis space G, which is a subset of the entire function set F = {f : X → Y}.In the static classification, support vector machines (SVMs) [53], C4.5 de-cision trees [38], and AdaBoost boosting algorithms [9] are well-known non-probabilistic approaches. These methods can also be generalized to the prob-lems of sequence classification and sequence tagging1. SVM, for instance, hasrecently been extended to these problems, where for sequence classification, onecan design a kernel for unequal-length observation sequences (e.g., [16, 15]),while for sequence tagging, each entire label sequence is treated as one of theexponentially (in sequence length) many classes (e.g., [3]).

On the other hand, the probabilistic approach tries to model the distributionP (x,y) or P (y|x). Then the problem turns into the joint or conditional densityestimation from the data, which has long been studied in modern statistics.The approach is also studied in depth in machine learning community, knownas graphical models. [20] is an excellent tutorial that deals with the issues such asinference, parameter/structure learning for graphical models. In the graphicalrepresentation of probabilistic models, the (conditional) independencies amongthe random variables (nodes) are specified by the absence of edges. The edgescould be either directed (Bayesian Networks) or undirected (Markov Networks).Thus the graphical structure defines a set of conditional independencies that themodel has to conform to, forming a certain family of distributions. The modelstructure is either given from the prior knowledge of the application domain,or learned from data. The structure learning is a big issue and often resorts to

1For sequence tracking where the output is a sequence of continuous vectors, one may haveto form a regression setting (e.g., ε-tube like support vector regression [42]). However, asfar as I know, there is no work since far to properly deal with the continuous multivariatesequence structure by the non-probabilistic approach.

2

Page 9: Discriminative Learning of Generative Models for Sequence ...€¦ · 1.1 Graphical Representation: Naive Bayes and Logistic Regression . 4 1.2 Test errors vs. sample sizes (m) for

exhaustive or (greedy) incremental model search [43, 2, 40].Once the structure is specified, one assigns the parameters to the edges to

define the local conditional distributions. In this proposal, I assume that themodel structure is given somehow (maybe incorrectly), and focus on estimatingthe parameters of the model from the data. In the probabilistic model-based ap-proach, the output of the new input x is predicted by the Maximum-A-Posteriori(MAP) decision rule, namely, y = arg maxy∈Y P (y|x). The posterior P (y|x) isevaluated either by a probabilistic inference for the joint model P (x,y) or di-rectly from the conditional model P (y|x). The joint models are called generativemodels, while the conditional models are called discriminative models.

1.2 Generative vs. Discriminative Models

The discriminative model appears to be more economic than the generativemodel since the former focuses only on the quantity ultimately necessary forthe prediction task, i.e., P (y|x). The generative model does unnecessary mod-eling effort by first estimating the full joint model P (x,y), then inferring theposterior distribution. However, the obvious benefit of the generative model isthe ability to synthesize samples. Recently, there have been some works com-paring two approaches analytically and empirically. In this section, I illustratetwo examples that frequently arise in static classification and sequence tagging.

Naive Bayes vs. Logistic Regression

In static classification, Naive Bayes (NB) and logistic regression are the verywidely used generative and discriminative model, respectively. As shown inFig. 1.1, the structure of NB indicates that the d attributes aj (j = 1, . . . , d)are independent given the class c. Logistic regression has a similar structure,yet it models only the conditional distribution of c given a = [a1, . . . , ad]T , as-suming the attributes are always given (shaded nodes). Here logistic regressiondoes not have any interaction terms, while introducing such terms amounts toadding edges between attribute nodes2. Despite the simple model structures,both models are known to yield good prediction performance. For these simplegenerative and discriminative models, I show how they are learned convention-ally (i.e., maximum likelihood learning), and compare their generalization per-formance on the standard UCI datasets. For the simplicity, we assume binaryclassification (c ∈ {0, 1}) and binary attributes (aj ∈ {0, 1} for j = 1, . . . , d).

NB is specified by defining the local conditional distributions, π = P (c = 1),µj,1 = P (aj = 1|c = 1), and µj,0 = P (aj = 1|c = 0), for j = 1, . . . , d. Theparameters of NB are θ = {π, {µj,c : j = 1, . . . , d, c = 0, 1}}. Note that theparameters are probabilities thus 0 ≤ π ≤ 1 and 0 ≤ µj,c ≤ 1, for all j and c.

2Equivalently for the generative model, this corresponds to have Tree-Augmented NaiveBayes (TAN), instead of NB (See Fig. 2.2(a)).

3

Page 10: Discriminative Learning of Generative Models for Sequence ...€¦ · 1.1 Graphical Representation: Naive Bayes and Logistic Regression . 4 1.2 Test errors vs. sample sizes (m) for

c

aaa1 2 d...

(a) Naive Bayes

����������������������������

����������������������������

����������������������������

����������������������������

����������������������������

����������������������������

c

aaa1 2 d...

(b) Logistic Regression

Figure 1.1: Graphical Representation: Naive Bayes and Logistic Regression

The joint likelihood of NB is

P (c,a|θ) = πc · (1− π)1−c ·d∏

j=1

µaj

j,c · (1− µj,c)1−aj . (1.1)

For train data D = {(ci,ai)}ni=1, the log-likelihood LL(θ;D) can be written as

LL(θ;D) =n∑

i=1

log P (ci,ai|θ)

=n∑

i=1

[ci · log(π) + (1− ci) · log(1− π) +

d∑j=1

(ai

j · log(µj,ci) + (1− aij) · log(1− µj,ci)

)]. (1.2)

It is easy to see that LL(θ;D) is a concave (negative convex) function of θ, andthe maximum solution is given analytically. To get the maximum likelihoodestimator, one can differentiate LL(θ;D) with respect to NB parameters andset to zero, that is,

∂LL

∂π=

∑i ci

π−∑

i(1− ci)(1− π)

= 0, π∗ =∑

i ci

n,

∂LL

∂µj,1=

∑i:ci=1 ai

j

µj,1−∑

i:ci=1(1− aij)

(1− µj,1)= 0, µ∗j,1 =

∑i:ci=1 ai

j∑i:ci=1 1

,

∂LL

∂µj,0=

∑i:ci=0 ai

j

µj,0−∑

i:ci=0(1− aij)

(1− µj,0)= 0, µ∗j,0 =

∑i:ci=0 ai

j∑i:ci=0 1

. (1.3)

Once the NB parameters are learned, the class prediction for a new observationa is made by the MAP decision rule, namely, c∗ = arg maxc P (c|a,θ).

On the other hand, (linear) logistic regression models the conditional distri-bution,

P (c|a, {w, b}) =1

1 + exp((−1)c · (wT · a + b)

) . (1.4)

4

Page 11: Discriminative Learning of Generative Models for Sequence ...€¦ · 1.1 Graphical Representation: Naive Bayes and Logistic Regression . 4 1.2 Test errors vs. sample sizes (m) for

Figure 1.2: Test errors vs. sample sizes (m) for Naive Bayes (solid lines) andLogistic Regression (dashed lines) on UCI datasets. Excerpted from [32].

The parameters of logistic regression are {w, b}, where w = [w1, . . . , wd]T ∈ Rd

and b ∈ R. The standard learning maximizes the log-likelihood (or actually theconditional log-likelihood) LL({w, b};D), which is

LL({w, b};D) = −n∑

i=1

log[1 + exp

((−1)ci

· (d∑

j=1

wj · aij + b)

)]. (1.5)

The log-likelihood function is strictly concave (having unique maximum) in wand b, however, there is no analytical solution. Instead, the gradient-basedmethods (conjugate gradient or quasi-Newton) are used, where they are shownto converge reasonably fast [30].

In terms of classification performance, it is widely conjectured that the dis-criminative models are better than the generative models even when both mod-els have the same representational power in prediction functions. It is not dif-ficult to show that NB and logistic regression share the same type of posteriordistributions, more precisely, logistic regression is just a reparameterization ofNB for P (c|a)3.

Recently there was a work on analytical and empirical comparison of NBand logistic regression [32]. The conclusion is that in most cases (1) NB hasthe lower test error for the small number of samples, and (2) as the sample sizeincreases, logistic regression eventually overtakes NB. In other words, logisticregression usually has lower asymptotic error than NB, while NB reaches itsasymptotic error much faster than logistic regression. Fig. 1.2 illustrates theseresults on the UCI datasets.

Hidden Markov Models vs. Conditional Random Fields

One of the most popular generative sequence models is the Hidden MarkovModel (HMM). In HMMs, the sequential observations y = y1, . . . ,yT are mod-

3See Eq. (2.2) for details.

5

Page 12: Discriminative Learning of Generative Models for Sequence ...€¦ · 1.1 Graphical Representation: Naive Bayes and Logistic Regression . 4 1.2 Test errors vs. sample sizes (m) for

... ...ss s...... ...

f (y ) f (y ) f (y )

tt−1 t+1

1 2 dt t t

(a) HMM

��������������������������������������������

��������������������������������������������

���������������������������������

���������������������������������

���������������������������������

���������������������������������

... ...ss s...... ...

f (y ) f (y ) f (y )

tt−1 t+1

1 2 dt t t

(b) CRF

Figure 1.3: Graphical Representation of HMM and CRF for Sequence Tagging

eled in a way that (1) a sequence of hidden states s = s1, . . . , sT is introducedconforming to the 1st-order Markov transition between adjacent states st andst+1, (2) then each hidden state st emits the observation yt. The hidden statesof HMMs should be discrete, while the observation could be either discrete orcontinuous (multivariate). Hence, the transition probabilities P (st+1|st) mustbe multinomial distributions. The emission distributions P (yt|st) are multi-nomial for discrete yt, while for continuous multivariate yt, they are usuallyassumed to be Gaussians or mixtures of Gaussians.

When HMMs are used for sequence tagging problems, the hidden state se-quence s is the target variables to predict for the observed sequence y. HereI give a concrete example of HMM learning and prediction for sequence tag-ging problems. For the sake of simplicity, I assume that we only measure dbinary features for the observation yt. That is, we have binary predicatesfj(yt) ∈ {0, 1}, j = 1, . . . , d, as the observation at time t. I further assumethat these d binary features are independent with one another given st, namely,P (f1(yt), . . . , fd(yt)|st) =

∏dj=1 P (fj(yt)|st). Each state st can take a value

from {1, . . . ,K}. This HMM is shown in Fig. 1.3(a). The parameter vector forHMM is θ = [π,A, µ]T , where πl = P (s1 = l), Al,l′ = P (st = l|st−1 = l′), andµj,l = P (fj(yt) = 1|st = l), for l′, l ∈ {1, . . . ,K} and j = 1, . . . , d. The jointprobability is:

P (s1, . . . , sT ,y1, . . . ,yT |θ) = πs1 · bs1(y1;θ) ·T∏

t=2

Ast,st−1 · bst(yt;θ), (1.6)

where

bl(yt;θ) = P (f1(yt), . . . , fd(yt)|st = l)

=d∏

j=1

fj(yt)j,l · (1− µj,l)1−fj(yt)

], for l = 1, . . . ,K.

For the train data D = {(si,yi)}ni=1, the joint log-likelihood LL(θ;D) can be

6

Page 13: Discriminative Learning of Generative Models for Sequence ...€¦ · 1.1 Graphical Representation: Naive Bayes and Logistic Regression . 4 1.2 Test errors vs. sample sizes (m) for

written as:

LL(θ;D) =n∑

i=1

[log(πsi

1) +

Ti∑t=2

log(Asit,s

it−1

) +Ti∑

t=1

log(bsit(yi

t))]

=n∑

i=1

[log(πsi

1) +

Ti∑t=2

log(Asit,s

it−1

) +

Ti∑t=1

d∑j=1

(fj(yi

t) log(µj,l) + (1− fj(yit)) log(1− µj,l)

)],

where Ti denotes the length of the i-th sequence4.The maximum likelihood estimator of HMM has an analytical solution:

π∗l =∑n

i=1 I(si1 = l)

n, A∗

l,l′ =∑n

i=1

∑Ti

t=2 I(sit = l, si

t−1 = l′)∑ni=1

∑Ti

t=2 I(sit−1 = l′)

,

µ∗j,l =∑n

i=1

∑Ti

t=1 fj(yit)∑n

i=1 Ti, for j = 1, . . . , d, l, l′ ∈ {1, . . . ,K}, (1.7)

where I(p) = 1 when the predicate p is true, and 0 when p is false.For a new observation sequence y, the label sequence prediction by the

learned HMM is done by the well-known HMM Viterbi decoding. It solves theoptimization problem arg maxs P (s|y), or equivalently, arg maxs P (s,y) usinga dynamic programming: The quantity δt(st) defined as

δt(st) = maxs1,...,st−1

P (s1, . . . , st−1, st,y1, . . . ,yt), for t = 1, . . . , T, (1.8)

can be evaluated recursively, that is, (for each j = 1, . . . ,K)

δ1(s1 = j) = P (y1|s1 = j) · P (s1 = j) = πj · bj(y1),

δt+1(st+1 = j) = maxst

[P (yt+1|st+1 = j) · P (st+1 = j|st) · δt(st)

]= max

st

[bj(yt+1) ·Aj,st · δt(st)

], for t = 1, . . . , T − 1.(1.9)

The optimal label sequence is found by backtracking: s∗T = arg maxsTδT (sT ),

and s∗t = arg maxst bs∗t+1(yt+1) ·As∗t+1,st · δt(st) for t = T − 1, . . . , 1.

Recently, discriminative models for sequence tagging problems have beenstudied intensively. The conditional random fields (CRFs) are the most suc-cessful conditional models, which are shown to outperform HMMs in manyapplications including the part-of speech tagging, the named-entity recognition,and the protein secondary structure prediction [24, 25]. CRFs are (undirected)log-linear models that condition on the observation sequence y. One can formCRFs with arbitrarily complex structures by including necessary features or

4So we assume that the sequence lengths are in general unequal.

7

Page 14: Discriminative Learning of Generative Models for Sequence ...€¦ · 1.1 Graphical Representation: Naive Bayes and Logistic Regression . 4 1.2 Test errors vs. sample sizes (m) for

edges, however, at the expense of inference costs. Here, for a fair comparison,the CRF is assumed to contain only the features related with the transition andthe emission of the HMM’s as shown in Fig. 1.3(b).

The features of CRF at time t are the (1st-order) transition I(st = l, st−1 =l′) and the emission I(st = l, fj(yt) = 1), for j = 1, . . . , d, l, l′ ∈ {1, . . . ,K},where I(·) is a predicate indicator as defined previously. This creates a CRFfunctionally equivalent to the HMM5. As a log-linear model, one associates alinear coefficient (a parameter) for each feature. Assuming homogeneous param-eters, let λj,l be the parameter for the emission feature I(st = l, fj(yt) = 1), andηl,l′ for the transition feature I(st = l, st−1 = l′), for all t. The clique potentialfunction at time t is defined as:

Mt(st, st−1|y) = exp

(d∑

j=1

K∑l=1

λj,l · I(st = l, fj(yt) = 1) +

d∑l=1

d∑l′=1

ηl,l′ · I(st = l, st−1 = l′)

). (1.10)

The conditional model is represented as a product of these clique potentialfunctions via normalization to make it a distribution for s, that is,

P (s|y) =1

Z(y)

∏t

Mt(st, st−1), where Z(y) =∑s

∏t

Mt(st, st−1|y). (1.11)

At first glance, what is called the partition function, Z(y), looks infeasibleto compute since the sum is over all label sequences which are exponentiallymany (i.e., KT ). However, the linear chain structure enables efficient for-ward/backward recursion. The forward message αt(st|y) at time t is definedas

αt(st|y) =∑

s1···st−1

t∏t′=1

Mt′(st′ , st′−1). (1.12)

From the definition, it is easy to derive the following recursion formula:

αt(st|y) =∑st−1

αt−1(st−1) ·Mt(st, st−1|y). (1.13)

Once the forward messages are evaluated for all t, the partition function can beobtained from Z(y) =

∑sT

αT (sT |y). For the inference in CRF, one furtherneeds to define the backward message βt(st|y) as

βt(st|y) =∑

st+1···sT

T∏t′=t+1

Mt′(st′ , st′−1|y). (1.14)

5In essence, the CRF with the HMM features is just a reparameterization of the HMM.

8

Page 15: Discriminative Learning of Generative Models for Sequence ...€¦ · 1.1 Graphical Representation: Naive Bayes and Logistic Regression . 4 1.2 Test errors vs. sample sizes (m) for

The backward recursion is similarly derived as

βt−1(st−1|y) =∑st

βt(st) ·Mt(st, st−1|y). (1.15)

The inference in CRF is completed by the posteriors derived as

P (st|y) =1

Z(y)αt(st|y) · βt(st|y),

P (st, st−1|y) =1

Z(y)αt−1(st−1|y) ·Mt(st, st−1|y) · βt(st|y). (1.16)

The learning of CRF is maximizing (conditional) likelihood of the train dataD with respect to the CRF parameters λ = {λj,l}j,l and η = {ηl,l′}l,l′ . Thereis no analytical solution for this, however, the log-likelihood function,

LL(λ,η;D) =n∑

i=1

[Ti∑

t=1

(∑j,l

λj,l · I(sit = l, fj(yi

t) = 1) +

∑l,l′

ηl,l′ · I(sit = l, si

t−1 = l′))− log Z(yi)

],

is concave in λ and η because of the log-convexity of the partition function. Thusthe CRF learning is usually done by numerical optimization such as the IterativeScaling algorithms [4, 5] or gradient search. Recently, it has been empiricallyshown that the conjugate gradient ascent or quasi-Newton variants (e.g., BFGS)often exhibit faster convergence than iterative scaling methods [44]. In thegradient-based CRF learning, the gradient of the log-likelihood is evaluatedusing the posteriors obtained from the inference, for instance,

∂LL(λ,η;D)∂λj,l

=n∑

i=1

[Ti∑

t=1

I(sit = l, fj(yi

t) = 1)−

EP (st|yi)

[ Ti∑t=1

I(st = l, fj(yit) = 1)

]]. (1.17)

The Viterbi decoding for CRF is similar to the HMM case. By defining δt(st =l) = maxs1,...,st−1 P (s1, . . . , st−1, st = l|y), the recursion is derived as

δt+1(st+1 = l) = maxst

δt(st) ·βt+1(st+1 = l|y) ·Mt+1(st+1 = l, st)

βt(st|y). (1.18)

In [24], the comparison between HMMs and CRFs with HMM featureshas been made for synthetic and real data. The synthetic experiment con-ducted in the paper is interesting. In their setting, the true data generat-ing process is modeled as a mixture of 1st- and 2nd-order transitions (i.e.,Pα(st|st−1, st−2) = α · P2(st|st−1, st−2) + (1 − α) · P1(st|st−1)), which makes

9

Page 16: Discriminative Learning of Generative Models for Sequence ...€¦ · 1.1 Graphical Representation: Naive Bayes and Logistic Regression . 4 1.2 Test errors vs. sample sizes (m) for

Figure 1.4: Test error scatter plots on synthetic data comparing HMMs andCRFs in sequence tagging. The open squares represent datasets generated fromα < 1/2, and the solid circles for α > 1/2. Excerpted from [24].

the assumed models (HMMs and CRFs) suboptimal to the true structure. Hereα is a controllerable parameter indicating the degree of 2nd-order transition.Thus α = 0 generates data with purely 1st-order transition (then we have acorrect structure), while α = 1 generates extreme 2nd-order data. The scat-ter plot in Fig. 1.4 shows the test errors for two models. The points (datasets) away from the diagonal for a model indicate that the model is superior tothe other model. From this result, one can conclude that CRF is more robustto HMM for the data with increasing 2nd-order transition effect. Consideringthe common modeling practice of approximating complex long-range dependen-cies by simpler structures, CRFs are more robust and appealing models thanHMMs [24].

From the above examples, we have seen that even though they have theequivalent functional representation in posterior family P (y|x), the discrimina-tive models give better generalization performance. However, one of the majordrawbacks of discriminative models is that modeling the regression-type P (y|x)directly is usually difficult especially when the input structure is complex andwhen there exist many hidden variables. However, in the conventional modelingof generative models, namely, P (x,y) = P (y) · P (x|y), modeling the target-specific conditionals P (x|y) is much easier and more intuitive. Moreover, thegenerative models have unique benefit of ability to sample data. Motivated bythis, the discriminative learning of generative models has been studied recently,which is discussed in the next chapter.

The paper is organized as follows: In Ch. 2, I provide a unifying parametricgradient-based optimization method for the discriminative learning of generalgenerative models. The prediction performance of the discriminative learning is

10

Page 17: Discriminative Learning of Generative Models for Sequence ...€¦ · 1.1 Graphical Representation: Naive Bayes and Logistic Regression . 4 1.2 Test errors vs. sample sizes (m) for

evaluated for the static and the structured classification problems on syntheticand real datasets. In Ch. 3, the method is applied to the continuous multi-variate state domain, yielding discriminative dynamical systems. For the 3Dhuman pose tracking problem from the monocular videos, the generalizationperformance of the proposed methods are demonstrated. In Ch. 4, in order toaddress certain drawbacks such as the computational overhead and the sensitiv-ity to the choice of the initial model, a novel recursive discriminative learningalgorithm is introduced. The improved classification performance of the pro-posed method is demonstrated in an extensive set of evaluations on time-seriessequence data, including human motion classification problems. Ch. 5 summa-rizes the paper while suggesting the future research work.

11

Page 18: Discriminative Learning of Generative Models for Sequence ...€¦ · 1.1 Graphical Representation: Naive Bayes and Logistic Regression . 4 1.2 Test errors vs. sample sizes (m) for

Chapter 2

Discriminative Learning ofGenerative Models

In the previous chapter, we have shown that discriminative models are morerobust and better in prediction than generative models in many cases. How-ever, generative models such as Bayesian networks (BNs) are attractive in anumber of data-driven modeling tasks. Among some of their advantages are theability to easily incorporate domain knowledge, factorize complex problems intoself-contained models, handle missing data and latent factors, and offer inter-pretability to results. Moreover, the discriminative models often suffer from themodeling complexity for the regression-like conditional distribution directly.

Recently, there have been efforts for learning generative models discrimina-tively. Two popular discriminative learning methods are the conditional likeli-hood maximization (CML) and the margin maximization (Max-Margin). Theseapproaches try to find a parameter vector among the family of generative mod-els P (x,y|θ) indexed by θ ∈ Θ, optimizing certain objectives. In detail, forthe train data D = {(xi,yi)}ni=1, CML has the conditional likelihood objectivearg maxθ

∑ni=1 log P (yi|xi,θ), while Max-Margin tries to maximize the margin

between the train traget and the other candidate target variables (e.g., max-imize the gaps for the inequalities, P (yi|xi,θ) ≥ maxy∈Y,y 6=yi P (y|xi,θ), for∀i). Both methods are shown to outperform the generative learning (i.e., thetraditional maximum likelihood (ML) learning) for a broad range of real data.It is also known that the discriminatively learned generative models focus on thedecision boundaries while the generative learning concentrates on the centers ofdata.

Unfortunately, apart from the maximum likelihood estimator (MLE), littletheoretical properties (e.g., asymptotic behavior, unbiased estimator) is knownfor the discriminative estimators. In addition, the methods usually do not re-sult in closed form solutions, rather resorting to gradient-based optimizationwith non-unique solutions1. However, the discriminative learning is empirically

1Recently, it has been shown that the max-margin can be formulated in convex optimization

12

Page 19: Discriminative Learning of Generative Models for Sequence ...€¦ · 1.1 Graphical Representation: Naive Bayes and Logistic Regression . 4 1.2 Test errors vs. sample sizes (m) for

valuable for better prediction particularly for the settings where the structureof the model is incorrect.

In this chapter, I will provide the general gradient-based optimization methodfor CML learning, as well as the latest issue of convex formulation in Max-Margin learning. In the gradient-based CML optimization, there have beensome works to derive the gradient of the conditional log-likelihood in staticclassification for specific generative models, e.g., NB or TAN [12, 36]. How-ever, it is the first to provide the derivation in a unifying framework (includingsequence classification, tagging, and sequence tracking) for general generativemodels. The generative models are assumed to be directed graphical models(i.e., Bayesian Networks), possibly containing some hidden variables z. By as-suming that the local conditionals of the model are in the exponential family,the complete joint distribution P (x,y, z) is also in the exponential family, whichmakes ∇θ log P (x,y, z|θ) be represented in a closed form2. Furthermore, theexact inferences P (z|x,y) and P (y|x) are assumed to be tractable. The popularmodels such as NB, TAN, Gaussians, HMMs, Linear Dynamical Systems, andtheir mixtures satisfy the assumptions.

2.1 Conditional Likelihood Maximization

The CML learning tries to find a parameter vector θ (among the parameterspace Θ of generative models) that maximizes the conditional likelihood. Forthe train data D = {(xi,yi)}ni=1, the conditional log-likelihood objective isdefined as

CLL(θ;D) =n∑

i=1

log P (yi|xi,θ) =n∑

i=1

[log P (xi,yi|θ)− log P (xi|θ)

]. (2.1)

The CLL objective is directly related with the prediction. In fact, working withthe family of generative models, one implicitly defines a family of conditionaldistributions CΘ = {P (y|x,θ) = P (x,y|θ)R

yP (x,y|θ)

: θ ∈ Θ}. The induced family of

conditional distributions are as powerful as those of discriminative models. Inparticular, it is easy to see that the generative models in the previous chapterhave the families of conditional distributions equivalent to the correspondingdiscriminative models. For example, the conditional distribution inferred fromNaive Bayes can be written as

P (c = 1|a) =

[1+exp

(log(

1− π

π)+∑

j

[aj ·log(

µj,1

µj,0)+(1−aj)·log(

1− µj,1

1− µj,0)])]−1

.

under certain reparameterization [45]. This latest issue will be discussed in Sec. 2.22Unlike the undirected graphical models, this is a unique property of directed graphical

models where the partition functions have simple forms.

13

Page 20: Discriminative Learning of Generative Models for Sequence ...€¦ · 1.1 Graphical Representation: Naive Bayes and Logistic Regression . 4 1.2 Test errors vs. sample sizes (m) for

Equivalently, logistic regression is a reparameterization of Naive Bayes, namely,

wj = − log(µj,1

µj,0) + log(

1− µj,1

1− µj,0), for j = 1, . . . , d,

b = − log(1− π

π)−

∑j

log(1− µj,1

1− µj,0). (2.2)

Noticing that the CLL objective is a standard maximum likelihood esti-mator for discriminative models, the CML learning does this for the inducedfamily of conditional distributions from generative models. In this sense, theCML learning of generative models can be regarded as an implicit way to re-alize discriminative models. In the above correspondence, unlike the strictlyconcave objective of logistic regression, the CML solution for Naive Bayes is notunique since there exists a many-to-one mapping from Naive Bayes parametersto logistic regression parameters. In general, the CLL objective of the gener-ative model has many global and local optima. Thus when compared to thediscriminative models, the discriminative learning of generative models takesthe benefits of generative models and superior prediction performance to thegenerative learning at the expense of non-convex optimization.

2.1.1 CML Optimization

The CLL objective can be locally optimized using a parametric gradient search.In [12, 36], a derivation for the CLL gradient for Naive Bayes and TAN modelsin static classification has been introduced. Here I provide a unifying way toevaluate the parametric gradient of CLL for general generative models. FromEq. (2.1), the gradient of CLL with respect to θ is given as:

∂CLL(θ;D)∂θ

=n∑

i=1

[∂

∂θlog P (xi,yi|θ)− ∂

∂θlog P (xi|θ)

]. (2.3)

The first term ( ∂∂θ log P (x,y|θ)), the gradient of the joint log-likelihood, is

straightforward to evaluate if P (x,y) has no hidden variables (assuming thatall the conditional densities in the BN belong to the exponential families). Thepresence of hidden variables z, on the other hand, trivially results in the expec-tation of the gradient of the complete log-likelihood (including z), namely,

∂θlog P (x,y|θ) =

1P (x,y|θ)

· ∂

∂θ

∫z

P (z,x,y|θ)

=1

P (x,y|θ)·∫z

P (z,x,y|θ) · ∂

∂θlog P (z,x,y|θ)

=∫z

P (z|x,y,θ) · ∂

∂θlog P (z,x,y|θ)

= EP (z|x,y,θ)

[∂

∂θlog P (z,x,y|θ)

]. (2.4)

14

Page 21: Discriminative Learning of Generative Models for Sequence ...€¦ · 1.1 Graphical Representation: Naive Bayes and Logistic Regression . 4 1.2 Test errors vs. sample sizes (m) for

The expectation can be computed easily as long as the inference for P (z|x,y)is tractable. The second term ( ∂

∂θ log P (x|θ)) of Eq. (2.6), the derivative of themeasurement log-likelihood, is the expectation (over y) of the joint log-likelihoodin the same manner by treating the target y as hidden. That is,

∂θlog P (x|θ) = EP (y|x,θ)

[∂

∂θlog P (x,y|θ)

]. (2.5)

From Eq. (2.4) and Eq. (2.5), the gradient of CLL is then represented as:

∂CLL(θ;D)∂θ

=n∑

i=1

EP (z|xi,yi,θ)

[∂

∂θlog P (z,xi,yi|θ)

]−

n∑i=1

EP (z,y|xi,θ)

[∂

∂θlog P (z,xi,y|θ)

], (z present), (2.6)

∂CLL(θ;D)∂θ

=n∑

i=1

∂θlog P (xi,yi|θ)−

n∑i=1

EP (y|xi,θ)

[∂

∂θlog P (xi,y|θ)

], (z absent). (2.7)

Some comments can be given here. First, the generative learning tries tomake the gradient of the joint log-likelihood vanished. That is, the MLE satisfies∑n

i=1∂∂θ log P (xi,yi|θ) = 0. This means the sum of the Fisher scores on the

train data has to be 0, which corresponds to fitting the model globally to thedata. On the other hand, the CML learning tries to make the gradient of theCLL 0, which is equivalent to minimizing the difference between the sum ofthe Fisher scores on data and the sum of the expected (by the model) Fisherscores. Hence the CML learning focuses on the the target posterior P (y|x) ofthe model, making it as close to the empirical conditional as possible.

Secondly, the gradient derivation in Eq. (2.4) gives a new insight on theExpectation-Maximization (EM) algorithm for the generative learning with hid-den variables. The first term of Eq. (2.1),

∑ni=1 log P (xi,yi|θ), is the very ob-

jective for the generative learning. With the presence of hidden variables, theequation by setting the gradient

∑ni=1

∂∂θ log P (xi,yi|θ) to 0 gives no analytical

solution. EM follows an iterative update scheme: (1) (E-step) for the currentiterate (parameter vector) θ, compute P (z|xi,yi,θ), and (2) (M-step) solveE[∑

i∂ log P (z,xi,yi|θ)

∂θ

]= 0 to θ as the next iterate, where the latter expectation

is with respect to the posterior obtained from the E-step. Alternatively, onecan form a lower bound maximization setting for EM by the Jensen’s inequal-ity using the concavity of log(·) function. Therefore, the iteration guaranteesmonotonic improvement of the objective for each update.

However, the EM algorithm cannot be directly applied to the CML opti-mization since the Jensen’s inequality for lower bound does not hold due to thenegative sign on the second term in Eq. (2.1). Recently, there was an effortto form an upper bound to the second term by applying what is called the

15

Page 22: Discriminative Learning of Generative Models for Sequence ...€¦ · 1.1 Graphical Representation: Naive Bayes and Logistic Regression . 4 1.2 Test errors vs. sample sizes (m) for

reverse Jensen’s inequality using the log-convexity of the partition function ofthe exponential family distributions [17]. Unfortunately, the derivation is quitecomplicated and the bound is often too loose. Instead, the gradient ascentoptimization such as the conjugate gradient search or the quasi-Newton typeoptimization (e.g., BFGS) is shown to perform well [44].

2.1.2 Example: Classification with Mixtures of Gaussians

We now see a concrete example of the generative and the discriminative CMLlearning for the mixtures of 2D Gaussians in static classification. The generativemodel used for binary classification represents a mixture of two Gaussians foreach class. So it has a class variable c taking either 1 or 2, a measurement ina 2D plane x ∈ R2, and a hidden variable z ∈ {1, 2} indicating a particularcomponent of the mixtures. The joint likelihood of the model is written asP (x, c) = P (c) · P (x|c) = P (c) ·

∑2z=1 P (z|c) · P (x|z, c). For the simplicity, the

class prior and the mixing proportions are assumed to be known as P (c = 1) =0.5 and P (z = 1|c = 1) = P (z = 1|c = 2) = 0.5. The class conditional is amixture of two bivariate Gaussians, that is,

P (x|c) = 0.5 · N (x;µc,1,Σc,1) + 0.5 · N (x;µc,2,Σc,2).

The parameters of the model are µc,z(∈ R2) and Σc,z(∈ R2×2+ ) for c = 1, 2,

z = 1, 2, where Rd×d+ is a set of (d × d) symmetric positive definite matrices.

The complete log-likelihood is given as:

log P (z,x, c) = log P (c) + log P (z|c) + log P (x|z, c)

= log(0.252π

)− 12

log |Σc,z| −12(x− µc,z)

T Σ−1c,z(x− µc,z).

For the train data {(xi, ci)}ni=1, the generative learning maximizes∑n

i=1 log P (xi, ci).The E-step computes the posterior distributions of the hidden variable z for thecurrent model,

qi(z) = P (z|xi, ci) =P (z,xi, ci)∑z′ P (z′ ,xi, ci)

. for i = 1, 2.

The M-step maximizes∑n

i=1

∑2z=1 qi(z) · log P (z,xi, ci), or equivalently, solves

the equation of the sum of the expected Fisher scores being 0, namely,

n∑i=1

2∑z=1

qi(z) · ∂

∂θlog P (z,xi, ci) = 0.

It is easy to see that the Fisher score can be evaluated in a closed form:

∂θlog P (z,x, c) =

Σ−1

c,z(x− µc,z) if θ = µc,z12Σc,z − 1

2 (x− µc,z)(x− µc,z)T if θ = Σ−1c,z

0 otherwise

16

Page 23: Discriminative Learning of Generative Models for Sequence ...€¦ · 1.1 Graphical Representation: Naive Bayes and Logistic Regression . 4 1.2 Test errors vs. sample sizes (m) for

In this way, the M-step has a closed form solution, for instance,

µnew1,1 =

∑i:ci=1 qi(1) · xi∑

i:ci=1 qi(1), Σnew

1,1 =

∑i:ci=1 qi(1) · (xi − µnew

1,1 )(xi − µnew1,1 )T∑

i:ci=1 qi(1).

The CML learning maximizes CLL =∑n

i=1 log P (ci|xi). Following thederivation in Eq. (2.6), the parametric gradient of the CLL objective is:

∂CLL

∂θ=

n∑i=1

EP (z|xi,ci)

[∂

∂θlog P (z,xi, ci)

]−

n∑i=1

EP (c,z|xi)

[∂

∂θlog P (z,xi, c)

].

The second posterior ri(c, z) = P (c, z|xi) can be computed easily by:

ri(c, z) =P (z,xi, c)∑

c′ ,z′ P (z′ ,xi, c′).

In this way, the CLL gradient with respect to the parameters is:

∂CLL

∂µc,z

=∑

i:ci=c

qi(z) ·Σ−1c,z(xi − µc,z)−

n∑i=1

ri(c, z) ·Σ−1c,z(xi − µc,z),

∂CLL

∂Σ−1c,z

=∑

i:ci=c

qi(z) ·[12Σc,z −

12(xi − µc,z)(x

i − µc,z)T]−

n∑i=1

ri(c, z) ·[12Σc,z −

12(xi − µc,z)(x

i − µc,z)T].

Note that for θ = Σ−1c,z , a special care needs to be taken to guarantee symmet-

ric positive definite matrices during the gradient ascent updates. One of thesimplest ways is to do a Cholesky-like reparameterization Σ−1

c,z = QT Q, whichresults in ∂CLL

∂Q = 2Q · ∂CLL∂Σ−1

c,zby the chain rule.

To compare the prediction performance of two learning methods, a syntheticexperiment is conducted. The data is sampled from 8 2D Gaussians with unitspherical covariances, 4 of them generating the class c = 1 while the other 4Gaussians for the class c = 2, as shown in Fig. 2.1(a). The number of samplesis chosen large (about 1, 000) to see the asymptotic behavior of the learnedmodels. The mixture model (2 components for each class) discussed aboveis obviously suboptimal to the true data generating process. However, for acertain parameter vector, the classification error becomes almost 0 as depictedin Fig. 2.1(b). In fact, the model in Fig. 2.1(b) is one of the global maxima forthe CML learning (CLL = −0.0014). Note that this is not the global maximafor the generative learning (LL = −6738.04) since the model in Fig. 2.1(c)records LL = −6314.23. The generative learning can find a model in Fig. 2.1(c)(depending on the initial model choice), however, as the CLL score indicates(CLL = −591.97), the classification error is about 0.5, no better than a randomguess.

17

Page 24: Discriminative Learning of Generative Models for Sequence ...€¦ · 1.1 Graphical Representation: Naive Bayes and Logistic Regression . 4 1.2 Test errors vs. sample sizes (m) for

−20 −15 −10 −5 0 5 10 15 20−20

−15

−10

−5

0

5

10

15

(a) Data Samples

−20 −15 −10 −5 0 5 10 15 20−20

−15

−10

−5

0

5

10

15

(b) CML (CLL =−0.0014, LL =−6738.04, ERR = 0)

−20 −15 −10 −5 0 5 10 15 20−20

−15

−10

−5

0

5

10

15

(c) ML (CLL =−591.97, LL =−6314.23, ERR =0.5)

−20 −15 −10 −5 0 5 10 15 20−20

−15

−10

−5

0

5

10

15

(d) Initial-01

−20 −15 −10 −5 0 5 10 15 20−20

−15

−10

−5

0

5

10

15

(e) CML (CLL =−23.81, LL =−11650, ERR =0.003)

−20 −15 −10 −5 0 5 10 15 20−20

−15

−10

−5

0

5

10

15

(f) ML (CLL =−882.13, LL =−7174.7, ERR =0.428)

−20 −15 −10 −5 0 5 10 15 20−20

−15

−10

−5

0

5

10

15

(g) Initial-02

−20 −15 −10 −5 0 5 10 15 20−20

−15

−10

−5

0

5

10

15

(h) CML (CLL =−0.80871, LL =−87033, ERR = 0)

−20 −15 −10 −5 0 5 10 15 20−20

−15

−10

−5

0

5

10

15

(i) ML (CLL =−407.9, LL =−6536.5, ERR =0.236)

−20 −15 −10 −5 0 5 10 15 20−20

−15

−10

−5

0

5

10

15

(j) Initial-03

−20 −15 −10 −5 0 5 10 15 20−20

−15

−10

−5

0

5

10

15

(k) CML (CLL =−347.72, LL =−16864, ERR = 0.2)

−20 −15 −10 −5 0 5 10 15 20−20

−15

−10

−5

0

5

10

15

(l) ML (CLL =−591.97, LL =−6314.23, ERR =0.5)

Figure 2.1: Asymptotic behavior of the ML/CML Learning: Depending on theintial model, the ML and the CML reach sometimes good or bad models.

18

Page 25: Discriminative Learning of Generative Models for Sequence ...€¦ · 1.1 Graphical Representation: Naive Bayes and Logistic Regression . 4 1.2 Test errors vs. sample sizes (m) for

c

aaa1 2 d...

(a) TAN

t+1t

t t+1

t−1

t−1 s

xx

ss

x

c

(b) HMMs for Classification

Figure 2.2: The generative models for static classification (TAN) and sequenceclassification (HMM).

Another important issue is that the learned models sensitively depend on theinitial model. This is due to the iterative update schemes of the EM algorithmfor the ML learning and the gradient-based optimization for the CML learning.For each row (from the second to the last) in Fig. 2.1, the leftmost plot showsthe initial model randomly chosen. The CML learning reaches the asymptoticerror rate of 0 for the first two initial choices, with the decision boundaries notthe same as the intuitive one in Fig. 2.1(b). The generative learning yields worseprediction even though it achieves higher (joint) likelihood scores. However, forthe last initial model, the discriminative learning fails to reach a global optimum,with the lower CLL score and the higher error rate. Even in this case, the CMLlearning results in lower prediction error than the ML learning.

2.1.3 Evaluation on Real Data

In order to see the generalization performance of the discriminative learningapproach, the evaluation is conducted on the real data for the problems ofstatic/sequence classification and tagging. The evaluation for the sequencetracking is discussed later in Ch. 3 since the CML learning for dynamical sys-tems (the conventional generative models for the continuous multivariate statesequence) is novel and worth to be discussed separately.

For the static classification, I use two datasets Chess and M-of-N-3-7-10from the UCI machine learning repository [14]. For these datasets have a set ofdiscrete attributes, one usually employs Naive Bayes or Tree-Augmented NaiveBayes (TAN) as underlying generative models. TAN is obtained by adding extraedges (among the attributes) to Naive Bayes, while each attribute is restrictedto have at most two edges (including the one from the class variable c). Anexample is shown in Fig. 2.2(a). In the experiment, as well as NB, I used 4TAN structures with the increasing order of complexity: adding 1, 2, 3, or d−1

19

Page 26: Discriminative Learning of Generative Models for Sequence ...€¦ · 1.1 Graphical Representation: Naive Bayes and Logistic Regression . 4 1.2 Test errors vs. sample sizes (m) for

NB NB+1 NB+2 NB+3 Full TAN

84

86

88

90

92

94

96

Chess

Model Structure

Test

Acc

urac

y(%

)

MLCML

(a) Chess

NB NB+1 NB+2 NB+3 Full TAN

88

90

92

94

96

98

100

M−of−N−3−7−10

Model Structure

Test

Acc

urac

y(%

)

MLCML

(b) M-of-N-3-7-10

Figure 2.3: Static Classification on UCI Data

edges to NB, where d is the number of attributes. The latter model is alsocalled a Full TAN. Which edge is to be added is determined by the empiricalCMI score, meaning that the edge which maximizes the conditional mutualinformation (CMI) score is selected at each stage in a greedy manner [19].

For each of 5 structure, the model is learned generatively (ML) and discrimi-natively (CML), and the test classification accuracies are shown in Fig. 2.3. Forthe two datasets, the CML learning is superior to the ML learning. Moreover,the discriminative learning yields lower test errors consistently throughout thedifferent model structures (less sensitive to the structure). This is an attractiveproperty because the structure learning is very difficult in general.

For sequence classification, where we classify or group the entire sequences,the time-series datasets including human walking sequences are used. Thedetails of the datasets are described in Sec. ??. To handle the sequentiallystructured input, the generative model is designed to have a HMM for eachclass. Note that apart from the sequence tagging case, the state variables ofthe HMM are now hidden. The graphical representation of the model is shownin Fig. 2.2(b). As shown in the test accuracies in Table 2.1, the CML learningoutperforms the generative learning consistently throughout the datasets.

As discussed in Sec. 1.2, HMMs are used for sequence tagging problem.The state variables of HMMs are no more hidden variables, but the target out-puts. The datasets used are from the computer vision (Kiosk speaker detectiondataset) and the information extraction (FAQ dataset). In Kiosk, given thesequence of the measurements (binary features from the face detectors and theaudio cues), the goal is to label each time frame to 1 or 2, indicating whether

20

Page 27: Discriminative Learning of Generative Models for Sequence ...€¦ · 1.1 Graphical Representation: Naive Bayes and Logistic Regression . 4 1.2 Test errors vs. sample sizes (m) for

Table 2.1: Sequence Classification Test Accuracies (%): For the datasets eval-uated with random-fold validation (Gun/Point and GT Gait), the averagesand the standard deviations are included. The other datasets contain averageleave-1-out test errors. Note that GT Gait and USF Set2 are the multi-classdatasets. See Sec. ?? for details.

Gun/PointAustralian GT Gait

USF Set1USF Set2

Sign Lang. (K = 5) (K = 7)

ML 63.78 ± 9.62 91.33 88.50 ± 4.78 79.76 44.64

CML 73.94 ± 5.23 94.55 96.62 ± 3.68 82.89 49.11

Table 2.2: Sequence Tagging Test Accuracies (%): leave-1-out test errors.

Kiosk FAQ-BSD FAQ-Fetish

ML 90.14 ± 3.86 92.48 ± 4.10 92.86 ± 7.91

CML 95.94 ± 1.73 97.86 ± 3.46 97.76 ± 2.49

CRF 94.27 ± 2.19 97.60 ± 2.16 98.66 ± 0.92

the human is speacking or not speaking, respectively. For 5 labeled sequencesof length around 2, 000, the leave-1-out test is performed. FAQ dataset is firstused in the information extraction community [28]. For the FAQ document witha particular topic (e.g., BSD or Fetish), each sentence in a document is labeledto one of the tags, {head, question, answer, tail}. The measurements are 24binary features extracted from the sentences (e.g., Start with a capital letter?,Contain a numeric letter?, or Length less than 3?). Table 2.2 shows the testaccuracies from the leave-1-out cross validation. As usual, CML is better thanML. Even compared with the discriminative model (CRF), the CML learningof generative models is never inferior.

Despite the good generalization performance, the major drawback of theparametric gradient-based CML learning is that the performance is sensitive tothe choice of the initial model, often staying at the local optima. Furthermore,the optimization would be computationally demanding if the model is complexwith many parameters. In the next chapter, I suggest an efficient discrimina-tive learning algorithm which is based on the mixture density estimation in arecursive manner. Before that, the latest work on the convex formulation ofMax-Margin, another objective for the discriminative learning, is briefly dis-cussed.

21

Page 28: Discriminative Learning of Generative Models for Sequence ...€¦ · 1.1 Graphical Representation: Naive Bayes and Logistic Regression . 4 1.2 Test errors vs. sample sizes (m) for

2.2 Margin Maximization

Even though the approach in [45] can be applied to sequence classification andtagging, it is fair to confine the problem to the multivariate static classificationin order to understand the main point. The multivariate classification is to mapthe input vector x ∈ Rd to a label y ∈ {1, . . . ,K}. Any classifier y = g(x) can berepresented as a set of real-valued confidence functions bc(x) for c = 1, . . . ,K.The prediction is considered as a voting, namely, y = arg maxc bc(x). Forinstance, (linear) SVM has linear confidence functions, while any probabilisticmodel approach takes the class posteriors P (c|x) as confidence functions.

For the train data {(xi, yi)}ni=1, the Max-Margin learning in general triesto maximize the gap between the confidence to the target in the data and theconfidence to the other candidate targets. Formally, (byi(xi) − bc(xi)) has tobe as large as possible for ∀c 6= yi (i = 1, . . . , n). SVM, by formulating aquadratic programming, can be subsumed in this framework. Inspired by SVM,the main contribution of [45] is to apply the Max-Margin to the generativemodels (e.g., Gaussians, Mixtures of Gaussians, and HMMs), formulating aconvex programming.

We begin with the Gaussian models for classification. The model has a classprior P (c) and the class conditionals P (x|c) = N (x;µc,Σc). The classifierinduced from the model is derived as:

y = arg maxc

P (c|x) = arg maxc

log P (c,x) = arg maxc

(log P (c) + log P (x|c)

)= arg min

c

((x− µc)

T Ψc(x− µc) + θc

), (2.8)

where Ψc = Σ−1c and θc = log |Σc| − 2 log P (c). The original parameter set

(µc,Ψc, θc) is reparameterized to Φc such that

Φc =[

Ψc −Ψcµc

−µTc Ψc µT

c Ψcµc + θc

]. (2.9)

This reparameterization is useful in that the classification function is reduced to

a simple form y = arg minc zT Φcz, where z =[

x1

]. Note that Φc is positive

definite and the two parameter sets are one-to-one convertible.In the Max-Margin framework, each train sample is constrained to be at

least one unit distant from the decision boundary to each competing class, i.e.,

zTi (Φc −Φyi)zi ≥ 1, ∀c 6= yi, (i = 1, . . . , n). (2.10)

Similarly to the SVM formulation, encoding preference to the smaller parame-ters results in the semidefinite programming:

min∑

c

trace(Ψc)

s.t. 1 + zTi (Φyi −Φc)zi ≤ 0, ∀c 6= yi, i = 1, . . . , n,

Φc � 0, c = 1, . . . ,K. (2.11)

22

Page 29: Discriminative Learning of Generative Models for Sequence ...€¦ · 1.1 Graphical Representation: Naive Bayes and Logistic Regression . 4 1.2 Test errors vs. sample sizes (m) for

By introducing slack variables ξ to allow non-separable data, the problem be-comes:

min∑i,c

ξi,c + γ∑

c

trace(Ψc)

s.t. 1 + zTi (Φyi −Φc)zi ≤ ξi,c,

ξi,c ≥ 0, ∀c 6= yi, i = 1, . . . , n,

Φc � 0, c = 1, . . . ,K, (2.12)

where γ ≥ 0 is a balancing hyperparameter. It is still an instance of semi-definiteprogramming.

The formulation is also extended to the Gaussian mixture model whose classconditional is modeled as a mixture of M Gaussians. The first step is to estimatethe proxy label mi for the i-th sample so that mi indicates which componentof the mixture is activated. This can be done by inference with the maximumlikelihood estimator. Then for the completely labeled sample (xi, yi,mi), onecan follow the same derivation as the Gaussian case. Now the constraints are:

zTi (Φc,m −Φyi,mi)zi ≥ 1, ∀c 6= yi, ∀m, (i = 1, . . . , n). (2.13)

However, it may be a problem that one has a large number of constraints(nKM). Using the softmax inequality, − log

∑m e−am ≤ minm am, Eq. (2.13)

can be reduced to the stricter constraints:

− log(∑

m

exp(− zT

i Φc,mzi

))− zT

i Φyi,mizi ≥ 1, ∀c 6= yi, (i = 1, . . . , n).

(2.14)Similarly to Eq. (2.12), the optimization problem is then:

min∑i,c

ξi,c + γ∑c,m

trace(Ψc,m)

s.t. 1 + zTi Φyi,mizi + log

(∑m

exp(− zT

i Φc,mzi

))≤ ξi,c,

ξi,c ≥ 0, ∀c 6= yi, i = 1, . . . , n,

Φc,m � 0, c = 1, . . . ,K, m = 1, . . . ,M. (2.15)

This is not an instance of semidefinite programming, but convex.The evaluation is conducted for the handwritten digit classification on the

MNIST dataset [26]. By varying the mixture order M , the test errors are shownfor the generative EM learning and the discriminative Max-Margin learning.Max-Margin yields significantly lower test errors, while being less sensitive tothe model structure.

For the mixture model with M = 4, the prototype digits (i.e., the meanvectors of the mixture components) of the generatively learned model and theMax-Margin model are illustrated in Fig. 2.4. The EM prototypes are therepresentative images (most likely seen), while the Max-Margin prototypes arethe samples that may look confusing to decide the labels, which must be at thedecision boundaries.

23

Page 30: Discriminative Learning of Generative Models for Sequence ...€¦ · 1.1 Graphical Representation: Naive Bayes and Logistic Regression . 4 1.2 Test errors vs. sample sizes (m) for

Table 2.3: MNIST Digit Classification Test Error (%). Excerpted from [45].

M EM Max-Margin

1 3.0 1.4

2 2.6 1.4

4 2.1 1.2

8 1.8 1.5

Figure 2.4: Digit Prototypes from the generative learning and the Max-Margindiscriminative learning. Excerpted from [45].

24

Page 31: Discriminative Learning of Generative Models for Sequence ...€¦ · 1.1 Graphical Representation: Naive Bayes and Logistic Regression . 4 1.2 Test errors vs. sample sizes (m) for

Chapter 3

Discriminative Learning ofDynamical Systems

I consider the problem of tracking or state estimation of time-series motion se-quences. The problem can be formulated as estimating a continuous multivariatestate sequence, x = x1 · · ·xT , from the measurement sequence, y = y1 · · ·yT ,where xt ∈ Rd and yt ∈ Rk. Its applications in computer vision include 3Dtracking of the human motion and pose estimation for moving objects fromsequences of monocular or multi-camera images.

Learning of dynamic models for tracking is often accomplished by optimizingthe likelihood of the measurement sequence P (y). Increased availability of high-precision motion capture tools and data opens a new possibility for learningmodels that directly optimize a tracker’s prediction accuracy, P (x|y). However,the study of discriminative learning methods for tracking has only recentlyemerged in the computer vision community.

A problem resembling the state estimation in tracking, when xt is a dis-crete label instead of continuous multivariate, is known as sequence tagging orsegmentation. The most popular generative model in this realm is the HiddenMarkov Model (HMM). Traditional Maximum Likelihood (ML) learning of gen-erative models such as HMMs is not directly compatible with the ultimate goalof label prediction (namely, x given y), as it optimizes the fit of the modelsto data jointly, x and y. Recently, discriminative models such as ConditionalRandom Fields (CRFs) and Maximum Entropy Markov Models (MEMMs) wereintroduced to address the label prediction problem directly, resulting in superiorperformance to the generative models [24, 28].

Despite a broad success of discriminative models in the discrete state do-main, the use of discriminative dynamic models for continuous multivariatestate estimation is not widespread. One reason for this is that a naturalreparameterization-based transformation of generative dynamical systems toconditional models may violate density integrability constraints and can of-ten produce unstable dynamic systems. For example, an extension of Linear

25

Page 32: Discriminative Learning of Generative Models for Sequence ...€¦ · 1.1 Graphical Representation: Naive Bayes and Logistic Regression . 4 1.2 Test errors vs. sample sizes (m) for

Dynamical System (LDS) to CRF imposes irregular constraints on the CRFparameters to ensure finiteness of the log-partition function, making convex orgeneral gradient-based optimization complex and prone to numerical failure.

As an alternative to CRF-based models in continuous state sequence domainswe propose to learn generative dynamic models discriminatively. This approachhas been well studied in classification settings: Learning generative models suchas Tree-Augmented Naive Bayes (TAN) or HMMs discriminatively via max-imizing conditional likelihoods yields better prediction performance than thetraditional maximum likelihood estimator [32, 12, 36, 19, 23]. The main contri-bution of this work is to extend this approach to dynamic models and the motiontracking problem. Namely, we learn dynamic models that directly optimize theaccuracy of pose predictions rather than jointly increasing the likelihood of theobject’s visual appearance and pose.

I introduce two discriminative learning algorithms for generative proba-bilistic dynamical systems, P (x,y). One is to maximize the conditional log-likelihood of the entire state sequence x, that is, arg max log P (x|y), while theother is for the individual state slices xt, namely, arg max(1/T )

∑Tt=1 log P (xt|y).

These objectives are not convex in general, however, the gradient-based opti-mization yields superior prediction performance to that of the standard MLalgorithm. In addition, I devise computationally efficient methods for gradientevaluation as a part of the proposed framework.

For several human motions, we compare the prediction performance of thecompeting models including nonlinear and latent variable dynamic models. Theproposed discriminative learning algorithms on LDS can provide significantlylower prediction error than the standard maximum likelihood estimator, oftencomparable to estimates of computationally more expensive and parameter sen-sitive nonlinear or latent variable models. Thus the discriminative LDS offer ahighly desired combination of high estimation accuracy and low computationalcomplexity.

The chapter is organized as follows: In the next section LDS is briefly re-viewed. In Sec. 3.2, it is discussed why discriminative models can be problematicin the continuous multivariate state domain. Then in Sec. 3.3, the proposed dis-criminative learning algorithms for LDS are described, followed by how they canbe extended to nonlinear models. After reviewing related work in Sec. 3.4, theevaluation on the motion tracking data appears in Sec. 3.5.

3.1 Linear Dynamical Systems

LDS assumes transition and emission densities to be linear Gaussian, conformingto the graphical representation in Fig. 3.1(a). The conditional densities of LDSare defined as:

x1 ∼ N (x1;m0,V0), xt|xt−1 ∼ N (xt;Axt−1,Γ),yt|xt ∼ N (yt;Cxt,Σ). (3.1)

26

Page 33: Discriminative Learning of Generative Models for Sequence ...€¦ · 1.1 Graphical Representation: Naive Bayes and Logistic Regression . 4 1.2 Test errors vs. sample sizes (m) for

t+1t

t t+1

t−1

t−1 x

y

x

y

x

y

(a) HMM or LDS

t+1t

t t+1

t−1

t−1 x

y

x

y

x

y

(b) CRF

t+1t

t t+1

t−1

t−1 x

y

x

y

x

y

(c) MEMM

Figure 3.1: Graphical Models: HMM (or LDS), CRF, and MEMM.

The LDS parameter set is Θlds = {m0,V0,A,Γ,C,Σ}. The joint log-likelihood,LL = log P (x,y)1 is (up to a constant):

LL = −12

[(x1 −m0)

′V−1

0 (x1 −m0) + log |V0|+

T∑t=2

(xt −Axt−1)′Γ−1(xt −Axt−1) + (T − 1) log |Γ|+

T∑t=1

(yt −Cxt)′Σ−1(yt −Cxt) + T log |Σ|

], (3.2)

where X′indicates the transpose of the matrix X.

The task of inference is to compute the filtered state densities, P (xt|y1, . . . ,yt)and the smoothed densities, P (xt|y). The linear Gaussian assumption on LDSimplies Gaussian posteriors that can be evaluated in linear time using the well-known Kalman filtering or RTS smoothing methods. We denote the means andthe covariances of these posterior densities by:

mt , E[xt|y1 . . .yt], Vt , V (xt|y1 . . .yt),mt , E[xt|y], Vt , V (xt|y), Σt,t−1 , Cov(xt,xt−1|y). (3.3)

To learn LDS, one needs to find Θlds that optimizes a desired objectivefunction. In the supervised setting that we assume throughout the paper, forthe given train data D = {(xi,yi)}ni=1, the generative learning maximizes thejoint log-likelihood,

∑ni=1 LL(xi,yi), which has a closed form solution by setting

the gradients in Eq. (3.4) to 0. For instance, the emission matrix,

C∗ =

[n∑

i=1

Ti∑t=1

yit xi

t

′]·

[n∑

i=1

Ti∑t=1

xit xi

t

′]−1

,

where Ti is the length of the i-th sequence.1For brevity, we will often drop the dependency on Θ in the notation.

27

Page 34: Discriminative Learning of Generative Models for Sequence ...€¦ · 1.1 Graphical Representation: Naive Bayes and Logistic Regression . 4 1.2 Test errors vs. sample sizes (m) for

∂LL

∂m0= V−1

0 (x1 −m0),

∂LL

∂V−10

=12V0 −

12(x1 −m0)(x1 −m0)

′,

∂LL

∂A= Γ−1 ·

T∑t=2

(xtx′

t−1 −Axt−1x′

t−1),

∂LL

∂Γ−1=

T − 12

Γ− 12

T∑t=2

(xt −Axt−1)(xt −Axt−1)′,

∂LL

∂C= Σ−1 ·

T∑t=1

(ytx′

t −Cxtx′

t),

∂LL

∂Σ−1=

T

2Σ− 1

2

T∑t=1

(yt −Cxt)(yt −Cxt)′. (3.4)

The ML learning of the generative model is intended to fit the model todata jointly on x and y. However, in tracking we are often more interested infinding a model that yields a high accuracy of predicting x from y, an objec-tive not achieved by ML learning in general. It is therefore tempting to employdiscriminative models which explicitly focus on the desired goal. In the dis-crete state domain, CRFs and MEMMs are such models shown to outperformthe generative models like HMMs. Unfortunately, as discussed in the next sec-tion, developing CRF- or MEMM-like discriminative models in the continuousmultivariate state domain can be a challenge.

3.2 Discriminative Dynamic Models

Analogous to extending HMMs to CRFs and MEMMs, I will extend LDS toconditional models that have the same representational capacity as LDS. This,for instance, reduces to exploiting 2nd-order moments (e.g., xtx

t−1) as localfeatures for CRF, and a linear Gaussian local conditional density P (xt|xt−1,yt)for MEMM.

3.2.1 Conditional Random Fields

CRF models the conditional probability of x given y. Since P (x|y) ∝ P (x,y),the log-conditional log P (x|y) has the same form as Eq. (3.2) except that thoseterms that are not involved in x (e.g., y

tΣ−1yt) can be removed as they will

be marginalized out into the log-partition function. We reparameterize Θlds toCRF parameters so that the latter become linear coefficients for the CRF fea-tures. Specifically, the new CRF parameter set Θcrf = {Λb,Λ1,Λ,ΛT ,ΛA,ΛC}

28

Page 35: Discriminative Learning of Generative Models for Sequence ...€¦ · 1.1 Graphical Representation: Naive Bayes and Logistic Regression . 4 1.2 Test errors vs. sample sizes (m) for

satisfies:

Λb = V−10 m0, ΛA = Γ−1A, ΛC = Σ−1C,

Λ1 = − 12 (V−1

0 + A′Γ−1A + C

′Σ−1C),

Λ = − 12 (Γ−1 + A

′Γ−1A + C

′Σ−1C),

ΛT = − 12 (Γ−1 + C

′Σ−1C). (3.5)

Then the LDS-counterpart CRF model can be written as:

P (x|y,Θcrf ) =exp(Φ(x,y;Θcrf )

)Z(y;Θcrf )

, where

Φ(x,y;Θcrf ) = Λ′

bx1 + x′

1Λ1x1 +T−1∑t=2

x′

tΛxt +

x′

T ΛT xT +T∑

t=2

x′

tΛAxt−1 +T∑

t=1

y′

tΛCxt,

Z(y;Θcrf ) =∫x

exp(Φ(x,y;Θcrf )

). (3.6)

Note that Λb ∈ Rd×1,Λ1,Λ,ΛT ,ΛA ∈ Rd×d, and ΛC ∈ Rk×d. Below, weabuse the notation by defining Λt , Λ for 2 ≤ t ≤ T − 1, which enables us tocompactly represent Λ1, Λ, and ΛT all together as Λt for 1 ≤ t ≤ T .

The (conditional) log-likelihood log P (x|y,Θcrf ) is concave in Θcrf becauseboth Φ(x,y;Θcrf ) and the log-partition function log Z(y;Θcrf ) are convex.However, the reparameterization produces unexpected constraints on the CRFparameter space. The thorough set of constraints is not immediately obviousand includes constraints such as the symmetry and negative definiteness of Λt.Other constraints can be revealed during the inference phase.

In the assumed chain-structured CRF as shown in Fig. 3.1(b), the potentialfunction Mt(·) defined on the clique at time t can be denoted as:

M1(x1|y) = ex′1Λ1x1+Λ′bx1+y

′1ΛCx1 ,

Mt(xt,xt−1|y) = ex′tΛtxt+x

′tΛAxt−1+y

′tΛCxt , t ≥ 2. (3.7)

With the initial condition, α1(x1|y) = M1(x1|y), the forward message is definedrecursively (for t ≥ 2) as,

αt(xt|y) =∫xt−1

αt−1(xt−1|y) ·Mt(xt,xt−1|y). (3.8)

Since αt(xt|y) is an unnormalized Gaussian, it can be represented by a triple(rt,Pt,qt) ∈ (R, Rd×d, Rd), where αt(xt|y) = rt exp(x

tPtxt + q′

txt). For a

29

Page 36: Discriminative Learning of Generative Models for Sequence ...€¦ · 1.1 Graphical Representation: Naive Bayes and Logistic Regression . 4 1.2 Test errors vs. sample sizes (m) for

feasible Θcrf ,

rt = rt−1

∣∣∣− πP−1t−1

∣∣∣1/2

exp(−14q′

t−1P−1t−1qt−1),

qt = Λ′

Cyt −12ΛAP−1

t−1qt−1, for 2 ≤ t ≤ T, and

Pt = Λ− 14ΛAP−1

t−1Λ′

A, for 2 ≤ t ≤ T − 1, (3.9)

with the boundary conditions: r1 = 1,P1 = Λ1,q1 = Λb + Λ′

Cy1, and PT =ΛT − 1

4ΛAP−1T−1Λ

A. Because Z(y) =∫xT

αT (xT |y), Pt, for all t, must benegative definite to guarantee not only a proper (integrable) density with afinite log-partition function, but also proper forward messages αt(·). As shownin the recursion in Eq. (3.9), however, these conditions may produce irregularconstraints on Θcrf .

The backward recursion, that can be similarly derived, adds additional con-straints on the parameters. As a result, specifying the feasible parameter spaceof continuous conditional dynamic models is difficult. This, in turn, makes theseemingly convex optimization infeasible.

3.2.2 Maximum Entropy Markov Models

MEMM has a graphical structure depicted in Fig. 3.1(c). Despite the well-known label bias problem, its simple learning procedure that does not requireforward/backward recursion is very attractive. Given a complete data {(x,y)},the likelihood function can be factored into terms related with individual slices(xt−1,yt,xt) and subsequently treated as a set of independent slice instances.Learning MEMM is equivalent to training a static classifier or regression functionP (xt|xt−1,yt) for the iid data with the output {xt} and the input {(xt−1,yt)}.

MEMM with the linear Gaussian conditional, namely,

xt|xt−1,yt ∼ N (xt;Axxt−1 + Ayyt + e,W), (3.10)

can be seen as a counterpart of LDS. The prediction is done by the recur-sion, P (xt|y) =

∫xt−1

P (xt|xt−1,yt) · P (xt−1|y). Note that in MEMMs thesmoothed posterior P (xt|y) equals the filtered posterior P (xt|y1, ...,yt), effec-tively removing the influence of future samples on current state estimates. Themean estimate mt = E[xt|y] is:

mt = Axmt−1 + Ayyt + e. (3.11)

Eq. (3.11) points to another deficiency of linear MEMMs. The next stateestimate is linearly related with the previous state mean, where the coefficientAx is determined by the multivariate linear regression learning with data treatedslicewise independently. If the learned Ax is unstable2, the state estimates

2Eigenvalues of matrix A have absolute magnitudes exceeding 1.

30

Page 37: Discriminative Learning of Generative Models for Sequence ...€¦ · 1.1 Graphical Representation: Naive Bayes and Logistic Regression . 4 1.2 Test errors vs. sample sizes (m) for

become unbounded. As a result, the state estimation error can be significantlyamplified in this MEMM setting.

This behavior may be reduced when non-linear or non-Gaussian noise modelsare used. In [47], for instance, a complex nonlinear regression function (BayesianMixture of Experts) was applied to the 3D human body pose estimation prob-lem. However, the failure of simple linear MEMM points to prevalent role oflocal functions over the MEMM’s overall discriminative model structure. Inother words, the success of MEMM may be strongly dependent on the perfor-mance of the employed static regression functions.

3.3 Discriminative Learning of LDS

The analysis of traditional conditional dynamic models points to possible modesof failure when such models are applied to continuous state domains. To addressthese deficiencies I suggest to learn the generative LDS model with discrimina-tive cost functionals. As the discriminative learning of TAN or HMM has shownto outperform generative learning in classification settings, the same approachcan be brought to benefit the task of motion tracking in continuous domains.We propose two discriminative objectives to solve the problem of discrimina-tive learning of LDS. The optimal parameter estimation is accomplished by anefficient gradient search on the two objectives. We also show how the discrimi-native learning task can be extended to a general family of nonlinear dynamicmodels.

3.3.1 Conditional Likelihood Maximization (CML)

The goal of CML learning is to find LDS parameters that maximize the con-ditional likelihood of x given y, an objective directly related to our goal ofaccurate state prediction. The conditional log-likelihood objective for the data(x,y) is defined as:

CLL = log P (x|y) = log P (x,y)− log P (y). (3.12)

CLL objective is, in general, non-convex in the model parameter space. How-ever, the objective can be locally optimized using a general gradient search. Thegradient of CLL with respect to Θlds is:

∂CLL

∂Θlds=

∂ log P (x,y)∂Θlds

− ∂ log P (y)∂Θlds

. (3.13)

The first term, the gradient of the complete log-likelihood (the Fisher score)is shown in Eq. (3.4). The second term, the gradient of the observation log-likelihood, is essentially the expected Fisher score w.r.t. the posterior density.

31

Page 38: Discriminative Learning of Generative Models for Sequence ...€¦ · 1.1 Graphical Representation: Naive Bayes and Logistic Regression . 4 1.2 Test errors vs. sample sizes (m) for

That is,

∂ log P (y)∂Θlds

=∫x

P (x|y)∂ log P (x,y)

∂Θlds

= EP (x|y)

[∂ log P (x,y)

∂Θlds

]. (3.14)

Hence, CLL gradient is the difference between the Fisher score on the data (x,y)and the expected Fisher score by the model given y only. Because the Fisherscore, as shown in Eq. (3.4), is a sum of 2nd-order moments (i.e., those relatedwith xtx

t, xtx′

t−1, or xt), the expected Fisher score can be easily computedonce we have the posterior P (x|y). For example, using the fact that E[XY

′] =

E[X]E[Y]′+ Cov(X,Y), the gradient w.r.t. the transition covaraince is:

∂ log P (y)∂Γ−1

= EP (x|y)

[∂LL

∂Γ−1

]=

T − 12

Γ− 12

T∑t=2

[(mtm

t + Vt)− (mtm′

t−1 + Σt,t−1)A′−

A(mtm′

t−1 + Σt,t−1)′+ A(mt−1m

t−1 + Vt−1)A′]. (3.15)

3.3.2 Slicewise Conditional Likelihood Maximization

The goal of CML is to find a model that minimizes the joint estimation error forx = x1, . . . ,xt, . . . ,xT for all t. In most motion tracking problems, however, itis more natural to consider the prediction error at each time slice independently.In the discrete state domain, this notion is directly related to minimization ofthe Hamming distance between the target and the inferred states. In the con-tinuous domain, we consider the Slicewise Conditional Likelihood Maximization(SCML) as the following objective:

SCLL =1T

T∑t=1

log P (xt|y). (3.16)

SCLL has been introduced as an alternative objective for CRF in the discretedomain sequence tagging problem [21]. Note that evaluating the objective itselfrequires a forward/backward or Kalman filtering/smoothing. SCML learning issubsequently based on gradient optimization.

I will extend the approach of [21] to LDS models. For notational clarity, thetrain data is distinguished from the random variables by denoting the former asx while the latter as x. It is easy to see that SCLL gradient can be written as:

∂SCLL

∂Θlds=

1T

T∑t=1

∂ log P (xt,y)∂Θlds

− ∂ log P (y)∂Θlds

. (3.17)

32

Page 39: Discriminative Learning of Generative Models for Sequence ...€¦ · 1.1 Graphical Representation: Naive Bayes and Logistic Regression . 4 1.2 Test errors vs. sample sizes (m) for

Since the second term is dealt in Eq. (3.14), we will focus on the first term ofEq. (3.17). It can be shown that the first term excluding (1/T ) is equivalent to:

T∑t=1

∫x\xt

P (x \ xt|xt,y) · ∂ log P (x,y)∂Θlds

∣∣∣∣xt=xt

, (3.18)

where x \ xt means set-minus, excluding xt from x. Recalling that the Fisherscore, log P (x,y)

∂Θlds, is a sum of 2nd-order moment terms, let f(xj ,xj−1) be one

of them. This enables us to evaluate E[f(xj ,xj−1)] individually (w.r.t. theunnormalized density

∑Tt=1 P (x \ xt|xt,y)), while later on all the expectations

of terms corresponding to the Fisher score have to be summed to obtain thequantity in Eq. (3.18).

For f(xj ,xj−1), for j = 2, . . . , T , the expectation E[f(xj ,xj−1)] with respectto∑T

t=1 P (x \ xt|xt,y) is:

EP (xj |xj−1,y)

[f(xj ,xj−1)

]+ EP (xj−1|xj ,y)

[f(xj ,xj−1)

]+∑j−2

t=1 EP (xj ,xj−1|xt,y)

[f(xj ,xj−1)

]+∑T

t=j+1 EP (xj ,xj−1|xt,y)

[f(xj ,xj−1)

]. (3.19)

The first and the second terms are the expectations w.r.t. the posteriors giventhe neighbor (next or previous) state. It is not difficult to show that they areboth Gaussians, namely,

P (xt+1|xt,y) = N (xt+1;Ft+1xt + bt+1,Rt+1), andP (xt|xt+1,y) = N (xt;Gtxt+1 + ct,St), (3.20)

where Ft+1 = Σt+1,tV−1t , Gt = Σ

t+1,tV−1t+1, bt+1 = mt+1 − Ft+1mt, ct =

mt −Gtmt+1, Rt+1 = Vt+1 − Ft+1Σ′

t+1,t, and St = Vt −GtΣt+1,t.The third term of Eq. (3.19) is the expectation with respect to the posterior

P (xj ,xj−1|xt,y), given the state two or more slices before (note that j > t).This requires another forward recursion on j which together with the Kalmanfilter forms the two-pass forward algorithm for SCML learning. Similarly, thefourth term of Eq. (3.19) forms the second-pass backward recursion. For spacelimitation, I will derive only the forward recursion here, where the backwardrecursion can be similarly done. First the following lemma from the Gaussianidentity is needed:

Lemma 1 P (xt+1|xt,y)·P (xt|xt−1,y) is a Gaussian on xt and xt+1, whereµ1

t , E[xt|xt−1,y] = Ftxt−1 + bt, µ2t , E[xt+1|xt−1,y] = Ft+1µ

1t + bt+1, and

V (xt|xt−1,y) = Rt.

Now we are ready to define the second-pass forward message as αj(xj ,xj−1) =∑j−2t=1 P (xj ,xj−1|xt,y), for j = 3, . . . , T . It is a sum of (j − 2) Gaussians

in the following reason. Initially for j = 3, α3(x3,x2) = P (x3,x2|x1,y), or

33

Page 40: Discriminative Learning of Generative Models for Sequence ...€¦ · 1.1 Graphical Representation: Naive Bayes and Logistic Regression . 4 1.2 Test errors vs. sample sizes (m) for

equivalently, P (x3|x2,y)·P (x2|x1,y) is a Gaussian from Lemma 1. Supposethat αj−1(xj−1,xj−2) be a sum of (j − 3) Gaussians. In the forward recursion:

αj(xj ,xj−1) = P (xj |xj−1,y) ·∫xj−2

αj−1(xj−1,xj−2) +

P (xj |xj−1,y)·P (xj−1|xj−2,y), (3.21)

the first term of RHS is a sum of (j−3) Gaussians by the inductive assumptionand Lemma 1, while the second term is another Gaussian from Lemma 1. In par-ticular, it can be shown that the m-th Gaussian component of αj(xj ,xj−1), has

the mean denoted by[

µ1j (m)

µ2j (m)

]and the covariance by

[Σ11

j (m) Σ12j (m)

Σ21j (m) Σ22

j (m)

]satisfying the recursion:

µ1j (m) = µ2

j−1(m), µ2j (m) = Fjµ

2j−1(m) + bj ,

Σ22j (m) = FjΣ22

j−1(m)F′

j + Rj , Σ11j (m) = Σ22

j−1(m),

Σ21j (m) = Σ12

j (m)′= FjΣ22

j−1(m), (3.22)

for m = 1, . . . , j − 3, and for the last (j − 2)-th component,

µ1j (j − 2) = µ1

j−1, µ2j (j − 2) = µ2

j−1,

Σ22j (j − 2) = FjRj−1F

j + Rj , Σ11j (j − 2) = Rj−1,

Σ21j (j − 2) = Σ12

j (j − 2)′= FjRj−1. (3.23)

In the same manner, the backward message, defined as

βj(xj ,xj−1) =T∑

t=j+1

P (xj ,xj−1|xt,y),

turns out to be a sum of (T−j) Gaussians. By summing up the expectations withrespect to these Gaussians, Eq. (3.19) can be computed, ultimately obtainingthe SCLL gradient in Eq. (3.17).

3.3.3 Extension to Nonlinear Dynamical Systems

CML and SCML learning can be similarly applied to the nonlinear dynamicalsystems (NDS). In NDS, the posterior can be evaluated via Extended Kalmanfiltering/smoothing based on the approximated linear model (e.g., [11]) or usingvarious particle filter methods, depending on the dimensionality of the statespace. Since the Fisher score for NDS is no more a sum of 2nd-order moments,rather a complex nonlinear function, evaluation of the expectation E[f(xt,xt−1)]becomes difficult. However, following [11] we can approximate a nonlinear dy-namic function with an RBF network:

xt|xt−1 ∼ N (xt;Akk(xt−1) + Axt−1,Γ),yt|xt ∼ N (yt;Ckk(xt) + Cxt,Σ), (3.24)

34

Page 41: Discriminative Learning of Generative Models for Sequence ...€¦ · 1.1 Graphical Representation: Naive Bayes and Logistic Regression . 4 1.2 Test errors vs. sample sizes (m) for

where k(xt) , [k(xt,u1), . . . , k(xt,uL)]′is a vector of RBF kernels evaluated on

the known centers {ul}Ll=1. For k(xt,ul) = e−12 (xt−ul)

′S−1

l (xt−ul), where Sl isthe kernel covariance, the nonlinear part in the Fisher score takes a specific formsuch as k(xt)k(xt)

′, k(xt)k(xt−1)

′, or xtk(xt)

′, and has a closed-form expec-

tation w.r.t. a Gaussian (approximated) posterior. As a result, gradient termsnecessary for CML/SCML optimization in dynamic RBF nonlinear models alsopossess closed-form expressions.

In the evaluation, It is verified that for LDS, the discriminative algorithmsimprove the generative learning significantly. For NDS, however, the improve-ment is not as significant as the linear case. In other words, the choice oflearning objective for nonlinear models appears less critical. However, the gen-eralization performance of the nonlinear models can be very sensitive to thechoice of the kernel centers and the kernel hyperparameters. In Sec. 3.5, I willdemonstrate that discriminatively learned linear models can be comparable toeven well-tuned nonlinear models.

3.4 Related Work

While discriminative learning of discrete-state dynamic models such as HMMs,CRFs and MEMMs has received significant attention recently, learning of similarmodels in the continuous space has been rarely explored. In robotics commu-nity, [1] empirically studied several objectives for learning of continuous-statedynamical systems. In contrast to [1]’s ad-hoc optimization method, it is thefirst to provide efficient gradient optimization algorithms for discriminative ob-jectives, by extending the method of [21] to dynamical systems in continuousmultivariate domains.

The recent work on the human motion tracking problem can be roughlycategorized into: dynamic model based ([18, 33, 35]), nonlinear manifold em-bedding ([8, 39, 46, 54]), and Gaussian process based latent variable mod-els ([51, 31, 52]) to name a few. In our approach, we consider a generativefamily of models and show that it can be used for accurate and computationallyefficient pose estimation, if coupled with a proper learning objective.

Related with the discriminative paradigm, [47] successfully employed a MEMM-like model with Bayesian mixtures of experts for 3D pose estimation. In general,MEMMs are sensitive to label-bias [24]. Their ability to successfully infer statesfrom observations mostly depends on the modeling capacity of the regressionfunctions and not on the choice of discriminative dynamic model objective. Un-like MEMMs, the discriminatively learned generative dynamic models could alsobe used for motion synthesis.

3.5 Evaluation

The discriminative dynamical system modeling approach is evaluated in a setof experiments that include synthetic data as well as the CMU motion capture

35

Page 42: Discriminative Learning of Generative Models for Sequence ...€¦ · 1.1 Graphical Representation: Naive Bayes and Logistic Regression . 4 1.2 Test errors vs. sample sizes (m) for

Model ML CML SCML

Error 1.79 ± 0.26 1.59 ± 0.22 1.30 ± 0.12

Log-Perplexity 4.76 ± 0.40 4.49 ± 0.34 3.80 ± 0.25

Table 3.1: Test errors and log-perplexities for synthetic data.

dataset3. The proposed models are denoted as CML and SCML, the LDSmodels learned via the methods in Sec. 3.3.1 and Sec. 3.3.2, respectively. ML isthe standard maximum likelihood estimator for LDS. I also include comparisonwith nonlinear and latent-variable dynamic models, as described in Sec. 3.5.2.

3.5.1 Synthetic Data

I synthesize data from a devised model which is structurally more complexthan LDS. The model has second-order dynamics and emission, specifically,xt = 1

2A1xt−1 + 12A2xt−2 +vt, and yt = 1

2C1xt + 12C2xt−1 +wt, where vt and

wt are Gaussian white noises. The purpose of this experiment is to see how thelearning algorithms behave for the incorrect model structure, emphasizing thefact that it is usually difficult to figure out the correct model structure in manyapplications.

The evaluation is done by leave-one-out cross validation for 10 sampled se-quences of lengths ∼ N (150, 202), where dim(xt) = 3 and dim(yt) = 2. Thetest errors and the log-perplexities of three learning methods are depicted in Ta-ble 3.1. Here the estimation error is defined as an averaged norm-2 difference,(1/T )

∑Tt=1 ||xt −mt||2, where x is the ground truth, and m is the estimated

state sequence. The log-perplexity is defined as −(1/T )∑T

t=1 log P (xt|y,Θ).The perplexity captures the variance of the estimate, which is not characterizedby the norm-2 error. The smaller number is better for both measures. Theestimated sequences are also visualized in Fig. 3.2.

The result shows that the prediction performance is improved by the pro-posed methods, while the significance is stronger for SCML than CML. Italso implies that discriminative learning can be useful for enhancing the re-stricted performance of the generatively trained models with (possibly) subop-timal structures.

3.5.2 Human Motion Data

I evaluate the performance of the proposed methods on the task of 3D poseestimation from real human motion data. The CMU motion capture datasetprovides the ground-truth body poses (3D joint angles), which makes it possibleto compare competing methods quantitatively. Here we include three differentmotions: walking, picking-up a ball, and running. For each motion, 5 or 6sequences from one subject are gathered to perform leave-one-out validation.

3http://mocap.cs.cmu.edu/.

36

Page 43: Discriminative Learning of Generative Models for Sequence ...€¦ · 1.1 Graphical Representation: Naive Bayes and Logistic Regression . 4 1.2 Test errors vs. sample sizes (m) for

−4

−3

−2

−1

0

1

2

3

4

time

dim

−1

True ML CML SCML

Figure 3.2: Visualization of estimated sequences for synthetic data. It showsthe estimated states (for dim-1) at t = 136 ∼ 148. The ground truth is depictedby solid (cyan) line, ML by dotted (blue), CML by dotted-dashed (red), andSCML by dashed (black).

The measurement is a 10-dim Alt-Moment feature vector extracted from themonocular silhouette image (e.g., [50]).

Typically, I will demonstrate how comparable the performance of the pro-posed algorithms on LDS is to that of nonlinear models learned generatively.Two nonlinear models that are used in the evaluation are brifely discussed.

The first model is NDS described in Eq. (3.24). Since it is computationallydemanding to use all poses xt in the train data for RBF kernel centers ul, weinstead adopt a sparse greedy kernel selection technique. It adds a pose fromthe pose pool (containing all train poses) one at a time, where we select the posethat maximizes a certain objective (e.g., data likelihood). Deciding the numberof poses (or kernel centers) to be added is crucial for generalization performance.In the experiment, we tried for several candidates (e.g., 5%, 10%, or 20% of thepool). Then the performance of the best one for test data is reported. Thekernel covariance Sl for each center ul is estimated in a way that the neighborpoints of ul have kernel values one half of its peak value [11]. This generatesreasonably smooth kernels. Further optimization of kernel hyperparameters wasnot performed as it commonly results in minor performance improvements andrequires significant computational overhead.

The second model is the latent variable nonlinear dynamic model, denoted asLVN. As it is broadly believed that the realizable poses lie in a low dimensionalspace, it is useful to introduce latent variables zt embedded from the poses xt.One possible way to devise LVN is to place dynamics on zt, assuming xt andyt are generated nonlinearly (with RBF kernels) by zt. Learning LVN can bedone by EM algorithm on the linear approximated model as introduced in [11].Initial subspace mapping for LVN is determined by PCA dim-reduction on thetrain poses. Similarly to NDS, the number of kernels is determined empirically

37

Page 44: Discriminative Learning of Generative Models for Sequence ...€¦ · 1.1 Graphical Representation: Naive Bayes and Logistic Regression . 4 1.2 Test errors vs. sample sizes (m) for

Motions Err. ML CML SCML NDS LVN

Walk

SJA 19.20 18.31 17.19 18.91 18.01

FJA 22.57 22.73 20.78 20.84 19.05

S3P 15.28 14.79 13.53 14.62 13.99

F3P 20.02 20.28 17.07 16.59 14.96

Pick-up

SJA 35.03 33.15 30.56 33.50 32.23

FJA 42.28 38.89 36.99 41.25 32.10

S3P 22.60 21.27 19.33 21.14 20.49

F3P 25.20 24.36 23.83 25.35 20.40

Run

SJA 23.35 22.11 19.39 21.26 19.08

FJA 21.87 22.09 20.92 21.86 19.76

S3P 21.52 19.85 16.96 18.41 16.97

F3P 20.40 20.43 18.43 18.42 17.65

Table 3.2: Average test errors. The error types are abbreviated as 3 letters: Thefirst indicates smoothed (S) or filtered (F), followed by 2 letters meaning thatthe error is measured in either the joint angle space (JA) or the 3D articulationpoint space (3P) (e.g., SJA = smoothed error in the joint angle space). Theunit scale for the 3D point space is deemed as the height of the human model∼ 25.

among several candidates. In the result, we highlighted the best one.Table 3.2 shows the average test (norm-2) errors of competing methods.

We recorded the smoothed (xt|y) and the filtered (xt|y1, . . . ,yt) estimationerrors for both the (joint angle) pose space and the 3D articulation point space.The latter can be easily evaluated by mapping the estimated joint angles tothe body skeleton model provided in the dataset. As shown, the proposedalgorithms have significantly lower prediction errors than ML learning, whileexhibiting comparable (or often superior) performance to the best nonlinearmodels possibly with latent variables.

It should be noticed that the filtered estimation errors of the proposed meth-ods are not as outstanding as the smoothed ones. This is probably due to theirsmoothing-based objectives. It is interesting, yet left as future work, to see theperformance of the modified objectives based on filtering. When comparing twodiscriminative algorithms, SCML yields superior performance to CML consis-tently for all motions. This is expected from the SCLL objective which is moreclosely related with the ultimate error measure. Note also that the inference(tracking) of CML or SCML is the standard Kalman filtering/smoothing, whichis much faster than the approaches based on particles or nonlinear optimization(e.g., [47, 51, 31, 52]). In Fig. 3.3, selected frames of the estimated body skele-tons are illustrated to compare SCML with the standard linear and nonlinearmodels.

38

Page 45: Discriminative Learning of Generative Models for Sequence ...€¦ · 1.1 Graphical Representation: Naive Bayes and Logistic Regression . 4 1.2 Test errors vs. sample sizes (m) for

(a) (b) (c) (d) (e) (f) (g)

(h) (i) (j) (k) (l) (m)

(n) (o) (p) (q) (r) (s)

Figure 3.3: Skeleton snapshots for walking (a−f), picking-up a ball (g−l), andrunning (m−s): The ground-truth is depicted by solid (cyan) lines, ML bydotted (blue), SCML by dashed (black), and latent variable nonlinear model(LVN) by dotted-dashed (red).

39

Page 46: Discriminative Learning of Generative Models for Sequence ...€¦ · 1.1 Graphical Representation: Naive Bayes and Logistic Regression . 4 1.2 Test errors vs. sample sizes (m) for

Chapter 4

Recursive Method forDiscriminative Learning

In the previous chapter, we have seen that the CML discriminative learningachieves better prediction performance than the traditional ML fitting. Unfor-tunately, the major drawback of the parametric gradient-based CML learning isthat the performance is usually sensitive to the choice of the initial model, Fur-thermore, the optimization would be computationally demanding if the modelis complex with many parameters. In this chapter, I suggest a novel discrimi-native learning algorithm for the general classification problem which assigns aclass label c ∈ {1, . . . ,K} to the input a.

Let f(c,a)1 denote a generative model used for classification as its examplesare shown in Fig. 2.2. Instead of learning f(c,a) directly with the discriminativecost function, the proposed method estimates a mixture of generative models,namely, F (c,a) =

∑Mm=1 αmfm(c,a), where αm ≥ 0 and

∑m αm = 1. In

a greedy fashion, a mixture component (i.e., the generative model f) to beadded at each stage is selected by the criterion that maximizes the conditionallikelihood of the newly augmented mixture. Hence the approach exploits theproperties of a mixture, alleviating the complex task of discriminative learning.

Theoretically formulated as a functional gradient boosting, the procedureyields data weights with which the new component f will be learned. This par-ticular weighting scheme effectively emphasizes the data points on the decisionboundary, a desirable property for successful classification. At the same time,it focuses on the insufficiently modeled points, a characteristic of traditionaldensity estimators and a property useful in general data fitting.

A crucial benefit of this method is efficiency: finding a new f requires MLlearning on the weighted data, a tractable task for a large family of distributions.Thus this approach is particularly suited to domains with complex componentmodels (e.g., HMMs in time-series classification). In addition, the recursive

1For notational convenience, f(c,a) is used to represent either a generative model or alikelihood P (c,a) at a data point (c,a) interchangeably.

40

Page 47: Discriminative Learning of Generative Models for Sequence ...€¦ · 1.1 Graphical Representation: Naive Bayes and Logistic Regression . 4 1.2 Test errors vs. sample sizes (m) for

approach is amenable to optimal mixture order estimation and lower sensitivityto initial parameter choices.

In the experimental evaluation in Sec. 4.3, the newly proposed approach isshown to yield performance comparable or better than that of many standardmethods (including non-generative discriminative approaches such as kernel-based classifiers) in an extensive set of sequence classification problems.

4.1 Discriminative Mixture Learning

The proposed approach is based on the functional gradient boosting framework,which is studied in [29, 34] for the task of unsupervised density modeling. In thisframework, the mixture model is learned in a greedy recursive (boosted) manner:at each stage we add a new component f(c,a) to the current mixture F so thatit optimizes a certain objective. Two potential advantages of this approach overthe standard EM-based generative mixture learning are (1) the lack of needfor a pre-determined mixture order M , and (2) the decreased sensitivity to theinitial parameter choice.

Formally, for the given objective J(F ) for the mixture F , we search fora new component f such that when we replace F with ((1 − ε)F + εf) forsome small positive ε, J((1 − ε)F + εf) is maximally increased. Due to theconvex combination constraint of a mixture, f should make the projection ofthe functional gradient of J(F ) onto (f − F ) maximized. It results in theoptimization problem, f∗ = arg maxf < f − F,∇J(F ) >, equivalently:

f∗ = arg maxf

n∑i=1

w(ci,ai) · f(ci,ai), (4.1)

where w(c,a) = ∇F (c,a)J(F ) = ∂J(F )/∂F (c,a). Thus ∇F (c,a)J(F ) serves as aweight for the data point (c,a) with which the new f will be learned.

When the objective is the joint log-likelihood (generative learning), JGen(F ) =∑ni=1 log F (ci,ai), the functional gradient is ∂JGen(F )/∂F (c,a) = 1/F (c,a)

yielding the generative data weight wGen(c,a) = 1/F (c,a) for (c,a). On theother hand, the conditional log-likelihood,

JDis(F ) =n∑

i=1

log F (ci|ai) =n∑

i=1

logF (ci,ai)F (ai)

, (4.2)

gives birth to the discriminative mixture learning. The functional gradient ofEq. (4.2) for the point (or index of dimension) (ci,ai) becomes:

∂JDis(F )∂F (ci,ai)

=∂

∂F (ci,ai)log

F (ci,ai)F (ai)

=F (ai)

F (ci,ai)· ∂

∂F (ci,ai)F (ci,ai)F (ai)

=F (ai)

F (ci,ai)·F (ai)− F (ci,ai) · ∂F (ai)

∂F (ci,ai)

F (ai)2

=1− F (ci|ai)

F (ci,ai)=(

F (ci|ai)F (¬ci|ai)

F (ai))−1

,

41

Page 48: Discriminative Learning of Generative Models for Sequence ...€¦ · 1.1 Graphical Representation: Naive Bayes and Logistic Regression . 4 1.2 Test errors vs. sample sizes (m) for

Algorithm 1 Discriminative Mixture Learning.input : A set of samples D = {(ci,ai)}ni=1.output: A mixture model F (c,a) =

∑m αmfm(c,a).

beginSelect initial f .F ← ffor m = 2, 3, . . . do

Select f∗ by solving Eq. (4.1) with w = wDis.Select α∗ by solving Eq. (4.3).Update F ← (1− α∗)F + α∗f∗

endend

where F (¬ci|ai) =∑

c6=ci F (c|ai) = 1−F (ci|ai). The discriminative data weightfor (c,a) is wDis(c,a) = (1− F (c|a))/F (c,a).

The discriminative weight indicates that the new f is learned with theweighted data inversely proportional to (F (ci|ai)/(1−F (ci|ai))) ·F (ai). Hencethe data points unexplained by the model, i.e., F (ai)→ 0, and incorrectly clas-sified by the current mixture, i.e., F (ci|ai)/(1−F (ci|ai))→ 0, are focused on inthe next stage. This is an intuitively appealing argument. In contrast, the gen-erative mixture learning, would only focus on unexplained points with weights1/F (ci,ai).

Once the optimal component f∗ has been selected, its optimal contributionto the mixture α∗ can be obtained as:

α∗ = arg maxα∈[0,1]

n∑i=1

log( (1− α)F (ci,ai) + αf∗(ci,ai)

(1− α)F (ai) + αf∗(ai)

). (4.3)

The complete recursive discriminative mixture modeling algorithm is outlined inAlgorithm 1. Selection of the first component can be done using the ML learning.Optimization in Eq. (4.1) is a log-of-sum instead of sum-of-log, which can bedone via a lower bound maximization technique, by recursively completing afew iterations of ML learning of f on the weighted data with the weights qi =wDis(ci,ai) · f(ci,ai). Optimal α can be found using any line search method.This means that the complexity of the discriminative mixture model learning isof the order O(M · (C0 · NML + NLS)) where NML stands for the complexityof the ML learning, C0 is the number of iterations of the the recursive ML,and NLS is the complexity of the line search. In practice, ML recursions aredominant resulting in the overall complexity O(MC0NML) with C0 ≈ 1. Hence,the discriminative mixture learning algorithm complexity is a constant factor ofthe simple generative learning of the base model on weighted data.

To illustrate the behavior of this discriminative algorithm consider a sim-ple example in Fig. 4.1 of two classes, each modeled with a mixture of threeGaussians from which 200 samples were drawn (top). The central lobe of eachclass models the majority of samples. Of the two side lobes, one is irrelevantfor classification while the other carries crucial samples.

42

Page 49: Discriminative Learning of Generative Models for Sequence ...€¦ · 1.1 Graphical Representation: Naive Bayes and Logistic Regression . 4 1.2 Test errors vs. sample sizes (m) for

−20 −15 −10 −5 0 5 10 15 200

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

−20 −15 −10 −5 0 5 10 15 200

50

100

150

200

250

300

wGen

(c=+,a)w

Gen(c=−,a)

wDis

(c=−,a)

wDis

(c=+,a)

−20 −15 −10 −5 0 5 10 15 200

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

f1(c=+,a) fDis

2(c=+,a) fDis

2(c=−,a) f

1(c=−,a)

fGen2

(c=+,a) fGen2

(c=−,a)

Figure 4.1: Data is generated by the distributions in the top panel (+ class inblue/dashed and − class in red/solid). The middle panel shows weights for thesecond component, both discriminative wDis(c,a) and generative wGen(c,a).The bottom panel displays the individual mixture components of the learnedmodels. Generatively learned component fGen

2 (c,a) are contrasted to the dis-criminatively learned one, fDis

2 (c,a).

43

Page 50: Discriminative Learning of Generative Models for Sequence ...€¦ · 1.1 Graphical Representation: Naive Bayes and Logistic Regression . 4 1.2 Test errors vs. sample sizes (m) for

Here the base model f(c,a) is assumed to have a single Gaussian densityto represent each class (i.e., f(a|c) is Gaussian), a suboptimal structure to thetrue data generating process. The initial mixture component f1(c,a) learnedusing ML learning models the majority of the samples (shown in the bottom).In the middle, we depict the next stage weight distributions determined byf1(c,a) for two learning criteria. In the discriminative learning, the points closeto the boundary are incorrectly classified by f1(c,a) and receive high weightswDis(c,a) while the unexplained points away from the boundary are not con-sidered because of their irrelevance for classification. The new mixture compo-nents will now be added close to the decision boundary (fDis

2 in the bottom).On the other hand, in the generative learning, higher weights are assigned tounexplained samples (wGen(c,a) in the middle), which selects the componentcorresponding to the main lobes, away from the boundary, hence obtaining aless discriminative mixture model (fGen

2 in the bottom).

4.2 Related Work

There have been similar approaches that recursively refine the base classifiersby forming an ensemble of classifiers. The AdaBoost boosting algorithm of [9]is a good example, which minimizes the exponential loss that relates to theMax-Margin principle. Recently, in [19] the boosting framework is applied togenerative models by treating f(c,a) as a (weak) hypothesis, namely c = g(a) =arg maxc f(c|a). For each stage, AdaBoost’s weights w on data (c,a) are usedto learn the next hypothesis f via weighted ML learning: arg maxf

∑ni=1 wi ·

log f(ci,ai). Hence this approach (called Boosted Bayesian Networks or BBNs)is computationally very efficient while inheriting certain benefits from AdaBoostsuch as good generalization by Max-Margin. However, the resulting ensemblecannot be simply interpreted as a generative model since the learned weak hy-potheses f are only grouped for the classification task. Note that the proposedapproch, on the other hand, gives rise to a mixture model which enjoys thebenefits of generative models.

Prior approaches to estimation of mixtures of BNs have emerged in recentyears [49, 41, 29]. The proposed recursive boosting algorithm for discriminativemixture learning is based on the functional gradient optimization of convexadditive models. While similar gradient approaches have been introduced in thepast [10, 27], they only provided heuristic methods for the component searchor did not focus on mixtures of generative models. In [34], a mixture fittingproblem, reduced to the joint log-likelihood cost functional optimization in thesupervised setting, was solved in a non-heuristic way. The proposed algorithmgeneralizes the framework of [34] to the classification setting with an appropriatedata weighting schemes for the discriminative cost functional.

44

Page 51: Discriminative Learning of Generative Models for Sequence ...€¦ · 1.1 Graphical Representation: Naive Bayes and Logistic Regression . 4 1.2 Test errors vs. sample sizes (m) for

4.3 Experiments

To evaluate the utility of the proposed method, I conduct experiments on syn-thetic and real data. Here I focus on the task of classifying structured measure-ments (i.e., sequence classification). This is in general a more difficult than thestatic multivariate data classification where the standard gradient-based meth-ods such as CML may not be preferred due to the complex model structures.

Throughout the experiments, the Gaussian-emission HMMs (GHMMs) areused to model the class conditional densities f(a|c) for the real multivariatesequence a. The competing methods are denoted as: (1) ML (EM-based MLlearning for f(c,a)), (2) CML (parametric gradient-based CML learning forf(c,a)), (3) BBN (Boosted Bayesian Networks of [19]), (4) BxML (generativemixture learning), and (5) BxCML (discriminative mixture learning).

4.3.1 Synthetic Experiment

2-dimensional sequences are generated from the following process: The class-1is composed of two GHMMs, f1E and f1H , and the class-2 is another mixtureof two GHMMs, f2E and f2H . The parameters of f1E and f2E are chosen ina way that they generate sequences looking very different. Thus it is easy todistinguish sequences sampled from two. On the other hand, we make f1H andf2H generate sequences similar to each other (hard to classify). Thus they emitsequences on the classification boundary. The example sequences generatedfrom this model are depicted in Fig. 4.2. Note that our model f(c,a) has asub-optimal structure since it has a single GHMM for each class. All GHMMsin the true model and the learned models have the fixed order (the number ofhidden states) 2.

05

1015

2025

30

−20

0

20

40−25

−20

−15

−10

−5

0

5

10

15

timedim−1

dim

−2

Class1 Easy

Class1 Hard

Class2 Easy

Class2 Hard

Figure 4.2: Example sequences generated by true model.

The experiment is conducted by random 5-fold validation with 50-sequencetrain sets and the held-out 100-sequence test sets. The sequence lengths are 30-

45

Page 52: Discriminative Learning of Generative Models for Sequence ...€¦ · 1.1 Graphical Representation: Naive Bayes and Logistic Regression . 4 1.2 Test errors vs. sample sizes (m) for

Table 4.1: Average test errors (%), log-likelihoods (LL), and conditional log-likelihoods (CLL) on the test data are shown. BBN does not have LL or CLLsince it is a non-generative classifier.

Test Error LL on Test CLL on Test

ML 19.60 -165.21 -1.97

CML 9.80 -174.99 -1.11

BBN 6.40 N/A N/A

BxML 4.20 -139.12 -0.44

BxCML 0.60 -154.62 -0.02

sample long. The first component of the mixture models (BxML and BxCML)is chosen as the ML model. The maximum number of iterations of BxML andBxCML is set as 4, however, BxCML often stops earlier than 4 when theconditional log-likelihood score reached a value sufficiently close to 0. We alsorun BBN for 10 iterations, sufficient for convergence.

The average test errors and the joint/conditional log-likelihood scores on thetest data are shown in Table 4.1. BxCML has the lowest classification error,meaning that it boosts the incorrect base model structure effectively. Overall,the methods that utilize a discriminative objective tend to perform better thangenerative counterparts. BxCML also improves the joint log-likelihood scoreover that of ML, implying that the discriminative mixture model can still enjoythe benefits of generative models, such as the richness in synthesis.

4.3.2 Experiments on Real Data

I next demonstrate the benefits of the proposed method in a comprehensive setof evaluations on real-world time-series sequence classification problems. The 5classification problems from 4 datasets will be described. All the experimentalresults are summarized in Table 4.2 and Fig. 4.3. In what follows, I brieflyreview two competing discriminative approaches (SVMs and Nearest Neighbors(NN)) that we compare against in our experiments. Subsequently, how themulti-class problems can be treated effectively is outlined.

FSVM and 1-NN/DTW

One way to approach the sequence classification problem relies on between-sequence distance measures. A central issue is the task of defining the distancemeasure (a kernel for SVM and the Euclidean distance for NN) between pairsof possibly unequal-length sequences.

(1) SVM with Fisher kernel (FSVM): The Fisher kernel between twosequences a and a′ is defined as the RBF evaluated on the distance between theirFisher scores with respect to the underlying generative model. More specifically,

46

Page 53: Discriminative Learning of Generative Models for Sequence ...€¦ · 1.1 Graphical Representation: Naive Bayes and Logistic Regression . 4 1.2 Test errors vs. sample sizes (m) for

assuming binary classification2, k(a,a′) = exp(−(Ua−Ua′)T (Ua−Ua′)/(2σ2)

),

where Ua = ∇θ log f(a|c = +). f(a|c = +) is usually learned by ML with theexamples of the positive class only. The RBF scale σ2 is determined as a mediandistance between the Fisher scores corresponding to the training sequences inthe positive class and the closest Fisher score from the negative class in thetrain data [15]. We use SVM with the Fisher kernel. In the experiments, theSVM hyperparameters are selected by 5-fold cross validation.

(2) NN with dynamic time warping (1-NN/DTW): For two unequal-length sequences, the dynamic time warping (DTW) finds the best warpingpath that minimize the Euclidean distance of aligned sequences using dynamicprogramming. With the warped Euclidean distance measure, we employ 1-NNto classify new sequences. We include 1-NN only since we have verified that thechoice of k ≥ 2 in k-NN rarely impacts on the classification performance in ourexperiments.

Treating Multi-class Problems

In multi-class settings we apply both direct multi-class solutions and binariza-tion. For FSVM we will ignore direct multi-class solutions due to difficultiesin direct treatment. The binarization is usually done in either one-vs-others orone-vs-one manner. In the one-vs-others setting multi-class labels are predictedusing the winner-takes-all (WTA) strategy from the outputs of the binarizedproblems. In the one-vs-one setting we employ the pairwise coupling (PWC)of [13]. Note that for FSVM, the SVM outputs have to be transformed toPlatt’s probabilistic outputs [37] before we apply PWC3 [7]. In the notation,we denote SVM one-vs-one PWC by FSVM(PWC), while FSVM(WTA) forone-vs-others.

For the generative models, we evaluate both direct multi-class solutions andthe PWC for one-vs-one. For instance, for CML, we denote the former by CML,while the latter by CML(PWC). For BBN, we used (1) direct multi-classtreatment (AdaBoost.M1) denoted by BBN, and (2) one-vs-one binarizationand max-win-vote denoted by BBN(MWV).

Datasets

The datasets used for evaluation are summarized as follows:(1) Gun/Point: The task is to distinguish whether gun is drawn or finger

is pointed [22]. The motions are represented by 1D sequences recording thex-coordinates of the centroid of the right hand of a subject. This is the onlydataset of the equi-length (150) sequences. The evaluation is performed by 10random-fold validation.

(2) Australian Sign Language (ASL): This dataset contains about 100signs generated by 5 signers with different levels of skills [14]. In this exper-iment, we consider only 10 selected signs (e.g., “hello”, “sorry”, etc.). The

2The multi-class problems can be reduced to many binary problems.3This transformation is not necessary for the generative probabilistic models.

47

Page 54: Discriminative Learning of Generative Models for Sequence ...€¦ · 1.1 Graphical Representation: Naive Bayes and Logistic Regression . 4 1.2 Test errors vs. sample sizes (m) for

sequences have features, corresponding to the hand position, hand orientation,finger flexion, and more. The sequence lengths are very diverse ranging from17 to 196. We formulate binary classification problems distinguishing one signfrom another, facing 45 (=

(102

)) problems. For each problem, 40 samples (20

from each sign) are gathered and the leave-one-out test is performed.(3) Georgia-Tech Speed-Control Gait (GT Gait): We also test the

proposed method on the human gait recognition problem. The database isoriginally intended for studying distinctive characteristics (e.g., stride length orcadence) of the human gait over different speeds [48]. Apart from the originalpurpose of the data, we are interested in recognizing subjects regardless of theirwalking speeds. We take sequences from 5 subjects with all their walking speeds,forming a 5-class problem, with 36 sequences in each class. The original datasetprovides high-quality 3D motion capture features on which most of competingmodels perform equally well. To make the classification task more difficult weconsider two modification: (1) From the original 1-cycle sequences, we take sub-sequences randomly. (2) The features related only to the lower body part areused. The evaluation is validated by 10 random-fold.

(4) USF Human ID Gait Data: The database4 consists of about 100subjects walking in the elliptical paths periodically in front of the cameras. Wefocus on the task of motion-based subject identification. From the processed hu-man silhouette video frames we computed the 7th order Hu moments which aretranslation and rotation invariant descriptors of binary images. We randomlychoose 7 humans from the database represented by 16 sequences. We considerthe following two problem settings: Set1 (Distinguish two subjects): Weselect sequences of only two humans, and distinguish the two subjects. Thus wehave 21(=

(72

)) binary classification problems, where each one contains 32 se-

quences (16 from each subject). Set2 (Recognize all subjects): We classifyall 7 human IDs. This is a more difficult multi(7)-class problem. Both sets areevaluated using leave-one-out validation.

Discussion

Results of our experiments on the four datasets are summarized in Table 4.2.The results suggest that the discriminatively-trained mixture model, BxCML,is among the class of best-performing models, performing on par or better thanstate-of-the-art methods such as FSVM. This points to the critical benefit ofBxCML that couples an increased modeling capacity of mixture models withthe discriminative learning objective.

For the Gun/Point dataset, with binary class equi-length sequences, thepurely discriminative classifiers (FSVM and 1-NN/DTW) outperform tradi-tional generative models trained both generatively and discriminatively (MLand CML). On the other hand, for the ASL dataset which contains diverse-length sequences, all generative models yield superior performance to example-based classifiers. This is possibly due to the sensitivity of kernel methods

4Available at http://figment.csee.usf.edu/GaitBaseline.

48

Page 55: Discriminative Learning of Generative Models for Sequence ...€¦ · 1.1 Graphical Representation: Naive Bayes and Logistic Regression . 4 1.2 Test errors vs. sample sizes (m) for

(FSVM and 1-NN/DTW) to the choice of kernel parameters, which be-comes a critical but difficult-to-solve problem for datasets with diverse-lengthsequences.

Generative models, on the other hand, naturally account for varying-lengthsequences. However, their representational power may need to be increased viathe mixture modeling formalism in order to account for variability not capturedby traditional HMMs, as suggested by the good performance of mixture modelsin the Gun/Point.

It is important to note, however, that despite the representational capac-ity of mixtures the role of proper optimization objective can be crucial. Forthe GT Gait dataset, we can see that generative models with discriminativeobjectives (CML and BxCML) are significantly better than those with gener-ative objectives (ML and BxML). Improved performance of CML comparedto BxML implies that the discriminative learning of models with even inferiorstructures can yield superior classifiers. Overall, comparison of BxCML andBxML suggests that the impact of discriminative learning of the mixtures canbe significant.

BBN has the potential similar to BxCML to focus on the decision boundarymodeling. Our experiments indicate that BxCML is never inferior to BBN,perhaps pointing to deficiencies in the approximation step of the weighted MLoptimization in BBN. On the other hand, the weighted ML training in theBxCML approach does not involve a similar approximation assumption. Ad-ditionally, BxCML results in a completely generative model F (c,a) that couldpossess attractive data-synthesis properties, as indicated by our result on thesynthetic data.

While the discriminative mixture model outperforms other approaches, multi-class problems, as indicated by USF Set2, raise an important modeling issue.In particular, USF Set2 suggests that the binarization yields better perfor-mance than direct multi-class treatment for some models. This issue does notdaunt generatively learned generative models (ML/ BxML) as the optimiza-tion of the joint likelihood implies no discrimination between the true and thecompeting class variables. BxCML, due to discriminative learning, may ex-hibit a large difference in binarization and direct multi-class treatment. Thisbehavior is due to the numerator of the weight, (1 − F (c|a)), which penalizesthe complement classes (¬c) equally for the incorrectly predicted point (c,a).Similar issues have been observed in binary class AdaBoost approaches and willbe addressed in the future work.

49

Page 56: Discriminative Learning of Generative Models for Sequence ...€¦ · 1.1 Graphical Representation: Naive Bayes and Logistic Regression . 4 1.2 Test errors vs. sample sizes (m) for

Table 4.2: Test errors (%): For the datasets evaluated with random-fold vali-dation (Gun/Point and GT Gait), the averages and the standard deviationsare included. The other datasets contain average leave-1-out test errors. “–”indicates redundant since a multi-class method is to be applied for binary classdata. (Note that GT Gait and USF Set2 are the multi-class datasets.) Theboldfaced numbers indicate the lowest, within the margin of significance, testerrors for a given dataset.

Gun/ASL

GT USF USF

Point Gait Set1 Set2

ML36.22

8.6711.50

20.24 55.36± 9.62 ± 4.78

ML– –

11.50– 55.36

(PWC) ± 4.78

CML26.06

5.453.38

17.11 50.89± 5.23 ± 3.68

CML– –

3.63– 39.29

(PWC) ± 3.51

BBN28.78

4.9010.13

17.11 55.36± 13.75 ± 3.61

BBN– –

3.50– 42.86

(MWV) ± 3.05

BxML19.28

6.3311.87

19.35 48.21± 6.15 ± 5.11

BxML– –

14.25– 50.00

(PWC) ± 4.90

BxCML17.28

5.185.75

13.84 54.46± 5.67 ± 2.78

BxCML– –

6.87– 35.71

(PWC) ± 4.09

FSVM 22.6710.90

7.1212.95 39.29

(PWC) ± 6.58 ± 4.17

FSVM– –

2.87– 44.64

(WTA) ± 2.29

1-NN/ 22.3312.06

8.3822.17 54.46

DTW ± 5.75 ± 3.68

Avg24.66

7.647.75

17.54 48.15± 7.54 ± 3.88

50

Page 57: Discriminative Learning of Generative Models for Sequence ...€¦ · 1.1 Graphical Representation: Naive Bayes and Logistic Regression . 4 1.2 Test errors vs. sample sizes (m) for

0 500

20

40

60

MLC

ML

0 500

20

40

60

ML

BB

N

0 500

20

40

60

ML

BxM

L

0 500

20

40

60

ML

BxC

ML

0 500

20

40

60

ML

FS

VM

0 500

20

40

60

ML

1−N

N/D

TW

0 500

20

40

60

CML

BB

N

0 500

20

40

60

CML

BxM

L

0 500

20

40

60

CML

BxC

ML

0 500

20

40

60

CML

FS

VM

0 500

20

40

60

CML

1−N

N/D

TW

0 500

20

40

60

BBN

BxM

L

0 500

20

40

60

BBN

BxC

ML

0 500

20

40

60

BBN

FS

VM

0 500

20

40

60

BBN

1−N

N/D

TW

0 500

20

40

60

BxML

BxC

ML

0 500

20

40

60

BxML

FS

VM

0 500

20

40

60

BxML

1−N

N/D

TW

0 500

20

40

60

BxCML

FS

VM

0 500

20

40

60

BxCML

1−N

N/D

TW

0 500

20

40

60

FSVM

1−N

N/D

TW

Figure 4.3: Test error scatter plots comparing 7 models from Table 4.2. Eachpoint corresponds to one of the 5 classification problems. For instance, congre-gation of points below the main diagonal in the BxCML vs. ML case suggeststhat BxCML outperforms ML in most of the experimental evaluations. The(red) rectangles indicate the plots comparing BxCML with others.

51

Page 58: Discriminative Learning of Generative Models for Sequence ...€¦ · 1.1 Graphical Representation: Naive Bayes and Logistic Regression . 4 1.2 Test errors vs. sample sizes (m) for

Chapter 5

Future Work andConclusion

In this work, I provided a unifying parametric gradient-based optimizationmethod for the discriminative learning of general generative models. The CMLlearning of generative models, regarded as an implicit way to realize discrimina-tive models, are shown to yield superior prediction performance to the generativelearning, often comparable to the discriminative models.

Applied to the dynamical systems for the problem of motion tracking. thediscriminative learning provided significantly lower prediction error than thestandard maximum likelihood estimator, often comparable to nonlinear models.As a future work, I plan to extend the methods to deal with the settings wherethe motion capture data is assumed noisy (e.g., severe occlusions). In addition Iwill apply the proposed approaches to piece-wise linear models such as switchingLDS (e.g., [35]) which can handle problematic motions that may contain rapidchanges in motion types.

To address the drawbacks of the parametric gradient-based optimizationsuch as the computational overhead and the sensitivity to the initial modelchoice, I introduced a novel discriminative method for learning mixtures of gen-erative models. The proposed method is computationally efficient, making itsuitable for domains described by complex generative models and settings suchas the spaces of time-series sequences.

Another interesting topic is the Max-Margin learning discussed in the endof Ch. 2. As far as I know, there is no prior work comparing two discriminativelearning algorithms (CML and Max-Margin) in a systematic way. One maydevelop a unifying framework for two different objectives. Even though theMax-Margin learning can be formulated as an instance of convex programming,its major drawback is that it is not easy to apply it to the problem with thecontinuous multivariate sequence outputs. In the future work, I will also try toapply the Max-Margin learning to dynamical systems, which may be formulatedby an ε-tube constraints as in the support vector regression [42].

52

Page 59: Discriminative Learning of Generative Models for Sequence ...€¦ · 1.1 Graphical Representation: Naive Bayes and Logistic Regression . 4 1.2 Test errors vs. sample sizes (m) for

Bibliography

[1] P. Abbeel, A. Coates, M. Montemerlo, A. Y. Ng, and S. Thrun. Discrimi-native training of Kalman filters, 2005. Robotics: Science and Systems.

[2] H. Akaike. A new look at the statistical model identification. IEEE Trans-actions on Automatic Control, 19(6), 1974.

[3] Y. Altun, I. Tsochantaridis, and T. Hofmann. Hidden Markov supportvector machines, 2003. International Conference on Machine Learning.

[4] J. N. Darroch and D. Ratcliff. Generalized iterative scaling for log-linearmodels. The Annals of Mathematical Statistics, 43(5):1470–1480, 1972.

[5] S. Della Pietra, V. Della Pietra, and J. Lafferty. Inducing features of ran-dom fields. IEEE Transactions on Pattern Analysis and Machine Intelli-gence, 19(4):380–393, 1997.

[6] L. Devroye, L. Gyorfi, and G. Lugosi. A Probabilistic Theory of PatternRecognition. Springer-Verlag, New York, 1996.

[7] K. Duan and S. Keerthi. Which is the best multiclass SVM method? Anempirical study, 2003. In Advances in Neural Information Processing Sys-tems.

[8] A. Elgammal and C.-S. Lee. Inferring 3D body pose from silhouettes usingactivity manifold learning, 2004. Computer Vision and Pattern Recogni-tion.

[9] Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting, 1995. European Conferenceon Computational Learning Theory.

[10] J. Friedman. Greedy function approximation: A gradient boosting ma-chine, 1999. Technical report, Dept. of Statistics, Stanford University.

[11] Z. Ghahramani and S. Roweis. Learning nonlinear dynamical systems usingan EM algorithm, 1999. In Advances in Neural Information ProcessingSystems.

[12] R. Greiner and W. Zhou. Structural extension to logistic regression: Dis-criminative parameter learning of belief net classifiers, 2002. Proceedingsof annual meeting of the American Association for Artificial Intelligence.

[13] T. Hastie and R. Tibshirani. Classification by pairwise coupling, 1998. InAdvances in Neural Information Processing Systems.

53

Page 60: Discriminative Learning of Generative Models for Sequence ...€¦ · 1.1 Graphical Representation: Naive Bayes and Logistic Regression . 4 1.2 Test errors vs. sample sizes (m) for

[14] S. Hettich and S. D. Bay. The UCI KDD archive [http://kdd.ics.uci.edu],1999. Irvine, CA. University of California, Department of Information andComputer Science.

[15] T. Jaakkola, M. Diekhans, and D. Haussler. Using the Fisher kernel methodto detect remote protein homologies, 1999. In Proceedings of the SeventhInternational Conference on Intelligent Systems for Molecular Biology.

[16] T. Jaakkola and D. Haussler. Exploiting generative models in discrim-inative classifiers, 1998. In Advances in Neural Information ProcessingSystems.

[17] T. Jebara and A. Pentland. On reversing Jensen’s inequality, 2000. InAdvances in Neural Information Processing Systems.

[18] A. D. Jepson, D. J. Fleet, and T. F. El-Maraghi. Robust online appearancemodels for visual tracking. IEEE Transactions on Pattern Analysis andMachine Intelligence, 25(10):1296–1311, 2001.

[19] Y. Jing, V. Pavlovic, and J. M. Rehg. Efficient discriminative learningof Bayesian Network Classifier via boosted augmented Naive Bayes, 2005.International Conference on Machine Learning.

[20] M. I. Jordan and C. Bishop. Introduction to graphical models, 2001. InProgress.

[21] S. Kakade, Y. Teh, and S. Roweis. An alternate objective function forMarkovian fields, 2002. International Conference on Machine Learning.

[22] E. Keogh and T. Folias. The UCR time series data mining archive[http://www.cs.ucr.edu/∼eamonn/TSDMA/index.html], 2002. RiversideCA. University of California - Computer Science & Engineering Depart-ment.

[23] M. Kim and V. Pavlovic. Discriminative learning of mixture of BayesianNetwork Classifiers for sequence classification, 2006. Computer Vision andPattern Recognition.

[24] J. Lafferty, A. McCallum, and F. Pereira. Conditional Random Fields:Probabilistic models for segmenting and labeling sequence data, 2001. In-ternational Conference on Machine Learning.

[25] J. Lafferty, X. Zhu, and Y. Liu. Kernel Conditional Random Fields: Repre-sentation and clique selection, 2004. International Conference on MachineLearning.

[26] Y. LeCun, L. D. Jackel, L. Bottou, A. Brunot, C. Cortes, J. S. Denker,H. Drucker, I. Guyon, U. A. Muller, E. Sackinger, P. Simard, and V. Vap-nik. Comparison of learning algorithms for handwritten digit recognition,1995. In F. Fogelman and P. Gallinari, editors, International Conferenceon Artificial Neural Networks.

[27] L. Mason, J. Baxter, P. Bartlett, and M. Frean. Functional gradient tech-niques for combining hypotheses, 1999. In Advances in Large Margin Clas-sifiers, MIT Press.

54

Page 61: Discriminative Learning of Generative Models for Sequence ...€¦ · 1.1 Graphical Representation: Naive Bayes and Logistic Regression . 4 1.2 Test errors vs. sample sizes (m) for

[28] A. Mccallum, D. Freitag, and F. Pereira. Maximum Entropy Markov Mod-els for information extraction and segmentation, 2000. International Con-ference on Machine Learning.

[29] C. Meek, B. Thiesson, and D. Heckerman. Staged mixture modelling andboosting, 2002. Uncertainty in Artificial Intelligence.

[30] T. P. Minka. A comparison of numerical optimizers for logistic regression,2003.

[31] K. Moon and V. Pavlovic. Impact of dynamics on subspace embedding andtracking of sequences, 2006. Computer Vision and Pattern Recognition.

[32] A. Y. Ng and M. Jordan. On discriminative vs. generative classifiers: Acomparison of logistic regression and Naive Bayes, 2002. In Advances inNeural Information Processing Systems.

[33] B. North, M. I. A. Blake, and J. Rittscher. Learning and classification ofcomplex dynamics. IEEE Transactions on Pattern Analysis and MachineIntelligence, 25(9):1016–1034, 2000.

[34] V. Pavlovic. Model-based motion clustering using boosted mixture model-ing, 2004. Computer Vision and Pattern Recognition.

[35] V. Pavlovic, J. M. Rehg, and J. MacCormick. Learning switching linearmodels of human motion, 2000. In Advances in Neural Information Pro-cessing Systems.

[36] F. Pernkopf and J. Bilmes. Discriminative versus generative parameterand structure learning of Bayesian Network Classifiers, 2005. InternationalConference on Machine Learning.

[37] J. Platt. Probabilistic outputs for support vector machines and compar-isons to regularized likelihood methods, 1999. In A. Smola, P. Bartlett, B.Schlkopf, D. Schuurmans, eds., Advances in Large Margin Classifiers, MITPress.

[38] R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann,San Mateo, 1993.

[39] A. Rahimi, T. Darrell, and B. Recht. Learning appearance manifolds fromvideo, 2005. Computer Vision and Pattern Recognition.

[40] J. Rissanen. Hypothesis selection and testing by the MDL principle. TheComputer Journal, 42(4):260–269, 1999.

[41] S. Rosset and E. Segal. Boosting density estimation, 2002. In Advances inNeural Information Processing Systems.

[42] B. Scholkopf and A. J. Smola. Learning with Kernels. MIT Press, Cam-bridge, MA, 2002.

[43] G. Schwarz. Estimating the dimension of a model. Annals of Statistics,6(2):461–464, 1978.

[44] F. Sha and F. Pereira. Shallow parsing with conditional random fields,2003. Proceedings of Human Language Technology-NAACL.

55

Page 62: Discriminative Learning of Generative Models for Sequence ...€¦ · 1.1 Graphical Representation: Naive Bayes and Logistic Regression . 4 1.2 Test errors vs. sample sizes (m) for

[45] F. Sha and L. K. Saul. Large margin hidden Markov models for automaticspeech recognition, 2007. In Advances in Neural Information ProcessingSystems.

[46] C. Sminchisescu and A. Jepson. Generative modeling for continuous non-linearly embedded visual inference, 2004. International Conference on Ma-chine Learning.

[47] C. Sminchisescu, A. Kanaujia, Z. Li, and D. Metaxas. Discriminative den-sity propagation for 3D human motion estimation, 2005. Computer Visionand Pattern Recognition.

[48] R. Tanawongsuwan and A. Bobick. Performance analysis of time-distancegait parameters under different speeds, 2003. 4th International Conferenceon Audio and Video Based Biometric Person Authentication, Guildford,UK.

[49] B. Thiesson, C. Meek, D. M. Chickering, and D. Heckerman. Learningmixtures of DAG models, 1998. Uncertainty in Artificial Intelligence.

[50] T.-P. Tian, R. Li, and S. Sclaroff. Articulated pose estimation in a learnedsmooth space of feasible solutions, 2005. In Proceedings of IEEE Workshopon Learning in Computer Vision and Pattern Recognition.

[51] R. Urtasun, D. Fleet, A. Hertzmann, and P. Fua. Priors for people track-ing from small training sets, 2005. International Conference on ComputerVision.

[52] R. Urtasun, D. J. Fleet, and P. Fua. 3D people tracking with Gaussianprocess dynamical models, 2006. Computer Vision and Pattern Recogni-tion.

[53] V. Vapnik. The Nature of Statistical Learning Theory. Springer, New York,1995.

[54] Q. Wang, G. Xu, and H. Ai. Learning object intrinsic structure for robustvisual tracking, 2003. Computer Vision and Pattern Recognition.

56