discriminative learning of generative models for sequence ... · pdf file 1.1 graphical...

Click here to load reader

Post on 05-Jul-2020

0 views

Category:

Documents

0 download

Embed Size (px)

TRANSCRIPT

  • Discriminative Learning of Generative Models for

    Sequence Classification and Motion Tracking

    Minyoung Kim

    January 2007

  • Contents

    1 Introduction 1 1.1 Probabilistic Model-Based Approach . . . . . . . . . . . . . . . . 2 1.2 Generative vs. Discriminative Models . . . . . . . . . . . . . . . 3

    2 Discriminative Learning of Generative Models 12 2.1 Conditional Likelihood Maximization . . . . . . . . . . . . . . . . 13

    2.1.1 CML Optimization . . . . . . . . . . . . . . . . . . . . . . 14 2.1.2 Example: Classification with Mixtures of Gaussians . . . 16 2.1.3 Evaluation on Real Data . . . . . . . . . . . . . . . . . . . 19

    2.2 Margin Maximization . . . . . . . . . . . . . . . . . . . . . . . . 22

    3 Discriminative Learning of Dynamical Systems 25 3.1 Linear Dynamical Systems . . . . . . . . . . . . . . . . . . . . . . 26 3.2 Discriminative Dynamic Models . . . . . . . . . . . . . . . . . . . 28

    3.2.1 Conditional Random Fields . . . . . . . . . . . . . . . . . 28 3.2.2 Maximum Entropy Markov Models . . . . . . . . . . . . . 30

    3.3 Discriminative Learning of LDS . . . . . . . . . . . . . . . . . . . 31 3.3.1 Conditional Likelihood Maximization (CML) . . . . . . . 31 3.3.2 Slicewise Conditional Likelihood Maximization . . . . . . 32 3.3.3 Extension to Nonlinear Dynamical Systems . . . . . . . . 34

    3.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

    3.5.1 Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . 36 3.5.2 Human Motion Data . . . . . . . . . . . . . . . . . . . . . 36

    4 Recursive Method for Discriminative Learning 40 4.1 Discriminative Mixture Learning . . . . . . . . . . . . . . . . . . 41 4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

    4.3.1 Synthetic Experiment . . . . . . . . . . . . . . . . . . . . 45 4.3.2 Experiments on Real Data . . . . . . . . . . . . . . . . . . 46

    5 Future Work and Conclusion 52

    i

  • List of Figures

    1.1 Graphical Representation: Naive Bayes and Logistic Regression . 4 1.2 Test errors vs. sample sizes (m) for Naive Bayes (solid lines) and

    Logistic Regression (dashed lines) on UCI datasets. Excerpted from [32]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

    1.3 Graphical Representation of HMM and CRF for Sequence Tagging 6 1.4 Test error scatter plots on synthetic data comparing HMMs and

    CRFs in sequence tagging. The open squares represent datasets generated from α < 1/2, and the solid circles for α > 1/2. Ex- cerpted from [24]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

    2.1 Asymptotic behavior of the ML/CML Learning: Depending on the intial model, the ML and the CML reach sometimes good or bad models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

    2.2 The generative models for static classification (TAN) and se- quence classification (HMM). . . . . . . . . . . . . . . . . . . . . 19

    2.3 Static Classification on UCI Data . . . . . . . . . . . . . . . . . . 20 2.4 Digit Prototypes from the generative learning and the Max-Margin

    discriminative learning. Excerpted from [45]. . . . . . . . . . . . 24

    3.1 Graphical Models: HMM (or LDS), CRF, and MEMM. . . . . . 27 3.2 Visualization of estimated sequences for synthetic data. It shows

    the estimated states (for dim-1) at t = 136 ∼ 148. The ground truth is depicted by solid (cyan) line, ML by dotted (blue), CML by dotted-dashed (red), and SCML by dashed (black). . . . . . . 37

    3.3 Skeleton snapshots for walking (a−f), picking-up a ball (g−l), and running (m−s): The ground-truth is depicted by solid (cyan) lines, ML by dotted (blue), SCML by dashed (black), and latent variable nonlinear model (LVN) by dotted-dashed (red). . . . . . 39

    ii

  • 4.1 Data is generated by the distributions in the top panel (+ class in blue/dashed and − class in red/solid). The middle panel shows weights for the second component, both discriminative wDis(c,a) and generative wGen(c,a). The bottom panel displays the indi- vidual mixture components of the learned models. Generatively learned component fGen2 (c,a) are contrasted to the discrimina- tively learned one, fDis2 (c,a). . . . . . . . . . . . . . . . . . . . . 43

    4.2 Example sequences generated by true model. . . . . . . . . . . . 45 4.3 Test error scatter plots comparing 7 models from Table 4.2. Each

    point corresponds to one of the 5 classification problems. For instance, congregation of points below the main diagonal in the BxCML vs. ML case suggests that BxCML outperforms ML in most of the experimental evaluations. The (red) rectangles indicate the plots comparing BxCML with others. . . . . . . . . 51

    iii

  • List of Tables

    2.1 Sequence Classification Test Accuracies (%): For the datasets evaluated with random-fold validation (Gun/Point and GT Gait), the averages and the standard deviations are included. The other datasets contain average leave-1-out test errors. Note that GT Gait and USF Set2 are the multi-class datasets. See Sec. ?? for details. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

    2.2 Sequence Tagging Test Accuracies (%): leave-1-out test errors. . 21 2.3 MNIST Digit Classification Test Error (%). Excerpted from [45]. 24

    3.1 Test errors and log-perplexities for synthetic data. . . . . . . . . 36 3.2 Average test errors. The error types are abbreviated as 3 letters:

    The first indicates smoothed (S) or filtered (F), followed by 2 letters meaning that the error is measured in either the joint angle space (JA) or the 3D articulation point space (3P) (e.g., SJA = smoothed error in the joint angle space). The unit scale for the 3D point space is deemed as the height of the human model ∼ 25. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

    4.1 Average test errors (%), log-likelihoods (LL), and conditional log- likelihoods (CLL) on the test data are shown. BBN does not have LL or CLL since it is a non-generative classifier. . . . . . . . 46

    4.2 Test errors (%): For the datasets evaluated with random-fold validation (Gun/Point and GT Gait), the averages and the standard deviations are included. The other datasets contain average leave-1-out test errors. “–” indicates redundant since a multi-class method is to be applied for binary class data. (Note that GT Gait and USF Set2 are the multi-class datasets.) The boldfaced numbers indicate the lowest, within the margin of sig- nificance, test errors for a given dataset. . . . . . . . . . . . . . . 50

    iv

  • Abstract

    I consider the issue of learning generative probabilistic models (e.g., Bayesian Networks) for the problems of classification and regression. As the generative models now serve as target-predicting functions, the learning problem can be treated differently from the traditional density estimation. Unlike the likelihood maximizing generative learning that fits a model to overall data, the discrimina- tive learning is an alternative estimation method that optimizes the objectives that are much closely related with the prediction task (e.g., the conditional likelihood of target variables given input attributes). The contribution of this work is three-fold. First, for the family of general generative models, I provide a unifying parametric gradient-based optimization method for the discriminative learning.

    In the second part, not restricted to the classification problem with discrete targets, the method is applied to the continuous multivariate state domain, resulting in dynamical systems learned discriminatively. This is very appeal- ing approach toward the structured state prediction problems such as motion tracking, in that the discriminative models in discrete domains (e.g., Conditional Random Fields or Maximum Entropy Markov Models) can be problematic to be extended to handle continuous targets properly. For the CMU motion capture data, I evaluate the generalization performance of the proposed methods on the 3D human pose tracking problem from the monocular videos.

    Despite the improved prediction performance of the discriminative learning, the parametric gradient-based optimization may have certain drawbacks such as the computational overhead and the sensitivity to the choice of the initial model. In the third part, I address these issues by introducing a novel recursive method for discriminative learning. The proposed method estimates a mixture of generative models, where the component to be added at each stage is selected in a greedy fashion, by the criterion maximizing the conditional likelihood of the new mixture. The approach is highly efficient as it reduces to the gener- ative learning of the base generative models on weighted data. Moreover it is less sensitive to the initial model choice by enhancing the mixture model re- cursively. The improved classification performance of the proposed method is demonstrated in an extensive set of evaluations on time-series sequence data, including human motion classification problems.

  • Chapter 1

    Introduction

    One of the fundamental problems in machine learning is to predict the unknown or unseen nature y of the observation x. Depending on the structures of x and y, the problem has its own name with related applications in the field of pattern recognition, computer vision, natural language processing, and bioinformatics. In this proposal, I am particularly interested in the problems summarized as f