atex class files, vol. , no. , april 2015 1 fluid dynamic...

14
0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2017.2666148, IEEE Transactions on Pattern Analysis and Machine Intelligence JOURNAL OF L A T E X CLASS FILES, VOL. , NO. , APRIL 2015 1 Fluid Dynamic Models for Bhattacharyya-based Discriminant Analysis Yung-Kyun Noh, Jihun Hamm, Frank Chongwoo Park, Byoung-Tak Zhang, and Daniel D. Lee Abstract—Classical discriminant analysis attempts to discover a low-dimensional subspace where class label information is maximally preserved under projection. Canonical methods for estimating the subspace optimize an information-theoretic criterion that measures the separation between the class-conditional distributions. Unfortunately, direct optimization of the information-theoretic criteria is generally non-convex and intractable in high-dimensional spaces. In this work, we propose a novel, tractable algorithm for discriminant analysis that considers the class-conditional densities as interacting fluids in the high-dimensional embedding space. We use the Bhattacharyya criterion as a potential function that generates forces between the interacting fluids, and derive a computationally tractable method for finding the low-dimensional subspace that optimally constrains the resulting fluid flow. We show that this model properly reduces to the optimal solution for homoscedastic data as well as for heteroscedastic Gaussian distributions with equal means. We also extend this model to discover optimal filters for discriminating Gaussian processes and provide experimental results and comparisons on a number of datasets. Index Terms—Discriminant Analysis, Dimensionality Reduction, Fluid Dynamics, Gauss Principle of Least Constraint, Gaussian Processes 1 I NTRODUCTION A LGORITHMS for classifying vector data utilize an inner product on the vector space, from which a distance metric is induced to train a classifier. Unfortunately, classifi- cation algorithms in high-dimensional spaces can be prone to overfitting when there is a lack of sufficient training data. Even more critically in high-dimensional spaces, operations such as taking the inner product can incur high compu- tational costs. For these and other reasons, it has become almost de rigueur to initially perform dimensionality re- duction by projecting the data onto a lower-dimensional subspace in order to extract a more parsimonious repre- sentation of the data. For good performance, it is critical that the low-dimensional subspace be chosen to retain as much discriminative information in the data distributions as possible after the projection. Algorithms for discovering interesting low-dimensional projections of data have been in use for many decades [1], [2], [3]. Projection pursuit is a canonical method for finding a low-dimensional subspace such that the projected data maximize certain statistical properties. One example of such an approach is Fisher Discriminant Analysis (FDA). For the special case of separating two classes of homoscedastic Gaussian data (i.e., the covariances of the class conditional distributions are the same), it can be shown that the simple criterion used in FDA produces the optimal projection, in the sense that no other subspace can better separate the data in terms of decreasing the Bayes classification error. However, when the data is heteroscedastic (i.e., where the covariances of the classes are different), or consist of more than two classes, FDA typically fails to find the optimal subspace. Recent work on discriminant analysis has focused on trying to find better projection subspaces for these more difficult cases. Instead of using a simple heuristic as em- ployed in FDA, these algorithms attempt to optimize a more sophisticated criterion describing the separation of projected data. Such criteria are typically motivated by information- theoretic measures that correspond to minimizing the Bayes error; example criteria include the Bhattacharyya/Chernoff coefficient [4], KL-divergence [5], [6], and mutual informa- tion [7], [8], [9]. It is worth noting that in these examples, the optimization criterion is a nonlinear function of the projected means and variances. Not only is the optimiza- tion problem non-convex—implying the presence of local minima—but it also does not generally yield tractable solu- tions for the optimal projection matrix. Therefore, the pre- vious methods tend to relax the optimization objective by simply modifying the objective function into an analytically plausible form [4], [8], using graph embedding optimization [10], or breaking the problem into a set of convex problems [11]. However, none of these relaxations can be interpreted in terms of the Bayes optimal solution in a heteroscedastic general Gaussian situation. In this work, we focus specifically on the Bhattacharyya criterion due to its nice analytical properties. We analyze the subspace solution minimizing this criterion. In the process we take into account two known analytical solutions as shown by Fukunaga [12]. The two solutions appear as special cases involving class-conditional Gaussian densities: (i) when the covariance matrices are equal (homoscedastic), and (ii) when the two means are the same. These two cases give rise to optimal solutions that can be solved via an eigenvector decomposition. In these situations, it is straight- forward to show that the spectral solutions also minimize the Bayes classification error. Although the optimal solution is clear for these two special cases, when dealing with the general Bhattachar- rya criterion, the corresponding analytic solution cannot

Upload: others

Post on 02-Jun-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ATEX CLASS FILES, VOL. , NO. , APRIL 2015 1 Fluid Dynamic …robotics.snu.ac.kr/fcp/files/_pdf_files_publications... · 2017-02-20 · Discriminant Analysis Yung-Kyun Noh, Jihun Hamm,

0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2017.2666148, IEEETransactions on Pattern Analysis and Machine Intelligence

JOURNAL OF LATEX CLASS FILES, VOL. , NO. , APRIL 2015 1

Fluid Dynamic Models for Bhattacharyya-basedDiscriminant Analysis

Yung-Kyun Noh, Jihun Hamm, Frank Chongwoo Park, Byoung-Tak Zhang, and Daniel D. Lee

Abstract—Classical discriminant analysis attempts to discover a low-dimensional subspace where class label information is maximallypreserved under projection. Canonical methods for estimating the subspace optimize an information-theoretic criterion that measuresthe separation between the class-conditional distributions. Unfortunately, direct optimization of the information-theoretic criteria isgenerally non-convex and intractable in high-dimensional spaces. In this work, we propose a novel, tractable algorithm for discriminantanalysis that considers the class-conditional densities as interacting fluids in the high-dimensional embedding space. We use theBhattacharyya criterion as a potential function that generates forces between the interacting fluids, and derive a computationallytractable method for finding the low-dimensional subspace that optimally constrains the resulting fluid flow. We show that this modelproperly reduces to the optimal solution for homoscedastic data as well as for heteroscedastic Gaussian distributions with equalmeans. We also extend this model to discover optimal filters for discriminating Gaussian processes and provide experimental resultsand comparisons on a number of datasets.

Index Terms—Discriminant Analysis, Dimensionality Reduction, Fluid Dynamics, Gauss Principle of Least Constraint, GaussianProcesses

F

1 INTRODUCTION

A LGORITHMS for classifying vector data utilize an innerproduct on the vector space, from which a distance

metric is induced to train a classifier. Unfortunately, classifi-cation algorithms in high-dimensional spaces can be proneto overfitting when there is a lack of sufficient training data.Even more critically in high-dimensional spaces, operationssuch as taking the inner product can incur high compu-tational costs. For these and other reasons, it has becomealmost de rigueur to initially perform dimensionality re-duction by projecting the data onto a lower-dimensionalsubspace in order to extract a more parsimonious repre-sentation of the data. For good performance, it is criticalthat the low-dimensional subspace be chosen to retain asmuch discriminative information in the data distributionsas possible after the projection.

Algorithms for discovering interesting low-dimensionalprojections of data have been in use for many decades [1],[2], [3]. Projection pursuit is a canonical method for findinga low-dimensional subspace such that the projected datamaximize certain statistical properties. One example of suchan approach is Fisher Discriminant Analysis (FDA). Forthe special case of separating two classes of homoscedasticGaussian data (i.e., the covariances of the class conditionaldistributions are the same), it can be shown that the simplecriterion used in FDA produces the optimal projection, inthe sense that no other subspace can better separate thedata in terms of decreasing the Bayes classification error.However, when the data is heteroscedastic (i.e., where thecovariances of the classes are different), or consist of morethan two classes, FDA typically fails to find the optimalsubspace.

Recent work on discriminant analysis has focused ontrying to find better projection subspaces for these more

difficult cases. Instead of using a simple heuristic as em-ployed in FDA, these algorithms attempt to optimize a moresophisticated criterion describing the separation of projecteddata. Such criteria are typically motivated by information-theoretic measures that correspond to minimizing the Bayeserror; example criteria include the Bhattacharyya/Chernoffcoefficient [4], KL-divergence [5], [6], and mutual informa-tion [7], [8], [9]. It is worth noting that in these examples,the optimization criterion is a nonlinear function of theprojected means and variances. Not only is the optimiza-tion problem non-convex—implying the presence of localminima—but it also does not generally yield tractable solu-tions for the optimal projection matrix. Therefore, the pre-vious methods tend to relax the optimization objective bysimply modifying the objective function into an analyticallyplausible form [4], [8], using graph embedding optimization[10], or breaking the problem into a set of convex problems[11]. However, none of these relaxations can be interpretedin terms of the Bayes optimal solution in a heteroscedasticgeneral Gaussian situation.

In this work, we focus specifically on the Bhattacharyyacriterion due to its nice analytical properties. We analyze thesubspace solution minimizing this criterion. In the processwe take into account two known analytical solutions asshown by Fukunaga [12]. The two solutions appear asspecial cases involving class-conditional Gaussian densities:(i) when the covariance matrices are equal (homoscedastic),and (ii) when the two means are the same. These two casesgive rise to optimal solutions that can be solved via aneigenvector decomposition. In these situations, it is straight-forward to show that the spectral solutions also minimizethe Bayes classification error.

Although the optimal solution is clear for these twospecial cases, when dealing with the general Bhattachar-rya criterion, the corresponding analytic solution cannot

Page 2: ATEX CLASS FILES, VOL. , NO. , APRIL 2015 1 Fluid Dynamic …robotics.snu.ac.kr/fcp/files/_pdf_files_publications... · 2017-02-20 · Discriminant Analysis Yung-Kyun Noh, Jihun Hamm,

0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2017.2666148, IEEETransactions on Pattern Analysis and Machine Intelligence

JOURNAL OF LATEX CLASS FILES, VOL. , NO. , APRIL 2015 2

Fig. 1. Low dimensional fluid flow fields to find the linear subspace that maximally separates the class-conditional densities. The optimal projectiondirections are intuitively related to the largest flow directions. Schematically, in the left figure, the class densities repel each other in the low-dimensional direction to maximize the separation. In the right figure where the two densities overlap, the densities are squeezed in directions tominimize the overlap. We show how to analytically estimate these optimal directions and how they can be used for discriminant analysis.

be derived. Here in this work, rather than directly op-timizing the projected statistics, we consider a radicallydifferent physics-based approach. We model the data as afluid mechanical system where dense fluids interact witheach other according to a potential function given by theBhattacharyya coefficient. The fluids act to decrease theirpotential energy, giving rise to a force field that tries to sep-arate the interacting fluids. The resulting fluid flow containsimportant information about the directions that increase theseparation of the data classes. By applying an appropriatelow-dimensional physical constraint on the fluid flow, theoptimal motion subspace that preserves the flow energycan be derived as shown in Fig. 1. The optimal flow-constrained low dimensional subspace can be determinedin terms of an eigenvector decomposition. In the limit ofGaussian densities with equal covariances or equal means,the resulting subspace is equivalent to the Bayes-optimalsubspace for classification.

The idea of using a mechanical system analogy hasappeared in previous works on discriminant analysis. Forexample, a force analogy and an information-theoretic po-tential function have been used to model the separation ofdifferent density functions [13], [14]. This method requiredgradient-based algorithms applied to pair-wise potentials,which are computationally burdensome and cannot be di-rectly related to the optimal homoscedastic and equal meansolutions [12]. Rather than considering pairwise potentials,the present work models the force field as arising frominteracting continuous densities using the Gauss Principle ofLeast Constraint. The resulting force field is then consistentwith continuum fluid flow models, and can be related to theoptimal analytic solutions for cases with homoscedastic andwith equal mean distributions.

In experiments with Benchmark datasets, the proposedalgorithm shows one of the best accuracies in most datasets.For more thorough comparison, we present one of thekernelized discriminant analysis methods, where the pro-posed method is not outperformed even by the kernelizedalgorithm in many datasets.

The proposed fluid discriminant model is also extendedto discriminate infinite-dimensional Gaussian processes. Wegive the analogue of Fukunaga’s two analytic solutions inthe Gaussian process setting, and present a method forobtaining an optimal linear filter that discriminates between

two Gaussian processes using the fluid flow model. Wepresent discrimination results showing the efficacy of thismethod on motion capture data and stock-market indexsequence classification.

The remainder of the paper is organized as follows.Section 2 reviews discriminant analysis based on the Bhat-tacharyya criterion and Fukunaga’s solutions. In Section 3,we derive our discrimination algorithm by optimizing aconstrained low-rank fluid flow under the Bhattacharyyainteraction potential; for the special cases, we compare ourmodel solutions with the analytic solutions. In Section 4,we compare and contrast experimental results on a numberof machine learning datasets, and show its application toGaussian processes in Section 5. Finally, we conclude with adiscussion in Section 6.

2 DISCRIMINANT ANALYSIS FOR CLASSIFICATION

In this section, we review previous work on discriminantanalysis from the perspective of minimizing the Bayes clas-sification error. In particular, methods of minimizing surro-gate objective functions are introduced including the Jensen-Shannon divergence and Bhattacharyya coefficient, and abrief analysis of direct minimization of the Bhattacharyyacoefficient is provided.

2.1 Previous discriminant analysis methods

One canonical method of seeking the optimal projectionpreserving class labels is Fisher’s linear discriminant anal-ysis (FDA) [15], [16], [17]. In FDA, a subspace is obtainedwhere the mean separation is maximized while the variancewithin classes is minimized at the same time. This method isable to achieve the Bayes optimal solution for homoscedas-tic situations and is also closely related to optimizing theBhattacharyya coefficient [12]. However, many pathologicalcases can be found where FDA completely fails to find anyrelevant solution such as in the heteroscedastic situationwith equal means.

A more relevant approach on this problem is to directlyminimize the theoretical Bayes error within the projectedsubspace,

J1 =1

2

∫min[p1(x), p2(x)]dx, (1)

Page 3: ATEX CLASS FILES, VOL. , NO. , APRIL 2015 1 Fluid Dynamic …robotics.snu.ac.kr/fcp/files/_pdf_files_publications... · 2017-02-20 · Discriminant Analysis Yung-Kyun Noh, Jihun Hamm,

0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2017.2666148, IEEETransactions on Pattern Analysis and Machine Intelligence

JOURNAL OF LATEX CLASS FILES, VOL. , NO. , APRIL 2015 3

for two classes having projected distributions p1(x) andp2(x), and a uniform prior. Discriminant analysis with la-beled data seeks to find a low dimensional subspace wherethe projected data can be easily classified. Given a subspace,a Bayes-optimal classifier will achieve the error in Eq. (1) onthe projected data distributions.

A common formulation for obtaining a projection spacewith lower Bayes error considers a projection matrix W ∈RD×d whereD and d (D > d) are the respective dimensionsof the original space and the projected space. When theprojected distributions are modeled as Gaussians, the meansare W>µ1 and W>µ2, and the covariance matrices areW>Σ1W and W>Σ2W, respectively, where µ1 and µ2 arethe means and Σ1 and Σ2 are the covariance matrices ofthe data in the original D-dimensional space. The projectionmatrix can be written as the collection of independent col-umn vectors W = [w1,w2, . . .] for wi ∈ RD , i = 1, . . . , d.Unfortunately, directly minimizing J1 with respect to W isintractable because of the non-analytic behavior of the minfunction.

To address this issue, information-theoretic approacheshave been introduced. Jensen-Shannon (JS) divergence,quadratic mutual information, and Bhattacharyya distancehave been suggested as the actual objective functions tooptimize due to their relationship with the optimal Bayeserror. The JS divergence and the Bhattacharyya distanceconstitute the upper bound of the Bayes error, and the upperbound of Bayes error is minimized by maximizing the JSdivergence [8], [18] or by maximizing the Bhattacharyyadistance [18], [19]. It is also shown that maximizing thosesurrogate criteria also minimizes a lower bound of theBayes error [19], [20], [21], [22], which means that the Bayeserror cannot be small unless JS divergence or Bhattacharyyadistance is large enough. The relationship between the Bayeserror and quadratic mutual information is given in [23].

Unfortunately, direct optimization of JS divergence isstill intractable for even two Gaussian densities because ofthe non-integrable log-sum terms. Instead of using the exactJS divergence, a modified objective function approximatingthe Gaussian mixture as a single Gaussian has been used [7],[8], [24], resulting in a tractable approximate solution. Analternative surrogate criterion, quadratic mutual informa-tion which uses Renyi entropy instead of Shannon entropy,has also been introduced as a nonparametric surrogate forBayes error [14], [23]. The resulting criterion uses pairwisepotentials, and the algorithm relies upon local optimizationmethods such as gradient descent.

Some apporaches use the Kullback-Leibler (KL) diver-gence between the projected class-conditional densities [5],[25], [6]. Similar to FDA, this line of work seeks to findthe subspace with maximally separated classes, but doesnot fully incorporate hereroscedastic information due to theassymmetry of the KL divergence. Nonparametric methodshave also been employed; Nonparametric Component Anal-ysis (NCA) uses nearest neighbor information to computethe separation between classes in the projected data [26].

Bhattacharyya distance is integrable and for a number ofcases, its analytic form is well-known and understandable.Direct optimization using gradient methods has previouslybeen tried [24], but the inherent non-convex structure ofthe problem precludes guarantees on the optimality of the

resulting solution. Modifications to this objective functionhave also been attempted to make the optimization moretractable [4]. In our work, we focus on the Bhattacharyyadistance, but utilize it to define flow fields for discriminantanalysis. This approach can then be related to minimizingthe Bayes error in a number of situations. When the sub-space with minimum Bayes error has been found, a classifiercan achieve good performance due to better generalizationin the low dimensional space.

2.2 Minimizing Bhattacharyya criterionThe Bhattacharyya coefficient

∫ √p1(x)p2(x)dx can be

related to the Bayes error due to the inequalitymin[p1(x), p2(x)] ≤

√p1(x)p2(x). The negative log of the

Bhattacharyya coefficient is called the Bhattacharyya dis-tance, and its integrated form can be represented by the sumof two simple terms:

J2(W) =1

8tr[(W>SWW)−1W>SBW] +

1

2ln| 12 (W>Σ1W + W>Σ2W)||W>Σ1W|1/2|W>Σ2W|1/2

(2)

where SW and SB are defined as SW = Σ1+Σ2

2 and SB =∆µ∆µ> for ∆µ = µ1 − µ2.

This criterion can be considered as the separation be-tween two projected distributions and is a bound on thenon-integrable J1. In general, optimizing this criterion re-quires numerical techniques such as gradient ascent [24],[27]. However, there are some special cases where the globaloptimum of the Bhattacharyya criterion can be analyticallyderived. In these cases, previous work showed that the so-lutions can be obtained by solving an eigenvector problem.We first review this analysis and describe how the projectionbases can be obtained sequentially by deflation.

2.2.1 Analytic solutions for Bhattacharyya criterionThere are two cases when exact solutions optimizing theBhattacharyya criterion can be analytically derived [12].This occurs in the homoscedastic condition where Σ1 =Σ2 and in the equal mean case where µ1 = µ2. Whenthe two covariances are the same, the optimal solutionreduces to finding W which maximizes the first termtr[(W>SWW)−1W>SBW] of the Bhattacharyya distancein Eq. (2). This is exactly the criterion used in Fisher Discrim-inant Analysis (FDA), whose solution is the eigenvector ofthe generalized eigenvector problem,

SBW = SWWΛ, (3)

with diagonal eigenvalue matrix Λ. On the other hand,when the two means are the same, the first termdisappears, and we can maximize the second term:

ln| 12 (W>Σ1W+W>Σ2W)||W>Σ1W|1/2|W>Σ2W|1/2 . A simple calculation shows that

the optimal solution is another eigenvector problem:

(Σ−11 Σ2 + Σ−1

2 Σ1 + 2I)W = WΛ. (4)

Previously, it is noted that Σ−11 Σ2 and Σ−1

2 Σ1 share thesame eigenvectors due to the inverse relationship, and thesolution can be obtained by solving the eigenvector problemΣ2W = Σ1WΛ′ and by choosing eigenvectors according

Page 4: ATEX CLASS FILES, VOL. , NO. , APRIL 2015 1 Fluid Dynamic …robotics.snu.ac.kr/fcp/files/_pdf_files_publications... · 2017-02-20 · Discriminant Analysis Yung-Kyun Noh, Jihun Hamm,

0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2017.2666148, IEEETransactions on Pattern Analysis and Machine Intelligence

JOURNAL OF LATEX CLASS FILES, VOL. , NO. , APRIL 2015 4

to the magnitude of the values λ′i + 1λ′i

[12]. It can be shownthat this solution optimizes the Bayes error criterion as well.

Thus in these two special cases, we can determine thesubspace that optimally minimizes the Bayes classificationerror. The subspace is found by minimizing the objective J2

and reduces to a generalized eigenvector problem in thesecases.

2.2.2 Deflation methods

Here we show how the optimal W can be sequentiallydetermined using a deflation scheme under certain condi-tions. The columns of W must obey certain properties suchas orthogonality. The deflation method recursively obtainsthe solution vectors by progressively decreasing the dimen-sionality of the system by projecting out previous solutionvectors. For example, having one eigenvector solution wi

of a symmetric matrix M ∈ RD×D with an eigenvalue αi,we can use the deflated matrix M − αiwiw

>i to find other

eigenvectors because M can be decomposed into the sumof the separated terms M =

∑i αiwiw

>i . The canonical

use of deflation is to find a set of d vectors w1, . . . ,wd

that maximizes∑di=1

w>i Mwi

w>i wi, whose solution is the set of d

leading eigenvectors. By finding the first leading eigenvec-tor w1 and deflating the matrix, we can reduce the problemto maximizing

∑di=2

w>i (M−α1w1w>1 )wi

w>i wiwith d − 1 vectors,

which can be applied recursively to find the resulting set ofd eigenvectors [28].

Unfortunately, for the general Bhattacharyya criterion,the deflation method cannot be applied because the ob-jective function cannot be separated into a sum over thecolumns of W.1 Therefore, the general solution needs tobe written as an optimization over the (d,D) Grassmannmanifold [29], [30].

However, in the two special cases, the optimal basescan be obtained using the deflation method. For the ho-moscedastic case, the Bhattacharyya objective function canbe decomposed using separate solution vectors wi as

Tr[(W>SWW)−1W>SBW

]=

d∑i=1

w>i SBwi

w>i SWwi. (5)

In this case, the deflation method eliminates the componentwiw

>i from SB using the previously found wi vector, result-

ing in a deflated SB : SB = SB − (w>i SBwi

w>i SW wi)SWwiw

>i SW .

In the homoscedastic case, SB is updated recursively, andthe leading w vectors are found while keeping them SW -and SB-orthogonal.

For the equal mean case, we can express the followingdeterminant using the product of eigenvalues,∣∣∣(W>Σ1W)−1W>Σ2W + I

∣∣∣=

d∏i=1

(1 +

w>i Σ2wi

w>i Σ1wi

). (6)

1. In other words, the objective function J2(W) of the projectionmatrix W cannot be reformulated as J2(W) =

∑di=1 j2(wi) using

the sum of separate objective functions j2(.) using the columns wi.

This decomposition enables the equal mean solution to beexpressed as

ln(∣∣∣(W>Σ1W)−1W>Σ2W + I

∣∣∣ ·∣∣∣(W>Σ2W)−1W>Σ1W + I∣∣∣)

=d∑i=1

ln

(w>i Σ2wi

w>i Σ1wi+

w>i Σ1wi

w>i Σ2wi+ 2

)(7)

for W both Σ1- and Σ2-orthogonal. As discussed in theprevious section, maximization of Eq. (7) yields a set ofwi which optimizes either

∑di=1

w>i Σ2wi

w>i Σ1wior∑di=1

w>i Σ1wi

w>i Σ2wi.

To maximize the first objective∑di=1

w>i Σ2wi

w>i Σ1wi, the deflation

defines a new Σ2 = Σ2 − (w>i Σ2wi

w>i Σ1wi)Σ1wiw

>i Σ1 recur-

sively to find the following wi. Alternatively, for the secondobjective

∑di=1

w>i Σ1wi

w>i Σ2wi, the deflation updates Σ1 using

Σ1 = Σ1 − (w>i Σ1wi

w>i Σ2wi)Σ2wiw

>i Σ2.

3 INTERACTING FLUID MODEL

We have seen that for discriminating heteroscedastic, non-equal mean Gaussians, there is no general analytic solutionthat directly optimizes the Bhattacharrya criterion. In thissection, we propose a tractable method for discriminantanalysis based on an analogous physical model. This modelwill be able to capture differences in both the means andcovariances of the class conditional distributions. The distri-butions are modeled as continuous fluids which interact viaa force field that minimizes the Bhattacharyya coefficient. Alow dimensional subspace is determined by considering thedominant directions of this force field using the Gauss prin-ciple of least constraint. We show this solution convergesto the optimal Bayes solutions when the distributions arehomoscedastic or equal mean, and successfully interpolatesbetween these two special cases for generic data distribu-tions.

3.1 Fluid densities

In our model, probability density functions are interpretedas continuous mass densities where each point in the distri-bution can flow in different directions. The resulting flowwill attempt to reduce the Bhattacharrya coefficient, andthus will contain information about the dominant directionsthat separate the interacting distributions.

The mass density functions are modeled as Gaussianshaving means µc and covariance matrices Σc for classes c ∈{1, 2}.

ρc(x) =1

√2π

D|Σc|1/2e−

12 (x−µc)>Σ−1

c (x−µc) (8)

for x ∈ RD. The fluid mass is distributed over spaceaccording to this density. Under flow, the mass must beconserved giving rise to the following equation of continuityat every point x:

∂ρc∂t

+∇ · (ρcv) = 0. (9)

Page 5: ATEX CLASS FILES, VOL. , NO. , APRIL 2015 1 Fluid Dynamic …robotics.snu.ac.kr/fcp/files/_pdf_files_publications... · 2017-02-20 · Discriminant Analysis Yung-Kyun Noh, Jihun Hamm,

0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2017.2666148, IEEETransactions on Pattern Analysis and Machine Intelligence

JOURNAL OF LATEX CLASS FILES, VOL. , NO. , APRIL 2015 5

3.2 Force field

We define the following potential energy functional:

U(ρ1, ρ2) =

∫ √ρ1ρ2dx, (10)

which is directly related to the Bhattacharyya coefficient.Minimizing this potential functional minimizes the overlapbetween the interacting fluids and maximizes the separationof the classes.

The fluids will flow via induced forces to minimizethis potential energy. The resulting force field can be writ-ten using the equation of continuity (9), and the relationdU = −

∫(F · ds) dx, where ds is the displacement vector

that the local mass has moved under F , and dx is thevolume element for integration. Assuming an infinitesimalchange in mass distribution ρ2(x) with fixed ρ1(x) gives theresulting change in potential:

dU

dt=

∫∂

∂t(√ρ1ρ2) dx (11)

=

∫1

2

√ρ1

ρ2

∂ρ2

∂tdx (12)

Time derivatives can be related to spatial derivatives usingthe equation of continuity ∂ρ2

∂t +∇ · (ρ2v) = 0:

dU

dt=

1

2

∫ √ρ1

ρ2[−∇ · (ρ2v)] dx. (13)

Integration by parts yields the equation

dU

dt=

1

2

∫ (ρ2∇

√ρ1

ρ2

)· vdx (14)

where the local fluid velocity is given by v. The power orinstantaneous rate of change of energy can be ascribed tomotion in a force field: dUdt = −F · v. Thus, the force fieldacting on class 2 at position x is given by:

F2(x) = −1

2ρ2∇

√ρ1

ρ2. (15)

For Gaussian distributions, the analytic form can be com-puted:

F2(x) =1

4C1e

− 14 (x−µ′+)>(Σ−1

1 +Σ−12 )(x−µ′+)

·(Σ−11 −Σ−1

2 )(x− µ′−) (16)

with constants

C1 =e−

14 ∆µ>(Σ1+Σ2)−1∆µ

(2π)D/2|Σ1|1/4|Σ2|1/4(17)

µ′+ =(Σ−1

1 + Σ−12

)−1(Σ−1

1 µ1 + Σ−12 µ2) (18)

µ′− =(Σ−1

1 −Σ−12

)−1(Σ−1

1 µ1 −Σ−12 µ2), (19)

using ∆µ = µ1 − µ2. The force field is a field describinga force vector at every point in space. In Fig. 2, we illus-trate the resulting force field for different configurations ofGaussian density functions. When two density functions areseparated as in Fig. 2(a), the force field tries to minimizethe overlap by translationally repelling the densities. How-ever, when the densities significantly overlap with unequalcovariances as in Fig. 2(b), the force field compress andsqueeze the fluid distributions.

The same analysis can be applied to derive the force fieldF1(x) acting on the class 1 density, and it is easy to see thatF1(x) = −F2(x) for every point x, thus obeying Newton’sthird law.

3.3 Gauss principle of least constraintWhen constraints are applied to a physical system withmany interacting particles, the Gauss principle of least con-straint is a useful method to derive the equations of motionfor the particles. The principle states that the motion followsthe trajectory having the least amount of constraint force.With interacting forces and constraints, the principle de-scribes the resulting dynamics as minimizing the objectivefunction 1

2

∑imi| Fi

mi− xi|2, where xi is the acceleration of

mass mi satisfying the imposed constraints.The constraint force is the difference between the re-

sultant force governing the actual acceleration and appliedforce, mx − F , and the objective is a weighted combi-nation of the squared magnitude of the constraint forces.For continuous fluids, the objective becomes an integralof constraint forces over space, where the mass densityfunction ρ(x) is used in place of point masses:

L =1

2

∫ρ(x)

∣∣∣∣x− F (x)

ρ(x)

∣∣∣∣2 dx (20)

The resulting constrained motion of fluids is determined byoptimizing Eq. (20).

3.4 Uniform translational movementWe first assume the simple constraint that the fluid is arigid body having only translational motion. In this case, theacceleration flow field is constant all over the space, x = w.Interestingly, minimizing the objective function

L(w) =1

2

∫ρc(x)

(w − Fc(x)

ρc(x)

)2

dx (21)

with respect to the uniform acceleration field w for classc = 2 yields the FDA direction:

w =

∫Fdx = C2

(Σ1 + Σ2

2

)−1

∆µ, (22)

and for class 1, w = C2(Σ1+Σ2

2 )−1(−∆µ) in the op-posite direction to class 2. Here, the constant C2 =

14

(|Σ1|

12 |Σ2|

12

|Σ1+Σ22 |

) 12

e−14 ∆µ>(Σ1+Σ2)−1∆µ. From this example,

we see that rigid body translational motion under the Bhat-tacharyya potential is equivalent to the FDA solution.

3.5 Low rank affine accelerationThe next constraint we consider is a low rank affine con-straint. This constraint introduces a low rank rectangularmatrix W ∈ RD×d for d < D to define the acceleration fieldas x = Wa(x) where a(x) ∈ Rd. The amount of motionac(x) of class c ∈ {1, 2} is an affine function that can beexpressed by ac(x) = U>c xe where xe is an extended vectorof x, representing xe = [x> 1]>, and Uc ∈ R(D+1)×d is theacceleration coefficient matrix. This constrains the transfor-mation matrix WU> to be a low rank affine transformation.

Page 6: ATEX CLASS FILES, VOL. , NO. , APRIL 2015 1 Fluid Dynamic …robotics.snu.ac.kr/fcp/files/_pdf_files_publications... · 2017-02-20 · Discriminant Analysis Yung-Kyun Noh, Jihun Hamm,

0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2017.2666148, IEEETransactions on Pattern Analysis and Machine Intelligence

JOURNAL OF LATEX CLASS FILES, VOL. , NO. , APRIL 2015 6

class 1

class 2

class 1

class 2

(a) (b)

Fig. 2. The force field F2(x) on class 2 induced by class 1 for different configurations of two Gaussians. The field is a repulsive force pushing class2 in (a), and compresses the fluid in (b). This behavior illustrates the importance of considering both the locations and shapes of the distributions inour fluid model.

In our setting, W is shared by different classes. The Gaussprinciple then yields the objective function

L =1

2

2∑c=1

∫ρc

(WU>c xe −

Fcρc

)2

dx. (23)

With the force field defined in (15), the low rank matrix Wcan be obtained by minimizing this objective function. Thesolution subspace is spanned by the leading eigenvectors ofthe following symmetric matrix,

2∑c=1

〈Fcx>e 〉〈xex>e 〉−1ρc 〈xeF

>c 〉. (24)

Sufficient statistics for this computation are given by thecorrelations 〈xex>e 〉ρc =

∫ρcxex

>e dx and 〈xeF>c 〉 =∫

xeF>c dx.

The multiway extension of this model to C > 2 classes isstraightforward. The Gauss principle is extended to the sumof all constraint forces on the C classes with shared projec-tion matrix W. The force on each density c ∈ {1, . . . , C} isgiven by summing the interactions from the other classes:

L =1

2

C∑c=1

∫ρc(x)

(WU>c xe −

Fc(x)

ρc(x)

)2

dx

Fc =∑c′ 6=c

Fc′→c = −1

2ρc∑c′ 6=c∇√ρc′

ρc

The extended analysis results in the optimal matrix W givenby the leading eigenvectors of the following symmetricmatrix.

C∑c=1

〈Fcx>e 〉〈xex>e 〉−1ρc 〈xeF

>c 〉 (25)

The statistics 〈xex>e 〉−1ρc and 〈Fcx>e 〉 of class c in (25) can

be represented using the mean and the covariance µi andΣi, i ∈ {1, . . . , C} as

〈xex>e 〉−1ρc =

(Σc + µcµ

>c µc

µ>c 1

)−1

(26)

=

(Σ−1c −Σ−1

c µc−µ>c Σ−1

c µ>c Σ−1c µc + 1

)and

〈xeF>c 〉 =∑k 6=c

(27)

2C2

Σc −Σk + {Σc(Σk + Σc)

−1µk+Σk(Σk + Σc)

−1µc}(−∆µ)>

(−∆µ)>

(Σk + Σc)−1

where the constant C2 is the same as in Eq. (22).

3.5.1 Special casesPreviously, we noted two special scenarios where optimalanalytic solutions exist for Gaussian density functions: oneis the homoscedastic case; the other is the heteroscedasticcase with equal means. We compare the solution of our fluidmodel in Eq. (24) to the analytic optimal solution.

For two classes, the fluid solution satisfies 〈xeF>1 〉 =−〈xeF>2 〉, and we can show the Eq. (24) can be representedas the sum of two terms with µ1, µ2, Σ1, and Σ2:

4C22

[(Σ1 + Σ2)−1(Σ1Σ

−12 + Σ2Σ

−11 − 2I) +

(∆µ>(Σ1 + Σ2)−1∆µ+ 2) ·(Σ1 + Σ2)−1∆µ∆µ>(Σ1 + Σ2)−1

](28)

If the two covariance matrices Σ1 and Σ2 are the same,the first term disappears, and the solution becomes theeigenvector of this rank 1 symmetric matrix:

(Σ1 + Σ2)−1∆µ∆µ>(Σ1 + Σ2)−1 (29)

Page 7: ATEX CLASS FILES, VOL. , NO. , APRIL 2015 1 Fluid Dynamic …robotics.snu.ac.kr/fcp/files/_pdf_files_publications... · 2017-02-20 · Discriminant Analysis Yung-Kyun Noh, Jihun Hamm,

0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2017.2666148, IEEETransactions on Pattern Analysis and Machine Intelligence

JOURNAL OF LATEX CLASS FILES, VOL. , NO. , APRIL 2015 7

(a) (b)

Fig. 3. For different configurations of two-dimensional Gaussians in (a) and (b), the solutions of the fluid model are shown as a plot of angles θcompared to the optimal angles obtained by directly optimizing the Bayes and Bhattacharyya criterion. In both examples (a) and (b), the meanseparation between the two classes is varied, and the solutions converge to the FDA solution as this separation goes to infinity. On the other hand,when the mean separation goes to zero, the solutions converge to the optimal equal mean analysis. The fluid model is able to smoothly interpolatebetween these two regimes for intermediate values of the mean separation.

The eigenvector of the only nonzero eigenvalue is w =(Σ1 + Σ2)−1∆µ, which is equivalent to the FDA solution.

Otherwise, when µ1 = µ2, all terms including ∆µ disap-pear, and the resulting orthogonal eigenvectors are also thesolutions of the following rearranged equation:

(Σ1Σ−12 + Σ2Σ

−11 − 2I)W = (Σ1 + Σ2)WΛ, (30)

This equation has a similar form to the eigenvector equationfor the equal mean case analyzed previously. If we compareto (4), we can see the solutions are equivalent if Σ1 andΣ2 commute and the sum of Σ1 and Σ2 is isotropic:Σ1 + Σ2 = αI. If symmetric matrices Σ1 and Σ2 commute,there is an orthonormal matrix W (satisfying W>W = I)that can diagonalize both Σ1 and Σ2. Therefore, we canwrite W>Σ1W = D1 and W>Σ2W = D2 for twodiagonal matrices D1 and D2. Solving Eq. (30) becomes theproblem of finding wi’s which are the columns of W witheigenvalues 1

d1i+d2i

(d1id2i

+ d2id1i− 2)

, where d1i and d2i arethe ith elements of D1 and D2. It will be 1

d2i+ 1

d2iwhen

d1i + d2i = α, or w>i (Σ1 + Σ2)wi = α, giving the samesolution as the Fukunaga’s equal mean solution in Section2.

In Fig. 3, we plot the optimal discriminating direction(angle) of our algorithm for Gaussian distributions in twodimensions. For two configuration examples in Fig. 3(a)and (b), the optimal directions are plotted while varying thedistance between the means of the two classes. Along withour fluid model solution, the direction of maximal Bhat-tacharyya distance and minimal Bayes error are plotted forcomparison. The optimal angles for Bhattacharyya distanceand Bayes error are exhaustively computed by scanningall possible directions. When the mean separation is small,the solution of our method approximates the equal meansolution, and as the distance between two means growslarger, the solution of our method approaches the FDAsolution. In the intermediate regime, our method smoothly

approximates the direction for optimal Bhattahcharyya dis-tance and optimal Bayes error, while its formulation as aneigenvector problem makes the solution computationallytractable.

3.5.2 Nonlinear extension using the “kernel trick”

The nonlinear extension of the discriminant analysis by ker-nelization is straightforward using representer theorem [31],[32], [33], [34]. We consider the solution matrix W = ΦA ∈Rf×d with data matrix in the feature space associated with afunction φ(x) ∈ Rf , Φ = [φ(x1), . . . , φ(xN )] ∈ Rf×N , andthe matrix of mixing coefficients A ∈ RN×d. If we considerthe matrix Eq. (24) which we will denote as P, we obtainthe mixing coefficients A maximizing

tr[(W>W)−1W>PW] (31)= tr[(A>Φ>ΦA)−1A>Φ>PΦA] (32)

which will be obtained from solving the general eigenvectorproblem, (Φ>PΦ)A = (Φ>Φ)AΛ with diagonal matrix Λcontaining eigenvalues.

For kernelization, we want to express the equation us-ing only the kernel K = Φ>Φ with elements Kij =φ(xi)

>φ(xj) and remove all representations using raw dataΦ. If we consider the data matrix of class c, Φc ∈ Rf×Nc ,the estimated mean and covariance can be expressed withregularization as

µc =1

NcΦc1 (33)

Σc =1

NcΦcΦ

>c − µcµ>c + σ2I (34)

=1

NcΦcEcΦ

>c + σ2I (35)

where Ec = INc − 1Nc1>Nc/Nc using the number of data of

class c, Nc, and the identity matrix INc ∈ RNc×Nc and the

Page 8: ATEX CLASS FILES, VOL. , NO. , APRIL 2015 1 Fluid Dynamic …robotics.snu.ac.kr/fcp/files/_pdf_files_publications... · 2017-02-20 · Discriminant Analysis Yung-Kyun Noh, Jihun Hamm,

0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2017.2666148, IEEETransactions on Pattern Analysis and Machine Intelligence

JOURNAL OF LATEX CLASS FILES, VOL. , NO. , APRIL 2015 8

uniform column vector 1Nc∈ RNc with elements 1. We note

that EcEc = Ec and E>c = Ec.We use the inverse Lemma to express Φ>Σ−1

c Φ:

Φ>Σ−1c Φ = (36)

Φ>[

1

σ2I−

1

σ4ΦcEc(NcI +

1

σ2EcΦ

>c ΦcEc)

−1EcΦ>c

=1

σ2K−

1

σ4KcEc(NcI +

1

σ2EcKccEc)

−1EcK>c ,

and express Φ>(Σ1 + Σ2)−1Φ:

Φ>(Σ1 + Σ2)−1Φ = (37)

Φ>

(2σ2I−

[Φ1E1√N1

Φ2E2√N2

] [Φ1E1√N1

Φ2E2√N2

]>)−1

Φ

2σ2

{I−

[Φ1E1√N1

Φ2E2√N2

](2σ2I +

[Φ1E1√N1

Φ2E2√N2

]>[

Φ1E1√N1

Φ2E2√N2

])−1[

Φ1E1√N1

Φ2E2√N2

]>}Φ

=1

2σ2

{K−

[K1E1√N1

K2E2√N2

](2σ2I+[

E1K11E1/N1 E1K12E2/√N1N2

E2K>12E1/√N1N2 E2K22E2/N2

])−1

[K1E1√N1

K2E2√N2

]>}, (38)

for K = Φ>Φ, Kc = Φ>Φc, Kcc = Φ>c Φc, and K12 =Φ>1 Φ2, with c = 1, 2, where the data matrix Φ = [Φ1 Φ2]is the concatenation of data matrices of two classes Φ1 andΦ2.

Using equations (33), (35), (36), and (37), we can expressEq. (32) with K and its submatrices. For kernelization, allinner products of the raw data within Φ have to be replacedby the elements of K. Then as explained in [31], we use akernel trick to have a nonlinear mappping by replacing Kby the elements of a positive definite function.

4 EXPERIMENTS

The proposed method is evaluated on both synthetic andbenchmark datasets. The data are projected onto the es-timated subspaces of the fluid model and other methods,then classification is performed in the subspace. The meth-ods used for comparison are Fisher Discriminant Analy-sis (FDA), Approximate Information Discriminant Analysis(AIDA) [7], and Approximate Chernoff Criterion (ACC) [4],which provide analytic solutions, and Pareto DiscriminantAnalysis (Pareto) [5], which uses a gradient optimization.

For the synthetic data experiments, we used data gener-ated from six different Gaussian configurations presented in[5]. The synthetic data consist of 20-dimensional Gaussiandensity functions generated from the following equation:x = Tcb +µc + ε for x ∈ R20 in c-th class with Tc ∈ R20×7,b ∼ N7(0, I), and ε ∼ N20(0, I), where Nk representsa k-dimensional Gaussian. The Tc ∈ R20×7 is a randommatrix where each element is independently generated fromN (0, 5). The means of each class are µ1 = (2N (0, 1) + 4)1,µ2 = 020, µ3 = 2(N (0, 1) − 4)[010 110]>, µ4 = (N (0, 1) +4)[110 010]>, and µ5 = (2N (0, 1) + 4)[15 05 15 05]>. Weconsider the situation where each class has only 20 trainingsamples.

Fig. 4. Classification results using projected nearest neighbors for fiveclasses

The multi-way classification results with five classesare presented in Fig. 4. In this result, we obtained a 10-dimensional subspace and the projected data on this space.The overall subspace is the aggregation of subspaces ob-tained from every pair of classes, and results are shown for1-nearest neighbor classification in the projected subspace.

The results show that the subspace obtained from thefluid model outperforms most of the other methods includ-ing the classification in the original 20-dimensional data. Inparticular, other methods except ACC do not outperformnearest neighbor classification in the original space, andthus do not demonstrate the advantages of dimensionalityreduction methods.

We also present the results for two-class classification,and the results for all class-pairs are shown in Fig. 5. Theresults again show that only the fluid model along withACC outperforms classification in the original high di-mensional space. The other methods generally outperformFDA, but fail to beat nearest neighbor classification withoutdimensionality reduction.

In Fig. 6, classification results are presented for a numberof benchmark datasets. For datasets from UCI and Delve,5-nearest neighbor classification is performed in the sub-space obtained from different methods, and the leave-one-out accuracy is reported in Fig. 6. FDA can only producea one-dimensional subspace for two-class data and thecorresponding result with the one-dimensional space isshown, whereas the other methods are shown at varyingdimensionalities and the accuracies are plotted. SubclassDiscriminant Analysis (SDA) [35] and Kernel SDA (KSDA)[36] assume clusters within each class, and they also havelimitations on the number of dimensionalities accordingto the number of subclasses. In general, the classificationaccuracies are expected to initially increase with subspacedimensionality, and then decrease due to poor generaliza-tion performance. Compared to simple FDA, better discrim-inant analysis can increase the accuracy by incorporatingcovariance differences.

In our experiments in Fig. 6, kernelized algorithm KSDAis outperformed by the proposed algorithm in nine datasets.In such cases, the fluid model yields mostly one of thehighest accuracies among all algorithms within the standard

Page 9: ATEX CLASS FILES, VOL. , NO. , APRIL 2015 1 Fluid Dynamic …robotics.snu.ac.kr/fcp/files/_pdf_files_publications... · 2017-02-20 · Discriminant Analysis Yung-Kyun Noh, Jihun Hamm,

0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2017.2666148, IEEETransactions on Pattern Analysis and Machine Intelligence

JOURNAL OF LATEX CLASS FILES, VOL. , NO. , APRIL 2015 9

1 vs. 2 1 vs. 3 1 vs. 4 1 vs. 5 2 vs. 30.6

0.8

1A

ccur

acy

2 vs. 4 2 vs. 5 3 vs. 4 3 vs. 5 4 vs. 50.6

0.8

1

Acc

urac

y

Fig. 5. Two-way classification results across all class pairs

error margin from the highest accuracies.

5 GAUSSIAN PROCESSES

In this section, we further extend our fluid model to theproblem of finding optimal filters that discriminate datafrom different stochastic processes. We focus this analysison discovering optimal filters for linear processes withGaussian noise. We generalize our previous analysis to aninterpretation of discriminating filters. The general problemof optimally discriminating two arbitrary Gaussian pro-cesses is intractable, similar to the finite dimensional dis-crimination problem of Section 2. We explain how to extendour tractable fluid model to discriminate different stochasticdynamical systems modeled by Gaussian processes.

The dynamical systems are expressed by the followinglinear system, resulting in a Gaussian process. We consideran infinite length sequence X = [. . . , xm−1, xm, xm+1, . . .]along with the following k-order recursive update rule,

xm =k∑i=1

aixm−i + bm + εm (39)

where each εm ∈ R follows a Gaussian distributionN (0, σ2), and bm is an input value at time m. Then thesequence X is a Gaussian process, and we want to calculatethe mean function µ(m) = 〈xm〉 and the covariance functionf(m,n) = 〈xmxn〉 − 〈xm〉〈xn〉.

To obtain µ(m) and f(m,n), we introduce a state updaterule for x = (xm, xm−1, · · · , xm−k+1)> of size k:

xm+1=

a1 a2 · · · ak1

. . .1 0

xm +

bm0...0

+

εm0...0

≡Axm + bm + ~εm. (40)

Then the recursive equation Eq. (40) can be used to obtain ageneral expression for xm:

xm =∞∑i=1

Ai−1(bm−i + ~εm−i). (41)

This equation leads to the mean function µ(m) and thetranslationally invariant covariance function f(m,n):

µ(m) = 〈xm〉 (42)

=∞∑i=1

Ai−1bm−i (43)

and

f(m,n) = 〈xmx>n 〉 − 〈xm〉〈xn〉> (44)

=

σ2Am−n(∑∞

i=1 Ai−1A>i−1), m > n

σ2(∑∞

i=1 Ai−1A>i−1)

A>n−m

, n > m.

We consider two systems with different A and bm pa-rameters and try to discriminate samples generated fromthese systems. In the following section, we discuss how wecan find filters for a low dimensional representation of theGaussian process for better classification.

5.1 Low dimensional filtersThe one dimensional projection for infinite sequence datacan be considered as a filter whose inner product with thesequence yields a scalar output. Thus, the projection vectorcan be expressed as a filter function w(m). We want todetermine the optimal filter that can be used to classifysamples from the different processes. If the samples aregenerated by Gaussian processes, an analysis analogous toFukunaga as in Section 2 can be applied. The special caseswhere the processes are either homoscedastic or have equalmean are especially enlightening.

5.1.1 FDA analysisWe first consider the case when the two covariance func-tions are the same and the mean functions are different.In this case, the optimal discriminating filter between thedynamical systems is the FDA solution, given by w(m) =∫

(f1(m,n) + f2(m,n))−1

[µ1(n)−µ2(n)] dn which is anal-ogous to the finite dimensional form (Σ1 + Σ2)−1∆µ. Theinverse function satisfies

∫f−1(l,m)f(m,n)dm = δ(l, n)

and can be obtained through a Fourier transform. f−1(m,n)is the inverse Fourier transformation of 1/F (ω) where F (ω)is the Fourier transformation of the translationally invariantf(m,n).

Fig. 7 shows the discrimination of homoscedastic firstorder (k = 1) dynamical systems. In this system, the meanand covariance function in Eq. (43) and Eq. (44) reduce to

µ(m) =∑∞i=1 a

i−11 bm−i and f(m,n) =

σ2a|m−n|1

1−a21. If we

use the same update parameter a1 for both systems, theresulting Gaussian processes possess the same covariancefunction f(m,n). In Fig. 7, the two systems have differentinputs

bm = b1(m) =

{.05, 100 < m ≤ 200

0, otherwise(45)

Page 10: ATEX CLASS FILES, VOL. , NO. , APRIL 2015 1 Fluid Dynamic …robotics.snu.ac.kr/fcp/files/_pdf_files_publications... · 2017-02-20 · Discriminant Analysis Yung-Kyun Noh, Jihun Hamm,

0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2017.2666148, IEEETransactions on Pattern Analysis and Machine Intelligence

JOURNAL OF LATEX CLASS FILES, VOL. , NO. , APRIL 2015 10

(a) Hillvalley with noise (b) Blood (c) Parkinsons

(d) Hillvalley without noise (e) Delve twonorm (f) Delve ringnorm

(g) Breast cancer (h) Image (i) German

(j) Ionosphere (k) Ozon one hour (l) Waveform

Fig. 6. Classification results on benchmark data for various subspace dimensionalities

for class 1, and

bm = b2(m) =

{−.05, 100 < m ≤ 200

0, otherwise(46)

for class 2, which contain opposite signs but start and end atthe same sequence number. We used the same a1 = .99 forboth processes. Fig. 7(a) shows the two mean sequences ofthese two systems and also two sequence samples generated

from these systems. One possible filter to discriminate thesamples is the simple difference of the means as shownin Fig. 7(b). The optimal solution in this case is the FDAsolution, also shown in Fig. 7(b), and looks much differentthat the mean difference. The FDA solution is actually adeconvolved form of the mean difference filter, and onlycompares the samples near where the input bm starts and

Page 11: ATEX CLASS FILES, VOL. , NO. , APRIL 2015 1 Fluid Dynamic …robotics.snu.ac.kr/fcp/files/_pdf_files_publications... · 2017-02-20 · Discriminant Analysis Yung-Kyun Noh, Jihun Hamm,

0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2017.2666148, IEEETransactions on Pattern Analysis and Machine Intelligence

JOURNAL OF LATEX CLASS FILES, VOL. , NO. , APRIL 2015 11

(a) (b) (c)

Fig. 7. (a) Sample signals and means of two homoscedastic Gaussian processes (b) Fluid model filter (equivalent to FDA in this case) and meandifference filter. This fluid model filter has two positive and negative peaks indicating two points of input change (See Section 5.1.1). (c) ROC curvefor classification using the fluid model filter and mean difference filter

(a) (b) (c)

Fig. 8. (a) Sample signals and the mean of two heteroscedastic Gaussian processes (b) Fluid model filter and FDA filter (c) ROC curve forclassification using fluid model filter, mean difference filter, and FDA filter

where it finishes. The ROC curve for discriminating betweenthe two projected processes is shown in Fig. 7(c), whichshows that the FDA filter yields much improved perfor-mance in distinguishing between the two processes.

5.1.2 Equal mean analysisThe finite dimensional equal mean solution can also beeasily extended when the covariance functions are transla-tionally invariant functions such that fc(m,n) = fc(m−n).These functions are analogous to the circulant covariancematrices whose eigenvectors are the Fourier modes sat-isfying

∫∫e−iωmeiωnfc(m,n)dmdn = λc(ω). Therefore,

the solution of the equal mean analysis is the Fouriermode w(m) = eiωm corresponding to the ω maximizingλ1(ω)λ2(ω) + λ2(ω)

λ1(ω) . From this analysis, we see that the optimalfilter is a single Fourier mode for translationally invariantprocesses with equal mean.

5.1.3 Fluid modelIn the general case, we consider dynamical processes whereboth mean and covariance functions are different. As inthe finite dimensional heteroscedastic problem, finding theoptimal filter is generally intractable. However, we can usean analysis of the fluid model to find a tractable solution.

In this situation, the elements of 〈Fx>e 〉〈xx>〉ρ〈xeF>〉can be expressed as

∫∫e−iω1meiω2nH(ω1, ω2)dω1dω2 where

H(ω1, ω2) is composed of two diagonal matrices whosecomponents are F1(ω) and F2(ω), and the Hermitian matrix

M(ω1)M(ω2)† where M(ω) =∫

∆µ(t)e−iωtdt, a Fouriertransform. The corresponding optimization can be solvedby the eigendecomposition of H(ω1, ω2), an infinite dimen-sional representation of Eq. (28).

We show an example of discriminating two differentGaussian processes in Fig. 8 with different means and co-variances. The two input functions are:

bm = b1(m) =

{.0001, 100 < m ≤ 200

0, otherwise(47)

for class 1, and

bm = b2(m) =

{−.0001, 100 < m ≤ 200

0, otherwise(48)

for class 2, which are smaller in magnitude than before,and a1 = .99 for class 1 and a1 = .97 for class 2. In thisscenario, the mean difference is very small compared to thedifferent covariances as shown in Fig. 8(a). Classificationresults for the fluid filter, the mean difference filter, and theFDA filter are presented in Fig. 8(c). For this example, themean difference filter shows much better performance thanthe FDA filter. The optimal fluid model filter in Fig. 8(b)is a combination of Fourier modes and the FDA filter, andshows the best classification performance in in Fig. 8(c).

5.2 Motion capture discriminationWe consider the problem of separating two different motionsequences in motion capture data. From the CMU motion

Page 12: ATEX CLASS FILES, VOL. , NO. , APRIL 2015 1 Fluid Dynamic …robotics.snu.ac.kr/fcp/files/_pdf_files_publications... · 2017-02-20 · Discriminant Analysis Yung-Kyun Noh, Jihun Hamm,

0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2017.2666148, IEEETransactions on Pattern Analysis and Machine Intelligence

JOURNAL OF LATEX CLASS FILES, VOL. , NO. , APRIL 2015 12

capture dataset2, we selected two motions (running andwalking) as shown in Fig. 9(a-b) and obtained a temporalscalar sequence by projecting the positions along the maxi-mum variance spatial axis. In this dataset, all running andwalking sequences have different lengths, and bm = b inEq. (39) is considered as constant. The order k is set to 10,and the parameters a1, . . . , ak and b are determined usingleast squares. Then from the Gaussian process fluid analysis,the projections of the sequences in the subspace of the twomost important discriminant filters are shown in Fig. 9(c).The first (vertical) axis shows the difference in means andthe second (horizontal) axis shows the difference due tocovariance, and is related to the motion frequency.

5.3 Stock market index

The fluid Gaussian process model is applied to the fixed-length sequence data with bm parameters. The discrimi-nation task is to predict the behavior of the stock marketindex for the next day. We use the KOSPI index for 9 yearsfrom Dec. 7, 1998, to Jul. 10, 2007, and try to determinewhether the index will increase more than 2% at the end ofthe next day. The prediction is performed using index datafrom the previous day between 9:20 to 14:40. Each sequencegenerated from this dataset is a 232-dimensional vector. Weapplied our discriminant analysis method by estimatingparameters a1, . . . , ak and bm with the order k = 15. Weevaluated the performance on test data from the next 4 yearsfrom Jul. 11, 2007 to Jul. 8, 2011.

The KOSPI index data from one day is presented inFig. 10(a). The objective is to predict whether the index willincrease more than 2% at the end of next day, based onthis pattern. Data are preprocessed by subtracting the meanindex of data of each day, and the pattern of index changein one day is classified using filters from the fluid model.

The two optimal filters are computed, and we provideclassification results with the first filter only as well as withboth filters. For comparison, we also provide results fromFDA, linear SVM, and SVM with RBF kernels. Fig. 10.(c)shows the ROC curve for the classification. Here, the linearSVM result is very poor, and the simple FDA filter doesnot perform well either. The SVM with RBF kernels showsdecent performance with an AUC of .6512, while the fluidmodel gives AUC of .6975 for two features and .6527 for onefeature. The shapes of the first and second filters for the fluidmodel are displayed in Fig. 10(b). They can be interpretedas combining information from mean differences as well asfrom discriminative periodic information in the stock indexsequences.

6 CONCLUSIONS

We have demonstrated how to obtain the discriminant fea-tures by using a fluid dynamics model with a Bhattachar-rya potential function. The resulting solution with low-dimensional constrained flows is related to the well-knownBayes-optimal projections for Gaussian distributions. Thealgorithm is computationally tractable with eigendecompo-sition of the resulting force fields in the fluid model.

2. http://mocap.cs.cmu.edu

The fluid model has been extended to the problem ofdiscriminating sequence data using Gaussian process for-mulation. The resulting filters are applied to two differentdatasets: motion capture and stock market index data. Thecomparison using ROC curves in stock market predictionshows how the proposed extension can make a flexible dis-criminant analysis model which outperforms other standardtechniques.

These results demonstrate the utility of modeling the dis-criminant analysis using fluid flows with low-dimensionalflow constraints. We anticipate that this model will be fur-ther developed using other non-Gaussian functions and willbe applied to a wide variety of problems in the future.

ACKNOWLEDGMENTS

YKN is supported by grants from NSRI, BK21Plus, MITIP-10048320,AFOSR, JHH from OFRN-C4ISR, NSF IIS EAGER 1550757, FCPfrom BK21Plus, MITIP-10048320, BTZ from IITP-R0126-16-1072, KEIT-10060086, KEIT-10044009, and DDL from the U.S. NSF, ONR, ARL,AFOSR, DOT, DARPA.

REFERENCES

[1] P. J. Huber, “Projection pursuit,” The Annals of Statistics, vol. 13,no. 2, pp. 435–475, 1985.

[2] R. Duda, P. Hart, and D. Stork, Pattern Classification (2nd Edition).Wiley-Interscience, 2000.

[3] F. De la Torre, and M. J. Black, “A framework for robust subspacelearning,” International Journal of Computer Vision, vol. 54, no.1/2/3, pp. 117–142, 2003.

[4] M. Loog and R. Duin, “Linear dimensionality reduction via aheteroscedastic extension of LDA: The Chernoff criterion,” IEEETransactions on Pattern Analysis and Machine Intelligence, vol. 26,no. 6, pp. 732–739, 2004.

[5] K. T. Abou-Moustafa, F. De la Torre, and F. P. Ferrie, “Paretodiscriminant analysis,” in IEEE Conference on Computer Vision andPattern Recognition (CVPR), 2010.

[6] F. De la Torre and T. Kanade, “Multimodal oriented discriminantanalysis,” in Proceedings of the 22nd international conference onMachine learning (ICML), 2005, pp. 177–184.

[7] K. Das and Z. Nenadic, “Approximate information discriminantanalysis: A computationally simple heteroscedastic feature extrac-tion technique,” Pattern Recognition, vol. 41, no. 5, pp. 1548–1557,2008.

[8] Z. Nenadic, “Information discriminant analysis: Feature extractionwith an information-theoretic objective,” IEEE Transactions on Pat-tern Analysis and Machine Intelligence, vol. 29, no. 8, pp. 1394–1407,2007.

[9] M. Padmanabhan and S. Dharanipragada, “Maximizing informa-tion content in feature extraction,” IEEE Transactions on Speech andAudio Processing, vol. 13, no. 4, pp. 512–519, 2005.

[10] S. Yan, D. Xu, B. Zhang, H.-J. Zhang, Q. Yang, and S. Lin, “Graphembedding and extensions: A general framework for dimension-ality reduction,” IEEE Transactions on Pattern Analysis and MachineIntelligence, vol. 29, pp. 40–51, 2007.

[11] O. Hamsici and A. Martinez, “Bayes optimality in linear discrim-inant analysis,” IEEE Transactions on Pattern Analysis and MachineIntelligence, vol. 30, no. 4, pp. 647–657, 2008.

[12] K. Fukunaga, Introduction to Statistical Pattern Recognition. SanDiego, California: Academic Press, 1990.

[13] J. C. Principe, D. Xu, Q. Zhao, and J. W. F. Iii, “Learning fromexamples with information theoretic criteria,” Journal of VLSISystems, Kluwer, vol. 26, pp. 61–77, 1999.

[14] K. Torkkola, “Feature extraction by non-parametric mutual infor-mation maximization,” Journal of Machine Learning Research, vol. 3,pp. 1415–1438, 2003.

[15] R. A. Fisher, “The use of multiple measurements in taxonomicproblems,” Annals of Eugenics, vol. 7, no. 7, pp. 179–188, 1936.

[16] R. C.R., “Information and accuracy attainable in the estimationof statistical parameters.” Bulletin of Calcutta Mathematical Society,vol. 37, pp. 81–91, 1945.

Page 13: ATEX CLASS FILES, VOL. , NO. , APRIL 2015 1 Fluid Dynamic …robotics.snu.ac.kr/fcp/files/_pdf_files_publications... · 2017-02-20 · Discriminant Analysis Yung-Kyun Noh, Jihun Hamm,

0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2017.2666148, IEEETransactions on Pattern Analysis and Machine Intelligence

JOURNAL OF LATEX CLASS FILES, VOL. , NO. , APRIL 2015 13

−4−2

02

4

−5

0

5

−10

−5

0

5

10

xz

y

−4−2

02

4

−10

−5

0

5

−15

−10

−5

0

5

10

xz

y

walkrunwalk untrainedrun untrained

(a) (b) (c)

Fig. 9. (a) A sample of walking motion; (b) a sample of running motion; (c) embeddings in the first two discriminative features.

(a) (b) (c)

Fig. 10. (a) A sample of the KOSPI index for one day as a function of time; (b) first and second optimal filters from the fluid model to discriminatewhen the index will have increased by more than 2% at the end of the next day; (c) ROC curves for classification using different filters: comparingthe first filter of the fluid model, the first two filters of the fluid model, FDA, linear SVM, and SVM with RBF kernel.

[17] N. A. Campbell, “Canonical variate analysis a general formula-tion.” Australian Journal of Statistics, vol. 26, pp. 86–96, 1984.

[18] M. Hellman and J. Raviv, “Probability of error, equivocation,and the Chernoff bound,” IEEE Transactions on Information Theory,vol. 16, no. 4, pp. 368–372, 1970.

[19] T. Kailath, “The divergence and bhattacharyya distance measuresin signal detection,” IEEE Transactions on Communication Technol-ogy, vol. 15, pp. 52 – 60, 1967.

[20] R. Fano, Transmission of Information: a Statistical Theory of Commu-nications. MIT Press, 1961.

[21] J. Lin, “Divergence measures based on the Shannon entropy,” IEEETransactions on Information Theory, vol. 37, no. 1, pp. 145 – 151, 1991.

[22] M.-J. Zhao, N. U. Edakunni, A. Pocock, and G. Brown, “BeyondFano’s inequality: bounds on the optimal F-score,BER, and cost-sensitive risk and their implications.” Journal of Machine LearningResearch, vol. 14, no. 1, pp. 1033–1090, 2013.

[23] J. Principe, J. Fisher III, and D. Xu, “Information theoretic learn-ing,” In Simon Haykin, editor, Unsupervised Adaptive Filtering, 2000.

[24] G. Saon and M. Padmanabhan, “Minimum Bayes error featureselection for continuous speech recognition,” in Advances in NeuralInformation Processing Systems 13, 2000, pp. 800–806.

[25] D. Tao, X. Li, X. Wu, and S. J. Maybank, “Geometric meanfor subspace selection,” IEEE Transactions on Pattern Analysis andMachine Intelligence, vol. 31, no. 2, pp. 260–274, 2009.

[26] K. Fukunaga and J. M. Mantock, “Nonparametric discriminantanalysis.” IEEE Transactions on Pattern Analysis and Machine Intel-lignece, vol. 5, no. 6, pp. 671–678, 1983.

[27] E. Choi and C. Lee, “Feature extraction based on the Bhat-tacharyya distance,” Pattern Recognition, vol. 36, no. 8, pp. 1703–1709, 2003.

[28] L. Mackey, “Deflation methods for sparse PCA,” in Advances inNeural Information Processing Systems (NIPS) 21, 2009, pp. 1017–1024.

[29] X. Liu, A. Srivastava, and K. Gallivan, “Optimal linear representa-tions of images for object recognition,” IEEE Transactions on PatternAnalysis and Machine Intelligence, vol. 26, pp. 662–666, 2004.

[30] A. Edelman, T. As, A. Arias, Steven, and T. Smith, “The geometry

of algorithms with orthogonality constraints,” SIAM J. MatrixAnal. Appl, vol. 20, pp. 303–353, 1998.

[31] B. Scholkopf and A. J. Smola, Learning with Kernels: Support VectorMachines, Regularization, Optimization, and Beyond. Cambridge,MA, USA: MIT Press, 2001.

[32] S. Mika, G. Ratsch, J. Weston, B. Scholkopf, and K.-R. Muller,“Fisher discriminant analysis with kernels,” in Neural Networks forSignal Processing IX, Y.-H. Hu, J. Larsen, E. Wilson, and S. Douglas,Eds. Piscataway, NJ: IEEE, 1999, pp. 41–48.

[33] G. Baudat and F. Anouar, “Generalized discriminant analysisusing a kernel approach,” Neural Computation, vol. 12, no. 10, pp.2385–2404, 2000.

[34] D. You and A. Martinez, “Bayes optimal kernel discriminant anal-ysis,” in IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 2010.

[35] M. Zhu and A. M. Martinez, “Subclass discriminant analysis,”IEEE Transactions on Pattern Analysis and Machine Intelligence,vol. 28, no. 8, pp. 1274–1286, Aug. 2006.

[36] D. You, O. C. Hamsici, and A. M. Martinez, “Kernel optimizationin discriminant analysis,” IEEE Transactions on Pattern Analysis andMachine Intelligence, vol. 33, no. 3, pp. 631–638, 2011.

[37] S. H. Yang and B.-G. Hu, “Discriminative feature selection bynonparametric Bayes error minimization,” IEEE Transactions onKnowledge and Data Engineering, vol. 24, no. 8, pp. 1422–1434, 2012.

[38] N. Vasconcelos, “Feature selection by maximum marginal diver-sity: optimality and implications for visual recognition,” in IEEEConference on Computer Vision and Pattern Recognition (CVPR), 2003.

[39] N. Vasconcelos, “Feature selection by maximum marginal diver-sity,” in Advances in Neural Information Processing Systems 15, 2002,pp. 1375–1382.

[40] G. Carneiro and N. Vasconcelos, “Minimum Bayes error featuresfor visual recognition,” Image Vision Comput., vol. 27, no. 1-2, pp.131–140, 2009.

[41] B. North, A. Blake, M. Isard, and J. Rittscher, “Learning andclassification of complex dynamics,” IEEE Transactions on PatternAnalysis and Machine Intelligence, vol. 22, no. 9, pp. 1016–1034, 2000.

[42] A. Bar-Hillel, T. Hertz, N. Shental, and D. Weinshall, “Learningdistance functions using equivalence relations,” in Proceedings of

Page 14: ATEX CLASS FILES, VOL. , NO. , APRIL 2015 1 Fluid Dynamic …robotics.snu.ac.kr/fcp/files/_pdf_files_publications... · 2017-02-20 · Discriminant Analysis Yung-Kyun Noh, Jihun Hamm,

0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2017.2666148, IEEETransactions on Pattern Analysis and Machine Intelligence

JOURNAL OF LATEX CLASS FILES, VOL. , NO. , APRIL 2015 14

International Conference on Machine Learning (ICML), 2003, pp. 11–18.

[43] E. Lifshitz and L. Landau, Fluid Mechanics, Second Edition: Volume6 (Course of Theoretical Physics). Butterworth-Heinemann, 1987.

[44] D. Evans and G. Morriss, Statistical mechanics of nonequilibriumliquids, 2nd ed. Cambridge University Press, Cambridge, 2008.

[45] M. Alvarez, D. Luengo, and N. Lawrence, “Latent force models,”Twelfth International Conference on Artificial Intelligence and Statistics,2009.

[46] R. Urtasun and T. Darrell, “Discriminative Gaussian process latentvariable models for classification,” in Proceedings of InternationalConference on Machine Learning (ICML), 2007.

[47] T. Hastie and R. Tibshirani, “Discriminant analysis by Gaussianmixtures.” Journal of the Royal Statistical Society. Series B (Method-ological), vol. 58, no. 1, pp. 155–176, 1996.

[48] N. Kumar and A. Andreou, “Heteroscedastic discriminant anal-ysis and reduced rank HMMs for improved speech recognition.”Speech Communication, vol. 26, pp. 283–297, 1998.

[49] H. Zou and T. Hastie, “Regularization and variable selection viathe elastic net.” Journal of the Royal Statistical Society: Series B,vol. 67, no. Part 2, pp. 301–320, 2005.

[50] M. Zhu and A. Martinez, “Subclass discriminant analysis,” IEEETransactions on Pattern Analysis and Machine Intelligence, vol. 28,no. 8, pp. 1274–1286, 2006.

[51] X. W. D. Tao, X. Li and S. Maybank, “General averaged divergenceanalysis,” in Proceedings of the 7th international conference on DataMining (ICDM), 2007, pp. 302–311.

[52] A. Iosifidis, A. Tefas, and I. Pitas, “On the optimal class representa-tion in linear discriminant analysis,” Neural Networks and LearningSystems, IEEE Transactions on, vol. 24, no. 9, pp. 1491–1497, 2013.

Yung-Kyun Noh Yung-Kyun Noh is currentlya BK21Plus Assistant Professor of the Schoolof Mechanical and Aerospace Engineering atSeoul National University (SNU). His researchinterests are metric learning and dimensionalityreduction in machine learning, and he is espe-cially interested in applying statistical theory ofnearest neighbors to real, large datasets. Hereceived his B.S. in Physics from POSTECH in1998 and his Ph.D. in Computer Science fromSNU in 2011. From 1998 to 2003, he was a

member of the Marketing Research Group at MPC Ltd. in Korea wherehe worked as a researcher and a system engineer. He worked in theGRASP Robotics Laboratory at the University of Pennsylvania from2007 to 2012 as a visiting researcher. Before he join the currentBK21Plus program, he was a Postdoctoral fellow in the same depart-ment and a Research Professor in the Department of Computer Scienceat KAIST.

Jihun Hamm Jihun Hamm is currently a Re-search Scientist in the department of ComputerScience and Engineering at the Ohio State Uni-versity. He received his B.S. and M.S. degrees inElectrical Engineering from Seoul National Uni-versity in 1998 and 2002, during which he servedmilitary services. He received his Ph.D. in thesame field from the University of Pennsylvania in2008, with a focus on dimensionality reductionproblems in machine learning. After completionof Ph.D program he spent four years in post-

doctoral research in applications of machine learning in medical sci-ences and psychology. His current areas of research interest includemachine learning problems such as learning long-term and structuraldependence of daily human activities, and classifications of medicalconditions from various data. He has a best paper award from one ofthe top conferences in medical imaging, and has served as a reviewerfor journals (JMLR, IEEE TPAMI, IEEE TNN, PR, IJPR, and MedIA), andconferences (NIPS, ICML, and MICCAI).

Frank Chongwoo Park Frank Chongwoo Parkreceived his B.S. in electrical engineering fromMIT in 1985, and Ph.D. in applied mathematicsfrom Harvard University in 1991. From 1991 to1995 he was assistant professor of mechanicaland aerospace engineering at the University ofCalifornia, Irvine. Since 1995 he has been pro-fessor of mechanical and aerospace engineeringat Seoul National University. His research inter-ests are in robotics, vision and image process-ing, and related areas of applied mathematics.

He has been an IEEE Robotics and Automation Society DistinguishedLecturer, and served on the editorial boards of the Springer Handbookof Robotics, Springer Advanced Tracts in Robotics (STAR), Robotica,and the ASME Journal of Mechanisms and Robotics. He has heldadjunct faculty positions at the NYU Courant Institute and the InteractiveComputing Department at Georgia Tech. He is a fellow of the IEEE, andcurrent editor-in-chief of the IEEE Transactions on Robotics.

Byoung-Tak Zhang Byoung-Tak Zhang is cur-rently a Professor in the Department of Com-puter Science and Engineering, Seoul NationalUniversity (SNU). He is also affiliated with theBrain Science and Cognitive Science Programsand Director of the Institute for Cognitive Sci-ence. He received his Ph.D. in computer sciencefrom the University of Bonn, Germany, in 1992,and B.S. and M.S. in computer science and en-gineering from SNU in 1986 and 1988, respec-tively. Prior to joining SNU, he was a Research

Fellow at the German National Research Center for Information Tech-nology (GMD) from 1992 to 1995. He has been a Visiting Professor atthe MIT CSAIL (2003-2004), the BMBF Excellence Centers for CognitiveInteraction Technology and Cognitive Technical Systems (2010- 2011),and the Princeton Neuroscience Institute (2014-2015). His researchinterests include biointelligence models of learning, evolution, and de-velopment and their application to artificial intelligence and cognitivescience. He serves as an Associate Editor for BioSystems, Advancesin Natural Computation, and the IEEE Transactions on EvolutionaryComputation (1997-2010).

Daniel D. Lee Daniel Lee is the UPS FoundationChair Professor in the School of Engineering andApplied Science at the University of Pennsyl-vania. He received his B.A. summa cum laudein Physics from Harvard University in 1990 andhis Ph.D. in Condensed Matter Physics from theMassachusetts Institute of Technology in 1995.Before coming to Penn, he was a researcher atAT&T and Lucent Bell Laboratories in the The-oretical Physics and Biological Computation de-partments. He is a Fellow of the IEEE and AAAI

and has received the National Science Foundation CAREER awardand the University of Pennsylvania Lindback award for distinguishedteaching. He was also a fellow of the Hebrew University Institute ofAdvanced Studies in Jerusalem, an affiliate of the Korea AdvancedInstitute of Science and Technology, and organized the US-Japan Na-tional Academy of Engineering Frontiers of Engineering symposium. Asdirector of the GRASP Laboratory and co-director of the CMU-PennUniversity Transportation Center, his group focuses on understandinggeneral computational principles in biological systems, and on applyingthat knowledge to build autonomous systems.