learning topology of a labeled data set with the ... · 2 state of the art 2.1 the gaussian mixture...

6
Learning Topology of a Labeled Data Set with the Supervised Generative Gaussian Graph Gaillard Pierre 1 , Aupetit Micha¨ el 1 and Govaert G´ erard 2 1- French Atomic Energy Commission - BP12 91680 Bruy` eres-le-Chˆ atel - France 2- University of Technology of Compiegne - France Abstract. Discovering the topology of a set of labeled data in a Eucli- dian space can help to design better decision systems. In this work, we propose a supervised generative model based on the Delaunay Graph of some prototypes representing the labeled data. 1 Introduction Basic supervised learning problems involve a given set of M labeled training data {x i ,c i |i =1, ..., M }, where x i R D is a ”feature” vector and c i ∈{1, ..., K} is its associated class label. The ultimate goal of classification problems is to design a classifier which predicts the class of new feature vectors with a minimum error rate. However, prediction is only the last step of the learning process, which can be enriched through the exploratory analysis of the data, and specifically the ex- traction of the topology of the classes. Indeed, several topological characteristics of the classes could be useful among which: (1) their connectedness to evaluate the complexity of the classification problem [1]; (2) their intrinsic dimension to select relevant features [2]. In order to extract this topological information, we first assume that the data are drawn from some labeled principal manifolds [3] corrupted with some additive noise. One way to capture the structure of the data is to model their distribution in terms of latent or hidden variables [4]. The main generative models dealing with unsupervised manifold learning are the Generative Topographic Mapping [4] and the Probabilistic Principal Compo- nent Analyzers [5]. In the first approach, the intrinsic dimension is fixed a priori allowing visualization, while in the second approach, the intrinsic dimension is captured but the connectedness is lost. In order to overcome these limits, an- other generative model is proposed in [6] which is based on the Delaunay Graph (DG) of some prototypes representing the data. This model, called Generative Gaussian Graph (GGG), assumes no a priori about the topology and allows to learn the connectedness of sets of points. We propose to extend the GGG to the supervised case in order to extract the topology of the classes. We observe that the GGG can be viewed as a generalization of a Gaussian Mixture model (GM) and that the GM has been transposed to supervised learning [7]. Our approach uses the same path to extend the GGG to the supervised case. Section 2 briefly reviews the GM and its supervised version as well as the GGG. In section 3, we introduce the new algorithm allowing to represent the topology of a labeled data set. Then we test it on artificial and real data in section 4, before the conclusion in section 5. 235 ESANN'2007 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), 25-27 April 2007, d-side publi., ISBN 2-930307-07-2.

Upload: others

Post on 07-Sep-2019

1 views

Category:

Documents


0 download

TRANSCRIPT

Learning Topology of a Labeled Data Set with

the Supervised Generative Gaussian Graph

Gaillard Pierre1 , Aupetit Michael1 and Govaert Gerard2

1- French Atomic Energy Commission -BP12 91680 Bruyeres-le-Chatel - France

2- University of Technology of Compiegne - France

Abstract. Discovering the topology of a set of labeled data in a Eucli-dian space can help to design better decision systems. In this work, wepropose a supervised generative model based on the Delaunay Graph ofsome prototypes representing the labeled data.

1 Introduction

Basic supervised learning problems involve a given set of M labeled training data{xi, ci|i = 1, ...,M}, where xi ∈ R

D is a ”feature” vector and ci ∈ {1, ...,K} is itsassociated class label. The ultimate goal of classification problems is to design aclassifier which predicts the class of new feature vectors with a minimum errorrate. However, prediction is only the last step of the learning process, which canbe enriched through the exploratory analysis of the data, and specifically the ex-

traction of the topology of the classes. Indeed, several topological characteristicsof the classes could be useful among which: (1) their connectedness to evaluatethe complexity of the classification problem [1]; (2) their intrinsic dimension toselect relevant features [2]. In order to extract this topological information, wefirst assume that the data are drawn from some labeled principal manifolds [3]corrupted with some additive noise. One way to capture the structure of thedata is to model their distribution in terms of latent or hidden variables [4].The main generative models dealing with unsupervised manifold learning arethe Generative Topographic Mapping [4] and the Probabilistic Principal Compo-

nent Analyzers [5]. In the first approach, the intrinsic dimension is fixed a prioriallowing visualization, while in the second approach, the intrinsic dimension iscaptured but the connectedness is lost. In order to overcome these limits, an-other generative model is proposed in [6] which is based on the Delaunay Graph(DG) of some prototypes representing the data. This model, called Generative

Gaussian Graph (GGG), assumes no a priori about the topology and allows tolearn the connectedness of sets of points. We propose to extend the GGG to thesupervised case in order to extract the topology of the classes. We observe thatthe GGG can be viewed as a generalization of a Gaussian Mixture model (GM)and that the GM has been transposed to supervised learning [7]. Our approachuses the same path to extend the GGG to the supervised case.Section 2 briefly reviews the GM and its supervised version as well as the GGG.In section 3, we introduce the new algorithm allowing to represent the topologyof a labeled data set. Then we test it on artificial and real data in section 4,before the conclusion in section 5.

235

ESANN'2007 proceedings - European Symposium on Artificial Neural NetworksBruges (Belgium), 25-27 April 2007, d-side publi., ISBN 2-930307-07-2.

2 State of the art

2.1 The Gaussian Mixture

Mixture modeling can be regarded as a flexible way to represent a probabilitydensity function with a parametric model. A normal mixture density is definedby a finite weighted sum of Gaussian components having the following form :p(x|π,w,Σ) =

∑N

j=1 πjgj(x|wj ,Σj) where N is the number of components, gj

is a gaussian density with mean wj and covariance matrix Σj ∈ Σ. πj ∈ πis the probability that an observation belongs to the jth component such thatπj ≥ 0 and

∑M

j=1 πj = 1. This model can be viewed as a 2-step data-generatingprocess: (step 1) drawing of the component j with a probability πj ; (step 2)drawing of the data following the density gj of the component j. Therefore inthis model, the principal manifold is assumed to be a set of points w, havingbeen corrupted with additive Gaussian noise gj to lead to the observed data.In the context of supervised learning, Miller and Uyar [7] suggest to learn theallocation of the mixture components to the classes during the training. Theyintroduce an additional parameter βcj ∈ β to the GM, which represents theconditional probability of assigning the mixture component j to the class c.Moreover, since the components are common to the different classes, the modelallows to represent easily a possible common structure for the different classes(e.g. high overlapping of the classes). The model, called Generalized Gaussian

Mixture (GGM), takes the form: p(x, c|π, β, w,Σ) =∑N

j=1 πjβcjg(x|wj ,Σj) with

the new constraints βcj ≥ 0 ∀j, c and∑K

c=1 βcj = 1 ∀j.

2.2 The Generative Gaussian Graph

In this section, we use the same notations as provided in [6]. The GGG as-sumes that the data are generated by some points and some segments con-stituting the principal manifolds that have been corrupted with an additivespherical Gaussian noise with zero-mean and unknown variance σ2. The under-lying model is based on two Gaussian elements, namely the Gaussian-points

and the Gaussian-segments which define a Gaussian mixture model. Givena set of N0 prototypes w located over the data distribution using a vectorquantization technique, the Delaunay Graph (DG) of the prototypes is con-structed. With a weighted sum of N0 vertices and the N1 edges of the DG,convolved with an isotropic gaussian distribution with variance σ2, the datageneration process can be seen as a generalization of a usual gaussian mix-ture model and is defined by: p(xi|π,w, σ,DG) =

∑1d=0

∑Nd

j=1 πdj gd(xi|j, σ),

where π0j (resp. π1

j ) is the probability that a datum xi was drawn from theGaussian-point associated to wj (resp. the Gaussian-segment associated to thejth edge of DG). The density at point xi involved by the jth Gaussian-pointand the jth Gaussian-segment [waj

, wbj] of length Lj are respectively defined as:

g0(xi|j, σ) = 1(2πσ2)D/2 exp(

−(xi−wj)2

2σ2 ) and g1(xi|j, σ) = 1Lj

∫ wbj

waj

g0(xi|w, σ)dw.

236

ESANN'2007 proceedings - European Symposium on Artificial Neural NetworksBruges (Belgium), 25-27 April 2007, d-side publi., ISBN 2-930307-07-2.

3 The Supervised Generative Gaussian Graph

In this paper, the data are assumed to be generated by some points and somesegments constituting the principal manifolds and then to be corrupted with anadditive spherical Gaussian noise with zero-mean and unknown variance (i.e.Σ = σ). Moreover, we assume that the jth Gaussian element of dimensiond (written (d, j)) can generate data from K different classes c with respectiveprobabilities βd

cj . Thus, we define the following model : p(x, c|π, β, w, σ,DG) =∑1

d=0

∑Nd

j=1 πdj βd

cjgd(x|j, σ) , such that βd

cj ≥ 0,∑K

c=1 βdcj = 1 ∀j, ∀d and πd

j ≥ 0

and∑1

d=0

∑M

j=1 πdj = 1.

3.1 The four-step learning process

1. Initialization : Given a set of prototypes w located over the data distri-bution using a GGM with identical variances and the EM algorithm [7], theDG of the prototypes is constructed and defines the initial graph. Then eachedge and each vertex of the graph is the basis of a generative model so thatthe graph generates a mixture of Gaussian density functions. The priors π areinitialized to give equiprobability to each vertices and edges. The parameterβ0 is initialized with the value obtained by the GGM while each component of

β1 is set to 1K

. Finally, we initialize σ with the noise value obtained by the GGM.

2. Learning of the parameters : The learning objective was chosen tobe the joint likelihood over the observed labeled data. This measure of qualitywrt the parameters of the model is defined as: L =

∏M

i=1 p(xi, ci|π, β, w, σ,DG).In order to maximize the likelihood we use the EM algorithm. The EM algo-rithm consists in tmax iterative steps updating π, β, σ which ensure the increaseof the likelihood. The updating rules take into account the constraints aboutpositivity or sum to unity of the parameters:

πd[new]j = 1

M

∑M

i=1 p(d, j|xi, ci)

σ2[new] = 1DM

∑M

i=1[∑N0

j=1 p(0, j|xi, ci)(xj − wi)2

+∑N1

j=1 p(1, j|xi, ci)(2πσ2)−D/2 exp(−

(xi−qij)2

2σ2 )(I1[(xi−qij)

2+σ2]+I2)

Lj ·g1(xi|j,σ) ]

βd[new]cj =

∑M

i=1:ci=c p(d, j|xi, ci)/∑M

i=1 p(d, j|xi, ci)

(1)

with I2 = σ2(

(Qij−Lj) exp(−

(Qij−Lj)

2

2σ2 )−Qij exp(−

(Qij)

2

2σ2 ))

,

I1 = σ√

π2 (erf(

Qij

σ√

2) − erf(

Qij−Lj

σ√

2)), where Qi

j =〈xi−waj

|wbj−waj

〉Lj

,

qij =waj

+(wbj−waj

)Qi

j

Ljand p(d, j|xi, ci) =

πdj βcijgd(xi|j,σ)

p(xi,ci|π,β,w,σ,DG) is the posterior prob-

ability that the datum xi was generated by the component (d, j).

3. Prunning : Finally, to get the supervised topology representing graphfrom the generative model, we prune from the initial DG the edges for whichthere is no chance they generated the data, i.e. edges having a null or an almost

237

ESANN'2007 proceedings - European Symposium on Artificial Neural NetworksBruges (Belgium), 25-27 April 2007, d-side publi., ISBN 2-930307-07-2.

null prior at the end of the learning process: π1j < ε. At that point, the edges

represent the connectedness of the joint density of all the classes.

4. Model selection : In statistical inference from data, selecting a parsi-monious model among a collection of models is an important but difficult task[8]. The complexity of the Supervised Generative Gaussian Graph (SGGG)is defined by its number of vertices and edges. Since the complexity of ourmodel is closely related to the number of prototypes, we choose the best GGMin the sense of the Bayesian Information Criterion (BIC) [8] to build the ini-tial DG. Thus we select the GGM M with N0 components which maximises: BIC(M) =

∏M

i=1 p(xi, ci|πM, βM, wM, σM) − vM

2 log(M) where vM is thenumber of free parameters of M : vM = N0.(K + D)

4 Experiments

We drawn a 2-D data sample from a set of class-manifolds : two quarter-circles, one ’Y-shaped manifold’ and one point with respective probabilities{0.2; 0.2; 0.5; 0.1} and β = {(1, 0), (0.5, 0.5), (0, 1), (1, 0)}. The observed data

are obtained with an additive gaussian noise with mean 0 and variance σ2 andare represented in the figure 1 (a). We use an artificial data-base from which weknow the topology in order to verify the validity of the model.For all the experiments we use the same parameter values : tmax = 100, ε = 0.01.Figure 1 (b) shows that the SGGG allows recovering the topology of the fourmanifolds wrt the class-label while the GMM (figure 1 (d)) does not give us anyinsight about the connectedness of the classes. Moreover, the SGGG informsus about the class-manifold overlapping thanks to the paramater β: a manifold

overlapping of different classes is caracterized by maxc

(βdcj) 6= 1.

Figure 2 describes an experiment where we want to verify the ability of theSGGG to learn the topology of a labeled data set in various noise conditions.We drawn 30 different training sets for several variances of the noise. We usethe learning process described in section 3 to build the SGGG and we extractthe topological characteristics of the model (number of connected components,degree of the vertices). Then, we compare the topology of the model with theoriginal topology of the set of manifolds : for example, we check that the part ofthe model representing the ’Y-shaped manifold’ is a set of connected single-classedges (i.e. max

c(β1

cj) = 1) such that two vertices have a degree equal to 1 (the

extreme points of the ’Y’), one has a degree equal to 3 (the crossing of the ’Y’)and all the other vertices have a degree equal to 2. We observe (figure 2) thatthe model can recover the good topology wrt to the classes for the quarter-circlesand the point even with high noise variance. However, the ’Y-shaped’ is oftenmodeled by a ’V-shaped’ when the noise variance increases. With noisy data,the accuracy of the model decreases but it is still relatively robust.We also tested the SGGG on real data. Figure 3 represents the natural andartificial seismic events in France in 2000 and the resulting SGGG.

238

ESANN'2007 proceedings - European Symposium on Artificial Neural NetworksBruges (Belgium), 25-27 April 2007, d-side publi., ISBN 2-930307-07-2.

−10 −5 0 5 10 15

−15

−10

−5

0

5

10

−10 −5 0 5 10 15

−15

−10

−5

0

5

10

−10 −5 0 5 10 15

−15

−10

−5

0

5

10

−10 −5 0 5 10 15

−15

−10

−5

0

5

10

(a) (b) (c) (d)

Fig. 1: Principle of the Supervised Generative Graph. We drawn 1000 datafrom the four principal manifolds (drawn figure 1 (d)) corrupted with an additivegaussian noise with variance σ2 = 0.81. First and second classes are respectivelyrepresented by ’black +’ and ’grey o’. The top left quarter-circle is mixed ’+’ and ’o’, theright one and the point are ’o’ and the Y-shaped is ’+’. (a) The prototypes are locatedwith a GGM (identical variances) and the EM algorithm. They are connected withedges from the Delaunay graph. (b) The optimal SGGG obtained after optimizationof the likelihood according to σ, β and π and pruning of the edges associated to anull or almost null prior. The edges representing only one class (resp. several classes),i.e maxc(β

dcj) = 1 are in black (resp. dotted) lines. (c-d) Iso-density curves at values

(0.0005, 0.001, 0.05, 0.01) for each class given by the SGGG and the best GGM (freecovariances) which has been selected with the appropriate BIC criterion. The estimatedvariance of the noise σ2 by the SGGG is equal to 0.98. The variance is over estimatedbecause non linear manifolds are approximated by piecewise linear manifolds. In orderto evaluate the accuracy of the generative model SGGG, we estimated the NormalizedNegative Loglikelihood (NNL) on an independent sample of 5000 data for both models: NNLSGGG = 5, 0154 and NNLGMM = 5, 5447.

σ2 = 0.25 σ2 = 1 σ2 = 2 σ2 = 3

−10 −5 0 5 10 15

−15

−10

−5

0

5

10

−10 −5 0 5 10 15

−15

−10

−5

0

5

10

−10 −5 0 5 10 15

−15

−10

−5

0

5

10

−10 −5 0 5 10 15

−15

−10

−5

0

5

10

87 97 67 100 87 97 60 100 83 83 40 100 77 80 20 97

Fig. 2: SGGG with various noise conditions: The noise variances are set toσ2 = 0.25, 1, 2 and 3. The figures represent one data set over the 30 different data setwhich have been randomly drawn for each noise value. The results are given in percentand represent the number of models which have correctly modeled the four principalmanifolds (the top-left quarter-circle, the top-right quarter-circle, the ’Y-shaped’ andthe point).

239

ESANN'2007 proceedings - European Symposium on Artificial Neural NetworksBruges (Belgium), 25-27 April 2007, d-side publi., ISBN 2-930307-07-2.

−5 0 5 1042

43

44

45

46

47

48

49

50

51

52

−5 0 5 1042

43

44

45

46

47

48

49

50

51

52

Fig. 3: Seismic data:(left) Natural (grey ’o’) vs Artificial (black ’x’) seismic eventsof magnitude greater than 2.0 in France or close of its border. (right) The resultingSGGG obtained. The edges representing only one class (resp. several classes) are incontinuous (resp. dot) lines. We can see in particular that the SGGG represents themountain chains (the Alpes and the Pyrenees) with some Gaussian-segments. We keepworking on this model to take into account the background noise in order to improvethe results.

5 Conclusion

Extracting the topology of a set of a labeled data is expected to provide im-portant information in order to design a better decision system. Following theprinciple of the Generalized Gaussian Mixture [7], we propose to extend theGenerative Gaussian Graph [6] to the supervised case in order to extract thetopology of labeled data sets. The graph obtained with the Supervised Gen-erative Gaussian Graph represents the connectedness of the joint density of allthe classes from which we can learn about the intra-class and the inter-class

connectedness and the manifold-overlapping of the different classes.One way to follow this work is to build a 2-D graph allowing to synthetize thisextractible topological information [1].

References

[1] Aupetit M. and Catz T. High-dimensional labeled data analysis with topology representinggraphs. Neurocomputing, Elsevier, 63, 2005.

[2] Verveer P. and Duin R. An evaluation of intrinsic dimensionality estimators. IEEE Trans-

actions on Pattern Analysis and Machine Intelligence, 17, 1995.

[3] Tibshirani R. Principal curves revisited. Statistics and Computing, 2:183–190, 1992.

[4] Bishop C., Svensen M., and Williams C. GTM: The generative topographic mapping.Neural Computation, 10(1):215–234, 1998.

[5] Tipping M. and Bishop C. Mixtures of probabilistic principal component analysers. Neural

Computation, 11(2):443–482, 1999.

[6] Aupetit M. Learning topology with the generative gaussian graph and the EM algorithm.Advances in Neural Information Processing Systems, 18, 2006.

[7] Miller D. and Uyar S. A mixture of experts classifier with learning based on both labelledand unlabelled data. Neural Information Processing Systems, 9, 1997.

[8] Bouchard G. and Celeux G. Selection of generative models in classification. IEEE Trans-

actions on Pattern Analysis and Machine Intelligence, 28, 2006.

240

ESANN'2007 proceedings - European Symposium on Artificial Neural NetworksBruges (Belgium), 25-27 April 2007, d-side publi., ISBN 2-930307-07-2.