a latent feature probabilistic principal component ... (002).pdf · dimensions increase within a...

61
ASTON UNIVERSITY AM30MP Mathematics Final Year Project A latent feature probabilistic principal component analysis model Author: Adam Farooq (741418) Date Submitted: 25 th April 2018

Upload: others

Post on 10-Jun-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A latent feature probabilistic principal component ... (002).pdf · dimensions increase within a dataset, the computational workload becomes more demand- ing; this boils down to a

ASTON UNIVERSITY

AM30MP Mathematics Final Year Project

A latent feature probabilisticprincipal component analysis

model

Author: Adam Farooq (741418)

Date Submitted: 25th April 2018

Page 2: A latent feature probabilistic principal component ... (002).pdf · dimensions increase within a dataset, the computational workload becomes more demand- ing; this boils down to a

Abstract

In recent years the amount of high-dimensional data has grown exponentially, this has ledto difficulties in both data processing and data visualisation. One can use dimensional-ity reduction techniques to overcome the problems associated with high-dimensional data,these techniques aim to project the high-dimensional data onto some low-dimensional sub-space, such that information loss in this process is minimised. A problem with many lineardimensionality reduction techniques is that they have simple assumptions, which the dataoften does not satisfy. Therefore, in this report we propose a new linear dimensionalityreduction technique which aims to relax the assumption that existing linear dimensionalityreduction techniques have. However, before this is introduced we must first introduce howexisting linear dimensionality reduction techniques work, and how they fail if the data doesnot follow the simple assumptions of the model. By using the methods we discussed in themathematics report, we propose a Bayesian approach for dimensionality reduction, wherewe assume that the data follows a latent feature structure; which in the case of dimen-sionality reduction, we assume that the high-dimensional data can be projected onto manylow-dimensional subspaces simultaneously. We then show how existing linear dimensional-ity reduction techniques are restricted, whereas the new proposed model is flexible, and canalso give us information on the optimal solution without any prior information. Finally,we discuss how we can extend the new proposed model into a full nonparametric Bayesianmodel, which will allow us to model situations where the complexity of the high-dimensionaldata is unknown and grows with the size of the data.

Page 3: A latent feature probabilistic principal component ... (002).pdf · dimensions increase within a dataset, the computational workload becomes more demand- ing; this boils down to a

Acknowledgement

I would like to take this opportunity to thank my supervisor, Dr. Max Little for supportingme through out this academic year. His guidance has not only helped me to complete bothmy report and project, but it has allowed me to grow both professionally and academically.I would also like to give a sincere thank you to Dr. Yordan Raykov, his help has allowed meto overcome many challenges during this project. I look forward on working with you bothin my PhD. Finally, I would also like to thank both Santos Romero and Tauseef Rehmanfor proof reading this report.

Page 4: A latent feature probabilistic principal component ... (002).pdf · dimensions increase within a dataset, the computational workload becomes more demand- ing; this boils down to a

Contents 3

Contents

List of Figures 5

1 Introduction 61.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.2 Dimensional reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Principal component analysis 72.1 When PCA fails . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2 Other problems with PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3 Probabilistic principal component analysis 143.1 Latent variable models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.1.1 Factor analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.2 Probabilistic PCA model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.3 Links to traditional PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.4 Mixtures of PPCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.4.1 Application of MPPCA . . . . . . . . . . . . . . . . . . . . . . . . . 16

4 Nonparametric latent feature models 184.1 A nonparametric Bayesian approach . . . . . . . . . . . . . . . . . . . . . . 184.2 Latent feature models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.2.1 Nonparametric latent feature models . . . . . . . . . . . . . . . . . . 194.3 Indian Buffet process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.3.1 Deriving the IBP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.4 Example: Linear Gaussian latent feature model . . . . . . . . . . . . . . . . 20

4.4.1 Synthetic data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

5 Latent feature probabilistic principal component analysis 255.1 The latent feature PPCA model . . . . . . . . . . . . . . . . . . . . . . . . . 255.2 Interpreting results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265.3 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295.4 Synthetic data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5.4.1 PCA approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315.4.2 MPPCA approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325.4.3 LF-PPCA approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 335.4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345.4.5 Different means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

6 Summary and future direction 376.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376.2 Future direction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

Page 5: A latent feature probabilistic principal component ... (002).pdf · dimensions increase within a dataset, the computational workload becomes more demand- ing; this boils down to a

Contents 4

Appendices 39

A Properties of a factor analysis model 39

B Posterior of latent variable 40B.1 Dummy example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40B.2 Posterior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

C EM algorithm for PPCA 42

D EM algorithm for MPPCA 44

E Linear Gaussian latent feature model 46

F Latent feature PPCA 49

G Inference algorithms 52G.1 Linear Gaussian latent feature model . . . . . . . . . . . . . . . . . . . . . . 52G.2 Latent feature PPCA model . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

H MATLABR© code for the LF-PPCA model 54H.1 Main script . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54H.2 UpdateZ(·) function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55H.3 logPY Z(·) function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57H.4 logPY (·) function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

References 59

Page 6: A latent feature probabilistic principal component ... (002).pdf · dimensions increase within a dataset, the computational workload becomes more demand- ing; this boils down to a

List of Figures

1 Demonstration of PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 Demonstration of PCA failing under the single plane assumption . . . . . . 93 Demonstration of PCA re-projection . . . . . . . . . . . . . . . . . . . . . . 104 Demonstration of PCA re-projection failing . . . . . . . . . . . . . . . . . . 115 Demonstration of how PCA fails with outliers . . . . . . . . . . . . . . . . . 126 Demonstration of the MPPCA model . . . . . . . . . . . . . . . . . . . . . . 177 Linear Gaussian latent feature model weight matrix . . . . . . . . . . . . . 218 Linear Gaussian latent feature model observation example . . . . . . . . . . 229 Linear Gaussian latent feature model number of classes over each iteration . 2310 Linear Gaussian latent feature model log-likelihood over each iteration . . . 2311 Linear Gaussian latent feature model weight matrix posterior . . . . . . . . 2412 LF-PPCA example matrix Z . . . . . . . . . . . . . . . . . . . . . . . . . . 2713 LF-PPCA example matrix F . . . . . . . . . . . . . . . . . . . . . . . . . . 2814 LF-PPCA example of observation in 2 subspaces . . . . . . . . . . . . . . . 2815 Synthetic latent feature data . . . . . . . . . . . . . . . . . . . . . . . . . . 3016 Detailed synthetic latent feature data . . . . . . . . . . . . . . . . . . . . . 3117 PCA performance on synthetic latent feature data . . . . . . . . . . . . . . 3218 MPPCA performance on synthetic latent feature data . . . . . . . . . . . . 3319 LF-PPCA performance on synthetic latent feature data . . . . . . . . . . . 3420 Synthetic latent feature data with different means . . . . . . . . . . . . . . 3621 When a single LF-PPCA model fails . . . . . . . . . . . . . . . . . . . . . . 38

Page 7: A latent feature probabilistic principal component ... (002).pdf · dimensions increase within a dataset, the computational workload becomes more demand- ing; this boils down to a

Introduction 6

1 Introduction

1.1 Context

In the mathematics report [1] we described a prior for a nonparametric Bayesian latentfeature model using a stochastic process called the Beta-Bernoulli process and the IndianBuffet process (which is the marginal process of the Beta-Bernoulli process); see section 4for a quick summary. We aim to use the work we did in in the mathematics report [1] tobuild a new linear dimensional reduction model (see section 1.2) which works on data whichhas a latent feature structure.

1.2 Dimensional reduction

In recent years we have seen a huge rise in the computational power available to us, howeverthe amount of data has also increased; for example, the development of new techniques inthe post-genomic era has generated enormous amounts of high-dimensional biological data,which has led to an exponential growth of many biological databases [16]. As the number ofdimensions increase within a dataset, the computational workload becomes more demand-ing; this boils down to a combinatorics problem; for example, if there are D dimensions, andeach dimension can have K states then there are KD different combinations. The problemsassociated with high-dimensional data are often referred to as ‘curse of dimensionality’; see[6] for more details on this.

As the number of dimensions of the data grow, the computational power needed toprocess the data also needs to grow at a exponential rate, and there is no guarantee thatevery dimension is useful. With so many dimensions we may just be measuring the sameunderlying pattern in several different ways [4]. In other words, we may believe that high-dimensional data are multiple, indirect measurements of an underlying source, which typ-ically cannot be directly measured [3]. Therefore, we can use dimensional reduction tech-niques to reduce the dimension of the data while minimising the loss of information in theprocess.

We are also restricted in visualising data in one, two and three dimensions, and oftenmost datasets are well above three dimensions; making the task of data visualisation onmost datasets impractical. Therefore, dimensional reduction techniques provide us with atool to lower the dimension of the dataset such that we can visualise it; a simple exampleof this can be seen in Figure 1.

Page 8: A latent feature probabilistic principal component ... (002).pdf · dimensions increase within a dataset, the computational workload becomes more demand- ing; this boils down to a

Principal component analysis 7

2 Principal component analysis

Principal component analysis (PCA) is a very popular technique in dimensionality reduc-tion1, it uses an orthogonal transformation to transform observed data into linearly uncorre-lated principal components. We can define PCA as the following, let yn be a D dimensionaldata point where n ∈ {1, ..., N}, let wj , j ∈ {1, ..., Q} be the Q principal axes which are or-thonormal and retain maximum variance under projection; where Q << D 2. The vectorswj are given by the Q eigenvectors with the largest associated eigenvalues of the sam-

ple covariance matrix S, which can be defined as S = 1N

∑Nn (yn − y)T (yn − y) such that

Swj = λjwj; where y is the sample mean. Then we can define the vector xn = WT(yn − y),where xn can be viewed as the ‘lower Q dimensional representation of yn’; or alternativelyxn is the subspace 3 representation of yn.

To demonstrate how a PCA works we will see how we can reduce the dimension of datafrom 2D to 1D. Let us first assume our data yn is in 2D, and our goal is to lower thedimension into 1D which would be our xn; this means that we will project the data onto asingle principal component. Using the notation from above D = 2 and Q = 1, the matrixW is now just a 2× 1 vector; the results of this can be seen in Figure 1.

1All dimensionality reduction techniques that will be mentioned in this report are linear dimensionalitytechniques which assume that data can be mapped linearly from an observed high-dimensional space to alow-dimensional subspace

2For the sake of visualisation, we will use D = 2 and Q = 1 throughout this report.3Formal definition of a subspace: Let V be a vector space, then W is said to a subspace of V if W is a

subset of V, if the following hold: (1) if w1, w2 ∈ W, then w1 + w2 ∈ W, (2) for any scalar α, if w ∈ Wthen αw ∈ W [18]; see [18] for more examples.

Page 9: A latent feature probabilistic principal component ... (002).pdf · dimensions increase within a dataset, the computational workload becomes more demand- ing; this boils down to a

2.1 When PCA fails 8

Figure 1: Left: The red points represent the data points in a 2D plane, the blue line aimsto capture maximum variance under the projection from a 2D plane to a line in the 1Dsubspace. Right: The resulting 1D projection of the 2D data from the left image onto theblue line.

2.1 When PCA fails

So far, we have assumed that the high dimensional data has a subspace representation, suchthat all data lies on the same plane4 within the same subspace. This can be seen in Figure1, where we can easily see that the 2D data lies on a single line in a 1D subspace. However,this is not always true, often high dimensional data does not follow this simple assumption;we can see this in Figure 2. It is clear that in Figure 2 the 2D data does not lie on a singleline within a 1D subspace; this confuses the PCA which aims to fit a single (blue line) 1Dline to capture the maximum amount of variance; this means two data points which are‘significantly’ further away in the 2D plane are now projected ‘near’ one another in the1D subspace; this is highlighted in the blue and black crosses in Figure 2. One possiblesolution for this is to use mixtures of probabilistic principal component analysers; which wewill explore in section 3.4.

4Can be a line, plane or a hyper-plane.

Page 10: A latent feature probabilistic principal component ... (002).pdf · dimensions increase within a dataset, the computational workload becomes more demand- ing; this boils down to a

2.1 When PCA fails 9

Figure 2: Left: The red points represent the data points in a 2D plane, the blue line aimsto capture maximum variance under the projection from 2D to 1D. Right: The resulting1D projection of the 2D data from the left image onto the blue line. The crosses indicatetwo different data points both in the 2D plane and the 1D projected subspace.

Once we are able to learn the projection matrix W, we can also project the originalhigher dimensional data point yn from the subspace representation xn; let us call this re-projection. In Figure 3 we can see how accurately we can project from the lower dimensionaldata back into its higher dimensional form; the original and the re-projected data are verysimilar. However, this does not always work well, let us assume that our original data ynis shifted by some scalar α in all D dimensions ∀ n, then our re-projection fails to capturethe α shift in the data; this can be seen in Figure 4, where each data point yn ∀ n has now

been shifted by 100 units in all dimension, for example before the shift y1 =

[6.1111.24

]and

after the shift yshift1 =

[106.11111.24

]

Page 11: A latent feature probabilistic principal component ... (002).pdf · dimensions increase within a dataset, the computational workload becomes more demand- ing; this boils down to a

2.1 When PCA fails 10

Figure 3: Left: The red points represent the data points in a 2D plane, the blue pointsare the re-projection of the original 2D data. Right: The resulting 1D projection of theoriginal 2D data from the left image.

Page 12: A latent feature probabilistic principal component ... (002).pdf · dimensions increase within a dataset, the computational workload becomes more demand- ing; this boils down to a

2.1 When PCA fails 11

Figure 4: Left: The red points represent the data points in a 2D plane, the blue pointsare the re-projection of the original 2D data. Right: The resulting 1D projection of theoriginal 2D data from the left image.

Often real data has many outliers, which can cause the PCA to incorrectly project datafrom the high-dimensional space onto some low-dimensional subspace; this can be seen inFigure 5. In Figure 5 we can see how three outliers in the 2D plane can impact the resulting1D projection of a dataset consisting of one-hundred observations (in red), firstly the 1Dprojection of the data shows that ‘Point 32’ is closer to the actual data than ‘Point 23’ and‘Point 57’ however, in terms of the actual Euclidean distance ‘Point 57’ is the closest to theactual data than the other two points, we should also observe that ‘Point 23’ and ‘Point 32’are equally further away from the actual data, hence the 1D projection provides us withfalse information. By observing the resulting 1D projection in Figure 5, we may concludewith some confidence that the point ‘Point 23’ is in fact an outlier; hence we remove it fromthe dataset, however we may fail to see that ‘Point 32’ and ‘Point 57’ are outliers (as theyare really close to the rest of the observations), which will result in us keeping some outlierswithin out dataset.

Page 13: A latent feature probabilistic principal component ... (002).pdf · dimensions increase within a dataset, the computational workload becomes more demand- ing; this boils down to a

2.2 Other problems with PCA 12

Figure 5: Left: The red points represent the actual data points in a 2D plane, the bluecircle, the black cross and the blue cross all represent points within the data which areactual outliers. Right: The resulting 1D projection of the original 2D data from the leftimage.

2.2 Other problems with PCA

Although this method is used widely, it lacks an associated probability density or generativemodel [2]. If we were to have some sort of density estimation with PCA, then the corre-sponding likelihood would allow us to compare this with other density-estimated techniques[2]; which of course will facilitate us in statistical testing, it would also allow us to deal withany missing data [17]. We can also use a Bayesian approach which allows us to compute themarginal likelihoods, which is a more robust way of incorporating added uncertainty whencomputing probabilities of out-of-sample data [1]. Another advantage of using a Bayesianapproach is that we can also add any prior information we may have into the model. ThePCA model is also restricted to linear dimensionality reduction, this implies that the datafollows a global linear structure, however, often data follows a non-linear structure; this hasled to the developments of global non-linear models; however these often have a compu-tational complexity of O(N3) [19]. Another more efficient way to model global non-linearstructured data is to assume that the data follows a local linear structure; therefore, once wehave some generative principal component analysis model, we can extend the single PCA

Page 14: A latent feature probabilistic principal component ... (002).pdf · dimensions increase within a dataset, the computational workload becomes more demand- ing; this boils down to a

2.2 Other problems with PCA 13

model into a model with mixtures of PCA, which can be viewed as a collection of locallinear sub-models [2]; see Figure 8 of [2] for a demonstration. Therefore, in the next sectionwe will aim to introduce the probabilistic version of PCA.

Page 15: A latent feature probabilistic principal component ... (002).pdf · dimensions increase within a dataset, the computational workload becomes more demand- ing; this boils down to a

Probabilistic principal component analysis 14

3 Probabilistic principal component analysis

Before we introduce the probabilistic version of PCA, we must first understand the conceptof latent variable models.

3.1 Latent variable models

A latent variable model aims to relate an observed D dimensional vector yn to a corre-sponding Q dimensional vector xn, where xn is referred to as the latent variable and in ageneral case we would want D > Q; such that we can describe high dimensional data in amore parsimonious way [2]. The model can be described as

yn = f(xn; W) + ε (1)

where f(·; ·) is a function such that f : xn → yn, W is a matrix of parameters and ε issome independent noise.

3.1.1 Factor analysis

The factor analysis model is a type of latent variable model, which firstly has a linearfunction f(·; ·), and secondly has a probability distribution over the latent variables andover the independent noise; traditionally this distribution is a Gaussian. The factor analysismodel can be seen below

yn = Wxn + µ+ ε

xn ∼ N (0, I)

ε ∼ N (0, Ψ)

(2)

where yn is the observed D × 1 vector. The matrix W is a D × Q matrix which containsthe factor loadings [2]. The latent variable xn is a Q × 1 vector. The vector µ is a D × 1vector with each element being the mean of each d dimension of yn, where d ∈ {1, ..., D}and n ∈ {1, ..., N}. The independent noise ε is also D × 1 vector; with Ψ being a D × Ddiagonal matrix.

By using the model described in equation (2), we can see the probability distributionover yn is yn ∼ N (µ,C) (see Appendix A for the derivation), where C = Ψ + WWT is acovariance matrix. The main idea behind this is that the dependencies between the datavariables yn are explained by a smaller number of latent variables xn, while ε representsvariance unique to each observation variable.

Page 16: A latent feature probabilistic principal component ... (002).pdf · dimensions increase within a dataset, the computational workload becomes more demand- ing; this boils down to a

3.2 Probabilistic PCA model 15

3.2 Probabilistic PCA model

The probabilistic version of PCA uses a slightly modified framework of the factor analysismodel (see equation (2)). The Probabilistic PCA (PPCA) model can be seen below

yn = Wxn + µ+ ε

xn ∼ N (0, I)

ε ∼ N (0, Iσ2)

(3)

where yn is the observed D × 1 vector. The matrix W is a D × Q matrix. The latentvariable xn is a Q× 1 vector. The vector µ is a D × 1 vector with each element being themean of each d dimension of yn, where d ∈ {1, ..., D} and n ∈ {1, ..., N}. The independentnoise ε is also D × 1 vector; with isotropic noise [2].

The probability distribution over yn for a given xn is

P (yn|xn) =1

(2πσ2)D/2exp

{− 1

2σ2(yn −Wxn − µ)T (yn −Wxn − µ)

}(4)

The prior distribution over xn is

P (xn) =1

(2π)Q/2exp

{−1

2xn

Txn

}(5)

The marginal distribution over yn is

P (yn) =1

(2π)D/2|C|−1/2 exp

{−1

2(yn − µ)TC−1(yn − µ)

}(6)

where C = Iσ2 + WWT ; this can be derived using the concepts discussed in Appendix A.The posterior distribution of the latent variable xn given the observed yn is

P (xn|yn) =

∣∣σ−2M∣∣1/2(2π)Q/2

×

exp

{−1

2(xn −M−1WT (yn − µ))T (σ−2M)(xn −M−1WT (yn − µ))

} (7)

where M = (WTW + Iσ2); see Appendix B for the full derivation.The latent variables xn (which we believe generated the observed data yn) are unknown,

therefore there is no straight forward solution which we can use to find the parametersW and σ2 of the model. Therefore, we must use the some iterative algorithm like theexpectation-maximisation, variation Bayes or Monte Carlo Markov Chain methods to doinference using PPCA; see Appendix C for the expectation-maximisation approach.

Page 17: A latent feature probabilistic principal component ... (002).pdf · dimensions increase within a dataset, the computational workload becomes more demand- ing; this boils down to a

3.3 Links to traditional PCA 16

3.3 Links to traditional PCA

In traditional PCA, the high dimensional data point yn is represented by a single data pointxn in the latent space (subspace). However, with PPCA the high dimensional data point ynis represented by a Gaussian posterior distribution which has the mean M−1WT (yn − µ)(which can be seen in equation (7)) in the subspace. It should also be noted that as theterm σ2 in the PPCA model approaches zero, the PPCA model (equation (3)) becomes thetraditional PCA model; see section 3.3 of [2] for more information. Furthermore, section3.4 of [2] shows how the PPCA model is closer to the traditional PCA model than it is tothe factor analysis model.

3.4 Mixtures of PPCA

In Figure 2 we showed how a PCA would fail if the subspace representation of the datadoes not lie on a single line (for simplicity we will assume that the subspace is 1D). Inthis section we will introduce the concept of having a mixture of probabilistic principalcomponent analysers (MPPCA). Let us first assume that there exist K different mixturesof PPCA models within our data, and let πk denote the proportion of the data points whichare modelled by the kth PPCA mixture; where k ∈ {1, ...,K} and

∑K πk = 1, note we

assume that each data point can only be modelled by only one of the K PPCA mixturesthat exist in this model . Therefore, for the kth PPCA mixture we need to define a πkmixing proportion, Wk matrix, µk mean vector and σ2k variance; these parameters canbe seen in equation (3). The MMPCA model also allows us to set different dimensionsfor each mixtures’ subspace (this is often denoted as dk), for example if K = 2, d1 = 1and d2 = 2 then we may assume that data points that can be modelled using first PPCAmixture can be projected into a 1D subspace and data points which can be modelled usingthe second PPCA mixture can be projected into a 2D subspace. Like the case of PPCA,the MPPCA model also has no straightforward solution for the parameters, therefore wemust use the some iterative algorithm like the expectation-maximisation, variation Bayesor Monte Carlo Markov Chain methods to do inference using PPCA; see Appendix D forthe expectation-maximisation approach.

3.4.1 Application of MPPCA

In Figure 6 you can see the results from the MPPCA model with K = 2 and d1 = d2 = 1,this model now assumes that the 2D data now has two different projection lines in the 1Dsubspace. This model aims to solve the issue we faced in Figure 2; where the traditionalPCA projected two points ‘near’ each other even though they were ‘significantly’ furtheraway in the 2D plane.

Page 18: A latent feature probabilistic principal component ... (002).pdf · dimensions increase within a dataset, the computational workload becomes more demand- ing; this boils down to a

3.4 Mixtures of PPCA 17

Figure 6: Left: The red points represent the data points in a 2D plane, the blue line aimsto capture maximum variance under the first projection from 2D to 1D and the green lineaims to capture maximum variance under the second projection from 2D to 1D. Right: Aplot of the data points from the left classified by colour to show which projection (or PPCAmixture) is each data point modelled with.

This model requires prior information on the number of PPCA mixtures (K) and theirrespective subspace dimensions (dk) that are needed to model our data. Extracting thisinformation by a simple plot is easy when the data is 2D or 3D; however, most data is wellabove that. Therefore, we need to have a model that grows in complexity as more data isobserved; we will explore this in the next section.

Page 19: A latent feature probabilistic principal component ... (002).pdf · dimensions increase within a dataset, the computational workload becomes more demand- ing; this boils down to a

Nonparametric latent feature models 18

4 Nonparametric latent feature models

A fundamental issue with both the PCA (or PPCA) and MPPCA model is choosing thenumber of principal components (and the number of mixtures of PPCA in the MPPCAcase), in order to overcome this we need to use some flexible model which can find theoptimal solution. Therefore, in this section we will introduce the foundations of nonpara-metric latent feature models, these will be useful in deriving a more flexible dimensionalityreduction model.

4.1 A nonparametric Bayesian approach

In the mathematics report [1], we discussed how the Bayesian framework allows us to com-pute the marginal likelihoods, this is a more robust way of incorporating added uncertaintywhen computing probabilities of out-of-sample data. For example, sophisticated modelsare created using past data within a Bayesian framework to make predictions about futurestock prices [7]. We also discussed how a Bayesian model can be parametric, i.e. the vectorof parameters θ = [θ1, θ2, ..., θK ] are fixed at K. A Bayesian model can also be a nonpara-metric, this just means that the parameter space is infinite-dimensional meaning the vectorof parameters θ = [θ1, θ2, ...] have no upper-bound and grows as more data is observed [9].

Therefore, the nonparametric Bayesian framework is a useful method when modellingcomplex data. A simple parametric model can under-fit the observed data (see Figure 2),as not enough parameters are defined to ‘capture’ the underlying structure of the observeddata. A complex parametric model on the other hand can over-fit the observed data,meaning the parameters also ‘capture’ any noise in the data used to train the model; this willresult in the model having poor generalization [8] on new (unobserved) data. Therefore, anonparametric model falls in-between a simple parametric model and a complex parametricmodel, the infinite-dimensions allow the model to ‘capture’ the underlying structure as moredata is observed

4.2 Latent feature models

In the mathematics report [1] we discussed how a latent class model assumes that an objectbelongs to a single class; and this class assignment then determines the distribution of theobject. On the other hand, we discussed how a latent feature model assumes that an objectcan belong to multiple classes, also called features; where the properties of that object aregenerated from a distribution determined by the features (where these latent feature valuescan be either discrete [10] or continuous [2]). Using this we discussed how a latent featuremodel is preferred over a latent class model in certain situations (see section 3.3 of [1]).

We can use a similar argument of why a latent feature model should be used insteadof a latent class model when working with MPPCA. Observe the data in the left panel ofFigure 6, it can be argued that the data which lies in the centre where the two 1D linescross can belong to either to ‘Projection 1’ or ‘Projection 2’; although the MPPCA allows

Page 20: A latent feature probabilistic principal component ... (002).pdf · dimensions increase within a dataset, the computational workload becomes more demand- ing; this boils down to a

4.3 Indian Buffet process 19

us to project the data onto any of the two 1D planes, it restricts us to assume a data pointbelongs to a single mixture of PPCA.

4.2.1 Nonparametric latent feature models

The first step in defining a nonparametric Bayesian model is to define the prior for themodel. We can assume these latent features are stored in a matrix F, this matrix can bebroken down into two independent components

• The first component is a matrix A, which determines the value of each feature foreach data point.

• The second component is a sparse binary matrix Z, indicating if a data point possessesa feature. For example, in the case of MPPCA, if znd = 1 then data point n can bemodelled using the kth PPCA mixture.

It can also be shown that the matrix F is an element-wise product of the matrices A and Z[5], therefore, the prior of matrix Z is independent of the prior for matrix A. This meansthe prior on the matrix F can be expressed as P (F) = P (Z)P (A). To define a prior formatrix F with potentially an infinite number of features (nonparametric Bayesian), one hasto define a prior on an infinite sparse binary matrix Z and an infinite matrix A [1].

4.3 Indian Buffet process

To work with a nonparametric Bayesian latent feature model, we must first define a priorover the infinite sparse binary matrix Z. In [1] we discussed how the Beta-Bernoulli process(see section 2 of [1]) can be used, but this requires one to define all the parameters of theinfinite-dimensional parameter space. We then introduced a different stochastic processcalled the Indian Buffet process (IBP); which is often explained using a cuisine metaphor,hence the name, this can be used to marginalise over the surplus dimensions of the infinite-dimensional parameter space; this allows us to use only a finite subset of the availableparameters to explain the sample [11].

4.3.1 Deriving the IBP

Let us assume there exists an Indian buffet, with an infinite number of dishes. The firstcustomer enters the buffet and takes a serving from each dish till he/she stops at thePoisson(α) numbered dish. Then the second customer enters, he/she moves along thebuffet selecting the dth dish with probability md

2 , where md are the number of customerswho have tried dish d, then after reaching the end of all previous chosen dishes he/shetries a Poisson(α2 ) number of new dishes. Therefore, the nth customer will go along thebuffet of dishes selecting the dth dish with probability of md

n , and then at the end trying aPoisson(αn ) number of new dishes.

The results of the IBP can be represented using a sparse binary matrix Z which has Nrows (after N number of customers have entered the buffet) and infinite number of columns

Page 21: A latent feature probabilistic principal component ... (002).pdf · dimensions increase within a dataset, the computational workload becomes more demand- ing; this boils down to a

4.4 Example: Linear Gaussian latent feature model 20

(as the buffet has infinite number of dishes). If the nth customer tried the dth dish thenzn,d = 1. Then the probability of any sparse binary matrix Z being produced by thisprocess is

P (Z) = exp(−αHN ) ·

(αD

+∏Nj=1D

jnew!

)(D+∏d=1

(md − 1)!(N −md)!

N !

)(8)

where HN =∑N

j=11j . The proof equation (8) can be seen in Appendix F in [1]. Notice, al-

though we say the sparse binary matrix generated using IBP has infinite number of columns,after a finite number of customers only a finite number of dishes will be tried, therefore,we can ignore the remaining (infinite number of) dishes, hence we only need to define finitenumber of columns.

4.4 Example: Linear Gaussian latent feature model

We can use the linear Gaussian latent feature model to understand how a nonparametricBayesian latent feature model works. In order to do this, we first consider a finite model.

Let us assume that an observed data point xn is a 1 × D vector generated from aGaussian distribution with mean znA and covariance matrix of ΣX = σ2XI; where zn is a1×K binary latent matrix and A is a K×D matrix of weights. Then if we stack all xn forn ∈ {1, ..., N} in a matrix, such that X = [x1,x2, ...,xN ]T and Z = [z1, z2, ..., zN ]T , thenthe distribution of X given A, Z and σx would simply be a matrix Gaussian, which can bedefined as

P (X | A,Z, σX) =1

2πσ2XND/2

exp

{− 1

2σ2Xtr(

(X− ZA)T (X− ZA))}

(9)

where the function tr(·) finds the trace of the matrix (the sum of the diagonal elements[12]). The matrix A also follows a matrix Gaussian with its own variance term σA, whichcan be defined as

P (A | σA) =1

2πσ2AKD/2

exp

{− 1

2σ2Atr(ATA

)}(10)

We can simplify this model by integrating over the A; see the Appendix E for moredetails. Therefore, the likelihood term of the model can be written as

P (X|Z, σX , σA) =exp

{− 1

2σ2X

tr(XT

(I− ZMZT

)X)}

2πND/2σ(N−K)DX σKDA

∣∣∣ZTZ +σ2X

σ2A

I∣∣∣D/2 (11)

As we use the IBP as the prior of the matrix Z, we can use equation (11) to infer theupdates Z using the following

P (zn,k|Z−(n,k),X, σX , σA) ∝ P (X|Z, σX , σA)P (zn,k|z−n,k) (12)

Page 22: A latent feature probabilistic principal component ... (002).pdf · dimensions increase within a dataset, the computational workload becomes more demand- ing; this boils down to a

4.4 Example: Linear Gaussian latent feature model 21

where Z−(n,k) is the matrix Z without the observation zn,k, likewise z−n,k is a set of assign-

ments of other data points to feature k excluding the nth data point. Where

P (zn,k|z−n,k) =m−n,kN

(13)

where m−n,k is a count of the data points which possess feature k excluding the nth datapoint.

So far, we have worked with a fixed value of K, which implies a finite linear Gaussianlatent feature model, for this model to work within the nonparamtric case, the model shouldstill be well defined as K → ∞. In section 5.3 of [5], Griffiths and Ghahramani show howthe likelihood (in our case equation (11)) for the infinite model is well defined.

4.4.1 Synthetic data

In this section we will demonstrate how we can use the infinite linear Gaussian latent featuremodel by running it on synthetic data which was created in the following way:

1. We set the values of the parameters, with N = 100, K = 4 and D = 36.

2. Randomly generated a 100× 4 binary matrix Z.

3. Produced a 4× 36 weight matrix A, where all non-zero elements set to one.

4. Generated the synthetic dataset X, which was ZA.

5. Added some independent Gaussian noise to the dataset X.

It is impossible to visualise the weight matrix A, however in Figure 7 we use a 6 × 6representation of the weight matrix A to visualise it.

Figure 7: Each box represents the four different structures of the latent features, where thewhite colour represents the non-zero elements, and the black represents the zero elements.

We can also use a 6×6 representation to view data points. The 13th observation has thefollowing binary vector z13 = [0, 1, 0, 1], then in Figure 8 we can see a visual representationof the data point x13; we can also see what difference the noise makes.

Page 23: A latent feature probabilistic principal component ... (002).pdf · dimensions increase within a dataset, the computational workload becomes more demand- ing; this boils down to a

4.4 Example: Linear Gaussian latent feature model 22

Figure 8: Left: A 6 × 6 visual representation of z13A, the yellow squares indicate 1 andthe purple squares indicate 0. Right: A 6× 6 visual representation of x13; which is simplyz13A (from the left) with additive independent Gaussian noise, the bright (yellowish) coloursrepresent entries close to 1, and the dark (blueish) colours represent entries close to 0.

The linear Gaussian latent feature model was initialised with K = 1 (with zn,1 = 1with probability 0.5), σX = 0.5, σA = 0.5, α = 0.5 (parameter for the IBP, see section4.3). The inference was done using Metropolis-Hastings, with 1000 iterations. Figure 9shows the value of K over the 1000 iterations, with it correctly identifying the true value ofK = 4. Figure 10 shows the value of the log-likelihood (log of equation (11)) over the 1000iterations.

The posterior value of the weight matrix A given the matrix X and the newly learntmatrix Z is

E[A|X,Z] =(ZTZ +

σ2X

σ2A

I)−1

ZTX (14)

the expression in equation (14) was derived using equation (52) alongside what we did inAppendix B. The results obtained by equation (14) are extremely important in evaluatinghow well the model has converged; this can be seen in Figure 11, where the results look verysimilar to that of Figure 7. This means that by observing the data X, the linear Gaussianlatent feature model was able to learn both matrix A and matrix Z.

Page 24: A latent feature probabilistic principal component ... (002).pdf · dimensions increase within a dataset, the computational workload becomes more demand- ing; this boils down to a

4.4 Example: Linear Gaussian latent feature model 23

Figure 9: A plot of the number of latent classes (K) identified over each iteration.

Figure 10: A plot of the value for the log-likelihood (log of equation (11)) over each iteration.

Page 25: A latent feature probabilistic principal component ... (002).pdf · dimensions increase within a dataset, the computational workload becomes more demand- ing; this boils down to a

4.4 Example: Linear Gaussian latent feature model 24

Figure 11: Each box represents the four different structures of the latent features learnt byour model; the bright (yellowish) colours represent entries close to 1, and the dark (blueish)colours represent entries close to 0.

The inference for the linear Gaussian latent feature model was done using a MarkovChain Monte Carlo (MCMC) method; this can be seen in Algorithm 1 of Appendix G.1.

This application of the linear Gaussian latent feature model is to demonstrate thatthis nonparametric Bayesian approach can efficiently learn the latent structure of the data,without having it to be fixed a priori [5]. The features used in this example were constructedin such a way that each ‘object’ appeared in a fixed orientation and in fixed locations;which is a basic challange of computer vision [5]; this implies that we need a more complexlikelihood function than a simple Gaussian (equation (11)) if we want to apply this modelin a computer vision scenario.

Although this model is too simple for computer vision, it has the framework whichcan allow us to derive a flexible dimensionality reduction model. Let us first observe thesimilarities between the linear Gaussian latent feature model (equation (9)) and the PPCAmodel (equation (4)), both of these models assume that the observed data is drawn fromsome Gaussian distribution (given their respective parameters). In section 4.2, we alsohighlighted how a flexible dimensionality reduction model will allow data points to beprojected down onto multiple subspaces, now observe that the linear Gaussian latent featuremodel allows data points to have multiple features, therefore, we can use this multiplefeature structure of the linear Gaussian latent feature model to somehow encode the multiplesubspace projection information. In the next section we will aim to combine the PPCAmodel with the structure of the linear Gaussian latent feature model to derive a flexibledimensionality reduction model.

Page 26: A latent feature probabilistic principal component ... (002).pdf · dimensions increase within a dataset, the computational workload becomes more demand- ing; this boils down to a

Latent feature probabilistic principal component analysis 25

5 Latent feature probabilistic principal component analysis

In section 4.2 we discussed how latent feature models in the context of PPCA would allowus to have an unrestricted model which would firstly allow us to learn the optimal num-ber of principal components, and then allow us to project data onto multiple subspacessimultaneously. We also discussed in section 3.4 that although the MPPCA model solvesmany issues of the traditional PCA model, it requires us to have prior information aboutthe number of PPCA mixtures that exist in the data (K) and their respective dimensions;which restricts the flexibility of the model. We can also use the Bayesian information cri-terion (BIC) to determine the dimensions of the PPCA model, however this can be timeconsuming as it requires us to run the PPCA model multiple times to obtain the optimalchoose of the dimension of the subspace; see [20] for more details on this. Therefore, in thissection we will introduce a nonparametric PPCA model which aims to solve these problemsin a rigorous and a parsimonious way.

5.1 The latent feature PPCA model

In this section we will introduce the latent feature PPCA (LF-PPCA) model. Let us firstassume that yn is a D dimensional observed vector, and let us also assume that each datapoint yn ∀ n ∈ {1, 2, ..., N} can live in multiple subspaces. For example in the traditionalPCA case we assumed that each D dimensional observed vector yn lived in a single Q di-mensional (called xn) subspace; where D > Q, but now we assume that each D dimensionalobserved vector yn can live in a subspace which is not restricted to Q but can in-fact alsobe up to D; meaning we now have D independent 1D subspaces which our data can be pro-jected onto. Therefore, for each yn there exists a D dimensional vector fn, which encodesthe subspace information of yn; for example if we assume that yn has a Q dimensionalsubspace, then the vector fn has Q non-zero elements and D −Q zero-elements.

The vector fn is simply the Hadamard product (term-wise product) of two independentD dimensional vectors xn and zn, such that fn = (xn � zn), where xn is a D dimensionalvector drawn from a multivariate Gaussian (with mean zero and some variance σx), and znis a latent feature binary vector. We can use the vector zn to indicate the dimension of thesubspace for yn, and we can use the vector xn to represent the corresponding values overeach subspace; a detailed example can be seen in section 5.2; note that we can also refer tothe vector xn as the latent variables for data point n. The model can then be described as

yn = Wfn + µ+ ε (15)

where yn is a D × 1 vector, W is a D ×D matrix with an orthonormal base, fn is a D × 1vector, µ is a D × 1 mean vector and ε is some isotropic noise (Iσy). Without the lossof generality we can assume that the mean of yn ∀ n ∈ {1, 2, ..., N} is zero; hence we canremove the µ from equation (15); this simply means that the mean of the data µ has noeffect on estimating the matrix W .

Page 27: A latent feature probabilistic principal component ... (002).pdf · dimensions increase within a dataset, the computational workload becomes more demand- ing; this boils down to a

5.2 Interpreting results 26

The probability distribution over yn given W,xn, zn is

P (yn |W,xn, zn, σy) =

1

2πσ2yD/2

exp

{− 1

2σ2y(yn − (W � zTn )xn)T (yn − (W � zTn )xn)

}(16)

where Wfn = W(xn � zn) = (W � zTn )xn. The prior distribution over xn is

P (xn) =1

(2πσ2x)D/2exp

{− 1

2σ2xxn

Txn

}(17)

Towards the bottom of section 4.2.1 we discussed that in order to define the prior over anonparametric latent feature model, one must define a prior over two independent infinitematrices (or vectors in our case), however, this can complicate the model. One way tosimplify the model is to integrate over the latent feature coefficients xn (see Appendix F);note we did something like this in the linear Gaussian latent feature model (section 4.4); seesection 8 of [19] for more information on why we would integrate over the latent variables.However, the vector xn carries important information of the value of each data point ineach 1D subspace, therefore one might think by integrating over it we may lose importantinformation. But once we learn the latent vectors zn, we can find the posterior distributionof each xn and draw samples from this; so, in reality no information was lost, and the modelhas now become more simplified. Once we integrate over the latent feature coefficients xn,we obtain the following likelihood

P (yn|W, zn, σX , σy) =|Mn|1/2

(2π)D/2σDxexp

{− 1

2σ2yyTn(I−KnMnK

Tn

)yn

}(18)

where Mn =(KTnKn +

σ2y

σ2xI)−1

and Kn = W� zTn . Once we learn the parameters and the

latent vectors zn, then the posterior distribution of xn given yn, W, zn, σx and σy is

E[xn|yn,W, zn, σxσy] =(KTnKn +

σ2y

σ2xI)−1

KTnyn (19)

where Kn = W � zTn .

5.2 Interpreting results

In this section we will cover a ‘dummy’ example on how one can interpret the results ofthe LF-PPCA model, we will be using the same notation from the previous section; butnow instead of working with individual data points, we will work with the entire data set,such that Y = [yT1 , ...,y

TN ], F = [fT1 , ..., f

TN ], Z = [zT1 , ..., z

TN ] and X = [xT1 , ...,x

TN ]. Let us

assume the dimension of the data D = 4, this means that there are four independent 1Dsubspaces our data can be potentially projected onto. Now let us assume we have obtained

Page 28: A latent feature probabilistic principal component ... (002).pdf · dimensions increase within a dataset, the computational workload becomes more demand- ing; this boils down to a

5.2 Interpreting results 27

the Z matrix in Figure 12; note we can use ‘Feature 1’ interchangeably with ‘subspace 1’;where ‘subspace 1’ is one of the four 1D subspaces our data can be projected down onto.

Figure 12: A graphical representation of the latent feature matrix Z, where the shaded boxindicates that the data point possess that feature; or in our case it can be projected ontothat particular subspace.

This matrix in Figure 12 indicates that data point y1 and y2 have an optimal lowerdimensional projection in a 2D subspace; consisting of ‘subspace 1’ and ‘subspace 3’. How-ever, we can also interpret the results of matrix Z in the following way, the data points y1,y2 and yN can also have a lower dimensional projection in the 1D ‘subspace 1’. Hence,this model not only finds the optimal lower dimensional subspace, but also provides us withinformation on what subspace is shared by what data point.

Once we have found the matrix Z, the matrix W and the parameters σx, σy (see section5.3) we can simply find the corresponding values of each data point over each subspace.Using equation (19), we can find the corresponding matrix X, and by using the Hadamardproduct of both Z and X, we can find the resulting feature matrix F, which in this casecan be seen in below in Figure 13.

Page 29: A latent feature probabilistic principal component ... (002).pdf · dimensions increase within a dataset, the computational workload becomes more demand- ing; this boils down to a

5.2 Interpreting results 28

Figure 13: A graphical representation of the matrix F, where each row represents the valueof each data point within each subspace.

Note that the 0 entries in Figure 13 simply mean that the data point has no represen-tation that has a lower dimension within that subspace.

Figure 14: A graphical representation of data points y1 and y2 plotted in their optimalsubspaces 1 and 3.

Page 30: A latent feature probabilistic principal component ... (002).pdf · dimensions increase within a dataset, the computational workload becomes more demand- ing; this boils down to a

5.3 Inference 29

As data points y1 and y2 have an optimal lower dimensional projection in a 2D subspaceconsisting of ‘subspace 1’ and ‘subspace 3’ (which are orthogonal), then we can use theinformation from Figure 12 to plot f1 and f2 above in Figure 14.

5.3 Inference

In this section we will cover how we can do inference in the model from equation (15), inorder to do this we will explore how to update each parameter of the model. In section 5.1,we mentioned how we can integrate over the latent features coefficients xn, and obtain thelikelihood described in equation (18); which means we only work with one infinite vector(latent features zn) instead of working with two infinite vectors (latent features zn andthe latent features coefficients xn), which simplifies the model; see [19] for more details onthis. However, the latent features coefficients xn contain important information about thesubspace projection of the data point yn, therefore, one can argue that by integrating overthe latent features coefficients xn we lose important information. But once we have obtainedthe latent feature vectors zn using a model with the likelihood described by equation (18),we can then infer the values for the latent features coefficients xn by using equation (19) tofind the posterior probability of xn given yn, zn and the parameters W, σxσy.

If the observed data has D dimensions, then we can assume that the subspace can alsobe up to D dimension. If we let D+ represent the dimension of the subspace at the currentiteration, such that D ≥ D+, then we first initialise the algorithm with D+ = 1 and updateit after each iteration using a restricted IBP; the restriction ensures that IBP does notproduce a matrix Z = [zT1 , ..., z

TN ] with more than D columns, this is because we cannot

infer more than D orthonormal subspaces given the observed data is in D dimension 5.We assume that W is a D × D+ matrix which has an orthonormal basis, and aims

to capture the maximum variance of the data in each direction, therefore we can simplyconstruct this by using the traditional PCA approach. Therefore, the vectors of wj are givenby the D+ eigenvectors with the largest associated eigenvalues of the sample covariancematrix S, which can be defined as S = 1

N

∑Nn (yn − y)T (yn − y) such that Swj = λjwj;

where y is the sample mean. After each iteration, if the number D+ increases to D+ + 1,then we simply use the D+ + 1 eigenvectors with the largest associated eigenvalues of thesample covariance matrix S.

To update the matrix Z, we can use a similar approach to the one we used in the linearGaussian latent feature model (section 4.4), which can be described as

P (zn,d|Z−(n,d),yn,W, σx, σy) ∝ P (yn|W, zn, σX , σy)P (zn,d|z−n,d) (20)

where Z−(n,d) is the matrix Z without the observation zn,d, likewise z−n,d is set of assign-

ments of other data points to feature d excluding the nth data point. Where

P (zn,d|z−n,d) =m−n,dN

(21)

5An alternative view of this can be in the case of clustering, where we cannot define more than N clustersif we have only observed N data points

Page 31: A latent feature probabilistic principal component ... (002).pdf · dimensions increase within a dataset, the computational workload becomes more demand- ing; this boils down to a

5.4 Synthetic data 30

where m−n,d is a count of the data points which possess feature d excluding the nth datapoint.

The algorithm used to do inference with the LF-PPCA model can be seen in Algorithm2 of Appendix G.2.

5.4 Synthetic data

In this section we will compare the LF-PPCA model with both the MPPCA model (section3.4) and the traditional PCA model (section 2). Let us assume we have some data whereeach observation can lie on multiple planes within the subspace simultaneously, this can beseen in Figure 15

Note the data in Figure 15 looks very similar to the data from Figure 6. We can alsosee that there are two distinct 1D lines which generate the data, Figure 16 shows how each1D plane in Figure 15 looks like.

Figure 15: Plot of 2D data where an observation can simultaneously lie on two differentlines in the 1D subspace

Page 32: A latent feature probabilistic principal component ... (002).pdf · dimensions increase within a dataset, the computational workload becomes more demand- ing; this boils down to a

5.4 Synthetic data 31

Figure 16: Left: The red points represent the data points in a 2D plane, the blue line showsthe first projection from 1D to 2D and the green line shows the second projection from 1Dto 2D. Middle: A plot of the original data points which were generated under Projection1. Right: A plot of the original data points which were generated under Projection 2.

In the next few sections we will compare different methods we’ve discussed so far on thedata from Figure 16.

5.4.1 PCA approach

One limitation of PCA is that we must project the entire data from the higher dimensionalspace into the lower dimensional subspace, therefore there would be no real point in di-mensionality reduction with PCA if we were to project the entire 2D data back into a 2Dsubspace, therefore we will aim to project the 2D from Figure 16 into a 1D subspace; wecan also observe that the PCA with K principal components (where K < D) is a specialcase of the LF-PPCA model with no noise and ones in the first K columns of the latentfeature matrix Z = [zT1 , ..., z

TN ].

Figure 17 shows the results we obtain from using PCA with a single principal component(highlighted by the blue line). It is clear we will get the issue we had in Figure 2 in section2.1; where two data points which were ‘significantly’ further away in the 2D plane wereprojected ‘near’ each other in the 1D subspace. Therefore, in this case the PCA approachfails to find the optimal solution. Note we would also get a similar result if we were to usethe PPCA approach instead on the PCA approach.

Page 33: A latent feature probabilistic principal component ... (002).pdf · dimensions increase within a dataset, the computational workload becomes more demand- ing; this boils down to a

5.4 Synthetic data 32

Figure 17: The red points are the data points in a 2D plane, the blue line aims to capturemaximum variance under the projection from 2D to 1D.

5.4.2 MPPCA approach

One possible solution to the PCA approach from above is to assume that there are a mixtureof PPCA within the data, however, we must have some prior information about this. Weinitialised the MPPCA model with K = 2, and d = 1 (the dimension of the subspace),meaning there are two different 1D subspaces which generated the data.

Figure 18 shows the results we obtain from using MPPCA with K = 2 and d = 1. Wecan see that the resulting 1D projection lines which the MPPCA finds (left image in Figure18) are almost identical to the two actual 1D projection lines from the left image in Figure16. However, there are two major issues with classifying data point to which projection theybelongs to, firstly some data points are clearly misclassified; i.e. data points which clearlyare generated by ‘Projection 1’ are classed as data points generated from ‘Projection 2’,secondly, each data point is classed to a single projection; which of-course is an assumptionof the MPPCA which is not satisfied by the data.

Although the MPPCA model performs better than the PCA/PPCA model on the datafrom Figure 16, it still fails to find an optimal solution.

Page 34: A latent feature probabilistic principal component ... (002).pdf · dimensions increase within a dataset, the computational workload becomes more demand- ing; this boils down to a

5.4 Synthetic data 33

Figure 18: Left: The red points represent the data points in a 2D plane, the blue line aimsto capture maximum variance under the first projection from 2D to 1D and the green lineaims to capture maximum variance under the second projection from 2D to 1D. Right:A plot of the original data points classified by colour to show which projection each datapoint is modelled with.

5.4.3 LF-PPCA approach

With the LF-PPCA model we assume that data lies on a subspace which is orthonormal;therefore, it is quite similar to the PCA approach. As we have data in 2D, the maximumnumber of subspace projections we can use are two 1D lines. This means that some datapoints have an optimal projection on one of the two single 1D lines, and the other datapoints have an optimal projection on both 1D lines.

The results of the LF-PPCA model can be seen in Figure 19. Like the MPPCA model,the latent feature model also finds the 1D projection lines (left image in Figure 19) whichare almost identical to the two actual 1D projection lines from the left image in Figure 16.Furthermore, the LF-PPCA model allows data points to belong to multiple 1D subspacessimultaneously, which allows the model to find the optimal solution for the given data.

Page 35: A latent feature probabilistic principal component ... (002).pdf · dimensions increase within a dataset, the computational workload becomes more demand- ing; this boils down to a

5.4 Synthetic data 34

Figure 19: Left: The red points represent the data points in a 2D plane, the blue line aimsto capture maximum variance under the first projection from 2D to 1D and the green lineaims to capture maximum variance under the second projection from 2D to 1D. Middle:A plot of the original data points which are classed as being generated under Projection1. Right: A plot of the original data points which are classed as being generated underProjection 2.

5.4.4 Results

The data we used in this test (Figure 16) had a mean of 0 and assumed that the observeddata points in the 2D plane can belong to a single line within the 1D subspace or canbelong to two different lines within the 1D subspace simultaneously. The issue with thePCA model was that we could only assume that the data can be projected down into asingle 1D subspace; hence it failed to recognise the two different 1D subspaces our data canbelong to. The MPPCA model found the correct matrix W, however, it also assumes thata data point can be projected down into a single 1D subspace (belong to a single PPCAmixture), therefore it failed to correctly model the data. The LF-PPCA model gave usthe flexibility to allow data points to lie on different projection within the 1D subspacesimultaneously, this allowed us to model the data from Figure 16 correctly.

For each method we will find the mean of the Euclidean distance between the observedyn and its re-projection (see section 2.1 for definition) yre−projn ∀ n ∈ {1, ..., N}, using thiswe will evaluate the performance of each method, the results of this can be seen below inTable 1. It is clear that the LF-PPCA model performed the best with an average Euclideandistance of 0.28 between the observed yn and its re-projection yre−projn over all data points,while the MPPCA model performed the worst with an average Euclidean distance of 0.97

Page 36: A latent feature probabilistic principal component ... (002).pdf · dimensions increase within a dataset, the computational workload becomes more demand- ing; this boils down to a

5.4 Synthetic data 35

between the observed yn and its re-projection yre−projn over all data points. The PCA modelperformed in between both the LF-PPCA and the MPPCA (average Euclidean distance of0.66 between the observed yn and its re-projection yre−projn over all data points), this isbecause the PCA is in-fact a special case of the LF-PPCA (the first column of the matrixZ = [zT1 , ..., z

TN ] equal to one, and the second column set to zero), hence it correctly re-

project some data points (roughly over a half) and incorrectly re-projected the others asnot enough subspaces were defined.

Method Mean Euclidean distance

LF-PPCA 0.2847PCA 0.6647MPPCA 0.9714

Table 1: The average Euclidean distance between observed data and its re-projection foreach method

5.4.5 Different means

So far we have assumed that without the loss of generality the mean of yn ∀ n ∈ {1, 2, ..., N}is zero; hence we can remove the vector µ from equation (15). This simply means that themean of the data (µ) has no effect on estimating the matrix W, and we can show this inFigure 20. Figure 20 shows three different datasets all with different values for the vector µwill have a re-projection error of approximately 0.28 for each dataset; this simply reinforcesthat in the LF-PPCA model the mean of the data (µ) has no effect on estimating the matrixW.

Page 37: A latent feature probabilistic principal component ... (002).pdf · dimensions increase within a dataset, the computational workload becomes more demand- ing; this boils down to a

5.4 Synthetic data 36

Figure 20: Left: Plot of 2D data where an observation can simultaneously lie on twodifferent lines in the 1D subspace, the mean of the entire dataset is (0, 0). Middle: Plotof 2D data where an observation can simultaneously lie on two different lines in the 1Dsubspace, the mean of the entire dataset is (10, 10). Right: Plot of 2D data where anobservation can simultaneously lie on two different lines in the 1D subspace, the mean ofthe entire dataset is (−5, 10).

Page 38: A latent feature probabilistic principal component ... (002).pdf · dimensions increase within a dataset, the computational workload becomes more demand- ing; this boils down to a

Summary and future direction 37

6 Summary and future direction

In this section we will first summarise the finding of this project, we will then discuss howthe work we’ve introduced can be extended.

6.1 Summary

In this report we introduced different types of dimensionality reduction techniques whichassumes that data can be mapped linearly from an observed high-dimensional space toa low-dimensional subspace (also called linear dimensionality reduction techniques). Insection 2 we firstly introduced how a simple PCA model will work, and we showed how itwould fail if the data didn’t follow certain assumptions. Then in section 3 we introduceda probabilistic version of PCA (PPCA), which allows us to have a mixture of principalcomponent analysers (MPPCA); which can solve issues in which the traditional PCA failsin. However, the MPPCA also requires us to have prior information of the number ofmixtures and their respective subspace dimensions; which is an easy task if the data is 3Dor below, but impractical otherwise. The MPPCA also assumes that each high dimensionalobservation has a single lower dimensional representation within some subspace (a datapoint can only belong to a single PPCA mixture); which is not always true.

We then proposed a new linear dimension reducing model called the LF-PPCA modelin section 5 which aims to solve the issue which the MPPCA model has. We first assumedthat the data has a latent feature structure as oppose to a latent class structure; in ourcase, this means that the high-dimensional observation can have many low-dimensionalrepresentations within different subspaces; however, we assumed that the subspaces areorthogonal. We finally ran some tests to compare how well the PCA, MPPCA and LF-PPCA performed on some data, both the PCA and MPPCA had certain assumptionswhich the data did not satisfy; for example, the MPPCA assumed that each data point canonly have a single projection in the low-dimensional subspace, hence they both failed tomodel the data correctly. On the other hand the flexibility of the LF-PPCA model allowedit to out-perform both the PCA and MPPCA model; however, the LF-PPCA model alsohas an assumption which would limit the model in some scenarios, see section 6.2 for moredetails.

6.2 Future direction

In this report we introduced the LF-PPCA model which out-performed both the PCA andthe MPPCA model in our tests. However, the current LF-PPCA model can only work ifeach subspace intersects at zero; meaning the mean of each subspace within the datasetis zero (see equation (17)). However, the LF-PPCA model would fail if the subspacesdon’t have a mean of zero, this can be seen below in Figure 21. In Figure 21 we have2D data which can be projected onto two different 1D subspaces (call them ‘subspace 1’and ‘subspace 2’), the data points can be projected onto either a single subspace or bothsubspaces simultaneously, the colour of the data point indicates the true subspace a data

Page 39: A latent feature probabilistic principal component ... (002).pdf · dimensions increase within a dataset, the computational workload becomes more demand- ing; this boils down to a

6.2 Future direction 38

point can be projected onto; the red points can only be projected onto ‘subspace 1’, thegreen data points can only be projected onto ‘subspace 2’, and the blue data points can beprojected onto both ‘subspace 1’ and ‘subspace 2’.

Figure 21: Plot of 2D data where an observation can simultaneously lie on two differentlines in the 1D subspace; where each subspace has a different mean.

This means that the existing LF-PPCA model would fail if the simple assumption thatall subspaces have mean zero is not satisfied. However, one potential way to model thedata in Figure 21 is to use the MPPCA model with four PPCA mixtures; this is becausewe can observe three ‘clusters’ of data in the 2D plot, and one of these ‘clusters’ (blue datapoints) may require more than one PPCA mixture for modelling. The first PPCA mixturewill aim to project the red data points onto a 1D subspace, the second PPCA mixture willaim to project the green data points onto a 1D subspace, and the final two PPCA mixtureswill aim to project the blue data points onto one of the two 1D subspaces (where each bluedata point can only be projected onto a single 1D subspace). The first problem with theMPPCA approach is that the final two PPCA mixtures will fail to find the optimal solutionfor the blue data points, this is because the PPCA mixtures will assume each data pointcan only be projected onto a single low-dimensional subspace, which in our case is not true;Figure 18 shows what happens when we use two PPCA mixtures on latent feature data.Another problem with the MPPCA approach is that we will need to assume that there arefour PPCA mixtures within the data (Figure 21), and that the data follows a latent classstructure; which is not true because that data follows a latent feature structure.

Therefore, we must extend the LF-PPCA model such that we can work with datasetswhich do not satisfy the assumption that each subspace has a mean of zero.

Page 40: A latent feature probabilistic principal component ... (002).pdf · dimensions increase within a dataset, the computational workload becomes more demand- ing; this boils down to a

39

Appendices

A Properties of a factor analysis model

In this section we will derive the important properties of the probability distribution overyn from equation (2). Let us assume we have the following model

yn = Wxn + µ+ ε

xn ∼ N (0, I)

ε ∼ N (0, Ψ)

(22)

We can then find the mean of yn as

E[yn] = E[Wxn + µ+ ε]

= µ+ WE[xn] +E[ε]

= µ

(23)

We can also find the covariance of yn as

E[(yn − E[yn])(yn − E[yn])T ] = E[(Wxn + µ+ ε− µ)(Wxn + µ+ ε− µ)T ]

= E[WxnxTnWT + εxTnWT + Wxnε

T + εεT ]

= WE[xnxTn ]WT + E[εεT ]

= WWT + Ψ

(24)

Page 41: A latent feature probabilistic principal component ... (002).pdf · dimensions increase within a dataset, the computational workload becomes more demand- ing; this boils down to a

Posterior of latent variable 40

B Posterior of latent variable

We need to use Bayes’ rule to derive the posterior distribution described in equation (7).To do this, let us first look at a ‘dummy example’, and then we will compare our resultswith the dummy example to evaluate the posterior distribution.

B.1 Dummy example

Let us assume we have a random variable A which is a D dimensional vector drawn froma multivariate Gaussian distribution centred at λ with a covariance matrix of Σ. Then wecan express the probability distribution over A as

P (A|λ,Σ) =1

(2π)D/2|Σ|−1/2 exp

{−1

2(A− λ)TΣ−1(A− λ)

}(25)

the exponent within the exponential term can be expressed as

− 1

2(AT − λT )Σ−1(A− λ) = −1

2(ATΣ−1A−ATΣ−1λ− λTΣ−1A + λTλ) (26)

we will use this to derive the posterior of the latent variables in the next section.

B.2 Posterior

We can express the posterior of xn using Bayes’ rule ( Posterior ∝ Prior · Liklihood),this can be written as

P (xn|yn) ∝ 1

(2π)Q/2exp

{−1

2xn

Txn

1

(2πσ2)D/2exp

{− 1

2σ2(yn −Wxn − µ)T (yn −Wxn − µ)

}∝ exp

{−1

2xn

Txn −1

2σ2(yn −Wxn − µ)T (yn −Wxn − µ)

} (27)

we can now simplify the terms in the exponent of the exponential

− 1

2σ2(σ2xn

Txn + (yn −Wxn − µ)T (yn −Wxn − µ))

= − 1

2σ2(σ2xn

Txn + ((yn − µ)T − xnTWT )((yn − µ)−Wxn)

= − 1

2σ2((yn − µ)T (yn − µ)− (yn − µ)TWxn − xTnWT (yn − µ)+

xTn (WTW + Iσ2)x)

(28)

we will now compare the results from equation (26) with equation (28) to derive the posteriorof xn, let us observe the following similarities between the two

ATΣ−1A = xTn (WTW + Iσ2)1

σ2x (29)

Page 42: A latent feature probabilistic principal component ... (002).pdf · dimensions increase within a dataset, the computational workload becomes more demand- ing; this boils down to a

B.2 Posterior 41

using this we can say that the covariance of the posterior for xn is (WTW + Iσ2)−1σ2, nowlet us observe the next similarity

λTΣ−1A = (yn − µ)TWxn1

σ2(30)

which simplifies toλT = (yn − µ)TW(WTW + Iσ2)−1 (31)

meaningλ = (WTW + Iσ2)−1WT (yn − µ) (32)

using this we can say that the mean of the posterior for xn is (WTW+Iσ2)−1WT (yn−µ).Using this we can express the posterior distribution over xn as

P (xn|yn) =

∣∣σ−2M∣∣1/2(2π)Q/2

exp

{−1

2(xn −M−1WT (yn − µ))T (σ−2M)(xn −M−1WT (yn − µ))

} (33)

where M = (WTW + Iσ2)

Page 43: A latent feature probabilistic principal component ... (002).pdf · dimensions increase within a dataset, the computational workload becomes more demand- ing; this boils down to a

EM algorithm for PPCA 42

C EM algorithm for PPCA

This appendix will go over the expectation-maximisation algorithm used to do inferencewith the PPCA model; we will continue to use the same notation from section 3.2. Thecompleter-data log-likelihood is given by

Lc −N∑n=1

ln{P (yn,xn)} (34)

which in this case would be the product of equations (4) and (5), which is

P (xn,yn) =1

(2π)Q/2exp

{−1

2xn

Txn

1

(2πσ2)D/2exp

{− 1

2σ2(yn −Wxn − µ)T (yn −Wxn − µ)

} (35)

E-step

In the E-step of the algorithm we will take the expectation of equation (35) with respect tothe distribution P (xn|yn) (equation (7)).

〈Lc〉 = −N∑n=1

(D2

lnσ2 +1

2tr(⟨xnx

Tn

⟩) +

1

2σ2(yn − µ)T (yn − µ)

− 1

σ2〈xn〉TWT (yn − µ) +

1

2σ2tr(WTW〈xnxTn 〉)

) (36)

where the terms

〈xn〉 = M−1WT (yn − µ)

〈xnxTn 〉 = σ2M−1 + 〈xn〉〈xn〉T(37)

and M = (σ2I + WTW).

M-step

In the M-step we need to maximise 〈Lc〉 from equation (36) with respect to W and σ2. Todo this, we must differentiate 〈Lc〉 with respect to each parameter and then set that to zero,the results of this can be seen below

W =

[N∑n=1

(yn − µ)〈xTn 〉

][N∑n=1

⟨xnx

Tn

⟩]−1(38)

Page 44: A latent feature probabilistic principal component ... (002).pdf · dimensions increase within a dataset, the computational workload becomes more demand- ing; this boils down to a

EM algorithm for PPCA 43

and

σ2 =1

ND

N∑n=1

{(yn − µ)T (yn − µ)− 2〈xTn 〉WT (yn − µ) + tr(〈xnxTn 〉WTW)

}(39)

We first initialise the parameters W and σ2, we then calculate the log-likelihood usingequation (36), then we update the parameters W and σ2 using equations (38) and (39). Wecontinue to iterate between equations (36) to (39) until we believe the model has converged.See Appenix A.5 in [2] for more information on the inference of the PPCA model using theEM algorithm.

Page 45: A latent feature probabilistic principal component ... (002).pdf · dimensions increase within a dataset, the computational workload becomes more demand- ing; this boils down to a

EM algorithm for MPPCA 44

D EM algorithm for MPPCA

This appendix will go over the expectation-maximisation algorithm used to do inferencewith mixtures of PPCA; we will continue to use the same notation from section 3.4; we willalso follow a similar algorithm from Appendix C. Let Rn,k denote the posterior probability ofdata point yn being generated from the kth PPCA mixture, which is given by the following

Rn,k =P (yn|k)πkP (yn)

(40)

where

P (yn) =K∑k=1

P (yn|k)πk (41)

where P (yn|k) is the probability of data point yn being generated from the kth PPCAmixture, and πk denotes the proportion of the data points which are modelled by the kth

PPCA. Then the complete-data log-likelihood can be expressed as

Lc −N∑n=1

K∑k=1

zn,k ln{πkP (yn,xn,k)} (42)

where zn,k is an indicator variable which equals to one if data point yn was generated usingthe kth PPCA mixture, else its equal to zero. Note for this model, each data point yn nowhas K different latent variables; this can be seen in equation (42), where instead of havinga xn terms, we now have a xn,k term; one for each of the K PPCA models.

E-step

In the E-step of the algorithm we will take the expectation equation (42) with respect tothe distribution P (xn,k|yn) (which would be similar to equation (7)).

〈Lc〉 =N∑n=1

K∑k=1

Rn,k

(lnπk −

D

2lnσ2k −

1

2tr(⟨xn,kx

Tn,k

⟩)− 1

2σ2k(yn − µk)T (yn − µk)

+1

σ2k〈xn,k〉TWT

k (yn − µk)−1

2σ2ktr(WT

k Wk〈xn,kxTn,k〉)) (43)

where the terms

〈xn,k〉 = M−1k WT

k (yn − µk)〈xn,kxTn,k〉 = σ2kM

−1k + 〈xn,k〉〈xn,k〉T

(44)

and Mk = (σ2kI + WTk Wk).

Page 46: A latent feature probabilistic principal component ... (002).pdf · dimensions increase within a dataset, the computational workload becomes more demand- ing; this boils down to a

EM algorithm for MPPCA 45

M-step

In the M-step we need to maximise 〈Lc〉 from equation (43) with respect to πk, µk, Wk

and σ2k. In order to do this, we must differentiate 〈Lc〉 with respect to each parameter andthen set that to zero. However, the maximisation with respect to πn needs to take accountof the constraint

∑k πk = 1. This can be done using the Lagrange multiplier λ (see [2]),

which would just look like

〈Lc〉+ λ( K∑k=1

πk − 1)

(45)

The results of the M-step can be seen below

πk =1

N

∑n

Rn,k (46)

µk =

∑nRn,k(yn − Wk 〈xn,k〉)∑

nRn,k(47)

Wk =

[∑n

Rn,k(yn − µk)〈xn,k〉T][∑

n

Rn,k〈xn,kxTn,k〉

]−1(48)

σ2k =1

D∑

nRn,k

(N∑n=1

Rn,k(yn − µk)T (yn − µk)

− 2∑n

Rn,k〈xn,k〉TWkT

(yn − µk) +∑n

Rn,ktr(〈xn,kxTn,k〉WTk Wk)

) (49)

We first randomly generate a matrix Z, where zn,k = 1 means data point n was generatedusing the kth PPCA mixture, we then generate the R matrix using equation (40), wethen initialise Wk and σ2k, then we calculate the log-likelihood using equation (43), thenwe update the parameters πk, µk, Wk and σ2k using equations (46), (47), (48) and (49)respectively. We continue to iterate between equations (43) to (49) (except equation (45))until we believe the model has converged. See Appenix C in [2] for more information onthe inference of the MPPCA model using the EM algorithm.

Page 47: A latent feature probabilistic principal component ... (002).pdf · dimensions increase within a dataset, the computational workload becomes more demand- ing; this boils down to a

Linear Gaussian latent feature model 46

E Linear Gaussian latent feature model

This Appendix will derive probability distribution P (X|Z, σX , σA) from equation (11). How-ever, before we can do this we must evaluate P (X | A,Z, σX)P (A|σA), which can be writtenas

P (X | A,Z, σX)P (A | σA) =1

2πσ2XND/2

exp

{− 1

2σ2Xtr(

(X− ZA)T (X− ZA))}

× 1

2πσ2AKD/2

exp

{− 1

2σ2Atr(ATA

)}=

1

2πσ2XND/2

1

2πσ2AKD/2

× exp

{−( 1

2σ2Xtr(

(X− ZA)T (X− ZA))

+1

2σ2Atr(ATA

) )}(50)

Let us now focus on the exponent of the exponential term in equation (50)

1

σ2X(X− ZA)T (X− ZA) +

1

σ2AATA

=1

σ2X

(−ATZT + XT

)(X− ZA) +

1

σ2AATA

=1

σ2X

(−ATZTX + XTX + ATZTZA−XTZA

)+

1

σ2AATA

= − 1

σ2X

(ATZTX

)+

1

σ2X

(XTX

)+

1

σ2X

(ATZTZA

)− 1

σ2X

(XTZA

)+

1

σ2A

(ATA

)=

1

σ2X

(XTX

)− 1

σ2X

(ATZTX

)− 1

σ2X

(XTZA

)+ AT

(1

σ2XZTZ +

1

σ2AI

)A

=1

σ2X

(XTX

)+

1

σ2X

(−ATZTX + ATZTZA + AT σ

2X

σ2AIA−XTZA

)=

1

σ2X

(XTX

)+

1

σ2X

(−ATZTX + AT

(ZTZ +

σ2Xσ2A

I

)A−XTZA

)(51)

If we set M =(ZTZ +

σ2X

σ2A

I)−1

then equation (51) can be further simplified into

Page 48: A latent feature probabilistic principal component ... (002).pdf · dimensions increase within a dataset, the computational workload becomes more demand- ing; this boils down to a

Linear Gaussian latent feature model 47

1

σ2X

(XTX

)+

1

σ2X

(−ATZTX + ATM−1A−XTZA

)=

1

σ2X

(−ATZTX + XTZMTZTX + ATM−1A−XTZMTM−1A

)+

1

σ2XXTX− 1

σ2XXTZMZTX

=1

σ2XXTX− 1

σ2XXTZMZTX +

1

σ2X

(−AT + XTZMT

) (ZTX−M−1A

)=

1

σ2XXTX− 1

σ2XXTZMZTX +

1

σ2X

(−AT + XTZMT

)M−1 (MZTX−A

)=

1

σ2X

(XT

(I− ZMZT

)X)

+(MZTX−A

)T (σ2XM

)−1 (MZTX−A

)(52)

Now we can simply integrate over P (X | A,Z, σX)P (A|σA) with respect to A to obtainP (X|Z, σX , σA).

P (X | Z, σX , σA)

=

∫ ∞−∞

P (X | Z,A, σX)P (A |σX) dA

=

∫ ∞−∞

1

2π(N+K)D/2σNDX σKDAexp

{− 1

2σ2Xtr(XT

(I− ZMZT

)X)}·

exp

{−1

2tr((

MZTX−A)T (

σ2XM)−1 (

MZTX−A))}

dA

=1

2π(N+K)D/2σNDX σKDAexp

{− 1

2σ2Xtr(XT

(I− ZMZT

)X)}·∫ ∞

−∞exp

{−1

2tr((

MZTX−A)T (

σ2XM)−1 (

MZTX−A))}

dA

(53)

Let Y =(MZTX−A

), this means dY = −dA, we can then re-write the integral from

equation (53) as

∫ ∞−∞

exp

{−1

2tr((

MZTX−A)T (

σ2XM)−1 (

MZTX−A))}

dA

= −∫ −∞∞

exp

{−1

2tr(YT

(σ2XM

)−1Y)}

dY

=

∫ ∞−∞

exp

{−1

2tr(YT

(σ2XM

)−1Y)}

dY

= (2π)KD/2∣∣σ2XM

∣∣D/2(54)

Page 49: A latent feature probabilistic principal component ... (002).pdf · dimensions increase within a dataset, the computational workload becomes more demand- ing; this boils down to a

Linear Gaussian latent feature model 48

Note the results in equation (54) are derived using the Gaussian integral over RK×D; see[14] for more details. Using the results from (54), we can re-write equation (53) as

P (X|Z, σX , σA) =

1

2π(N+K)D/2σNDX σKDAexp

{− 1

2σ2Xtr(XT

(I− ZMZT

)X)}

(2π)KD/2∣∣σ2XM

∣∣D/2=

σKDX |M|D/2

2πND/2σNDX σKDAexp

{− 1

2σ2Xtr(XT

(I− ZMZT

)X)}

=exp

{− 1

2σ2X

tr(XT

(I− ZMZT

)X)}

2πND/2σ(N−K)DX σKDA

∣∣∣ZTZ +σ2X

σ2A

I∣∣∣D/2

(55)

Note if M is a K ×K square matrix, and α is some scalar, then |αM| = αK |M| [13]; thisproperty was used to simplify equation (55).

Page 50: A latent feature probabilistic principal component ... (002).pdf · dimensions increase within a dataset, the computational workload becomes more demand- ing; this boils down to a

Latent feature PPCA 49

F Latent feature PPCA

This Appendix will derive probability distribution P (yn|W, zn, σx, σy) from section 5. How-ever, before we can do this we must evaluate P (yn|W,xn, zn, σy) P (xn|σx), which can bewritten as

P (yn |W,xn, zn, σy) =1

2πσ2yD/2

exp

{− 1

2σ2y(yn − (W � zTn )xn)T (yn − (W � zTn )xn)

}× 1

(2πσ2x)D/2exp

{− 1

2σ2xxn

Txn

}=

1

(2πσ2y)D/2

1

(2πσ2x)D/2

× exp

{− 1

2σ2y(yn − (W � zTn )xn)T (yn − (W � zTn )xn)− 1

2σ2xxn

Txn

}(56)

Let us now focus on the exponent of the exponential term in equation (56)

1

σ2y(yn − (W � zTn )xn)T (yn − (W � zTn )xn) +

1

σ2xxn

Txn

=1

σ2y(yTn − xTn (zn �WT ))(yn − (W � zTn )xn) +

1

σ2xxn

Txn

(57)

let Kn = W � zTn , then equation (57) becomes

1

σ2y(yTn − xTnKT

n )(yn −Knxn) +1

σ2xxn

Txn

= − 1

σ2y

(xTnKT

nyn)

+1

σ2y

(yTnyn

)+

1

σ2y

(xTnKT

nKnxn)− 1

σ2y

(yTnKnxn

)+

1

σ2x

(xTnxn

)=

1

σ2y

(yTnyn

)− 1

σ2y

(xTnKT

nyn)− 1

σ2y

(yTnKnxn

)+ xTn

(1

σ2yKTnKn +

1

σ2xI

)xn

=1

σ2y

(yTnyn

)+

1

σ2y

(−xTnKT

nyn + xTnKTnKnxn + xTn

σ2yσ2x

Ixn − yTnKnxn

)

=1

σ2y

(yTnyn

)+

1

σ2y

(−xTnKT

nyn + xTn

(KTnKn +

σ2yσ2x

I

)xn − yTnKnxn

)

(58)

If we set Mn =(KTnKn +

σ2y

σ2xI)−1

then equation (58) can be further simplified into

Page 51: A latent feature probabilistic principal component ... (002).pdf · dimensions increase within a dataset, the computational workload becomes more demand- ing; this boils down to a

Latent feature PPCA 50

1

σ2y

(yTnyn

)+

1

σ2y

(−xTnKT

nyn + xTnM−1n xn − yTnKnxn

)=

1

σ2y

(−xTnKT

nyn + yTnKnMTnKT

nyn + xTnM−1n xn − yTnKnM

TnM−1

n xn)

+1

σ2yyTnyn −

1

σ2yyTnKnMnK

Tnyn

=1

σ2yyTnyn −

1

σ2yyTnKnMnK

Tnyn +

1

σ2y

(−xTn + yTnKnM

Tn

) (KTnyn −M−1

n xn)

=1

σ2yyTnyn −

1

σ2yyTnKnMnK

Tnyn +

1

σ2y

(−xTn + yTnKnM

Tn

)M−1

n

(MnK

Tnyn − xn

)=

1

σ2y

(yTn(I−KnMnK

Tn

)yn)

+(MnK

Tnyn − xn

)T (σ2yMn

)−1 (MnK

Tnyn − xn

)(59)

Now we can simply integrate over P (yn | W,xn, zn, σy)P (xn|σx) with respect to xn toobtain P (yn|W, zn, σx, σy).

P (yn|W, zn, σx, σy)

=

∫ ∞−∞

P (yn |W,xn, zn, σy)P (xn|σx)dxn

=

∫ ∞−∞

1

(2πσ2y)D/2(2πσ2x)D/2

exp

{− 1

2σ2yyTn(I−KnMnK

Tn

)yn

}exp

{−1

2

(MnK

Tnyn − xn

)T (σ2yMn

)−1 (MnK

Tnyn − xn

)}dxn

=1

(2πσ2y)D/2(2πσ2x)D/2

exp

{− 1

2σ2yyTn(I−KnMnK

Tn

)yn

}∫ ∞−∞

exp

{−1

2

(MnK

Tnyn − xn

)T (σ2yMn

)−1 (MnK

Tnyn − xn

)}dxn

(60)

Let sn =(MnK

Tnyn − xn

), this means dsn = −dxn, we can then re-write the integral from

equation (60) as

∫ ∞−∞

exp

{−1

2

(MnK

Tnyn − xn

)T (σ2yMn

)−1 (MnK

Tnyn − xn

)}dxn

= −∫ −∞∞

exp

{−1

2sTn(σ2yMn

)−1sn

}dsn

=

∫ ∞−∞

exp

{−1

2sTn(σ2yMn

)−1sn

}dsn

= (2π)D/2∣∣σ2yMn

∣∣1/2(61)

Page 52: A latent feature probabilistic principal component ... (002).pdf · dimensions increase within a dataset, the computational workload becomes more demand- ing; this boils down to a

Latent feature PPCA 51

Note the results in equation (61) are derived using the Gaussian integral over RD; see [14]for more details. Using the results from (61), we can re-write equation (60) as

P (yn|W, zn, σX , σy) =

1

(2πσ2y)D/2(2πσ2y)

D/2exp

{− 1

2σ2yyTn(I−KnMnK

Tn

)yn

}(2π)D/2

∣∣σ2yMn

∣∣1/2=

σDy |Mn|1/2

2πD/2σDy σDx

exp

{− 1

2σ2yyTn(I−KnMnK

Tn

)yn

}=|Mn|1/2

2πD/2σDxexp

{− 1

2σ2yyTn(I−KnMnK

Tn

)yn

}(62)

where Mn =(KTnKn +

σ2y

σ2xI)−1

and Kn = W � zTn .

Page 53: A latent feature probabilistic principal component ... (002).pdf · dimensions increase within a dataset, the computational workload becomes more demand- ing; this boils down to a

Inference algorithms 52

G Inference algorithms

This Appendix will cover all the inference algorithms.

G.1 Linear Gaussian latent feature model

The algorithm used to run the linear Gaussian latent feature model can be seen in Algorithm1, where the parameters of the model α, σX , σA are referred to as Θ.

Algorithm 1: MCMC for linear Gaussian latent feature model

Input: X, Θ,NumIter1 Initialise: Set K+ = 12 for j = 1 to NumIter do3 for n = 1 to N do4 for k = 1 to K+ do5 znk ∼ znk|z−n,k,xn, Θ6 end7 Propose adding new class8 Accept or reject proposal9 if Accept proposal then

10 Set K+ = K+ + 111 end

12 end13 Re-sample Θ|X,Z using MH proposal

14 endOutput: Z, Θ

The matrix Z is updated using equation (12), and each parameter of the model isupdated separately using Metropolis-Hastings updates where an example of σX can be seenbelow; note the same method is also used to update both α and σA.

The Metropolis-Hastings acceptance ratio for each update for the parameter σX is de-fined as

accept =P (X|Z, σproposedX , σA, α)τ(σX |σproposedX , ω)

P (X|Z, σXσA, α)τ(σproposedX |σX , ω)(63)

where σproposedX is the new proposed value for σX , τ(·) is simply a proposal distribution;which for simplicity is a Gaussian distribution centred at the existing value of σX with avariance parameter ω; see [15] for more details on the Metropolis-Hastings algorithm.

Page 54: A latent feature probabilistic principal component ... (002).pdf · dimensions increase within a dataset, the computational workload becomes more demand- ing; this boils down to a

G.2 Latent feature PPCA model 53

G.2 Latent feature PPCA model

The algorithm used to run the LF-PPCA can be seen in Algorithm 2, where the parametersof the model α, σx, σy are referred to as Θ.

Algorithm 2: MCMC for latent feature PPCA model

Input: Y, Θ,NumIter1 Initialise: Set D+ = 1, Compute W with largest eigenvector2 for j = 1 to NumIter do3 for n = 1 to N do4 for d = 1 to D+ do5 znd ∼ znd|z−n,d,yn,W, Θ6 end7 if D+ < D then8 Propose adding new class9 Accept or reject proposal

10 if Accept proposal then11 Compute W with D+ + 1 largest eigenvectors12 Set D+ = D+ + 1

13 end

14 end

15 end16 Re-sample σx|Y,Z,W using MH proposal17 Re-sample σy|Y,Z,W using MH proposal18 Re-sample α|Y,Z,W using MH proposal

19 endOutput: Z, Θ

The matrix Z is updated using equation (20), and when ‘largest eigenvectors’ refers tothe eigenvectors with the largest associated eigenvalues of the sample covariance matrix S,which can be defined as S = 1

N

∑Nn (yn − y)T (yn − y) such that Swj = λjwj; where y is

the sample mean. Each parameter of the model is updated separately using Metropolis-Hastings updates where an example of σy can be seen below; note the same method is alsoused to update both α and σy. The Metropolis-Hastings acceptance ratio for each updatefor the parameter σy is defined as

accept =P (Y|Z,W, σproposedy , σx, α)τ(σy|σproposedy , ω)

P (Y|Z,W, σyσx, α)τ(σproposedy |σy, ω)(64)

where σproposedy is the new proposed value for σy, τ(·) is simply a proposal distribution;which for simplicity is a Gaussian distribution centred at the existing value of σy with avariance parameter ω; see [15] for more details on the Metropolis-Hastings algorithm.

Page 55: A latent feature probabilistic principal component ... (002).pdf · dimensions increase within a dataset, the computational workload becomes more demand- ing; this boils down to a

MATLABR© code for the LF-PPCA model 54

H MATLABR© code for the LF-PPCA model

This appendix will show the MATLABR© code used to do inference using the LF-PPCAmodel to obtain results in section 5.4.3, the pseudo-code can be seen in Algorithm 2 ofAppendix G.2. The code has one main script which uses three different functions; the mainscript can be seen in Appendix H.1, and the resulting function can be seen in AppendixH.2, Appendix H.3 and Appendix H.4.

H.1 Main script

This section will show the MATLABR© code used for the main script.

%Initialise Parameters

sigma_y=0.4;

sigma_x=1.0;

alpha=0.2;

score=[];

%Randmoly initialise matrix Z

z=round(rand(1,N));

%Generate matrix W using PCA approach

[Wnew,v]=eig(cov(y’));

Wdata=Wnew*v;

for i=1:100

%Update matrix Z

z=UpdateZ(y,z,Wdata,sigma_y, sigma_x,alpha);

%Update number of subspaces defined

k=size(z,1);

%Update each parameter of the model using MH updates

new_sigma_x=sigma_x+randn*0.001;

proposed=logPYZ(y,z,Wdata(:,1:k),sigma_y,new_sigma_x,alpha);

current=logPYZ(y,z,Wdata(:,1:k),sigma_y,sigma_x,alpha);

acc_ratio=proposed-current;

if (log(rand)<min(acc_ratio,0))

sigma_x=new_sigma_x;

end

new_sigma_y=sigma_y+randn*0.001;

Page 56: A latent feature probabilistic principal component ... (002).pdf · dimensions increase within a dataset, the computational workload becomes more demand- ing; this boils down to a

H.2 UpdateZ(·) function 55

proposed=logPYZ(y,z,Wdata(:,1:k),new_sigma_y,sigma_x,alpha);

current=logPYZ(y,z,Wdata(:,1:k),sigma_y,sigma_x,alpha);

acc_ratio=proposed-current;

if (log(rand)<min(acc_ratio,0))

sigma_y=new_sigma_y;

end

new_alpha=alpha+randn*0.001;

proposed=logPYZ(y,z,Wdata(:,1:k),sigma_y,sigma_x,new_alpha);

current=logPYZ(y,z,Wdata(:,1:k),sigma_y,sigma_x,alpha);

acc_ratio=proposed-current;

if (log(rand)<min(acc_ratio,0))

alpha=new_alpha;

end

%Store log-likelihood

score(i)=logPYZ(y,z,Wdata(:,1:k),sigma_y,sigma_x,alpha);

%Display the itteration value and K value for each loop

disp(i )

disp(k)

end

H.2 UpdateZ(·) function

This section will show the MATLABR© code for the function ‘UpdateZ(·)’, which is used inthe main script from Appendix H.1.

function [ Z ] = UpdateZ(Y,Z,W,sigma_y, sigma_x,alpha)

N=size(Y,2);

D=size(Y,1);

K=size(Z,1);

for i=1:N

%Exisitng dishes in IBP

for k=1:K

Z = Existing(Y, Z,W(:,1:K), sigma_y, sigma_x,N, i, k);

end

%Sample new dish in IBP

if K<D

Page 57: A latent feature probabilistic principal component ... (002).pdf · dimensions increase within a dataset, the computational workload becomes more demand- ing; this boils down to a

H.2 UpdateZ(·) function 56

Z = NewDish(Y, Z’,W(:,1:K), sigma_y, sigma_x, alpha ,N, K, D, i);

%Z = Z(sum(Z)>0,:);

K = size(Z,1);

end

end

%Existing dishes IPB function

function Z = Existing(Y, Z,W, sigma_y, sigma_x,N, i, k)

pZik = zeros(2,1);

mk = sum(Z(k,[1:i-1 i+1:end]));

if (mk > 0)

old = Z(k,i);

for a = 1:2

if(a == 2)

choose = log(mk/N);

else

choose = log(1-(mk/N));

end

Z(k,i) = a-1;

PXZ=logPY(Y,Z,W,sigma_y, sigma_x);

pZik(a)=PXZ+choose;

end

Z(k,i) = old;

p = 1/(1+exp(pZik(2)-pZik(1))); %difference between yes and no

if(p > rand)

Z(k,i) = 0;

else

Z(k,i) = 1;

end

end

end

%Sample new dish in IBP

function Z = NewDish(Y, Z,W, sigma_y, sigma_x, alpha ,N, K, D, i)

m = sum(Z(setdiff(1:N,i),:));

Page 58: A latent feature probabilistic principal component ... (002).pdf · dimensions increase within a dataset, the computational workload becomes more demand- ing; this boils down to a

H.3 logPY Z(·) function 57

Z(i,m==0) = 0;

Prob_newK = zeros(D-K,1);

for newK = 0:D-K

Prob_newK(newK+1) = logPY(Y,Z’,W(:,1:K),sigma_y, sigma_x)+(-alpha/N +

(K+newK)*log(alpha/N) - gammaln(K+newK+1));

end

maxProb_newK = max(Prob_newK);

pdf = exp(Prob_newK-maxProb_newK);

pdf = pdf/sum(pdf);

total = pdf(1);

newK = 0;

add=1;

while(total<rand)

add=add+1;

total=total+pdf(add);

newK = newK+1;

end

if(newK>0)

kplus = size(Z,2);

Znew = zeros(size(Z) +[0 newK]);

Znew(1:size(Z,1),1:kplus) = Z;

Z = Znew;

Z(i,kplus+1:end) = 1;

end

Z=Z’;

end

end

H.3 logPY Z(·) function

This section will show the MATLABR© code for the function ‘logPY Z(·)’, which is used inthe main script from Appendix H.1.

function [PYZ] = logPYZ(Y,Z,W,sigma_y,sigma_x,alpha)

N=size(Y,2);

D=size(Y,1);

PYZ=0;

K=D;

Page 59: A latent feature probabilistic principal component ... (002).pdf · dimensions increase within a dataset, the computational workload becomes more demand- ing; this boils down to a

H.4 logPY (·) function 58

mk = sum(Z’);

HN = sum(1./(1:N));

Zprior = K*log(alpha) -alpha*HN +

sum(gammaln(mk)+gammaln(N-mk+1)-gammaln(N+1));

for n=1:N

K=W*Z(:,n).*W;

M=inv(K’*K + ((sigma_y^2)/(sigma_x^2))*eye(D));

y=Y(:,n);

top=D*log(sigma_x)+(-1/(2*sigma_y^2))*(y’*(eye(D)-K*M*K’)*y)+(D/2)*log(det(M));

bottom=(D/2)*log(2*pi) + D*log(sigma_y);

PYZ=PYZ+top-bottom;

end

PYZ=PYZ+Zprior;

end

H.4 logPY (·) function

This section will show the MATLABR© code for the function ‘logPY (·)’, which is used inthe ‘UpdateZ(·)’ function from Appendix H.2.

function [PYZ] = logPY(Y,Z,W,sigma_y,sigma_x)

N=size(Y,2);

D=size(Y,1);

k=size(Z,1);

PYZ=0;

for n=1:N

K=W.*Z(:,n)’;

M=inv(K’*K + ((sigma_y^2)/(sigma_x^2))*eye(k));

y=Y(:,n);

top=(-1/(2*sigma_y^2))*(y’*(eye(D)-K*M*K’)*y)+0.5*log(det(M));

bottom=(D/2)*log(2*pi) + D*log(sigma_x);

PYZ=PYZ+top-bottom;

end

end

Page 60: A latent feature probabilistic principal component ... (002).pdf · dimensions increase within a dataset, the computational workload becomes more demand- ing; this boils down to a

References 59

References

[1] Farooq, A. (2017). A nonparametric prior for latent feature models. MathematicsReport. Aston University.

[2] Tipping, M.E. and Bishop, C.M., 1999. Mixtures of probabilistic principal componentanalyzers. Neural computation, 11(2), pp.443-482.

[3] Ghodsi, A., 2006. Dimensionality reduction a short tutorial. Department of Statisticsand Actuarial Science, Univ. of Waterloo, Ontario, Canada, 37, p.38.

[4] freeCodeCamp. (2018). The Curse of Dimensionality – freeCodeCamp. [online]Available at: https://medium.freecodecamp.org/the-curse-of-dimensionality-how-we-can-save-big-data-from-itself-d9fa0f872335 [Accessed 8 Apr. 2018].

[5] Griffiths, T.L. and Ghahramani, Z., 2011. The indian buffet process: An introductionand review. Journal of Machine Learning Research, 12(Apr), pp.1185-1224.

[6] Chen L. (2009) Curse of Dimensionality. In: LIU L., OZSU M.T. (eds) Encyclopedia ofDatabase Systems. Springer, Boston, MA

[7] INVESTOPEDIA. Financial Forecasting: The Bayesian Method. [ONLINE] Availableat: https://www.investopedia.com/articles/financial-theory/09/bayesian-methods-financial-modeling.asp. [Accessed 31 October2017].

[8] Machine Learning Mastery. 2016. Overfitting and Underfitting With Machine LearningAlgorithms. [ONLINE] Available at: https://machinelearningmastery.com/overfitting-and-underfitting-with-machine-learning-algorithms/. [Accessed 31 October2017].

[9] Orbanz, P. and Teh, Y.W., 2011. Bayesian nonparametric models. In Encyclopedia ofMachine Learning (pp. 81-89). Springer US.

[10] Ghahramani, Z., 1995. Factorial learning and the EM algorithm. In Advances inneural information processing systems (pp. 617-624).

[11] Orbanz, P. and Teh, Y.W., 2011. Bayesian nonparametric models. In Encyclopedia ofMachine Learning (pp. 81-89). Springer US.

[12] Mathworld.wolfram.com. (2018). Matrix Trace – from Wolfram MathWorld. [online]Available at:http://mathworld.wolfram.com/MatrixTrace.html [Accessed 8 Apr. 2018].

[13] Mathworld.wolfram.com. (2018). Determinant – from Wolfram MathWorld. [online]Available at: http://mathworld.wolfram.com/Determinant.html [Accessed 8 Apr.2018].

Page 61: A latent feature probabilistic principal component ... (002).pdf · dimensions increase within a dataset, the computational workload becomes more demand- ing; this boils down to a

References 60

[14] Straub, W.O., 2009. A brief look at Gaussian integrals. Tech. Rep.

[15] Yildirim, I., 2012. Bayesian inference: metropolis-hastings sampling. Department ofBrain and Cognitive Sciences, Univ. of Rochester, Rochester, NY.

[16] Sorzano, C.O.S., Vargas, J. and Montano, A.P., 2014. A survey of dimensionalityreduction techniques. arXiv preprint arXiv:1403.2877.

[17] Qu, L., Li, L., Zhang, Y. and Hu, J., 2009. PPCA-based missing data imputation fortraffic flow volume: A systematical approach. IEEE Transactions on intelligenttransportation systems, 10(3), pp.512-522.

[18] Brilliant.org. (2018). Subspace — Brilliant Math & Science Wiki. [online] Availableat: https://brilliant.org/wiki/subspace/ [Accessed 15 Apr. 2018].

[19] Lawrence, N., 2005. Probabilistic non-linear principal component analysis withGaussian process latent variable models. Journal of machine learning research, 6(Nov),pp.1783-1816.

[20] Minka, T.P., 2001. Automatic choice of dimensionality for PCA. In Advances inneural information processing systems (pp. 598-604).