bayesian joint analysis of heterogeneous datapeople.ee.duke.edu › ~lcarin ›...

Bayesian Joint Analysis of Heterogeneous Data

1Priyadip Ray, 2Lingling Zheng, 1Yingjian Wang, 2Joseph Lucas,3David Dunson and 1Lawrence Carin

1Electrical & Computer Engineering Department2Institute for Genomic Science and Policy

3Department of Statistics

Duke University

Durham, NC, USA

Abstract

A Bayesian factor model is proposed for integrating multiple disparate,

but related datasets. The approach is based on factoring the latent space

(feature space) into a shared component and a data-specific component, with

the dimension of these spaces inferred via a beta-Bernoulli process. For cases

in which there are space-time covariates, the factor scores and/or loadings are

modeled via a Gaussian process (GP), with inhomogeneity addressed through

a novel kernel stick-breaking process (KSBP) based mixture of GPs. Theoret-

ical properties of the KSBP-GP factor model are discussed, and an MCMC

algorithm is developed for posterior inference. The proposed approach is

first demonstrated by jointly analyzing multiple types of genomic data (gene

expression, copy number variations, and methylation) for ovarian cancer pa-

tients, showing that the model can uncover key attributes related to cancer;

these heterogeneous data allow consideration of model performance in the ab-

sence of space-time covariates. Analysis of space-time-dependent data is con-

sidered in the form of multi-year unemployment rates at various geographical

locations, with these data analyzed jointly with time-evolving stock prices of

companies in the S & P 500.

Key words: Data fusion, joint factor analysis, Gaussian process, DNA microarray

analysis, spatio-temporal data

1 Introduction

An important research problem in statistical signal processing and machine learning

concerns integration/fusion of multiple disparate, but statistically related datasets.

1

For example, in genomic signal processing, integration of DNA copy number vari-

ations and gene expressions may help identify key drivers in cancer mechanisms.

In econometrics and social sciences, joint analysis of multiple heterogeneous finan-

cial and social databases may reveal interesting underlying socio-economic trends.

Though the range of potential applications is immense, the increase in data di-

mension, data heterogeneity and the presence of noise often make such data fusion

problems challenging.

A key assumption employed when modeling such high-dimensional data is that

the intrinsic dimension of the data is much lower than the observed data dimen-

sion, i.e., the data lie in or are close to a low-dimensional subspace. For modeling

multiple disparate datasets, approaches often rely on the assumption that the data

are different manifestations of a single shared low-dimensional latent space (feature

space). The problem then lies in identifying this low-dimensional shared feature

space and the data specific mappings from this shared space to the observed data.

Classical data analysis techniques for multiple datasets, such as canonical correla-

tion analysis (CCA) (Hotelling, 1936; Borga, 1998; Hardoon et al., 2004), compute

a low-dimensional shared linear embedding of a set of variables, such that the cor-

relations among the variables is maximized in the embedded space. Probabilistic

approaches to CCA have been proposed in (Bach and Jordan, 2005; Wang, 2007;

Rai and Daume, 2009). For joint analysis of multiple data sets, (Bach and Jordan,

2005; Wang, 2007; Rai and Daume, 2009) assume the existence of underlying shared

latent variables and conditional independence of the data given the latent variables.

However the assumption of a single shared latent space may be limiting, and a more

flexible approach is to factorize the latent space into a component that is shared

among all datasets and a component that is specific to each. Such models are more

likely to capture the shared features among all datasets while still preserving the

idiosyncratic features unique to each.

Bayesian and semi-Bayesian latent variable models have developed to factorize

the latent space into a shared and data-specific part (Archambeau and Bach, 2008;

Klami and Kaski, 2008). However, in these approaches the number of latent factors

are chosen a priori. Alternatively, one may consider multiple factor models, each

with a different number of factors, and perform model selection based on information

criteria such as AIC (Akaike, 1987) or BIC (Schwarz, 1978). However, as it is often

challenging to check modeling assumptions in high-dimensions, a nonparametric

2

or semiparametric model is desirable. In this paper we propose a Bayesian factor

analysis approach for integrating multiple heterogeneous datasets, with the number

of factors inferred from the available data. Our proposed approach is based on

factoring the latent space into shared and data-specific components, employing a

beta-Bernoulli process (Griffiths and Ghahramani, 2005; Thibaux and Jordan, 2007;

Paisley and Carin, 2009) to infer the dimension of these latent spaces.

We further extend the proposed approach to problems with multiple heteroge-

neous data sources exhibiting spatio-temporal dependencies. Gaussian Process (GP)

priors (Rasmussen and Williams, 2005) provide a particularly effective solution for

incorporating knowledge of the spatial locations and the time stamps in spatio-

temporal data. In (Luttinen and Ilin, 2009; Schmidt and Laurberg, 2008; Schmidt,

2009), the authors propose GP factor analysis (GPFA) techniques for modeling a

single spatio-temporal dataset, with spatial dependence incorporated in the factor

loadings and with temporal dependence incorporated in the factor scores. These

approaches are primarily parametric, i.e., the number of factor loadings is assumed

known a priori. Further in the GPFA model considered in (Luttinen and Ilin, 2009),

the factor loadings capture the spatial correlation and each factor loading is drawn

from a single GP. However, the underlying assumption, that the correlation pattern

is spatially invariant, may not be valid for many real data. For example, it is likely

that the spatial correlation pattern among the cities in densely populated regions is

different from that among less densely populated regions.

To model data with such non-stationary spatial covariance structure, we first

propose a new GP factor model, in which the factor loadings are drawn from a

mixture of GPs. Though not specifically in the context of factor models, mixture

of GPs have been applied previously to model data with non-stationary covariance

structure. Most mixture of GP approaches, such as (Shi et al., 2003), assume a

known number of mixture components. Nonparametric approaches, with the num-

ber of mixture components inferred from the data, have been proposed in (Tresp,

2001; Rasmussen and Ghahramani, 2002; Meeds and Osindero, 2006; Gramacy and

Lee, 2007). In (Gramacy and Lee, 2007), the authors propose a tree-based parti-

tioning approach to divide the input space and fit different base models to data

independently in each region. In (Rasmussen and Ghahramani, 2002; Meeds and

Osindero, 2006), the authors propose an input-dependent Dirichlet process mixture

of GPs to model non-stationary covariance functions.

3

We propose a novel GP mixture model based on the kernel stick breaking process

(KSBP) (Dunson and Park, 2008). Our approach is similar in spirit to (Rasmussen

and Ghahramani, 2002), but it has several advantages, discussed in depth below.

We also provide a detailed theoretical analysis of the properties of the imposed prior.

The proposed KSBP-GP approach (where a smaller subset of the data, instead of

the entire data, is associated with each GP) has the added advantage of improved

computational efficiency relative to a single GP. The computational efficiency of

our approach may be improved further by adopting various existing approximations

for GP regression. A detailed overview of such approximations may be found in

(Rasmussen and Williams, 2005; Candela et al., 2007).

To first isolate the component of the model that shares heterogeneous data

sources, without consideration of space-time phenomena, we consider the joint anal-

ysis of genomic data for ovarian cancer patients. Three different datasets are con-

sidered: gene expression, copy number variations, and DNA methylation levels for

ovarian cancer patients. We demonstrate that the joint analysis of gene expres-

sion/copy number variations and gene expression/DNA methylation levels can po-

tentially identify genomic and epigenomic regulators influencing cancer pathophys-

iology outcomes. To clearly illustrate the performance of the KSBP-GP mixture

prior, we isolate it from the factor model and apply it separately on the classic mo-

torcycle data (Silverman, 1985), which consists of measurements of the acceleration

of the head of a motorcycle rider in the first moments after impact.

To demonstrate the model on data with space-time dependencies, we next an-

alyze time-evolving unemployment rates at various geographical locations in the

United States. We consider two types of data: one contains time series of unem-

ployment rates at 83 counties in the state of Michigan (relatively uniform spatial

population density) and the other contains time series of unemployment rates at

187 metro cities across the entire United States (highly non-uniform spatial popu-

lation density). Further, by jointly analyzing other readily available auxiliary time-

dependent data sources that have significant correlation with the job market, we

may be able to improve learning space-time unemployment rates. An example of

such an auxiliary data source, considered in this paper, is the time-dependent stock

prices of companies listed in the S & P 500. The fundamental idea is to appro-

priately borrow statistical strength between these distinct but correlated data, to

obtain a better representation of each data type. We demonstrate that the KSBP-

4

GP joint factor model can exploit the shared statistical features across multiple data

types, to learn a better model for each data type, and considerably improve spatial

imputation of unemployment rates.

The remainder of the paper is organized as follows. In Section 2 we present the

proposed hierarchical Bayesian model for jointly analyzing heterogeneous data. In

Section 3 we discuss priors on factor loadings and in Section 4 we discuss priors

on factor scores. In Section 5 we provide theoretical properties of the proposed

KSBP-GP mixture prior. Section 6 outlines an MCMC inference algorithm, and

Section 7 provides experimental results for the proposed model on the analysis of

multiple genomic data, motorcycle data and econometric data. Finally we provide

concluding remarks in Section 8.

2 Joint Bayesian Factor Analysis

LetX(r)

r=1,R

represent data fromR different modalities, whereX(r) = (x(r)1 , . . . ,x

(r)M ) ∈

RNr×M . In sparse factor modeling, learning a single shared matrix of factor load-

ings for different signal classes has been proposed in (Mairal et al., 2008). However

for heterogeneous data such as that considered here, learning a shared set of factor

loadings is more difficult.

The joint factor model may be represented as

X(r) = D(r)(W (c) +W (r)

)+E(r) (1)

The matrix D(r) = (d(r)1 , . . . ,d

(r)K ) ∈ RNr×K consists of the factor loadings specific

to data modality r, factor scores W (r) = (w(r)1 , . . . ,w

(r)M ) ∈ RK×M are specific to

data from modality r, W (c) = (w(c)1 , . . . ,w

(c)M ) ∈ RK×M consists of the factor scores

common among all modalities, and E(r) = (ε(r)1 , . . . , ε

(r)M ) ∈ RNr×M consists of the

noise/residual specific to data of modality r.

Note that one may alternatively consider

X(r) = D(rc)W (c) +D(r)W (r) +E(r) (2)

where D(rc) correspond to factor loadings associated with common factors, with

D(rc) reflective of how these common factors are viewed by modality r; D(r) are

factor loadings associated with factors specific only to modality r. The framework

5

in (1), which we employ throughout, allows the ability to share factor loadings

between the common and modality-specific factors, and the manner in which W (c)

andW (r) are modeled allows sufficiently flexibility to yield (2) if the data so warrant.

We wish to impose the condition that any x(r)i is a sparse linear combination of

the factor loadings. Hence, the factor scores are represented as,

w(r)i = s

(r)i b

(r)i and w

(c)i = s

(c)i b

(c)i (3)

where s(r)i ∈ RK , s

(c)i ∈ RK , b

(r)i ∈ 0, 1

K , b(c)i ∈ 0, 1

K and represents the

Hadamard product (elementwise vector product).

The choice of priors for D(r), s(r)i and s

(c)i are application-dependent and in

the case of D(r), also modality-dependent; these are discussed below for specific

examples. The sparse binary vectors b(r)i are drawn from the following beta-Bernoulli

process (Griffiths and Ghahramani, 2005; Thibaux and Jordan, 2007; Paisley and

Carin, 2009)

b(r)i ∼

K∏k=1

Bernoulli(πk) , π ∼K∏k=1

Beta(cα, c(1− α)) (4)

with πk representing the kth component of π and α ∈ (0, 1). In practice K is

finite, and the above equation represents a finite approximation to the beta-Bernoulli

process, where the number of non-zero components of each b(r)i is a random variable

drawn from Binomial (K,α). If α is set to ρK

, in the limit K → ∞ this reduces

to the number of non-zero components in b(r)i being drawn from Poisson(ρ); this

corresponds to the Indian buffet process (IBP) (Griffiths and Ghahramani, 2005;

Thibaux and Jordan, 2007; Paisley and Carin, 2009). We may therefore explicitly

impose a prior belief on the number of non-zero components in w(r)i . The shared

binary vectors b(c)i are modeled similarly as b

(r)i . The noise or residual in (1) is

modeled as

ε(r)i ∼ N (0, γ(r)

ε

−1INr), γ(r)

ε ∼ Gamma(a0, b0) (5)

where INr represents the Nr ×Nr identity matrix.

The construction in (1) imposes the belief that there are underlying (low-dimensional)

features represented by the factor scores that may be shared across modalities, via

W (c); however, each modality has a unique mapping from these low-dimensional fac-

tor scores to the high-dimensional data, reflected by D(r). Further, each modality

6

may also have idiosyncratic low-dimensional features, characterized by W (r). The

common and idiosyncratic features are learned jointly, via the simultaneous anal-

ysis of all modalities. A unique feature of the above construction is that it allows

complete sharing of some low-dimensional features across different data modalities

as well as partial sharing, i.e., a shared feature may be slightly perturbed via W (r)

and shared across different modalities.

3 Imposing Structure on Factor Loadings

3.1 Simple construction

In the absence of covariates, the factor loadings may be drawn i.i.d. from a Gaussian

distribution (for ease of notation, we henceforth drop the modality index r, unless

referring to multiple data modalities simultaneously),

dk ∼ N (0, γ−1s IN), γs ∼ Gamma(a5, b5) (6)

3.2 Imposing sparsity

In many biological applications, it is desirable that the factor loading matrix is

sparse (Carvalho et al., 2008). To impose sparsity on the factor loadings, we employ

a Student-t sparseness-promoting prior (Tipping, 2001). In this construction, djk,

the jth component of dk, is drawn

djk ∼ N (0, τ−1jk ) , τjk ∼ Gamma(a1, b1) (7)

However, there are multiple ways one may desire to impose sparsity, such as using the

spike-slab prior (Ishwaran and Rao, 2005; Carvalho et al., 2008; Chen et al., 2011).

This consists of a discrete-continuous mixture of a point mass at zero, referred to

as the ‘spike’, and any other distribution, such as the Gaussian distribution, known

as the ‘slab’. A hierarchical beta-Bernoulli construction of the spike-slab prior for

imposing sparsity on the factor loadings is provided in (Chen et al., 2011). We found

that the spike-slab prior works as well as the model presented above; however, for

the sake of brevity, we include only the results for the Student-t sparseness prior in

this paper.

7

3.3 Covariate dependent factor loadings

3.3.1 Stationary covariance structure

We may model the covariate-dependent factor loadings as being drawn from a GP,

with kth column of D, denoted dk, drawn

dk ∼ N (0,Σ) (8)

Σ(n,m) = τ exp

−β ‖rn − rm‖

2

2

+ σδn,m (9)

where rn and rm represent the nth and mth instances of the covariates, respectively.

The kernel (9) embeds the covariates into the covariance matrix, and is characterized

by three parameters: τ controls the signal variance, σ controls the noise variance,

and β is the bandwidth or scale parameter, and it controls the amount of smoothing.

Other popular kernel choices may be found in (Rasmussen and Williams, 2005) and

references therein. Equation (9) imposes the belief that data that have similar

covariates are likely to be correlated and the correlation decays with increasing

distance in the covariate space. This presumes a stationary covariance structure.

We will utilize this prior for spatio-temporal data of unemployment rates over an

approximately uniformly populated region.

3.3.2 Non-stationary covariance structure

For data with non-stationary covariance structure, we propose to divide the data

into clusters, with each cluster sharing a unique GP. We utilize a mixture-of-GP

approach based on the kernel stick-breaking process (KSBP) (Dunson and Park,

2008). The KSBP is based on the stick-breaking process (Sethuraman, 1994), which

involves sequential breaks of “sticks” of length wl from an original stick of unit

length∑∞

l=1 wl = 1. In a stick-breaking process, the cluster indicators are drawn as

z(i) ∼∞∑l=1

wlδl, i = 1, . . . , N ; wl = Vl

l−1∏j=1

(1− Vj) ; Vl ∼ Beta(1, γ) (10)

Due to the property of the beta distribution for small γ, it is likely that only a

relatively small set of “sticks” will have appreciable weight.

The primary difference between a stick-breaking process and a kernel stick-

8

breaking process is that the stick weights are further modulated by an additional

bounded kernel K(r; r∗l , φl) → [0, 1] which is a function of the covariates r. This

imposes the belief that data that are closer in the covariate space will have similar

stick weights wl(r), and hence are likely to share the same cluster. In a KSBP, the

cluster indicators are drawn as

z(i) ∼∞∑l=1

wl(ri)δl , wl(r) = VlK(r; r∗l , φl)l−1∏j=1

[1− VjK(r; r∗j , φj)] (11)

where we employ a radial basis function (RBF) kernel,

K(r; r∗l , φl) = exp

(−‖r − r

∗l ‖

2

φl

)(12)

and Vl ∼ Beta(1, γ), γ ∼ Gamma(a4, b4), r∗l ∼ F , and φl ∼ H. The expression δl

corresponds to a unit point measure at the index l.

Here F denotes the prior distribution over potential kernel-center locations, and

is expected to be continuously varying over the entire space; however, this may

complicate computations. Hence, we choose a discrete prior, i.e., F ∼∑

h ehr∗h,

where r∗h constitutes of a grid of potential basis locations (eh > 0 and∑

h eh =

1). Similarly, the KSBP kernel widths, which may be interpreted as the region of

influence of the basis functions, are inferred from the data. For ease of posterior

inference, a discrete prior H ∼∑

h phφh is chosen and φh represents a set of

potential kernel widths (ph > 0 and∑

h ph = 1). Details on inference of these

parameters are provided in Appendix B.

For the case of covariate-dependent factor loadings, a single KSBP is drawn, with

associated covariates. For each factor loading dk, the components of this vector are

apportioned to clusters, via the KSBP. Each such cluster corresponds to a unique

GP, with associated GP parameters; all components of dk associated with a given

cluster are drawn jointly from this GP. Specifically, the lth GP for dk is characterized

by covariance

Σkl(n,m) = τl exp

−βl‖rn − rm‖2

2

+ σlδn,m

for n,m ∈ Akl, where Akl = n : zk(n) = l for n = 1, . . . , N and zk(n) represents

the cluster indicator for the nth element of dk; Akl is the set of components of dk

that are associated with GP component l, as apportioned via the KSBP. In practice

9

we truncate the number of GP mixture components to a large value J .

We will utilize this prior for spatio-temporal data of unemployment rates of cities

over a large non-uniformly populated region. The KSBP-GP not only imposes

the belief that cities form clusters based on similarities in their unemployment

rates, but it additionally imposes that cities that are geographically proximate are

likely to belong to the same cluster. This is the fundamental difference from a

Dirichlet process mixture of GPs (Rasmussen and Ghahramani, 2002), which does

not incorporate spatial information in the clustering process; other related mixture

models are discussed in Muller et al. (1996); Gelfand et al. (2005); Duan et al.

(2007). The benefits of the KSBP-GP mixture over a DP-GP mixture, for modeling

non-stationary spatial data, are illustrated in the results provided later.

4 Imposing Structure on Factor Scores

In the simplest scenario (for example for the genomic data considered in the paper)

the factor scores may be drawn i.i.d. from a Gaussian distribution as,

s(r)i ∼ N (0, γ−1

r IK) s(c)i ∼ N (0, γ−1

c IK) (13)

We impose broad gamma prior on γr and γc: γr ∼ Gamma(a2, b2) and γc ∼Gamma(a3, b3).

For data with covariates and with a stationary covariance structure, the scores

may be drawn in a similar manner as the factor loadings as discussed earlier in

Section 3.3.1. Let S(r) = (s(r)1 , . . . , s

(r)M ) ∈ RK×M . The rows of S(r), of length M

and denoted s(r)(k) for the kth row, are drawn from a Gaussian process (for example

for time-dependent data, s(r)(k) represents the time dependence of factor k, at M time

points). We employ a GP construction

s(r)(k) ∼ N (0,Σ′) (14)

Σ′(p, q) = τ (t) exp

−β

(t) ‖tp − tq‖2

2

+ σ(t)δp,q (15)

where tp and tq represent the pth and qth covariates respectively. Note that the rows

of S(c) = (s(c)1 , . . . , s

(c)M ) are modeled similarly as the rows of S(r).

We have employed this construction for the space-time econometric data and

10

time-dependent stock-price data. For data with non-stationary covariance structure

(non-stationary temporal behavior), the factor scores may be drawn from a KSBP-

GP mixture, as discussed earlier in Section 3.3.2. However, for the data considered

in this paper, it was deemed not necessary to employ a KSBP-GP mixture for the

factor scores.

5 Conditional and Marginal Properties of KSBP-

GP Mixture

We next present analytical results on the correlation properties of the KSBP-GP

mixture prior. We are particularly interested in observing the effect of the KSBP

parameters (GP bandwidth parameters and the stick-breaking parameter γ) on the

covariance structure. For ease of understanding, the interpretations of the properties

are provided in context to the space-time data considered in this paper. The proofs

of the following properties are provided in Appendix A.

5.1 Conditional spatial covariance

Let Θk = r∗kl, φkl, Vkll=1,∞ represent the parameter set of the KSBP and Ωk =

τ (s)kl , β

(s)kl , σ

(s)kl l=1,∞ represent the parameter set for the GPs corresponding to the

kth factor loading. It can be shown that the conditional spatial covariance is

Cov(dik, di′k|Θk,Ωk) =< ψik,ψi′k > (16)

where dik represents the ith element of dk andψik = [σk1wik1, σk2wik2, . . . , σk∞wik∞]T ,ψi′k =

[σk1wi′k1, σk2wi′k2, . . . , σk∞wi′k∞]T , σkl =√

Σk,l(i, i′) and wikl = VklK(ri, r∗kl;φkl)

∏l−1j=1[1−

VkjK(ri, r∗kj;φkj)].

Equation (16) shows the dependence of the conditional spatial covariance on the

KSBP kernel parameter φkl. Note that a smaller φkl implies little borrowing of spa-

tial information, whereas a larger φkl implies greater sharing of spatial information.

As φkl becomes smaller, the elements of dk are less correlated spatially and in the

limit φkl → 0,Cov(dik, di′k|Θk,Ωk) = 0, the model reduces to the elements of dk

being drawn independently from a normal distribution. As φkl becomes larger, more

long range spatial correlation is encouraged and in the limit φkl → ∞, the kernel

effect vanishes and the model reduces to a Dirichlet process (DP) mixture of GP

11

prior on dk. Note that in our model the factor loadings dkk=1,K are drawn from a

mixture of Gaussians and hence they are non-Gaussian in nature; this is a powerful

feature of our proposed model, since most spatio-temporal factor models assume

that the factor loadings are Gaussian (Luttinen and Ilin, 2009; Lopes et al., 2008,

2011).

5.2 Marginal spatial covariance

The marginal spatial covariance is

Cov(dik, di′k) =∞∑l=1

∫Θk,Ωk

Σk,l(i, i′)p(zk(i) = zk(i

′) = l|Θk)P (dΘk)P (dΩk)

To demonstrate the properties of the marginal spatial covariance, we analyze its

behavior under some simplified assumptions. We assume that all the GPs share the

same kernel parameters, i.e., Σk,l(i, i′) = Σk(i, i

′). We further assume a rectangular

kernel given by, K(r, r∗;φ) = 1, for ||r− r∗||2 ≤ ∆; K(r, r∗;φ) = 0, for ||r− r||2 >∆; we also assume that r∗ is drawn uniformly. The the marginal spatial covariance

reduces to

Cov(dik, di′k) =E[Σk(i, i

′)]2+γ

2

1π

[arcsin

√∆2−||

ri−ri′2 ||22

∆2 −||ri−ri′

2 ||2

√∆2−||

ri−ri′2 ||22

∆2 ]

− 1(17)

where E denotes the expectation operator. Equation (17) casts insight on the effects

of the stick-breaking parameter γ, the distance between two spatial locations ||ri −ri′ ||2 as well as the kernel width ∆ on the marginal spatial covariance. For example,

when γ increases, Cov(dik, di′k) decreases and vice versa. This result is intuitively

pleasing, since when γ increases, the number of inferred clusters increases, which

reduces the probability of ri, ri′ sharing the same GP, which in turn reduces the

correlation among them. We also observe that when the distance between the two

spatial location ||ri−ri′ ||2 increases, Cov(dik, di′k) decreases, and when the width of

the kernel ∆ increases, the Cov(dik, di′k) increases. A limiting property is obtained

when the kernel width ∆→∞, Cov(dik, di′k) = E[Σk(i,i′)]1+γ

.

12

5.3 Spatio-temporal covariance

For data point xij located at ri at time tj and data point xi′j′ located at ri′ at time

tj′ , the covariance is given by

Cov(xij, xi′j′)

= α2

K∑k=1

Cov(dik, di′k)Cov(skj, skj′)(18)

Equation (18) reveals that the spatio-temporal covariance is equal to the sum of

the products of the spatial and temporal covariance along every dimension. It is

evident that when dik, di′k do not share the same GP along any dimension then

Cov(xij, xi′j′) = 0.

6 MCMC Inference

The conditional posterior distribution of all the model parameters for the joint

factor model (for data without covariates, such as the heterogeneous genomic data

considered in this paper) may be derived analytically. We use a Gibbs sampler to

draw samples from the posterior distribution of the model parameters. For the factor

analysis results on multiple genomic data presented in Sections 7.1.2 and 7.1.3, the

number of Gibbs burn-in samples is set to 3000 and the number of collection samples

is set to 1000. Broad gamma hyperpriors are chosen for the variance terms with

a0 = b0 = a2 = b2 = a3 = b3 = 10−5. The results are relatively insensitive to these

settings and various other settings such as a0 = b0 = a2 = b2 = a3 = b3 = 10−3 or

a0 = b0 = a2 = b2 = a3 = b3 = 10−6 yielded very similar results. The shrinkage

parameters on the factor loadings are set at a1 = 10−3 and b1 = 10−6 (for Gene-copy

number analysis results) and a1 = 1 and b1 = 10−2 (for Gene-Methylation analysis

results).

For the joint KSBP-GP factor model, the conditional posteriors for all the model

parameters, except the GP parameters, are available in closed form. For the GP

parameters, we obtain point estimates via restricted range maximum likelihood

estimation (Casella and Berger, 2001). For the specific application considered here

(joint modeling of econometric data), we would like the shared latent space to learn

more macroscopic similarities (such as global economic trends) between the different

13

data and the unique latent space to learn more microscopic features unique to any

particular data. This is imposed by specifying different intervals for the bandwidth

parameter during the restricted range MLE based optimization to update the GP

parameters. For the kernel stick-breaking process we truncate the sum in (11) to

J = 50 terms, with wJ(r) = 1−∑J−1

l=1 wl(r). For the results presented on motorcycle

data, unless otherwise noted, the number of Gibbs burn-in samples is set to 3000

and the number of collection samples is set to 1000. For the results presented on

econometric data, the number of Gibbs burn-in samples is set to 1500 and the

number of collection samples is set to 500. All important (unique) update equations

are provided in Appendix B.

Since much correlation is encoded in the priors, the mixing of the MCMC sampler

was also carefully examined. The sampler was run extensively for different number

of burn-in and collection samples. It was also run multiple times in parallel with

different initial values. The results of these experiments were found to be consistent

and repeatable across such runs.

7 Experimental Results

We perform a number of experiments to validate the joint factor model as well as its

various components. The proposed model (for data without covariates) is applied

to heterogeneous genomic data for ovarian cancer patients. Next, the KSBP-GP

mixture prior is isolated from the factor model, and validated separately on motor-

cycle data (measurements of the acceleration of the head of a motorcycle rider after

impact). Finally, the KSBP-GP factor model is demonstrated first on the analy-

sis of spatially inhomogeneous data of a single modality (multi-year unemployment

rates of metro cities across US), and then on multi-modal space-time data (US and

Michigan unemployment rates and stock prices of companies in the S & P 500).

7.1 Joint analysis of heterogeneous genomic data

There are numerous publications on combining different types of DNA modifications

with gene expression. Perhaps the most natural of these are methods such as expres-

sion quantitative trait loci (eQTL) analysis (Kendziorski et al., 2006); more recent

eQTL formulations are discussed in (Scott-Boyer et al., 2012). CNAmet (Louhimo

and Hautaniemi, 2011) defines a similar approach to relate gene expression changes

14

with either copy number change or DNA methylation. Other approaches use well

established models for each of the individual data types, then combine the results

into a statistic that addresses the problem of interest. The approach of Jeong et al.

(2010) is an example of this for the identification of genes that are regulated by DNA

methylation. A shortcoming of all of these approaches is that they do not reduce

the dimension of the individual data sets through an accounting of their respective

correlation structures.

In (Lanckriet et al., 2004) the authors used kernel functions predefined for each

data type, and mapped to the same vector space, which allows joint analysis in the

common range of the kernels. Copy number and expression in cancer (CONEXIC)

(Akavia et al., 2010) has been proposed as a Bayesian scoring function that measures

how well a set of candidate gene regulators correlate with the expression of gene

modules (groups of genes that are correlated with each other). Another approach

(Lucas et al., 2010) utilizes a sparse factor model to model the correlation structure

of the gene expression data, but uses post-hoc hypothesis tests to draw connections

between gene expression and copy number data. These approaches do allow for

effective dimension reduction, but don’t use correlation structure in one data set to

inform estimations of correlation in the others.

The most direct approach to jointly modeling the correlation structure of heter-

geneous genomic data is to require the factor matrix to be shared, as in Shen et al.

(2009). Their model does not contain a data-type-specific factor structure equiva-

lent to W (r) in our model, and is therefore somewhat less flexible. In addition, they

utilize standard normal distributions on the elements of the factor matrix, eliminat-

ing the possibility of discovering factors that are relevant for only a subset of the

subjects.

7.1.1 Data description

The data in this study include ovarian cancer gene expression, copy number varia-

tion (CNV) and methylation data collected from the Cancer Genome Atlas (TCGA)

project (http://cancergenome.nih.gov/). We aim to integrate gene expression/CNVs

and gene expression/methylation from 74 ovarian cancer patients. For computing

purposes, we downsized the original massive data into smaller sets. Independent

gene-by-gene filtering (based on criteria such as overall mean and overall variance)

is typically employed to reduce data dimension as well as increase the number of

15

discoveries in high-throughput experiments (Bourgon et al., 2010; Gentleman et al.,

2005; Talloen et al., 2007). In our analysis, a filtering criteria was established for

the gene expression data to eliminate probes with sample mean below 6, or stan-

dard deviation below 0.4, which resulted in a gene expression data set downsized

from 22277 to 5976. Comparative genomic hybridization (CGH) data was filtered

to remove Agilent Human Genome CGH 244A probes containing missing values.

This set was further filtered by keeping only one in 50 probes, leaving 4443 probes.

Methylation data (Illumina Infinium human methylation 27K bead assay) was fil-

tered to retain only higher variance samples (resulting in 4722 probes) and was

inverse-probit transformed to lie on the real line.

7.1.2 Analysis of gene expression and copy number variation data

Samples

Fac

tors

Gene expression binary matrix

20 40 60

10

20

30

40

50

60

Samples

CNV binary matrix

20 40 60

10

20

30

40

50

60

Samples

Shared binary matrix

20 40 60

10

20

30

40

50

60

Figure 1: The inferred feature selection matrices unique to data of modality r(B(r)) and common to both data modalities (B(c)). From left to right, the figuresare binary matrices unique to gene expressions, CNVs and shared between geneexpressions and CNVs, respectively. The y-axis shows the indicator of each factor,and x-axis represents the 74 subjects. The inferred factors and samples selected bythe model are assigned as 1 (red) and 0 (blue) otherwise. Results are shown for themaximum-likelihood collection sample (for illustration purposes).

We applied the joint Bayesian factor model to gene expression and CGH in order

to identify factors that are representative of correlated changes in gene expression

and DNA copy number variations. We set the upper bound on the number of factors

as K = 60, and obtained 1 specific to gene expression, 4 unique to CNVs and 19

shared between both modalities (Figure 1). Figure 2 shows the correlation structure

16

Gene expression

CN

V

(a) Correlation coefficient

20 40 60 80 100

5

10

15

20

25

30

35

40

45

50

55 −1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Gene expression

CN

V

(b) Correlation coefficient

20 40 60 80 100

5

10

15

20

25

30

35

40

45

50

55 −1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Chr 17Chr 6

Chr 17

Chr 6

Figure 2: Correlation structure between gene expression and CNVs of top loadedgenes from factor 41. The figure displays correlation coefficients between the twodata. Panel (a) and (b) shows the correlation results from patients selected anddropped out by the model, respectively. It is observed that CNVs from chromosome6 and genes from chromosome 17 have a reverse correlation pattern.

of the probe sets (gene expression) and CGH clones (CNA) that are included in joint

factor number 41 (the factor numbering is arbitrary, and changes between collection

samples, with these results illustrative; in these and related results we depict the

maximum likelihood collection sample). As expected, correlation between the factor

genes for those patients who were included in this factor is higher than for those

not included. In results shown in Section 7.2.2, when analyzing the econometric

data, we demonstrate how the approximate posterior distribution may be utilized

(beyond just the maximum likelihood collection sample).

It is well known that some variations in cancer gene expression are caused by

gene dosage changes due to CNVs. In addition, because of the mechanism by which

CNV occurs, it tends to happen in contiguous regions. Of the 20 CNV factors

identified, one is a nearly perfect representation of batch effects in the data and the

remaining 19 display copy number amplification/deletion in specific chromosomal

regions. Most of these show similar gene expression changes in the same region. We

demonstrate this behavior in Figure 3, which shows that the largest factor loadings

from both CNV and gene expression for factor 18 are clustered around the same

region of chromosome 8.

We identified highly associated copy number variations in the chromosomal arm

8q12.3-8q24.13 (factor 18), which is a known region for frequent high-level ampli-

17

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y−2

−1

0

1

2

3Gene expression factor18

Chromosome

Fac

tor

load

ings

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y−1.5

−1

−0.5

0

0.5

1

1.5CNV factor18

Chromosome

Fac

tor

load

ings

B

A

Figure 3: Factor analytic relationship between CNV and gene expression. Thefigures show the factor loadings from the 18th factor of the joint factor model fit toCNV (Panel A) and gene expression data (Panel B), respectively. These results arefor the maximum-likelihood collection sample, for ease of interpretation, although afull approximate posterior distribution is inferred. The red denotes odd-numberedchromosomes and the blue denotes even-numbered chromosomes.

cation associated with disease progression in human cancers (Frank et al., 2007;

Pils et al., 2005). The rediscovery of genes in this region also validates our ap-

proach. For example, E2F5 (8q21.2, Unique ID: 1875), an important gene in the

regulation of cell cycle, is known to be overexpressed in ovarian epithelial cancer

(Kothandaraman et al., 2010). Over-expressed genes, MTDH (8q22.1, Unique ID:

92140) and EBAG9 (8q23, Unique ID: 9166), have been recognized in a variety of

cancers including ovarian and breast cancers (Akahira et al., 2004; Rennstam et al.,

2003; Emdad et al., 2007). Another gene in this region whose expression level is

known to be important in tumor biology is WWP1 (8q21, Unique ID: 11059). This

recapitulation of some of the well known features of aneuploidy in cancer suggests

that our joint model is appropriately capturing correlation structure between gene

expression and CGH data.

As described above, many factors we obtained are associated with individual

chromosomal locations, as demonstrated in Figure 3. However, there is also a sub-

set of factors (1, 14, 32, 41, 45, 57) which are representative of multiple regions.

Figure 4 shows that the largest factor loadings in CNV/gene expression for factor

41 come from both chromosome 6 and 17. This is the explanation of the checker-

18

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y−10

−5

0

5


Chromosome

Fac

tor

load

ings

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y−4

−2

0

2

4

6CNV factor41

Chromosome

Fac

tor

load

ings

A

B

Figure 4: Dual peaks shown in the loadings of factor 41 of the joint factor modelfit to CNV (Panel A) and gene expression (Panel B) data. The red denotes odd-numbered chromosomes and the blue denotes even-numbered chromosomes.

board regions of positive and negative correlation in Figure 2 as well. The copy

number variations from the top ranked CGH probes in the two locations are highly

negatively correlated, with copy number gain in chromosome 6 and loss in the other.

There are a number of possible mechanistic explanations for this feature. For ex-

ample, it is possible that wholesale duplication of one region is lethal to the cells

without shutting down the apoptosis pathway. Such a shut down might be accom-

plished by deletion of other regions. Previous approaches to the joint analysis of

gene expression and CNV through the use of factor models, such as Lucas et al.

(2010), have failed to find these relationships.

The proposed joint factor model provides the flexibility of discovering factors that

are relevant only for a subset of the subjects. It is interesting to note that a similar

model which enforces that all subjects are included in the inferred factors, performed

poorly compared to the proposed model and discovered much fewer factors which

captured correlated changes in gene expressions and copy number variations.

7.1.3 Analysis of gene Expression and DNA methylation data

In this analysis, 18 common factors were inferred between methylation and gene ex-

pression. Unlike CNVs, methylation does not typically occur in contiguous regions,

therefore it is not surprising that no regional peaks were detected. Methylation acts

19

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y−3

−2

−1

0


Chromosome

Fac

tor

load

ings

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y−1

−0.5

0

0.5

1Methylation factor5

Chromosome

Fac

tor

load

ings

B

A

Figure 5: SPON1 gene identified in the loadings peak from factor 5 of the jointfactor model fit to DNA methylation (Panel A) and gene expression (Panel B) data.The red denotes odd-numbered chromosomes and the blue denotes even-numberedchromosomes.

as an epigenetic regulator and silences tumor suppressor genes by changing chromo-

somal structures. We detected a gene, SPON1(11p15.2, Unique ID: 10418), which

appears to be predominantly regulated by methylation of its CpG site (Figure 5).

Elevated expression of this gene relative to normal tissue is a known hallmark of

ovarian cancer (Pyle-Chenault et al., 2005), however, the mechanism of this overex-

pression was previously unknown. SPON1 encodes VSGP/F-spondin protein pro-

moting proliferation in vascular smooth cell during ovarian folliculogenesis, which

has been identified as a potential diagnostic marker or therapeutic target for ovarian

carcinoma (Pyle-Chenault et al., 2005; Miyamoto et al., 2001).

In contrast to the almost single gene precision of factor 5, factor 24 shows strong

correlation between methylation and gene expression in many different loci across

the entire genome (Figure 6). The list of CpG sites heavily loaded on this factor

are displayed in Table 1. Pathway analysis on these candidate genes reveals that

many are involved in DNA binding and regulation of transcription. The correla-

tion of methylation levels at all of these sites combined with their correlated gene

expression levels suggests that they are all the targets of a single methlylation pro-

gram, however, the existence of coordinated methylation enzymes that target these

locations is unconfirmed.

20

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y−1

−0.5

0

0.5


Chromosome

Fac

tor

load

ings

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y−0.4

−0.2

0

0.2

0.4

0.6Methylation factor24

Chromosome

Fac

tor

load

ings

A

B

Figure 6: Loadings from factor 24 with strong correlations between methylation(Panel A) and gene expression (Panel B) at many different loci. The red denotesodd-numbered chromosomes and the blue denotes even-numbered chromosomes.

We implemented the joint factor model for analysis of multiple genomic data in

non-optimized Matlab on a quad core PC with 2.2 GHz CPU and 4 GB ram. The

average time per iteration of the Gibbs sampler for the results in Section 7.1.2 is 72

seconds and for the results in Section 7.1.3 is 55 seconds.

7.2 Joint analysis of space-time-varying data

7.2.1 Motorcycle Data

Among the primary contributions in this paper is the introduction of a new KSBP-

GP mixture prior for modeling non-stationary data, as well as its integration in the

factor model. To clearly illustrate the performance of the KSBP-GP mixture model,

we first isolate it from the factor model and apply it separately on the classic mo-

torcycle data (Silverman, 1985), which has been used frequently in recent literature

to demonstrate the success of non-stationary models (Rasmussen and Ghahramani,

2002; Meeds and Osindero, 2006). The motorcycle data, represented as xm ∈ R94×1,

constitutes measurements of the acceleration of the head of a motorcycle rider in

the first moments after impact (over the first 60 milliseconds). We model xm via a

KSBP-GP mixture, and therefore components of xm are clustered via KSBP, using

21

HOXA13 SOCS2 SLC38A4 SPAG6 EPHX3 BST2 PITX2 FERD3L

GPR133 FCRL3 F10 BCAN ALX4 CDIPT CPT1C BCAP31

SOX1 ZNF385D IGF2AS ADCY4 EOMES GATA4 CABP5 PAX7

FLRT1 LEP TRPA1 HOXD10 PLEKHB1 GPR142 STK19 EVX1

SLC4A11 ZDHHC11 ZNF750 NXN AJAP1 VSX2 TRH FOXI1

RAC3 RENBP MYO3A GATA4 GRIK3 CARD14 APCDD1L CA3

PCDHAC1 BCAP31 SNCAIP CYP4F22 FCN1 SSNA1 GBP4 CASQ1

ARHGAP4 KLHL6 CEACAM3 CEBPG ABCB4 LYZL4 TRHDE CDX2

SCML1 PTHLH KLF11 SLC22A18 DENND2D C2orf43 PI3 ESX1

CBLN4 MAGEB6 AIM2 ZDHHC8P HEPACAM2 A2BP1 TERC C3

Table 1: Genes showing significantly differential methylation pattern in their CpGsites. The list is generated from candidates displayed in Figure 6A.

time as a covariate; all components of xm within the same KSBP-defined cluster

are drawn from an associated GP.

As shown in Figure 7 (left), the data roughly shows three regions: a low flat

noise region, a curved region and a flat high noise region. Typically as shown in

Figure 8(b), the model infers three to four dominant clusters (GPs); by “dominant”

we mean that approximately 94% of the data points are in these clusters. As shown

in Figure 7 (right), the KSBP-GP identifies the three regions of the data (time

intervals 0-10 ms, 10-40 ms and 40-60 ms), as distinct clusters, drawn from GPs

with different kernel parameters. Figure 8(a) shows 100 samples drawn from the

posterior (predictive distribution) of the KSBP-GP function evaluated (using GP

regression) at intervals of 0.5 ms. The choice of 100 Gibbs samples is to facilitate

comparison to earlier literature (Rasmussen and Ghahramani, 2002; Meeds and

Osindero, 2006), which provide results based on a similar number of Gibbs samples.

The inferred levels of uncertainty are in close agreement with those reported in

(Rasmussen and Ghahramani, 2002; Meeds and Osindero, 2006), but arguably, the

KSBP-GP captures the varying levels of uncertainty over the time interval 30 ms to

60 ms slightly better than (Rasmussen and Ghahramani, 2002; Meeds and Osindero,

2006), where they are flatter.

In Figure 8(c) we also show the frequency of the number of inferred clusters com-

puted over 1000 Gibbs collection samples. We observe that the KSBP-GP mixture

model primarily infers four to nine clusters to model the data. However as seen

from Figure 8(b), approximately 94% of the data points are primarily associated

with 3 to 4 dominant clusters. This is in close agreement with the approximately

22

0 20 40 60−150

−100

−50

0

50

100

Time (ms)

Acc

eler

atio

n (g

)

0 20 40 60−150

−100

−50

0

50

100

Figure 7: Motorcycle data (left) and typical inferred clusters for the maximum like-lihood Gibbs collection sample (right). Here different colors correspond to differentKSBP clusters (clustering shown for the most-likely collection sample).

three non-stationary regions as seen in Figure 7 (left) and is significantly better

than (Rasmussen and Ghahramani, 2002), which reported that the posterior distri-

bution uses between 3 and 15 experts to fit the data and with a low probability of

using up to 31 experts. Concerning additional advantages of the proposed KSBP-

GP model over the gating model proposed in (Rasmussen and Ghahramani, 2002):

The gating model (Rasmussen and Ghahramani, 2002) is a purely conditional model

(note (Meeds and Osindero, 2006) has extended the gating model of (Rasmussen

and Ghahramani, 2002) to a fully generative model and has reported results sim-

ilar to (Rasmussen and Ghahramani, 2002)). Further, the gating parameter in

(Rasmussen and Ghahramani, 2002) is not conjugate to its prior and has to be sam-

pled via Metropolis-Hastings. The proposed KSBP-GP model is a fully generative

model and the KSBP kernel parameters φ (equivalent to the gating parameter in

(Rasmussen and Ghahramani, 2002)) as well as the anchor points denoted by r∗j are

conjugate to their priors and may be efficiently sampled by Gibbs sampling. Details

regarding sampling of φ and r∗j are provided in Appendix B.

7.2.2 Unemployment data across United States

In this section we provide results for the proposed KSBP-GP factor model on unem-

ployment data across United States (we first consider space-time data from a single

23

0 20 40 60−150

−100

−50

0

50

100

Time (ms)A

ccel

erat

ion

(g)

(a)

3 4 5 60

0.2

0.4

0.6

0.8

Number of dominant clusters

Fre

quen

cy

(b)

3 4 5 6 7 8 9 10110

0.1

0.2

0.3

0.4

Number of inferred clusters

Fre

quen

cy

(c)

Figure 8: (a) 100 samples from the posterior for interpolation at intervals 5ms, (b)frequency of the number of dominant clusters over 1000 Gibbs collection samples,(c) frequency of number of inferred clusters over 1000 Gibbs collection samples.

24

data modality, and below we then consider multiple modalities). The data contains

unemployment rates of 187 metro cities across United States, sampled monthly,

over the period 1991 to 2009. We first show typical clustering results obtained via

the KSBP-GP mixture model. We have set the truncation level for the KSBP to

J = 50 and typically about 15 clusters are inferred. In Figure 9(a)-(d), we show the

probabilities of the cities being associated with 4 dominant inferred clusters. Also,

in our model, we infer the bandwidth of the GP associated with each cluster. For

example, the inferred bandwidth associated with the GP shown in Figure 9(a) is

18.35 whereas the inferred bandwidth for the cluster shown in Figure 9(b) is 0.4028.

It is interesting to observe that the inferred bandwidth associated with the north-

eastern cities is much smaller compared to the midwestern cities. It is intuitively

pleasing to note that the model infers short-range correlation pattern among the

more populated northeastern states whereas it infers more long-range correlation

pattern among the sparsely populated midwestern states.

In Table 2, we provide comparative interpolation results for our proposed model

and two other simpler models. In our experiment, we divide the unemployment

data across US (consisting of unemployment rates of 187 cities), into two parts;

87 cities (approximately 46% of the total number of cities) are drawn uniformly

at random once, and the unemployment rates of these cities are used for model

learning. The unemployment rates of the remaining 100 cities are assumed to be

unknown and are interpolated based on the learned model (using Gaussian process

regression (Rasmussen and Williams, 2005; Lopes et al., 2008) methods) and the

results are provided in Table 2. The results demonstrate that the KSBP-GP is a

better model for the spatially varying US unemployment data.

Table 2: Row 1: Average MSE of reconstruction of the unemployment rates (whichare in units of percent), for the 87 cities across US used for model learning. Row 2:Average MSE of interpolation of the unemployment rates for the 100 missing citiesin US. The results are the posterior means computed over 500 Gibbs collectionsamples. B1 refers to a factor model where the factor loadings are drawn froma Dirichlet process (Sethuraman, 1994) mixture of GPs and B2 refers to a factormodel where the factor loadings are drawn from a GP.

KSBP-GP B1 B2

0.0802 0.0700 0.53661.3288 1.4329 1.8046

25

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

(a)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

(b)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

(c)

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

(d)

Figure 9: (a)-(b) Four dominant inferred clusters obtained via the KSBP-GP mix-ture model. Darker shades indicate higher probability of a city being associated witha specific cluster. Note that the stick weights of the KSBP, captures the probabilityof association of a city with a particular cluster.

26

7.2.3 Joint Modeling: Michigan/United States plus S & P 500

To demonstrate the performance of the proposed joint KSBP-GP factor model, in

this paper we consider the integration of three different data types: Unemployment

rates of 83 counties in Michigan (modality r = 1), unemployment rates of 187

metro cities across the United States (modality r = 2) and the stock prices of 500

companies in S & P 500 (modality r = 3). For the Michigan data, the columns

of D(1) are drawn from a GP as in (8) (we assume spatial homogeneity across

Michigan) and for the US data the columns of D(2) are drawn from the KSBP-GP

(we assume spatial heterogeneity across US). For the stock price data, we have no

spatial information and the components of D(3) are drawn i.i.d. from a Gausssian

as in (6).

1990 1992 1994 1996 1998 2000 2002 2004 2006 20080

2

4

6

8

10

12

Years

Une

mpl

oym

ent r

ates

(%

)

True unemployment rates at Delta county of Michigan

Interpolated by joint GPFA (Michigan + US + S & P 500)

Interpolated by GPFA using only Michigan data

Figure 10: Typical interpolation result for GPFA and joint GPFA models for 79%missing data for a single missing county in Michigan (results averaged over 500collection samples).

Sampling of unemployment rates on a fine spatial scale is a difficult task. Hence,

we perform the following experiment to demonstrate our proposed joint GPFA

model. We assume that we have coarsely sampled unemployment data across Michi-

gan (unemployment rates of 18 counties, i.e., approximately 21% of the total coun-

ties, drawn uniformly at random once) and across the US (unemployment rates of

40 cities, i.e, 21% of the total cities, drawn uniformly at random once), with data

27

1990 1992 1994 1996 1998 2000 2002 2004 2006 2008−0.05

0

0.05

Years

Latent feature shared across all data modalities

−0.5

0

0.5

1Latent feature unique to Michigan

Recession Period

Figure 11: Typical inferred latent feature unique to Michigan (inferred row of S(r),where r = Michigan) / common to all data modalities (one inferred row of S(c)) forthe maximum likelihood Gibbs collection sample.

sampled monthly from these sites over the period 1991-2009. However, we have the

stock prices of all the 500 companies listed in the S & P 500, sampled monthly over

the same time period. Next, using GP regression (Rasmussen and Williams, 2005;

Lopes et al., 2008), we obtain the posterior distributions of the unemployment rates

at the remaining 79% counties of Michigan conditioned on (a) 21% observed data in

Michigan and (b) 21% observed data in Michigan and the US plus the stock prices

of all companies in the S & P 500. For these two scenarios, we compute the posterior

mean of the unemployment rates at the missing counties, using 500 Gibbs collection

samples. The MSE of interpolation of the unemployment rates (which are in units

of percent) for a typical example missing county in Michigan for the two cases are

1.2447 (using only undersampled Michigan data) and 0.4922 (using Michigan + US

+ S & P data). We also show the time series of the interpolation for the missing

county in Figure 10. The results clearly demonstrate that jointly analyzing the

Michigan, the United States and the S & P data provides significant improvement

in the MSE of interpolation for the missing counties in Michigan, relative to interpo-

lating based only on observed Michigan data. In Figure 11 we show typical inferred

factor scores (latent features) unique to Michigan and common to all data modali-

28

ties. It is interesting to observe that the latent features unique to Michigan capture

more microscopic and data specific behavior. For example, the latent feature in

Figure 11, specific to Michigan, captures the well known phenomenon of seasonal

variation in unemployment rates. On the other hand, latent features shared across

all data modalities, capture more macroscopic and global bahavior pertaining to the

general economy. For example, the shared latent feature in Figure 11 captures the

US economic cycle over the last two decades.

We implemented the KSBP-GP factor model for analysis of econometric data in

non-optimized Matlab on a quad core PC with 2.2 GHz CPU and 4 GB ram. The

average time per iteration for the analysis in Section 7.2.2 is 1 minute and for the

analysis in Section 7.2.3 is 1.5 minutes. Note that the results are for the most naive

implementation of a GP and the computational times may be significantly improved

by adopting efficient approximations for GP regression as provided in (Rasmussen

and Williams, 2005; Candela et al., 2007).

8 Conclusions

A joint factor analysis method is introduced for modeling multiple disparate but

statistically related data. The proposed approach was first demonstrated on the

joint analysis of heterogeneous genomic data related to ovarian cancer. The pro-

posed model uncovered key drivers of cancer, some of which have been previously

reported in literature as well as some new genomic causes of cancer (potentially).

Next the proposed approach was extended to a Gaussian process based factor analy-

sis approach to integrate space-time varying data. The approach can readily handle

spatial inhomogeneity and is also applicable to large and realistic datasets. This is

achieved via the introduction of a novel KSBP based mixture of GP prior. The-

oretical properties of the KSBP-GP factor model are also derived and discussed.

The joint KSBP-GP factor analysis model is shown to be particularly effective in

improving the model learning of undersampled data with the aid of other available

correlated data. Further, the joint model produced interpretable low dimensional

features (shared as well as data-specific features).

In this paper we have focussed on integrating multiple heterogeneous but sta-

tistically correlated datasets, via a joint factor analysis approach where the latent

space is factorized into shared and data specific components. Moreover, data spe-

29

cific linear mappings from the latent space to the observation spaces where obtained

via joint analysis of all data modalities. However, for certain applications, the as-

sumption that the data lie in or close to a low-dimensional subspace is restrictive

and a better assumption is that the data lie on some low-dimensional manifold. In

the future we wish to relax the linearity assumption of our joint factor model via a

mixture of factor analyzers (MFA) approach.

References

Akahira, J.-I., Aoki, M., Suzuki, T., Moriya, T., Niikura, H., Ito, K., Inoue, S., Oka-

mura, K., Sasano, H., and Yaegashi, N. (2004). “Expression of EBAG9/RCAS1

is associated with advanced disease in human epithelial ovarian cancer.” Br J

Cancer , 90, 11, 2197–2202.

Akaike, H. (1987). “Factor analysis and AIC.” Psychometrika, 52, 317–332.

Akavia, U. D., Litvin, O., Kim, J., Sanchez-Garcia, F., Kotliar, D., Causton, H. C.,

Pochanard, P., Mozes, E., Garraway, L., and Pe’er, D. (2010). “An Integrated

Approach to Uncover Drivers of Cancer.” Cell , 143, 6, 1005–1017.

Archambeau, C. and Bach, F. (2008). “Sparse probabilistic projections.” In Neural

Information Processing Systems .

Bach, F. and Jordan, M. I. (2005). “A probabilistic interpretation of canonical

correlation analysis.” Tech. rep.

Borga, M. (1998). “Learning Multidimensional Signal Processing.” Linkping Studies

in Science and Technology. Dissertations No. 531, Linkping University, Sweden.

Bourgon, R., Gentleman, R., and Huber, W. (2010). “Independent filtering increases

detection power for high-throughput experiments.” Proceedings of the National

Academy of Sciences , 107, 21, 9546–9551.

Candela, J. Q., Rasmussen, C. E., and Williams, C. K. I. (2007). “Approximation

Methods for Gaussian Process Regression.” Tech. rep., Microsoft Research.

Carvalho, C., Chang, J., Lucas, J., Nevins, J., Wang, Q., and West, M. (2008).

“High-Dimensional Sparse Factor Modelling: Applications in Gene Expression

Genomics.” Journal of the American Statistical Association, 103, 484, 1438–1456.

30

Casella, G. and Berger, R. (2001). Statistical Inference. Duxbury Resource Center.

Chen, M., Zaas, A., Woods, C., Ginsburg, G., Lucas, J., Dunson, D., and Carin, L.

(2011). “Predicting Viral Infection From High-Dimensional Biomarker Trajecto-

ries.” Journal of the American Statistical Association, 1–21.

Duan, J., Guindani, M., and Gelfand, A. (2007). “Generalized spatial Dirichlet

process models.” Biometrika, 94, 809–825.

Dunson, D. and Park, J.-H. (2008). “Kernel stick-breaking processes.” Biometrika,

95, 2, 307–323.

Emdad, L., Sarkar, D., Su, Z.-Z., Lee, S.-G., Kang, D.-C., Bruce, J., Volsky, D., and

Fisher, P. (2007). “Astrocyte elevated gene-1: Recent insights into a novel gene

involved in tumor progression, metastasis and neurodegeneration.” Pharmacology

& Therapeutics , 114, 2, 155 – 170.

Frank, B., Bermejo, J. L., Hemminki, K., Sutter, C., Wappenschmidt, B., Meindl,

A., Kiechle-Bahat, M., Bugert, P., Schmutzler, R., Bartram, C. R., and Bur-

winkel, B. (2007). “Copy number variant in the candidate tumor suppressor gene

MTUS1 and familial breast cancer risk.” Carcinogenesis , 28, 7, 1442–1445.

Gelfand, A., Kottas, A., and MacEachern, S. (2005). “Bayesian nonparametric

spatial modeling with Dirichlet process mixing.” J. Am. Stat. Ass., 100, 1021–

1035.

Gentleman, R., Carey, V., Huber, W., Irizarry, R., and Dudoit, S. (2005). Bioinfor-

matics and Computational Biology Solutions Using R and Bioconductor (Statistics

for Biology and Health). Secaucus, NJ, USA: Springer-Verlag New York, Inc.

Gramacy, R. and Lee, H. (2007). “Bayesian Treed Gaussian Process Models with

an Application to Computer Modeling.” Journal of the American Statistical As-

sociation.

Griffiths, T. and Ghahramani, Z. (2005). “Infinite latent feature models and the

Indian buffet process.” In Neural Information Processing Systems , 475–482.

Hardoon, D. R., Szedmak, S., and Shawe-Taylor, J. (2004). “Canonical Correlation

Analysis: An Overview with Application to Learning Methods.” Neural Compu-

tation, 16, 12, 2639–2664.

31

Hotelling, H. (1936). “Relations Between Two Sets of Variates.” Biometrika, 28,

3/4, 321–377.

Ishwaran, H. and Rao, J. S. (2005). “Spike and slab variable selection: Frequentist

and Bayesian strategies.” Annals of Statistics , 33, 730–773.

Jeong, J., Li, L., Liu, Y., Nephew, K., Huang, T., and Shen, C. (2010). “An empirical

Bayes model for gene expression and methylation profiles in antiestrogen resistant

breast cancer.” BMC Medical Genomics , 3, 1, 55.

Kendziorski, C. M., Chen, M., Yuan, M., Lan, H., and Attie, A. D. (2006). “Sta-

tistical Methods for Expression Quantitative Trait Loci (eQTL) Mapping.” Bio-

metrics , 62, 1, 19–27.

Klami, A. and Kaski, S. (2008). “Probabilistic approach to detecting dependencies

between data sets.” Neurocomputing , 72, 1-3, 39–46.

Kothandaraman, N., Bajic, V., Brendan, P., Huak, C., Keow, P., Razvi, K., Salto-

Tellez, M., and Choolani, M. (2010). “E2F5 status significantly improves malig-

nancy diagnosis of epithelial ovarian cancer.” BMC Cancer , 10, 1, 64.

Lanckriet, G. G., De Bie, T., Cristianini, N., Jordan, M., and Noble, W. S. (2004).

“A statistical framework for genomic data fusion.” Bioinformatics , 20, 16, 2626–

2635.

Lopes, H. F., Gamerman, D., and Salazar, E. (2011). “Generalized spatial dynamic

factor models.” Comput. Stat. Data Anal., 55, 1319–1330.

Lopes, H. F., Salazar, E., and Gamerman, D. (2008). “Spatial Dynamic Factor

Analysis.” Bayesian Analysis , 3, 4, 759–792.

Louhimo, R. and Hautaniemi, S. (2011). “CNAmet: an R package for integrating

copy number, methylation and expression data.” Bioinformatics , 27, 6, 887–888.

Lucas, J. E., Kung, H. N., and Chi, J. T. A. (2010). “Latent Factor Analysis to Dis-

cover Pathway-Associated Putative Segmental Aneuploidies in Human Cancers.”

Plos Computational Biology , 6, 9.

Luttinen, J. and Ilin, A. (2009). “Variational Gaussian-process factor analysis

for modeling spatio-temporal data.” In Neural Information Processing Systems ,

1177–1185.

32

Mairal, J., Bach, F., Ponce, J., Sapiro, G., and Zisserman, A. (2008). “Supervised

Dictionary Learning.” In Neural Information Processing Systems , 1033–1040.

Meeds, E. and Osindero, S. (2006). “An alternative infinite mixture of Gaussian

process experts.” In Neural Information Processing Systems , 883–890.

Miyamoto, K., Morishita, Y., Yamazaki, M., Minamino, N., Kangawa, K., Matsuo,

H., Mizutani, T., Yamada, K., and Minegishi, T. (2001). “Isolation and Charac-

terization of Vascular Smooth Muscle Cell Growth Promoting Factor from Bovine

Ovarian Follicular Fluid and Its cDNA Cloning from Bovine and Human Ovary.”

Archives of Biochemistry and Biophysics , 390, 1, 93 – 100.

Muller, P., Erkanli, A., and West, M. I. K. E. (1996). “Bayesian curve fitting using

multivariate normal mixtures.” Biometrika, 83, 1, 67–79.

Paisley, J. and Carin, L. (2009). “Nonparametric factor analysis with beta process

priors.” In Proceedings of the 26th International Conference on Machine Learning ,

777–784.

Pils, D., Horak, P., Gleiss, A., Sax, C., Fabjani, G., Moebus, V., Zielinski, C.,

Reinthaller, A., Zeillinger, R., and Krainer, M. (2005). “Five genes from chromo-

somal band 8p22 are significantly down-regulated in ovarian carcinoma.” Cancer ,

104, 11, 2417–2429.

Pyle-Chenault, R., Stolk, J., Molesh, D., Boyle-Harlan, D., McNeill, P., Repasky,

E., Jiang, Z., Fanger, G., and Xu, J. (2005). “VSGP/F-spondin: a new ovarian

cancer marker.” Tumor Biol..

Rai, P. and Daume, H. (2009). “Multi-Label Prediction via Sparse Infinite CCA.” In

Advances in Neural Information Processing Systems 22 , eds. Y. Bengio, D. Schu-

urmans, J. Lafferty, C. K. I. Williams, and A. Culotta, 1518–1526.

Rasmussen, C. and Ghahramani, Z. (2002). “Infinite mixtures of Gaussian process

experts.” In Neural Information Processing Systems , 881–888.

Rasmussen, C. E. and Williams, C. K. I. (2005). Gaussian Processes for Machine

Learning (Adaptive Computation and Machine Learning). The MIT Press.

33

Rennstam, K., Ahlstedt-Soini, M., Baldetorp, B., Bendahl, P.-O., Borg, A., Karhu,

R., Tanner, M., Tirkkonen, M., and Isola, J. (2003). “Patterns of Chromosomal

Imbalances Defines Subgroups of Breast Cancer with Distinct Clinical Features

and Prognosis. A Study of 305 Tumors by Comparative Genomic Hybridization.”

Cancer Research, 63, 24, 8861–8868.

Schmidt, M. N. (2009). “Function Factorization using Warped Gaussian Processes.”

In Proceedings of the 26th International Conference on Machine Learning , 921–

928. Montreal.

Schmidt, M. N. and Laurberg, H. (2008). “Non-negative matrix factorization with

Gaussian process priors.” Computational Intelligence and Neuroscience.

Schwarz, G. (1978). “Estimating the dimension of a model.” The Annals of Statis-

tics , 6, 461–464.

Scott-Boyer, M., Imholte, G., Tayeb, A., Labbe, A., Deschepper, C., and Gottardo,

R. (2012). “An Integrated Hierarchical Bayesian Model for Multivariate eQTL

Mapping.” Stat. Appl. Genetics Molecular Biology , 11.

Sethuraman, J. (1994). “A constructive definition of Dirichlet priors.” Statistica

Sinica, 4, 639–650.

Shen, R., Olshen, A. B., and Ladanyi, M. (2009). “Integrative clustering of multiple

genomic data types using a joint latent variable model with application to breast

and lung cancer subtype analysis.” Bioinformatics , 25, 22, 2906–2912.

Shi, J. Q., Murray-Smith, R., and Titterington, M. (2003). “Bayesian Regression

and Classification Using Mixtures of Gaussian Processes.” In International Jour-

nal of Adaptive Control and Signal Processing , 149–161.

Silverman, B. W. (1985). “Some Aspects of the Spline Smoothing Approach to Non-

Parametric Regression Curve Fitting.” Journal of the Royal Statistical Society.

Series B (Methodological), 47, 1, 1–52.

Talloen, W., Clevert, D.-A., Hochreiter, S., Amaratunga, D., Bijnens, L., Kass,

S., and Gohlmann, H. (2007). “I/NI-calls for the exclusion of non-informative

genes: a highly effective filtering tool for microarray data.” Bioinformatics , 23,

21, 2897–2902.

34

Thibaux, R. and Jordan, M. I. (2007). “Hierarchical beta processes and the Indian

buffet process.” In Proceedings of the 11th Conference on Artificial Intelligence

and Statistic.

Tipping, M. E. (2001). “Sparse bayesian learning and the relevance vector machine.”

J. Mach. Learn. Res., 1, 211–244.

Tresp, V. (2001). “Mixtures of Gaussian processes.” In Neural Information Pro-

cessing Systems , 654–660.

Wang, C. (2007). “Variational Bayesian Approach to Canonical Correlation Analy-

sis.” Neural Networks, IEEE Transactions on, 18, 3, 905 –910.

35

Appendix A.

In this appendix we provide the proofs for the correlation properties provided in

Section 5.

Proof of conditional spatial covariance

Let Θk = r∗kl, φkl, Vkll=1,∞ represent the parameter set of the KSBP and Ωk =

τ (s)kl , β

(s)kl , σ

(s)kl l=1,∞ represent the parameter set for the GPs corresponding to the

kth factor loading. The conditional spatial covariance is

Cov(dik, di′k|Θk,Ωk) =∞∑l=1

Σk,l(i, i′)p(zk(i) = zk(i

′) = l|Θk)

=∞∑l=1

Σk,l(i, i′)wiklwi′kl

=< ψik,ψi′k >

(19)

where, ψik = [σk1wik1, σk2wik2, · · · , σk∞wik∞]T ,ψi′k = [σk1wi′k1, σk2wi′k2, · · · , σk∞wi′k∞]T ,

σkl =√

Σk,l(i, i′) and wikl = VklK(ri, r∗kl;φkl)

∏l−1j=1[1− VkjK(ri, r

∗kj;φkj)].

Proof of marginal spatial covariance

The marginal spatial covariance is

Cov(dik, di′k) =∞∑l=1

∫Θ,Ω

Σk,l(i, i′)p(z(i) = z(i′) = l|Θ)P (dΘ)P (dΩ) (20)

As it is difficult to obtain a closed form expression for (20), we make the following

simplifying assumptions. If all the GPs share the same kernel parameters, i.e.,

36

Σk,l(i, i′) = Σk(i, i

′), then the marginal spatial covariance is

Cov(dik, di′k) = E[Σk(i, i′)]

∞∑l=1

∫Θ

p(z(i) = z(i′) = l|Θ)P (dΘ)

= E[Σk(i, i′)]

∞∑l=1

EV 2l K(ri, r

∗l ;φ

∗l )K(ri′ , r

∗l ;φ

∗l )

·l−1∏`=1

[1− V`K(ri, r∗` ;φ

∗`)][1− V`K(ri′ , r

∗` ;φ

∗`)]

= E[Σk(i, i′)]

∞∑l=1

EV 2K(ri, r∗;φ∗)K(ri′ , r

∗;φ∗)

· [E(1− V K(ri, r∗;φ∗))(1− V K(ri′ , r

∗;φ∗))]l−1

= E[Σk(i, i′)]

E[V 2K(ri, r∗;φ∗)K(ri′ , r

∗;φ∗)]

1− E[(1− V K(ri, r∗;φ∗))(1− V K(ri′ , r∗;φ∗))]

=E[Σk(i, i

′)]E[V ]E[V 2]

· E[K(ri,r∗;φ∗)]+E[K(ri′ ,r∗;φ∗)]

E[K(ri,r∗;φ∗)K(ri′ ,r∗;φ∗)]

− 1

(21)

where E denotes the expectation operator. Since Θ = r∗l , φ∗l , Vl∞l=1 are drawn

i.i.d., we drop the subscript l from line 4 onwards of (21). For V ∼ Beta(1, γ), we

have:E[V ]

E[V 2]=

2 + γ

γ(22)

To obtain a simplified expression forE[K(ri,r

∗;φ∗)]+E[K(ri′ ,r∗;φ∗)]

E[K(ri,r∗;φ∗)K(ri′ ,r∗;φ∗)]

, we make another

simplifying assumption, and instead of a Gaussian kernel we assume a rectangular

kernel for the KSBP given by, K(r, r∗;φ∗) = 1, for ||r−r∗||2 ≤ ∆; K(r, r∗;φ∗) = 0,

for ||r−r∗||2 > ∆; we also assume that r∗ is drawn uniformly with density function

given by Pr∗(r∗) = 1

S , where S denotes the area of the entire support of r∗. Hence

we have

E[K(ri, r∗;φ∗)] = E[K(ri′ , r

∗;φ∗)] =1

SScircle =

1

Sπ∆2 (23)

E[K(ri, r∗;φ∗)K(ri′ , r

∗;φ∗)] =1

SS∩ (24)

where S∩ denotes the area of the intersection of two circles with centers at ri and

37

ri′ and with identical radius ∆. Hence

S∩ = 2(Ssector − Striangle)

= 22 arcsin

√∆2−||

ri−ri′2||22

∆2

2ππ∆2 − ||ri − ri

′

2||2√

∆2 − ||ri − ri′

2||22 (25)

Hence we have

E[K(ri, r∗;φ∗)] + E[K(ri′ , r

∗;φ∗)]

E[K(ri, r∗;φ∗)K(ri′ , r∗;φ∗)]=

1

1π[arcsin

√∆2−||

ri−ri′2||22

∆2 − ||ri−ri′

2||2

√∆2−||

ri−ri′2||22

∆2 ]

(26)

Substituting (22) and (26) in (21), we have

Cov(dik, di′,k) =E[Σk(i, i

′)]2+γ

2

1π

[arcsin

√∆2−||

ri−ri′2 ||22

∆2 −||ri−ri′

2 ||2

√∆2−||

ri−ri′2 ||22

∆2 ]

− 1(27)

Proof of spatio-temporal covariance

For data point xij located at ri at time tj and data point xi′j′ located at ri′ at time

tj′ , the covariance is given by

38

Cov(xij, xi′j′) = Cov(K∑k=1

dikbkjskj,

K∑k=1

di′kbkj′skj′)

=K∑k=1

Cov(dikbkjskj, di′kbkj′skj′) +∑k 6=k′

Cov(dikbkjskj, di′k′bk′j′sk′j′)

(a)=

K∑k=1

Cov(dikbkjskj, di′kbkj′skj′)

(b)=

K∑k=1

E(bkj)E(bkj′)Cov(diksi′k, dkjskj′)

(c)= α2

K∑k=1

Cov(diksi′k, dkjskj′)

(d)= α2

K∑k=1

Cov(dik, di′k)Cov(skj, skj′)

(28)

where the steps of the proof are justified as follows: (a) when k 6= k′, dikbkjskj

and di′k′bk′j′sk′j′ are independent and therefore Cov(dikbkjskj, di′k′bk′j′sk′j′) = 0; (b)

bkj, bk′j′ , dikskj and di′k′sk′j′ are independent; (c) bkj, bk′j′ are drawn independently

from a Bernoulli distribution with expectation α; (d) dik, di′k′ and skj, sk′j′ are drawn

from independent GPs and therefore Cov(diksi′k, dkjskj′) = Cov(dik, di′k)Cov(skj, skj′).

Appendix B.

An MCMC algorithm for posterior inference of the joint KSBP-GP factor model

proposed in the paper is provided below:

Sample d(r)k

p(d(r)k |−) ∼

M∏i=1

N(x

(r)i ;D(r)(s

(c)i b

(c)i + s

(r)i b

(r)i ), γ(r)

ε

−1INr

)N(d

(r)k ; 0,Σk

)(29)

In this and the notation below, p(d(r)k |−) is the probability of d

(r)k conditioned

on all other parameters being fixed to the last value in the sequence of Gibbs update

equations.

39

It can be shown that d(r)k is drawn from a normal distribution

p(d(r)k |−) ∼ N

(µ

d(r)k,Σ

d(r)k

)(30)

where

Σd

(r)k

=

(Σ−1k + γε

M∑i=1

(s

(c)ki b

(c)ki + s

(r)ki b

(r)ki

)2

INr

)−1

(31)

µd

(r)k

= γεΣdk

M∑i=1

(s(c)i b

(c)i + s

(r)i b

(r)i )x

−k,(r)i (32)

where

x−k,(r)i = x

(r)i −D(r)(s

(c)i b

(c)i + s

(r)i b

(r)i ) + d

(r)k (s

(c)ki b

(c)ki + s

(r)ki b

(r)ki ) (33)

Note, for modeling S & P 500 data, the factor loadings are drawn i.i.d. from a

Gaussian distribution and Σk = γ−1s INr , and for modeling Michigan data the factor

loadings are drawn from a GP and Σk(n,m) = τ(s)k exp

−β

(s)k ‖rn−rm‖

2

2

+ σ

(s)k δn,m.

Finally, for modeling the US data, the factor loadings are drawn from a mixture of

GPs

d(r)k ∼

J∏l=1

N (0,Σk,l) (34)

Hence the elements of d(r)k which belong to the lth cluster, denoted by d

(r)k,l , are drawn

from a GP with covariance Σk,l and the update equations for the lth GP cluster are

identical to (30), (31) and (32), with d(r)k replaced by d

(r)k,l and Σk replaced by Σk,l.

Sample b(c)k , b

(r)k

p(b(c)ik |−) ∼

R∏r=1

N(x

(r)i ;D(r)(s

(c)i b

(c)i + s

(r)i b

(r)i ), γ(r)

ε

−1INr

)Bernoulli(b

(c)ik ; πk)

(35)

The posterior probability that b(c)ik = 1 is proportional to

p1 = πk

R∏r=1

exp

(−γ

(r)ε

2

(s

(c)ik

2d

(r)k

Td

(r)k − 2s

(c)ik d

(r)k

Tx−k,(r)i

))(36)

40

The posterior probability that b(c)ik = 0 is proportional to

p0 = 1− πk (37)

Hence, b(c)ik may be drawn from a Bernoulli distribution

b(c)ik ∼ Bernoulli

p1

p1 + p0

(38)

Similarly, b(r)ik may be drawn from a Bernoulli distribution

b(r)ik ∼ Bernoulli

p′1p′1 + p′0

(39)

where

p′1 = πk exp

(−γ

(r)ε

2

(s

(r)ik

2d

(r)k

Td

(r)k − 2s

(r)ik d

(r)k

Tx−k,(r)i

))(40)

p′0 = 1− πk (41)

Sample s(c)(k), s

(r)(k)

In this and the notation below, x(r)(i) represent the ith column of row of matrix

X(r).

p(s(c)(k)|−) ∼

R∏r=1

Nr∏i=1

N(x

(r)(i) ;(S(c)T B(c)T

+ S(r)T B(r)T)d

(r)(i) , γ

(r)ε

−1IM

)N(s

(c)(k); 0,Σ′k

)(42)

Note, s(c)(k) represents the kth column of row of matrix S(c).

It can be shown that s(c)(k) is drawn from a normal distribution

p(s(c)(k)|−) ∼ N (µsk ,Σsk) (43)

41

where

Σs

(c)(k)

=

(Σ′−1k +

R∑r=1

Nr∑i=1

(γ(r)ε

(d

(r)ki

)2)(b

(c)(k)b

(c)(k)

T)IM

)−1

(44)

µs

(c)(k)

= Σs

(c)(k)

R∑r=1

Nr∑i=1

(γ(r)ε d

(r)ki

)(b

(c)(k) x

−k,(r)(i)

)(45)

where,

x−k,(r)(i) = x

(r)(i) − (S(c)T B(c)T

)d(r)(i) +

(s

(c)(k) b

(c)(k)

)d

(r)ki −

(s(r)T b(r)T

)d

(r)(i) (46)

Similarly, it can be shown that s(r)(k) is drawn from a normal distribution

p(s(r)(k)|−) ∼ N

(µ

s(r)k,Σ

s(r)k

)(47)

where

Σs

(r)k

=

(Σ′−1k +

Nr∑i=1

(γ(r)ε

(d

(r)ki

)2)(b

(r)(k)b

(r)(k)

T)IM

)−1

(48)

µs

(r)k

= Σs

(r)k

Nr∑i=1

(γ(r)ε d

(r)ki

)(b

(r)(k) x

−k,(r)(i)

)(49)

where,

x−k,(r)(i) = x

(r)(i) − (S(r)TB(r)T

)d(r)(i) +

(s

(r)(k) b

(r)(k)

)d

(r)ki −

(S(c)T B(c)T

)d

(r)(i) (50)

Sample zk (cluster labels for kth factor loading)

p(zk(i) = l|−) ∼ N (µ, σ2)wikl(ri) (51)

where

wikl(ri) = Vkl(ri; r∗kl, φkl)

l−1∏j=1

[1− VklK(ri; r∗kj, φkj)] (52)

42

and (Rasmussen and Williams, 2005),

µ = Σk,l(ri, r\i)TΣk,l(r\i, r\i)

−1d(r)k,l\i (53)

σ2 = Σk,l(ri, ri)−Σk,l(ri, r\i)TΣk,l(r\i, r\i)

−1Σk,l(ri, r\i) (54)

Here the elements of d(r)k which belong to the lth cluster is denoted by d

(r)k,l . The

notation d(r)k,l\i denote all indicators except number i and r\i denotes the spatial

locations corresponding to the elements in d(r)k,l\i.

Sample Vk

A data augmentation approach to update Vkll=1,J has been proposed in Dun-

son and Park (2008). It requires the introduction of two auxiliary variables: Aikl ∼Bernoulli(Vkl) and Bikl ∼ Bernoulli (K(ri, ; r

∗kl, φkl)) , independently for each l,

with zk(i) = min l : Aikl = Bikl = 1. We can then alternate between sampling

(Aikl, Bikl) from their conditional distribution given zk(i) and updating Vkl by sam-

pling from the conditional posterior distribution

Vkl ∼ Beta

1 +∑

i:zk(i)≥l

Aikl, γk +∑

i:zk(i)≥l

(1− Aikl)

(55)

Sample φk

A discrete prior is placed on the KSBP kernel width parameter φkjj=1,J

p(φkj) =∑h

phφh (56)

where φh are potential kernel widths. The posterior takes the form,

p(φkj) =∑h

pnewh φh (57)

where,

pnewh = ph

J∏j=1

[VkjK(ri; r

∗kj, φh)

]I(zk(i)=j) (1− VkjK(ri; r

∗kj, φh)

)I(zk(i)>j)(58)

43

Sample r∗k

A discrete prior is placed on the KSBP basis locationr∗kjj=1,J

p(r∗kj) =∑h

ehr∗h (59)

where r∗h constitutes a grid of potential locations. The posterior takes the form,

p(φj) =∑h

enewh r∗h (60)

where,

enewh = ph

J∏j=1

[VkjK(ri; r∗h, φkj)]

I(zk(i)=j) (1− VkjK(ri; r∗h, φkj))

I(zk(i)>j) (61)

Sample πk

p(πk|−) ∼ Beta(πk; cα, c(1− α))M∏i=1

Bernoulli(bki; πk) (62)

It can be shown that πk may be drawn from a Beta distribution as

p(πk|−) ∼ Beta(cα +M∑i=1

bki, c(1− α) +M −M∑i=1

bki) (63)

Sample γk

p(γk|−) ∼J∏l=1

Beta(Vkl; 1, γk)Gamma(γk; a2, b2) (64)

It can be shown that γk may be drawn from a Gamma distribution as

γk ∼ Gamma(a2 + J − 1, b2 −J∑l=1

ln(1− Vkl)) (65)

Sample γ(r)ε

44

p(γ(r)ε |−) ∼

M∏i=1

N(x

(r)i ;D(s

(c)i b

(c)i + s

(r)i b

(r)i ), γ(r)

ε

−1INr

)Gamma(γ(r)

ε , a0, b0)

(66)

It can be shown that γ(r)ε may be drawn from a Gamma distribution as,

p(γ(r)ε |−) ∼ Gamma

(a0 +

1

2MNr, b0 +

1

2

M∑i=1

∥∥∥x(r)i −D(r)(s

(c)i b

(c)i + s

(r)i b

(r)i )∥∥∥2)

(67)

Update of GP parameters Ωkl

For the GP parameters, the full conditional posteriors are difficult to obatin in

closed form. We obtain point estimates for these parameters via maximum likelihood

estimation. Let Ωl = τ (s)kl , β

(s)kl , σ

(s)kl represent the parameter set for the lth GP

(spatial GP) corresponding to the kth factor loading. The MLE Ωkl is obtained as

(Rasmussen and Williams, 2005),

Ωkl = maxΩkl

−1

2ln det(Σk,l)−

1

2d

(r)k,l

TΣk,ld

(r)k,l −

Sl2

ln(2π)

(68)

where det(Σk,l) denotes the determinant of the matrix Σk,l and Sl denotes the

number of elements in d(r)k,l . The parameters for the temporal GPs may be updated

in a similar manner.

45

bayesian joint analysis of heterogeneous datapeople.ee.duke.edu › ~lcarin ›...

Documents