sparse bayesian multiview learning for simultaneous ...zhe/pdf/multiview-ld.pdf · and discover...

7
Sparse Bayesian Multiview Learning for Simultaneous Association Discovery and Diagnosis of Alzheimer’s Disease Shandian Zhe Department of Computer Science Purdue University Zenglin Xu School of Computer Science & Engineering, Big Data Res. Center University of Electronic Science and Technology of China Yuan Qi Department of Computer Science Purdue University Peng Yu Eli Lilly and Company for the ADNI * Abstract In the analysis and diagnosis of many diseases, such as the Alzheimer’s disease (AD), two important and related tasks are usually required: i) selecting genetic and pheno- typical markers for diagnosis, and ii) identifying associations between genetic and phenotypical features. While previous studies treat these two tasks separately, they are tightly cou- pled due to the same underlying biological basis. To har- ness their potential benefits for each other, we propose a new sparse Bayesian approach to jointly carry out the two impor- tant and related tasks. In our approach, we extract common latent features from different data sources by sparse projec- tion matrices and then use the latent features to predict dis- ease severity levels; in return, the disease status can guide the learning of sparse projection matrices, which not only reveal interactions between data sources but also select groups of related biomarkers. In order to boost the learning of sparse projection matrices, we further incorporate graph Laplacian priors encoding the valuable linkage disequilibrium (LD) information. To efficiently estimate the model, we develop a variational inference algorithm. Analysis on an imaging genetics dataset for AD study shows that our model dis- covers biologically meaningful associations between single nucleotide polymorphisms (SNPs) and magnetic resonance imaging (MRI) features, and achieves significantly higher accuracy for predicting ordinal AD stages than competitive methods. Introduction Alzheimer’s disease (AD) is the most common neurode- generative disorder (Khachaturian 1985). Given genetic variations, e.g., single nucleotide polymorphisms (SNPs), and phenotypical traits, e.g., magnetic resonance imaging (MRI), we want to develop noninvasive diagnosis methods * Data used in preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.ucla.edu). As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in analysis or writing of this report. A complete listing of ADNI investigators can be found at: http://adni.loni.ucla.edu/wp- content/uploads/how to apply/ADNI Acknowledgement List.pdf Copyright c 2015, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. for better preventative care; to understand the underlying AD pathology, we want to identify the disease related features and discover their associations. Many approaches have been proposed to discover associ- ations, such as canonical correlation analysis (CCA) and its extensions (Harold 1936; Bach and Jordan 2005a), or select features (or variables), such as lasso (Tibshirani 1994), elas- tic net (Zou and Hastie 2005), and Bayesian automatic rel- evance determination (MacKay 1991; Neal 1996). Despite their wide success in many applications, these approaches suffer several limitations: i) most association studies neglect the supervision from the disease status, while diseases as AD are a direct result of genetic variations and often highly cor- related to clinical traits; ii) most feature selection approaches do not consider the disease severity order, while AD sub- jects have a natural severity order from being normal to mild cognitive impairment (MCI) and then from MCI to AD; iii) most previous approaches are not designed to handle het- erogeneous data sources, e.g., the SNPs values are discrete and ordinal while the imaging features are continuous; iv) most previous methods ignore the valuable prior knowledge such as Linkage Disequilibrium (LD) (Falconer and Mackay 1996) measuring the non-random association of alleles. To our knowledge, this structure has not been utilized for asso- ciation discovery in medical study. To address these limitations, we propose a new Bayesian model that unifies multiview learning with sparse ordinal re- gression for joint association study and disease diagnosis. Specifically, genetic variations and phenotypical traits are generated from common latent features based on separate sparse projection matrices and suitable link functions, and the latent features are used to predict the disease status. To encourage sparsity, we assign spike and slab priors (George and McCulloch 1997) over the projection matrices; we fur- ther employ an additional graph Laplacian prior encoding LD knowledge to boost the learning of the sparse projection matrix on the genetic variation view. The sparse projection matrices then not only reveal critical interactions between the data sources but also identify biomarkers relevant to the disease. Meanwhile, via their direct connection to the latent features, the disease status will influence the estimation of the projection matrices and guide the discovery of disease- Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence 1966

Upload: others

Post on 15-Oct-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Sparse Bayesian Multiview Learning for Simultaneous ...zhe/pdf/multiview-ld.pdf · and discover their associations. Many approaches have been proposed to discover associ-ations, such

Sparse Bayesian Multiview Learning for SimultaneousAssociation Discovery and Diagnosis of Alzheimer’s Disease

Shandian ZheDepartment of Computer Science

Purdue University

Zenglin XuSchool of Computer Science & Engineering,

Big Data Res. CenterUniversity of Electronic Science

and Technology of China

Yuan QiDepartment of Computer Science

Purdue University

Peng YuEli Lilly and Company

for the ADNI∗

Abstract

In the analysis and diagnosis of many diseases, such asthe Alzheimer’s disease (AD), two important and relatedtasks are usually required: i) selecting genetic and pheno-typical markers for diagnosis, and ii) identifying associationsbetween genetic and phenotypical features. While previousstudies treat these two tasks separately, they are tightly cou-pled due to the same underlying biological basis. To har-ness their potential benefits for each other, we propose a newsparse Bayesian approach to jointly carry out the two impor-tant and related tasks. In our approach, we extract commonlatent features from different data sources by sparse projec-tion matrices and then use the latent features to predict dis-ease severity levels; in return, the disease status can guide thelearning of sparse projection matrices, which not only revealinteractions between data sources but also select groups ofrelated biomarkers. In order to boost the learning of sparseprojection matrices, we further incorporate graph Laplacianpriors encoding the valuable linkage disequilibrium (LD)information. To efficiently estimate the model, we developa variational inference algorithm. Analysis on an imaginggenetics dataset for AD study shows that our model dis-covers biologically meaningful associations between singlenucleotide polymorphisms (SNPs) and magnetic resonanceimaging (MRI) features, and achieves significantly higheraccuracy for predicting ordinal AD stages than competitivemethods.

IntroductionAlzheimer’s disease (AD) is the most common neurode-generative disorder (Khachaturian 1985). Given geneticvariations, e.g., single nucleotide polymorphisms (SNPs),and phenotypical traits, e.g., magnetic resonance imaging(MRI), we want to develop noninvasive diagnosis methods

∗Data used in preparation of this article were obtainedfrom the Alzheimer’s Disease Neuroimaging Initiative (ADNI)database (adni.loni.ucla.edu). As such, the investigators withinthe ADNI contributed to the design and implementationof ADNI and/or provided data but did not participate inanalysis or writing of this report. A complete listing ofADNI investigators can be found at: http://adni.loni.ucla.edu/wp-content/uploads/how to apply/ADNI Acknowledgement List.pdfCopyright c© 2015, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.

for better preventative care; to understand the underlying ADpathology, we want to identify the disease related featuresand discover their associations.

Many approaches have been proposed to discover associ-ations, such as canonical correlation analysis (CCA) and itsextensions (Harold 1936; Bach and Jordan 2005a), or selectfeatures (or variables), such as lasso (Tibshirani 1994), elas-tic net (Zou and Hastie 2005), and Bayesian automatic rel-evance determination (MacKay 1991; Neal 1996). Despitetheir wide success in many applications, these approachessuffer several limitations: i) most association studies neglectthe supervision from the disease status, while diseases as ADare a direct result of genetic variations and often highly cor-related to clinical traits; ii) most feature selection approachesdo not consider the disease severity order, while AD sub-jects have a natural severity order from being normal to mildcognitive impairment (MCI) and then from MCI to AD; iii)most previous approaches are not designed to handle het-erogeneous data sources, e.g., the SNPs values are discreteand ordinal while the imaging features are continuous; iv)most previous methods ignore the valuable prior knowledgesuch as Linkage Disequilibrium (LD) (Falconer and Mackay1996) measuring the non-random association of alleles. Toour knowledge, this structure has not been utilized for asso-ciation discovery in medical study.

To address these limitations, we propose a new Bayesianmodel that unifies multiview learning with sparse ordinal re-gression for joint association study and disease diagnosis.Specifically, genetic variations and phenotypical traits aregenerated from common latent features based on separatesparse projection matrices and suitable link functions, andthe latent features are used to predict the disease status. Toencourage sparsity, we assign spike and slab priors (Georgeand McCulloch 1997) over the projection matrices; we fur-ther employ an additional graph Laplacian prior encodingLD knowledge to boost the learning of the sparse projectionmatrix on the genetic variation view. The sparse projectionmatrices then not only reveal critical interactions betweenthe data sources but also identify biomarkers relevant to thedisease. Meanwhile, via their direct connection to the latentfeatures, the disease status will influence the estimation ofthe projection matrices and guide the discovery of disease-

Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence

1966

Page 2: Sparse Bayesian Multiview Learning for Simultaneous ...zhe/pdf/multiview-ld.pdf · and discover their associations. Many approaches have been proposed to discover associ-ations, such

H

Z

y

U

X

G

Sh

Sg η

w Sw

hj

0

L

Figure 1: The graphical representation of our model, whereX is the continuous view, Z is the ordinal view, y are theordinal labels and L is the graph Laplacian generated fromthe LD structure.

sensitive data source associations. For efficient model esti-mation, we develop a variational inference approach, whichiteratively minimizes the Kullback Leibler divergence be-tween a tractable approximation and exact Bayesian poste-rior distributions. The results on Alzheimer’s Disease Neu-roimaging Initiative (ADNI) dataset show that our modelachieves highest prediction accuracy among all the compet-ing methods and finds biologically meaningful associationsbetween SNPs and MRI features.

MethodsNotations and AssumptionsWe assume there are two heterogeneous data sources: onecontains continuous data – e.g., MRI features – and the othercontains discrete ordinal data – e.g., SNPs. Note that we caneasily generalize our model below to handle more views andother data types by adopting suitable link functions (e.g., aPossion model for count data). Given data from n subjects,p continuous features and q discrete features, we denote thecontinuous data by a p × n matrix X = [x1, . . . ,xn], thediscrete ordinal data by a q×n matrix Z = [z1, . . . , zn] andthe labels (i.e., the disease status) by a n × 1 vector y =[y1, . . . , yn]>. For the AD study, we let yi = 0, 1, and 2if the i-th subject is in the normal, MCI or AD condition,respectively.

ModelTo link the two data sources X and Z together, we intro-duce common latent features U = [u1, . . . ,un] and assumeX and Z are generated from U by sparse projection. Thecommon latent feature assumption is sensible because bothSNPs and MRI features are biological measurements of thesame subjects. Note that ui is a k dimensional latent featurefor the i-th subject. In a Bayesian framework, we assign aGaussian prior over U, p(U) =

∏iN (ui|0, I), and specify

the rest of the model (see Figure 1) as follows.Continuous data distribution. Given U, X is generated

from

p(X|U,G, η) =n∏i=1

N (xi|Gui, η−1I)

where G = [g1,g2, ...gp]> is a p× k projection matrix, I is

an identity matrix, and η is a scalar. We assign an uninforma-tive Gamma prior over η, p(η|r1, r2) = Gamma(η|r1, r2),where r1 = r2 = 10−3.

Ordinal data distribution. For an ordinal variable z ∈0, 1, . . . , R − 1, we introduce an auxiliary variable c anda segmentation of the number axis, −∞ = b0 < b1 <. . . < bR = ∞. We define that z = r if and only ifc falls in [br, br+1). Then given a q × k projection matrixH = [h1,h2, ...hq]

> and the auxiliary matrix C = cij,the ordinal data Z are generated from

p(Z,C|U,H) =

q∏i=1

n∏j=1

p(cij |hi,uj)p(zij |cij)

where p(zij |cij) =∑R−1r=0 δ(zij = r)δ(br ≤ cij < br+1)

and p(cij |hi,uj) = N (cij |h>i uj , 1). Here δ(a) = 1 if a istrue and δ(a) = 0 otherwise. In AD study, Z take values in0, 1, 2 and hence R = 3.

Label distribution. The disease status labels y are ordi-nal variables too. To generate y, we use the ordinal regres-sion model based the latent representation U,

p(y, f |U,w) = p(y|f)p(f |U,w),

where f is the latent continuous values corresponding to y,w is the weight vector for the latent features, p(yi|fi) =∑2r=0 δ(yi = r)δ(br ≤ fi < br+1) and p(fi|ui,w) =

N (fi|u>i w, 1). Note that y are linked to X and Z via thelatent features U and the projection matrices H and G. Dueto the sparsity in H and G, only a few groups of variables inX and Z are selected to predict y.

LD structure as additional priors for SNPs correla-tions. Linkage Disequilibrium records the occurrence ofnon-random combinations of alleles or genetic markers,which can be a natural indicator for the SNPs correlations.To utilize this valuable prior knowledge, we introduce a la-tent q × k matrix H, which is tightly linked to H. Each col-umn hj of H is regularized by the graph Laplacian of theLD structure, i.e.,

p(H|L) =∏

jN (hj |0,L−1) =

∏jN (0|hj ,L−1)

= p(0|H,L),

where L is the graph Laplacian matrix of the LD structure.As shown above, the prior p(H|L) has the same form asp(0|H,L), which can be viewed as a generative model – inother words, the observation 0 is sampled from H. This viewenables us to combine the generative model for graph Lapla-cian regularization with the sparse projection model via aprincipled hybrid Bayesian framework (Lasserre, Bishop,and Minka 2006).

To link the two models together, we introduce a prior overH:

p(H|H) =∏

jN (hj |hj , λI)

where the variance λ controls how similar H and H are inour model. For simplicity, we set λ = 0 so that p(H|H) =

δ(H−H) where δ(a) = 1 if a = 1 and δ(a) = 0 if a = 0.Sparse priors for projection matrices and weights vec-

tor. In order to discover critical interactions between data

1967

Page 3: Sparse Bayesian Multiview Learning for Simultaneous ...zhe/pdf/multiview-ld.pdf · and discover their associations. Many approaches have been proposed to discover associ-ations, such

sources and enhance the model prediction, we use spike andslab prior (George and McCulloch 1997) to sparsify the pro-jection matrices G and H and the weight vector w. Specif-ically, we use a p × k matrix Sg to represent the selectionof elements in G: if sgij = 1, gij is selected and follows aGaussian prior distribution with variance σ2

1 ; if sgij = 0, gijis not selected and forced to almost zero (i.e., sampled froma Gaussian with a very small variance σ2

2). Thereby, we havethe following prior over G:

p(G|Sg,Πg) =

p∏i=1

k∏j=1

p(gij |sijg )p(sijg |πijg )

where πijg in Πg is the probability of selecting gij(i.e., sijg = 1), p(gij |sijg ) = sijg N (gij |0, σ2

1) + (1 −sijg )N (gij |0, σ2

2), and p(sijg |πijg ) = πijgsijg (1 − πijg )1−s

ijg .

Note that σ22 should be close to 0 and we set σ2

1 = 1and σ2

2 = 10−6 in the experiment. For a full Bayesiantreatment, we assign an uninformative hyperprior over Πg:p(Πg|l1, l2) =

∏pi=1

∏kj=1 Beta(πijg |l1, l2) where l1 =

l2 = 1. Similarly, we assign the prior for H,

p(H|Sh,Πh) =

q∏i=1

k∏j=1

p(hij |sijh )p(sijh |πijh ),

where Sh are binary selection variables, πijh is select-ing probability of hij , p(hij |sijh ) = sijhN (hij |0, σ2

1) +

(1 − sijh )N (hij |0, σ22) and p(sijh |π

ijh ) = πijh

sijh (1 −πijh )1−s

ijh . We assign uninformative Beta hyperpriors for

Πh: p(Πh|d1, d2) =∏qi=1

∏kj=1 Beta(πijh |d1, d2) where

d1 = d2 = 1. Finally, for weights vector w, we have

p(w|sw,πw) =

k∏j=1

p(wj |sjw)p(sjw|πjw)

where sw are binary selection variables, πjw is the se-lecting probability of wj , p(wj |sjw) = sjwN (wj |0, σ2

1) +

(1 − sjw)N (wj |0, σ22) and p(sjw|πjw) = πjw

sjw(1 −πjw)1−s

jw . We assign Beta hyperpriors for πw: p(πw) =∏k

i=1 Beta(πiw|e1, e2) where e1 = e2 = 1.

Joint distribution. Based on the aforementioned compo-nents, the joint distribution of our model is

p(X,Z,y,U,G,Sg,Πg, η,C,H, H,Sh,Πh,Sw,Πw, f)

= p(X|U,G, η)p(G|Sg)p(Sg|Πg)p(Πg|l1, l2)p(η|r1, r2)· p(Z,C|U,H)p(H|Sh)p(Sh|Πh)p(Πh|d1, d2)p(H|H)

· p(0|H,L)p(y|f)p(f |U,w)p(w|Sw)p(Sw|Πw)p(U).

EstimationGiven the model, we present an efficient approach to esti-mate the latent features U, the projection matrices H andG, the selection indicators Sg and Sh, the selecting proba-bilities Πg and Πh, the variance η, the auxiliary variables

C for ordinal data Z, the auxiliary variables f for ordinallabels y, the weights vector w for prediction and the corre-sponding selection indicators and probabilities sw and πw.In a Bayesian framework, this amounts to computing theirposterior distributions.

However, computing the exact posteriors is infea-sible because we cannot calculate the normaliza-tion constant for the posterior distributions. There-fore, we resort to a mean-field variational approach,which approximates the posterior distributions ofU,H,G,Sg,Sh,Πg,Πh, η,w,C, f by a factorizeddistribution Q(·) = Q(U)Q(H)Q(G)Q(Sg)Q(Sh)Q(Πg)Q(Πh)Q(η)Q(w)Q(C)Q(f). Note that since weset p(H|H) = δ(H − H), we do not need a separatedistribution Q(H). We minimize the Kullback-Leibler(KL) divergence between the approximate and the exactposteriors, KL(Q||P ) where P represents the exact jointposterior distributions. We use coordinate descent: weupdate an approximate distribution, say, Q(H), while fixingthe other approximate distributions, and iteratively refineall the approximate distributions. The detailed updates aregiven in the following sections. For brevity, the calculationof the required moments are provided in the supplementarymaterial.

Updating variational distributions for continuousdataFor continuous data X, the approximate distributions of theprojection matrix G, the noise variance η, the selection in-dicators Sg and the selection probabilities Πg are

Q(G) =

p∏i=1

N (gi;λi,Ωi), (1)

Q(Sg) =

p∏i=1

k∏j=1

βsijgij (1− βij)1−s

ijg , (2)

Q(Πg) =

p∏i=1

k∏j=1

Beta(πijg |lij1 , l

ij2 ), (3)

Q(η) = Gamma(η|r1, r2). (4)

The parameters for Q(G) are calculated by Ωi =(〈η〉〈UU>〉 + 1

σ21

diag(〈sig〉) + 1σ22

diag(1 − 〈sig〉))−1

andλi = Ωi(〈η〉〈U〉xi), where 〈·〉 means the expectation overa distribution, xi and sig are the transpose of the i-th row inX and Sg respectively; the parameters for Q(Sg) are calcu-lated by βij = 1/

(1 + exp(〈log(1 − πijg )〉 − 〈log(πijg )〉 +

12 log(

σ21

σ22) + 1

2 〈g2ij〉( 1

σ21− 1

σ22))); the parameters for Q(Πg)

are given by lij1 = βij + l1 and lij2 = 1 − βij + l2; theparameters for Q(η) are given by r1 = r1 + np

2 and r2 =

r2 + 12 tr(XX>)− tr(〈G〉〈U〉X>) + 1

2 tr(〈UU>〉〈G>G〉).

Updating variational distributions for ordinal dataFor ordinal data Z, we update the approximate distributionsof the projection matrix H, the auxiliary variables C, the

1968

Page 4: Sparse Bayesian Multiview Learning for Simultaneous ...zhe/pdf/multiview-ld.pdf · and discover their associations. Many approaches have been proposed to discover associ-ations, such

selection indicators Sh and the selecting probabilities Πh.To make the variational distributions tractable, we factorizeQ(H) in a column-wise way. This is different from Q(G)which are factorized in a row-wise way. Specifically, we de-note the i-th column of H and Sh by hi and sih, the i-th rowof U by ui

>, and calculate the variational distributions of Cand H by

Q(C) =

q∏i=1

k∏j=1

Q(cij), (5)

Q(cij) ∝ δ(bzij ≤ cij < bzij+1)N (cij |cij , 1), (6)

Q(H) =k∏i=1

N (hi;γi,Λi), (7)

where cij = (〈H〉〈U〉)ij , Λi =(〈ui>ui〉I + L +

1σ21

diag(〈sih〉) + 1σ22

diag(〈1− sih〉))−1

, and γi = Λi(〈C〉 −∑j 6=i γj〈uj〉)〈ui〉.The variational distributions of Sh and Πh are

Q(Sh) =

q∏i=1

k∏j=1

αsijhij (1− αij)1−s

ijh , (8)

Q(Πh) =

q∏i=1

k∏j=1

Beta(πijh |d

ij1 , d

ij2 ), (9)

where αij = 1/(1 + exp(〈log(1 − πijh )〉 − 〈log(πijh )〉 +

12 log(

σ21

σ22) + 1

2 〈h2ij〉( 1

σ21− 1

σ22))), and dij1 = αij + d1, dij2 =

1− αij + d2.Note that in Equation (6), Q(cij) is a truncated Gaussian

and the truncation is controlled by the observed ordinal datazij .

Updating variational distributions for labelsFor ordinal labels y, we calculate the approximate distribu-tions of the auxiliary variables f , the weights vector w, theselection indicators sw and the selecting probabilities πw by

Q(f) =

n∏i=1

Q(fi), (10)

Q(fi) ∝ δ(byi ≤ fi < byi+1)N (fi|fi, σ2fi), (11)

Q(w) = N (w; m,Σw), (12)

Q(sw) =k∏i=1

τsiwi (1− τi)1−s

iw , (13)

Q(πw) =k∏i=1

Beta(πiw; ei1, ei2), (14)

where fi = (〈U〉>m)i, Σw =(〈UU>〉+ 1

σ21

diag(〈sw〉) +

1σ22

diag(〈1 − sw〉))−1

, m = Σw〈U〉〈f〉, τi = 1/(1 +

exp(〈log(1− πiw)〉− 〈log(πiw)〉+ 12 〈w

2i 〉( 1

σ21− 1

σ22))), ei1 =

τi + e1 and ei2 = 1− τi + e2.

Note that Q(fi) is also a truncated Gaussian and the trun-cated region is decided by the ordinal label yi. In this way,the supervised information from y is incorporated into es-timation of f and then estimation of the other quantities bythe recursive updates.

Updating variational distributions for latentfeature U

The approximate distribution for U is

Q(U) =∏i

N (ui|µi,Σi) (15)

where Σi =(〈ww>〉+ 〈η〉〈G>G〉+ 〈H>H〉+ I

)−1and

µi = Σi(〈w〉〈fi〉+ 〈η〉〈G〉>xi + 〈H〉>〈ci〉).

PredictionLet us denote the training data by Dtrain =Xtrain,Ztrain,ytrain and the test data by Dtest =Xtest,Ztest. We jointly carry out variational inferenceon Dtrain and Dtest. After the latent features for Dtest areobtained (i.e., Q(Utest)), we predict the labels by

ftest = 〈Utest〉>m, (16)

yitest =R−1∑r=0

r · δ(br ≤ f itest < br+1), (17)

where yitest is the prediction for i-th test sample.

Experimental Results and DiscussionWe conducted association analysis and AD diagnosis basedon a dataset from Alzheimer’s Disease Neuroimaging Ini-tiative (ADNI). The ADNI study is a longitudinal multisiteobservational study of elderly individuals with normal cog-nition, mild cognitive impairmen (MCI) or AD. We appliedour model to study the associations between genotypes andbrain atrophy measured by MRI and to predict the subjectstatus (normal vs MCI vs AD). Note that the statuses areordinal since they represent increasing severity levels.

The dataset was downloaded from http://adni.loni.ucla.edu/. After removing missing values, it consists of 618 sub-jects (183 normal, 308 MCI and 134 AD) and each sub-ject contains 924 SNPs (selected as the top SNPs to sepa-rate normal subjects from AD in ADNI) and 328 MRI fea-tures (measuring the brain atrophies in different brain re-gions based on cortical thickness, surface area or volumeusing FreeSurfer software). Moreover, the LD structure wasretrieved from www.ncbi.nlm.nih.gov/books/NBK44495/.

To evaluate the diagnosis accuracy, we compared with thefollowing ordinal or multinomial regression methods: (1)lasso for multinomial regression (Tibshirani 1994), (2) elas-tic net for multinomial regression (Zou and Hastie 2005),(3) sparse ordinal regression with the spike and slab prior,(4) CCA + lasso, for which we first ran CCA to obtainthe projected data and then applied lasso for prediction, (5)CCA + elastic net, which is the same as CCA + lasso ex-cept that we applied elastic net for prediction, (6) GaussianProcess Ordinal Regression (GPOR) (Chu and Ghahramani

1969

Page 5: Sparse Bayesian Multiview Learning for Simultaneous ...zhe/pdf/multiview-ld.pdf · and discover their associations. Many approaches have been proposed to discover associ-ations, such

Figure 2: The variational lower bound for the modelmarginal likelihood.

2005), which employs Gaussian processes to learn the latentfunctions for ordinal regression, and (7) Laplacian SupportVector Machine (LapSVM) (Melacci and Mikhail 2011), asemi-supervised SVM classification scheme. We used theGlmnet package by (Friedman, Hastie, and Tibshirani 2010)for lasso and elastic net, the GPOR package by (Chu andGhahramani 2005), and the LapSVM package by (Melacciand Mikhail 2011). For all the methods, we used 10-foldcross validation (i.e., each fold we have 556 training and 62test samples) to tune free parameters, e.g., the kernel formand parameters for GPOR and LapSVM. Note that all thealternative methods stack X and Z together into a wholedata matrix and ignore their heterogeneous nature.

To determine the dimension k for the latent features U inour method, we calculated the variational lower bound as anapproximation to the model evidence, with various k values10, 20, 40, 60. The variational lower bound can be easilycalculated based on what we have presented in the modelestimation section. We chose the value with the largest ap-proximate evidence, which led to k = 20 (see Figure 2).

Our experiments confirmed that with k = 20, our modelachieved the highest prediction accuracy, demonstrating thebenefit of evidence maximization.

As shown in Figure 3, our method achieved the high-est prediction accuracy, higher than that of the second bestmethod, GP ordinal Regression, by 10% and than that of theworst method, CCA+lasso, by 22%. The two-sample t testshows our model outperforms the alternative methods sig-nificantly (p < 0.05).

We also examined the strongest associations discoveredby our model. First of all, the ranking of MRI features interms of prediction power for the three different diseasepopulations (normal, MCI and AD) demonstrate that mostof the top ranked features are based on the cortical thick-ness measurement. On the other hand, the features basedon volume and surface area estimation are less predic-tive. Particularly, thickness measurements of middle tempo-ral lobe, precuneus, and fusiform were found to be most pre-dictive compared with other brain regions. These findings

Figure 3: Prediction accuracy with standard errors.

are consistent with the memory-related function in theseregions and findings in the literature for their predictionpower of AD (Whitwell et al. 2007; Risacher et al. 2009;Teipel et al. 2013). We also found that measurements ofthe same structure on the left and right side have similarweights, indicating that the algorithm can automatically se-lect correlated features in groups, since no asymmetrical re-lationship has been found for the brain regions involved inAD (Kusbeci et al. 2009).

Second, the analysis of associating genotype to AD pre-diction generated interesting results. Similar to the MRI fea-tures, SNPs that are in the vicinity of each other are often se-lected together, indicating the group selection characteristicsof the algorithm. For example, the top ranked SNPs are asso-ciated with a few genes including CAPZB (F-actin-cappingprotein subunit beta), NCOA2 (The nuclear receptor coacti-vator 2) and BCAR3(Breast cancer anti-estrogen resistanceprotein 3).

At last, biclustering of the gene-MRI associations, asshown in Figure 4 reveal interesting pattern in terms ofthe relationship between genetic variations and brain at-rophy measured by structural MRI. For example, the topranked SNPs are associated with a few genes includingBCAR3 (Breast cancer anti-estrogen resistance protein 3)and NCOA2, which have been studied more carefully in can-cer research (Stephens et al. 2012). The set of SNPs are as-sociated with cingulate in negative direction, which is partof the limbic system and involve in emotion formation andprocessing, compared with other structures such as temporallobe, which plays a more important role in the formation oflong-term memory.

Related WorkOur model is closely related to probabilistic factor analy-sis methods which try to learn a latent representation whoseprojection leads to the observed data (Tipping and Bishop1999; Bach and Jordan 2005b; Guan and Dy 2009; Yu etal. 2006; Archambeau and Bach 2009; Virtanen, Klami, andKaski 2011). In addition to learning latent representation,our model uses spike and slab prior to learn sparse projectionmatrices in order to select features in different data sourcesand find their critical associations. The spike and slab prioravoids confounding the degree of sparsity with the degree of

1970

Page 6: Sparse Bayesian Multiview Learning for Simultaneous ...zhe/pdf/multiview-ld.pdf · and discover their associations. Many approaches have been proposed to discover associ-ations, such

CA

PZ

B(r

s7044150)

CA

PZ

B(r

s4369252)

CA

PZ

B(r

s8050232)

CA

PZ

B(r

s16936880)

CA

PZ

B(r

s1324727)

CA

PZ

B(r

s7145509)

CA

PZ

B(r

s9887859)

BC

AR

3(r

s1553833)

BC

AR

3(r

s3858038)

MA

P3K

1(r

s1131877)

NC

OA

2(r

s8014818)

NC

OA

2(r

s12588339)

NC

OA

2(r

s761308)

TR

AF

3(r

s7521782)

TR

AF

3(r

s12896381)

TR

AF

3(r

s2533059)

TR

AF

3(r

s13260060)

Vol (WMP) of R. CaudateVol (WMP) of L. LateralVentricleVol (WMP) of R. LateralVentricleVol (WMP) of ThirdVentricleSurf Area of L. UnknownVol (WMP) of NonWMHypoIntensitiesVol (WMP) of FifthVentricleVol (WMP) of R. CerebellumWMVol (WMP) of L. CerebellumWMVol (WMP) of CorpusCallosumMidAnteriorVol (WMP) of CorpusCallosumPosteriorVol (WMP) of CorpusCallosumCentralVol (WMP) of CorpusCallosumMidPosteriorCT std of L. PrecentralCT std of R. SuperiorParietalCT std of R. PostcentralCT std of L. SuperiorParietalCT std of R. CaudalAnteriorCingulate

−2

−1

0

1

2

(a)

CA

PZ

B(r

s7044150)

CA

PZ

B(r

s8050232)

CA

PZ

B(r

s4369252)

CA

PZ

B(r

s16936880)

CA

PZ

B(r

s7145509)

CA

PZ

B(r

s1324727)

CA

PZ

B(r

s9887859)

TR

AF

3(r

s13260060)

TR

AF

3(r

s2533059)

TR

AF

3(r

s12896381)

TR

AF

3(r

s7521782)

BC

AR

3(r

s1553833)

BC

AR

3(r

s3858038)

MA

P3K

1(r

s1131877)

NC

OA

2(r

s8014818)

NC

OA

2(r

s12588339)

NC

OA

2(r

s761308)

Vol (CP) of L. UnknownVol (WMP) of R. AmygdalaVol (CP) of R. UnknownVol (WMP) of L. AmygdalaCT Avg of L. UnknownCT Avg of R. UnknownCT Avg of L. ParahippocampalCT Avg of R. ParahippocampalVol (CP) of L. ParahippocampalCT Avg of L. EntorhinalCT Avg of R. EntorhinalVol (CP) of R. EntorhinalVol (CP) of L. EntorhinalCT std of R. UnknownCT std of L. UnknownVol (CP) of R. ParahippocampalVol (WMP) of L. HippocampusVol (WMP) of R. Hippocampus

(b)

Figure 4: The estimated associations between MRI features and SNPs. In each sub-figure, the MRI features are listed on theright and the SNP names are given at the bottom.

regularization and has shown to outperform l1 regularizationin an unsupervised setting (Mohamed, Heller, and Ghahra-mani 2012). Our model is also connected with learning frommultiple sources or multiview learning methods (Hardoonet al. 2008), many of which try to learn a better classifierfor multi-label classification based on the correlation struc-ture among the training data and the labels (Yu et al. 2006;Virtanen, Klami, and Kaski 2011). However, our work con-ducts two tasks—the association discovery and ordinal labelprediction—simultaneously to benefit each other. Finally,our work can be considered as an extension of the work (Zheet al. 2013) by incorporating the LD structure knowledgeand using sparse ordinal regression to model disease sever-ity level.

ConclusionsWe presented a new Bayesian multiview learning model forjoint associations discovery and disease prediction in ADstudy. We expect that our model can be further applied toa wide range of applications in biomedical research – e.g.,eQTL analysis supervised by additional labeling informa-tion.

AcknowledgementThis work was supported by NSF IIS-0916443, IIS-1054903, CCF-0939370 and a Project 985 grant fromUniversity of Electronic Science and Technology (No.A1098531023601041).

ReferencesArchambeau, C., and Bach, F. 2009. Sparse probabilisticprojections. In Advances in Neural Information ProcessingSystems 21. 73–80.Bach, F., and Jordan, M. 2005a. A probabilistic interpreta-tion of canonical correlation analysis. Technical report, UCBerkeley.Bach, F., and Jordan, M. 2005b. A probabilistic interpreta-tion of canonical correlation analysis. Technical report, UCBerkley.

Chu, W., and Ghahramani, Z. 2005. Gaussian processes forordinal regression. Journal of Machine Learning Research6:1019–1041.Falconer, D., and Mackay, T. 1996. Introduction to Quanti-tative Genetics (4th ed.). Addison Wesley Longman.Friedman, J.; Hastie, T.; and Tibshirani, R. 2010. Regu-larization paths for generalized linear models via coordinatedescent. Journal of Statistical Software 33(1):1–22.George, E., and McCulloch, R. 1997. Approaches forbayesian variable selection. Statistica Sinica 7(2):339–373.Guan, Y., and Dy, J. 2009. Sparse probabilistic principalcomponent analysis. Journal of Machine Learning Research- Proceedings Track 5:185–192.Hardoon, D.; Leen, G.; Kaski, S.; and Shawe-Taylor, J., eds.2008. NIPS Workshop on Learning from Multiple Sources.Harold, H. 1936. Relations between two sets of variates.Biometrika 28:321–377.Khachaturian, S. 1985. Diagnosis of Alzheimer’s disease.Archives of Neurology 42(11):1097–1105.Kusbeci, O. Y.; Bas, O.; Gocmen-Mas, N.; Karabekir, H. S.;Yucel, A.; Ertekin, T.; and Yazici, A. C. 2009. Evaluation ofcerebellar asymmetry in Alzheimer’s disease: a stereologicalstudy. Dement Geriatr Cogn Disord 28(1):1–5.Lasserre, J.; Bishop, C. M.; and Minka, T. P. 2006. Prin-cipled hybrids of generative and discriminative models. InIEEE Computer Society Conference on Computer Vision andPattern Recognition, volume 1, 87–94.MacKay, D. 1991. Bayesian interpolation. Neural Compu-tation 4:415–447.Melacci, S., and Mikhail, B. 2011. Laplacian support vectormachines trained in the primal. Journal of Machine Learn-ing Research 12:1149–1184.Mohamed, S.; Heller, K.; and Ghahramani, Z. 2012.Bayesian and L1 approaches for sparse unsupervised learn-ing. In ICML’02.Neal, R. 1996. Bayesian Learning for Neural Networks.

1971

Page 7: Sparse Bayesian Multiview Learning for Simultaneous ...zhe/pdf/multiview-ld.pdf · and discover their associations. Many approaches have been proposed to discover associ-ations, such

Risacher, S.; Saykin, A.; West, J.; Firpi, H.; and McDonald,B. 2009. Baseline MRI predictors of conversion from MCIto probable AD in the ADNI cohort. Curr. Alzheimer Res.6:347–361.Stephens, P.; Tarpey, P.; Davies, H.; Van, Loo, P.; Green-man, C.; Wedge, D.; et al. 2012. The landscape of can-cer genes and mutational processes in breast cancer. Nature486(7403):400–404.Teipel, S.; Grothe, M.; Lista, S.; Toschi, N.; Garaci, F.; andHampel, H. 2013. Relevance of magnetic resonance imag-ing for early detection and diagnosis of Alzheimer disease.Medical Clinics of North America 97(3):399–424.Tibshirani, R. 1994. Regression shrinkage and selection viathe lasso. Journal of the Royal Statistical Society, Series B58:267–288.Tipping, M., and Bishop, C. 1999. Probabilistic principalcomponent analysis. Journal of The Royal Statistical SocietySeries B-statistical Methodology 61:611–622.Virtanen, S.; Klami, A.; and Kaski, S. 2011. Bayesian CCAvia group sparsity. In ICML’11, 457–464.Whitwell, J.; Przybelski, S.; Weigand, S.; et al. 2007. 3Dmaps from multiple MRI illustrate changing atrophy pat-terns as subjects progress from mild cognitive impairmentto Alzheimer’s disease. Brain 130(7):1777–1786.Yu, S.; Yu, K.; Tresp, V.; Kriegel, H.; and Wu, M. 2006.Supervised probabilistic principal component analysis. InKDD’06, 464–473.Zhe, S.; Xu, Z.; Qi, Y.; and Yu, P. 2013. Joint associa-tion discovery and diagnosis of alzheimer’s disease by su-pervised heterogeneous multiview learning. In Pacific Sym-posium on Biocomputing. Pacific Symposium on Biocomput-ing, volume 19, 300–311. World Scientific.Zou, H., and Hastie, T. 2005. Regularization and variableselection via the elastic net. Journal of the Royal StatisticalSociety, Series B 67:301–320.

1972