integrating heterogeneous genetic data sets based on rigorous mathematical foundation

Integrating heterogeneous genetic data sets based on rigorous mathematical foundation

Oct 21, 2010József Bukszár et al.

Center for biomarker research and personalized medicine

et al. – Amit N. Khachane– Karolina Aberg– Youfang Liu– Joseph L. McClay– Patrick F. Sullivan– Edwin J. C. G. van den Oord

Data integration

Existing data set (EDS)(e.g. linkage data)

Compound local True Discovery Rate (cℓTDR)

Novel data collection (NDC)(e.g. GWAS, sequencing)

Incorporation



Our goal is to find test units that have effect in the novel data collection(NDC) using information from existing data sets (EDS-s) as well.

cℓTDR of a genetic unit is the posterior probability that the genetic unithas an effect in the NDC based on the information in the NDC and EDS-s.

Genetic unit: SNP in GWAS, gene in expression data, an entire chromosomal segment in next-generation sequencing of regions of interest.

The ℓTDR



ℓTDR of a genetic unit is the posterior probability that the genetic unithas an effect in the NDC based on the information in the NDC only.

Genetic unit: SNP in GWAS, gene in expression data, an entire chromosomal segment in next-generation sequencing of regions of interest.

The compound ℓTDR (cℓTDR)

The compound ℓTDR (cℓTDR) of a test unit is defined as the posterior probability that the test unit is alternative in the NDC based on the information we have from the EDS-s and the NDC:

,,...,1,,|1Pr)( NDC kisStTHjTDRc ij

ijjj

the observed statistic of test unit j in the NDC

the event that test unit jis alternative in the NDC

the rank of test unit j in the i-th EDS

the sum of the cℓTDR-s of genetic units in a group of genetic units =

= the expected number of genetics units with effect in this group

A direct consequence of the definition:

Using cℓTDR / ℓTDR

Cumulative cℓTDR at k = the sum of the largest k cℓTDR-s = = the expected number of genetic units with effect among the k genetic units with largest cℓTDR-s.

Cumulative cℓTDR at 6000 =1822.

1822

5703

the total number of gen. units with effectall genetic units10,000 genetic units with larges cℓTDR-s

Using cℓTDR / ℓTDR

5703

In the ideal scenario we would have 5703 genetic units with cℓTDR 1,and the rest with cℓTDR 0 (blue curve).

The closer the cumulative cℓTDR/ℓTDR curve is to the blue curve, the more information we have.

How can we estimate cℓTDR / ℓTDR accurately?

Estimating cℓTDR for the NDC using (combined) prior probabilities from the EDS-s

The cℓTDR of a test unit j can be calculated as

,)()(1

)()(

1prior combined

0prior combined

1prior combined

jjjj

jj

tftf

tfjTDRc

j (combined prior) is the prior probability that test unit j is alternative in the NDC based on the combined information we have from all EDS-s,

where

f0 and f1 are the null and alternative density functions (pdf) in the NDC, resp.,

tj is the observed statistic of test unit j in the NDC.

We need to estimate/have

1. j (combined prior) ,2. f0 and f1 (pdf-s in the NDC),

which we will plug in the above formula.

Test unitsExisting data set (EDS)

(e.g. linkage data)


Transformationinto test unit data


Incorporation





First step: we transform every data set in such a way that they arebased on the same genetic unit.

This common genetic unit will be referred to as test unit.

Example.: If the NDC is a GWAS and the EDS-s are gene expression data, then we can transform the EDS-s into SNP-based data sets. The test unit will be SNP.

Estimating the combined prior probabilities j (combined prior)

1. We estimate prior probabilities of test units for each EDS.2. We combine the sets of prior probabilities into a single set of prior probabilities.





Incorporation





Combined prior probabilities

Prior probabilities Prior probabilities Prior probabilities

,1 1

1

11

0

m

m

m

mmrm

mico

overlap

i

and

m1* is the number of test units in the EDS that are alternative in the NDC

m1 is the number of test units alternative in the EDS

m1overlap is the number of test units that are alternative in the EDS and in

the NDC

,)()( 1

m

micoi

where

Theorem: For the (prior) probability that test unit i is alternative in the NDC we have that

the contribution of test unit i from the EDS to the NDC

ri is the rank of test unit i in the EDS (r) is the probability that a test with rank r in the EDS is alternative in the EDS

m is the number of test units that are in the EDS and in the NDC

Estimating prior probabilities for a single EDS

,,:#, , MrdtjQMdQ jjMd

where tj is the NDC test statistic of test unit j and rj is the rank of test unit j in the EDS, and

We will use the test statistic

the higher the better the lower the better

.:#)( dtjdm j

Statistic for estimating the contributions

),(, dmm

MMdQ

where

The above statistic can be calculated for any real number d≥0 and positive integer M.

.:#,:# dtjm

MMrdtj jjj

The rationale of the statistic

)(, dmm

MMdQ

The statistic fluctuates around 0 if the being an alternative in the EDSis independent of being an alternative in the NDC.

The idea is that the group of test units with |tj| ≥ d is likely to contain “many” test units with effect if d is large.

If test units that are alternative in the NDC are more likely to bealternative in the EDS than the test units that are null in the NDC,then the statistic will be positive for small M and large d.

Stanley schiz. as EDS and 16 GWAS as NDC

Stanley

reshuffled Stanley

The Stanley p-values werereshuffled on the genes.

M

Sta

tistic

val

ues

Stanley schiz. as EDS

Full range of M Lower range of M

M

Sta

tistic

val

ues

M

Sta

tistic

val

ues

Stanley bipolar

Full range of MLower range of M

Sta

tistic

val

ues

Sta

tistic

val

ues

M M

Lewis

Full range of M Lower range of M

Sta

tistic

val

ues

M

Sta

tistic

val

ues

M

),()()()(,

10, jcodFdFdmm

MQE

MrjMd

j

Theorem: For the expected value of our statistic we have that

where

F0 and F1 are the null and the alternative c.d.f., resp., in the NDC, i.e.

NDC in theunit test null a of statistic test a is |Pr)(0 TdTdF

NDC in theunit test ealternativan of statistic test a is |Pr)(1 TdTdF

Note that F0 and F1 are well-defined, i.e. they always exist, even if wedo not know them.

CO(M)

Statistic for estimating the contributions

It follows from the previous theorem that

Dd dFdF

dmmM

MdQ

DMCO

)()(

)(),(1)(

10

).()(,

jcoMCOMrj j

Notation

is an unbiased estimator of CO(M), where D is a set of positive real numbers.

).()()()(,

10, jcodFdFdmm

MQE

MrjMd

j

From the previous slide:

If there are no ties among ranks in the EDS, then we use

)1()()( MCOMCOico

)0(CO is defined 0.to estimate the contributions co(i) for i=1,…,m, where

The case when we have ties among ranks can be handled as well.

i is the test unit whose rank is M in the EDS.

The co(.) is lower for test units whose rank is larger in the EDS, i.e.

)1()()( MCOMCOico

i is the test unit whose rank is M in the EDS.)1()()( MCOMCOico

is a decreasing function of M, i.e. CO(M) is a concave function.

Consequently, the estimates should follow, the same pattern, i.e.

should be a decreasing function of M, i.e. CO(M) should be concave.

Smoothing needs to be done.

Great deal of details …

Combining prior probabilities from multiple EDS-s

,1)NDC(1

)NDC(0

1)NDC(

0

)NDC(1prior combined

ji

ji

k

ji m

m

m

m

where i*j is the prior probability that test unit i is alternative based on the j-th EDS, m1

(NDC) is the number of alternative test units in the NDC,and m0

(NDC) = m (NDC) - m1

(NDC).

Under mild assumptions, for the combined odd we have that

The combined odd of test unit i is defined as

,1/ prior combinedprior combinedprior combinediii

where i (combined prior) is the combined prior probability of test unit i.

We replace the terms with estimates in the above formula to obtain an estimate of the combined odd.

Simulation results

black: cumulative cℓTDR curve

blue: # of test units alternative in the NDC in the top test units when cℓTDR - based selection was used

green: # of test units alternative in the NDC in the top test units when ℓTDR - based selection was used

red: cumulative ℓTDR curve

16 GWAS as NDC

Original p-values

Lambda=1.169 lambda=1.00048

We developed a parametric method for GWAS NDC.

For estimating the prior probabilities, we need the null and the alternative c.d.f. (F0 and F1 ), and the number of alternative test units in the NDC.

p-values “corrected” by the estimated distributions

Existing data sets used

• Stanley schizophrenia expression data

• Lewis linkage data

• OMIM (without Schiz. DB data)

• Candidate genes Schiz. Database

• SLEP human mouse orthologs

• NR EQTL

Black: cumulative cℓTDR curve

Red: cumulative cℓTDR curve

Blue: empirical distribution of the cℓTDR – selected p-values

Replication study on two GWAS

Red: empirical distribution of the ℓTDR – selected p-values

Light blue: empirical distribution of randomly selected p-values

integrating heterogeneous genetic data sets based on rigorous mathematical foundation

Documents

common genetic unit

ctdr tdrcumulative ctdr

number of test units

group of genetic units

largest ctdr

edssthe ctdr

ndcthe rank of test

unit jis alternative