integrating heterogeneous genetic data sets based on rigorous mathematical foundation
DESCRIPTION
Integrating heterogeneous genetic data sets based on rigorous mathematical foundation. Oct 21, 2010 József Bukszár et al. Center for biomarker research and personalized medicine. et al. Amit N. Khachane Karolina Aberg Youfang Liu Joseph L. McClay Patrick F. Sullivan - PowerPoint PPT PresentationTRANSCRIPT
Integrating heterogeneous genetic data sets based on rigorous mathematical foundation
Oct 21, 2010József Bukszár et al.
Center for biomarker research and personalized medicine
et al. – Amit N. Khachane– Karolina Aberg– Youfang Liu– Joseph L. McClay– Patrick F. Sullivan– Edwin J. C. G. van den Oord
Data integration
Existing data set (EDS)(e.g. linkage data)
Compound local True Discovery Rate (cℓTDR)
Novel data collection (NDC)(e.g. GWAS, sequencing)
Incorporation
Existing data set (EDS)(e.g. linkage data)
Existing data set (EDS)(e.g. linkage data)
Our goal is to find test units that have effect in the novel data collection(NDC) using information from existing data sets (EDS-s) as well.
cℓTDR of a genetic unit is the posterior probability that the genetic unithas an effect in the NDC based on the information in the NDC and EDS-s.
Genetic unit: SNP in GWAS, gene in expression data, an entire chromosomal segment in next-generation sequencing of regions of interest.
The ℓTDR
Compound local True Discovery Rate (cℓTDR)
Novel data collection (NDC)(e.g. GWAS, sequencing)
ℓTDR of a genetic unit is the posterior probability that the genetic unithas an effect in the NDC based on the information in the NDC only.
Genetic unit: SNP in GWAS, gene in expression data, an entire chromosomal segment in next-generation sequencing of regions of interest.
The compound ℓTDR (cℓTDR)
The compound ℓTDR (cℓTDR) of a test unit is defined as the posterior probability that the test unit is alternative in the NDC based on the information we have from the EDS-s and the NDC:
,,...,1,,|1Pr)( NDC kisStTHjTDRc ij
ijjj
the observed statistic of test unit j in the NDC
the event that test unit jis alternative in the NDC
the rank of test unit j in the i-th EDS
the sum of the cℓTDR-s of genetic units in a group of genetic units =
= the expected number of genetics units with effect in this group
A direct consequence of the definition:
Using cℓTDR / ℓTDR
Cumulative cℓTDR at k = the sum of the largest k cℓTDR-s = = the expected number of genetic units with effect among the k genetic units with largest cℓTDR-s.
Cumulative cℓTDR at 6000 =1822.
1822
5703
the total number of gen. units with effectall genetic units10,000 genetic units with larges cℓTDR-s
Using cℓTDR / ℓTDR
5703
In the ideal scenario we would have 5703 genetic units with cℓTDR 1,and the rest with cℓTDR 0 (blue curve).
The closer the cumulative cℓTDR/ℓTDR curve is to the blue curve, the more information we have.
How can we estimate cℓTDR / ℓTDR accurately?
Estimating cℓTDR for the NDC using (combined) prior probabilities from the EDS-s
The cℓTDR of a test unit j can be calculated as
,)()(1
)()(
1prior combined
0prior combined
1prior combined
jjjj
jj
tftf
tfjTDRc
j (combined prior) is the prior probability that test unit j is alternative in the NDC based on the combined information we have from all EDS-s,
where
f0 and f1 are the null and alternative density functions (pdf) in the NDC, resp.,
tj is the observed statistic of test unit j in the NDC.
We need to estimate/have
1. j (combined prior) ,2. f0 and f1 (pdf-s in the NDC),
which we will plug in the above formula.
Test unitsExisting data set (EDS)
(e.g. linkage data)
Compound local True Discovery Rate (cℓTDR)
Transformationinto test unit data
Novel data collection (NDC)(e.g. GWAS, sequencing)
Incorporation
Existing data set (EDS)(e.g. linkage data)
Transformationinto test unit data
Existing data set (EDS)(e.g. linkage data)
Transformationinto test unit data
First step: we transform every data set in such a way that they arebased on the same genetic unit.
This common genetic unit will be referred to as test unit.
Example.: If the NDC is a GWAS and the EDS-s are gene expression data, then we can transform the EDS-s into SNP-based data sets. The test unit will be SNP.
Estimating the combined prior probabilities j (combined prior)
1. We estimate prior probabilities of test units for each EDS.2. We combine the sets of prior probabilities into a single set of prior probabilities.
Existing data set (EDS)(e.g. linkage data)
Compound local True Discovery Rate (cℓTDR)
Transformationinto test unit data
Novel data collection (NDC)(e.g. GWAS, sequencing)
Incorporation
Existing data set (EDS)(e.g. linkage data)
Transformationinto test unit data
Existing data set (EDS)(e.g. linkage data)
Transformationinto test unit data
Combined prior probabilities
Prior probabilities Prior probabilities Prior probabilities
,1 1
1
11
0
m
m
m
mmrm
mico
overlap
i
and
m1* is the number of test units in the EDS that are alternative in the NDC
m1 is the number of test units alternative in the EDS
m1overlap is the number of test units that are alternative in the EDS and in
the NDC
,)()( 1
m
micoi
where
Theorem: For the (prior) probability that test unit i is alternative in the NDC we have that
the contribution of test unit i from the EDS to the NDC
ri is the rank of test unit i in the EDS (r) is the probability that a test with rank r in the EDS is alternative in the EDS
m is the number of test units that are in the EDS and in the NDC
Estimating prior probabilities for a single EDS
,,:#, , MrdtjQMdQ jjMd
where tj is the NDC test statistic of test unit j and rj is the rank of test unit j in the EDS, and
We will use the test statistic
the higher the better the lower the better
.:#)( dtjdm j
Statistic for estimating the contributions
),(, dmm
MMdQ
where
The above statistic can be calculated for any real number d≥0 and positive integer M.
.:#,:# dtjm
MMrdtj jjj
The rationale of the statistic
)(, dmm
MMdQ
The statistic fluctuates around 0 if the being an alternative in the EDSis independent of being an alternative in the NDC.
The idea is that the group of test units with |tj| ≥ d is likely to contain “many” test units with effect if d is large.
If test units that are alternative in the NDC are more likely to bealternative in the EDS than the test units that are null in the NDC,then the statistic will be positive for small M and large d.
Stanley schiz. as EDS and 16 GWAS as NDC
Stanley
reshuffled Stanley
The Stanley p-values werereshuffled on the genes.
M
Sta
tistic
val
ues
Stanley schiz. as EDS
Full range of M Lower range of M
M
Sta
tistic
val
ues
M
Sta
tistic
val
ues
Stanley bipolar
Full range of MLower range of M
Sta
tistic
val
ues
Sta
tistic
val
ues
M M
Lewis
Full range of M Lower range of M
Sta
tistic
val
ues
M
Sta
tistic
val
ues
M
),()()()(,
10, jcodFdFdmm
MQE
MrjMd
j
Theorem: For the expected value of our statistic we have that
where
F0 and F1 are the null and the alternative c.d.f., resp., in the NDC, i.e.
NDC in theunit test null a of statistic test a is |Pr)(0 TdTdF
NDC in theunit test ealternativan of statistic test a is |Pr)(1 TdTdF
Note that F0 and F1 are well-defined, i.e. they always exist, even if wedo not know them.
CO(M)
Statistic for estimating the contributions
It follows from the previous theorem that
Dd dFdF
dmmM
MdQ
DMCO
)()(
)(),(1)(
10
).()(,
jcoMCOMrj j
Notation
is an unbiased estimator of CO(M), where D is a set of positive real numbers.
).()()()(,
10, jcodFdFdmm
MQE
MrjMd
j
From the previous slide:
If there are no ties among ranks in the EDS, then we use
)1()()( MCOMCOico
)0(CO is defined 0.to estimate the contributions co(i) for i=1,…,m, where
The case when we have ties among ranks can be handled as well.
i is the test unit whose rank is M in the EDS.
The co(.) is lower for test units whose rank is larger in the EDS, i.e.
)1()()( MCOMCOico
i is the test unit whose rank is M in the EDS.)1()()( MCOMCOico
is a decreasing function of M, i.e. CO(M) is a concave function.
Consequently, the estimates should follow, the same pattern, i.e.
should be a decreasing function of M, i.e. CO(M) should be concave.
Smoothing needs to be done.
Great deal of details …
Combining prior probabilities from multiple EDS-s
,1)NDC(1
)NDC(0
1)NDC(
0
)NDC(1prior combined
ji
ji
k
ji m
m
m
m
where i*j is the prior probability that test unit i is alternative based on the j-th EDS, m1
(NDC) is the number of alternative test units in the NDC,and m0
(NDC) = m (NDC) - m1
(NDC).
Under mild assumptions, for the combined odd we have that
The combined odd of test unit i is defined as
,1/ prior combinedprior combinedprior combinediii
where i (combined prior) is the combined prior probability of test unit i.
We replace the terms with estimates in the above formula to obtain an estimate of the combined odd.
Simulation results
black: cumulative cℓTDR curve
blue: # of test units alternative in the NDC in the top test units when cℓTDR - based selection was used
green: # of test units alternative in the NDC in the top test units when ℓTDR - based selection was used
red: cumulative ℓTDR curve
16 GWAS as NDC
Original p-values
Lambda=1.169 lambda=1.00048
We developed a parametric method for GWAS NDC.
For estimating the prior probabilities, we need the null and the alternative c.d.f. (F0 and F1 ), and the number of alternative test units in the NDC.
p-values “corrected” by the estimated distributions
Existing data sets used
• Stanley schizophrenia expression data
• Lewis linkage data
• OMIM (without Schiz. DB data)
• Candidate genes Schiz. Database
• SLEP human mouse orthologs
• NR EQTL
Black: cumulative cℓTDR curve
Red: cumulative cℓTDR curve
Blue: empirical distribution of the cℓTDR – selected p-values
Replication study on two GWAS
Red: empirical distribution of the ℓTDR – selected p-values
Light blue: empirical distribution of randomly selected p-values