predicting regulatory variants with composite statisticvariants • composite strategy takes...
TRANSCRIPT
Predicting regulatory variants with composite statistic
MJ Li et al.
Presented by Yuchuan Wang
10/03/2016
1
Introduction• Prediction and prioritization of human non-coding regulatory variants
• Existing tools utilize functional genomics data and evolutionary information to evaluate the functions of non-coding variants
• Different algorithms have inconsistent and even conflicting predictions
• Integrate prediction scores from eight tools that prevalently used in predicting non-coding regulatory variants
2
Methods• Variant prediction scores collection and processing
3
Methods• Variant prediction scores collection and processing
Score Name Source Link Pre-calculated
CADD_CScore, CADD_PHRED http://krishna.gs.washington.edu/download/CADD/v1.1/whole_genome_SNVs.tsv.gz Y
DANN https://cbcl.ics.uci.edu/public_data/DANN/data/DANN_whole_genome_SNVs.tsv.bgz Y
FunSeq http://funseq.gersteinlab.org/data Y
FunSeq2 http://archive.gersteinlab.org/funseq2/hg19_wg_score.tsv.gz Y
GWAS3D http://jjwanglab.org/gwas3d N
GWAVA_Region,
GWAVA_TSS,
GWAVA_Unmatched
ftp://ftp.sanger.ac.uk/pub/resources/software/gwava/v1.0/annotated/gwava_db_csv.tgz N
SuRFR http://www.cgem.ed.ac.uk/resources/SuRFR/SuRFR_0.99.0.tar.gz N
FATHMM-MKL http://fathmm.biocompute.org.uk/database/fathmm-MKL_Current.tab.gz Y 4
MethodsConstruction of training/testing datasets• A disease-causal or functional regulatory variants dataset by combining four
different resources
• Manually curated 81 experimentally validated regulatory variants from recent publications, which served as an independent dataset for causal variants in evaluating existing algorithms and our model.
5
MethodsComposite model• Calculate the pdf of scores from each of the eight tools• Assuming the independence between tests (tools)
6
MethodsComposite model• Calculate the pdf of scores from each of the eight tools• Assuming the independence between tests (tools)
Given a set of scores S𝑆 = 𝑠1, 𝑠2, … , 𝑠𝑛
we can calculated the Bayes factor (BF)
𝐵𝐹 =ෑ
𝑖=1
𝑛𝑃(𝑠𝑖|𝑐𝑎𝑠𝑢𝑎𝑙)
𝑃(𝑠𝑖|𝑛𝑒𝑢𝑡𝑟𝑎𝑙)
7
MethodsComposite model• The probability of the variant being causal is computed as the composite
likelihood
𝑃(𝑐𝑎𝑠𝑢𝑎𝑙|𝑆) =ෑ
𝑖=1
𝑛𝑃(𝑠𝑖|𝑐𝑎𝑠𝑢𝑎𝑙) × 𝜋
𝑃 𝑠𝑖 𝑐𝑎𝑠𝑢𝑎𝑙 × 𝜋 + 𝑃(𝑠𝑖|𝑛𝑒𝑢𝑡𝑟𝑎𝑙) × (1 − 𝜋)
• Use flat prior probability π = 0.5 for the causal probability of each variant
8
ResultsIntegrative resources for non-coding regulatory variant functional annotation and prediction• Prediction scores for around 8.6 billion possible SNPs in human genome
• 5247 genome-wide non-redundant variants with reliable causal evidence as the training set
• A control dataset (10 times that of the positive data) that do not contain casual and disease-associated variants
• Independent QTL datasets collected
9
ResultsIntegrative resources for non-coding regulatory variant functional annotation and prediction
Mulin Jun Li et al. Bioinformatics 2016;32:2729-2736 10
ResultsExisting methods show inconsistent prioritization of non-coding regulatory variants• Spearman’s Rank Correlation (SRC) tests for each pair of algorithms
11
SRC among eight tools for (A) refined causal dataset and (B) curated experimentally validated
dataset.
Mulin Jun Li et al. Bioinformatics 2016;32:2729-2736 12
ResultsComposite of multiple signals improves casual regulatory variant detection• A composite likelihood statistic and estimated the probability of the
investigated variant being causal• Ten-fold cross-validation• AUC 0.84 and MCC 0.41
13
Regulatory variant predictions performance of different methods.
Mulin Jun Li et al. Bioinformatics 2016;32:2729-2736 14
ResultsComposite of multiple signals improves casual regulatory variant detection• Combining only a subset of the eight
methods can achieve better predictive power
• CADD_Cscore, GWAVA_TSS, GWAS3D and SuRFR
Mulin Jun Li et al. Bioinformatics 2016;32:2729-2736 15
ResultsEvaluation of composite model on eQTL, allelic imbalance and dsQTL datasets• Three independent human QTLs datasets to further validate the capacity of
full composite model• Improvement in predicting eQTLs (AUC of 0.81) and allelic imbalanced loci
(AUC of 0.92)• Similar performance as FunSeq2 for the dsQTLs dataset (both for AUC of
0.71)
16
Performance of regulatory QTLs prediction from different methods.
Mulin Jun Li et al. Bioinformatics 2016;32:2729-2736 17
ResultsComparison with unsupervised integrative approach
18
Conclusion• Existing methods show inconsistent prioritization of non-coding regulatory
variants• Composite strategy takes advantage of the complementary attributes of
individual tools to achieve a better performance• Identifying the high quality and confident causal regulatory variants
training dataset and corresponding control is challenging• The correlations among existing methods may be attributed to the different
perspectives and logics of existing algorithms. • Large and independent gold standard is needed to test the correlation of
different tools and stability of reduced combination model.
19