predicting regulatory variants with composite statisticvariants • composite strategy takes...

Post on 15-Mar-2020

11 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Predicting regulatory variants with composite statistic

MJ Li et al.

Presented by Yuchuan Wang

10/03/2016

1

Introduction• Prediction and prioritization of human non-coding regulatory variants

• Existing tools utilize functional genomics data and evolutionary information to evaluate the functions of non-coding variants

• Different algorithms have inconsistent and even conflicting predictions

• Integrate prediction scores from eight tools that prevalently used in predicting non-coding regulatory variants

2

Methods• Variant prediction scores collection and processing

3

Methods• Variant prediction scores collection and processing

Score Name Source Link Pre-calculated

CADD_CScore, CADD_PHRED http://krishna.gs.washington.edu/download/CADD/v1.1/whole_genome_SNVs.tsv.gz Y

DANN https://cbcl.ics.uci.edu/public_data/DANN/data/DANN_whole_genome_SNVs.tsv.bgz Y

FunSeq http://funseq.gersteinlab.org/data Y

FunSeq2 http://archive.gersteinlab.org/funseq2/hg19_wg_score.tsv.gz Y

GWAS3D http://jjwanglab.org/gwas3d N

GWAVA_Region,

GWAVA_TSS,

GWAVA_Unmatched

ftp://ftp.sanger.ac.uk/pub/resources/software/gwava/v1.0/annotated/gwava_db_csv.tgz N

SuRFR http://www.cgem.ed.ac.uk/resources/SuRFR/SuRFR_0.99.0.tar.gz N

FATHMM-MKL http://fathmm.biocompute.org.uk/database/fathmm-MKL_Current.tab.gz Y 4

MethodsConstruction of training/testing datasets• A disease-causal or functional regulatory variants dataset by combining four

different resources

• Manually curated 81 experimentally validated regulatory variants from recent publications, which served as an independent dataset for causal variants in evaluating existing algorithms and our model.

5

MethodsComposite model• Calculate the pdf of scores from each of the eight tools• Assuming the independence between tests (tools)

6

MethodsComposite model• Calculate the pdf of scores from each of the eight tools• Assuming the independence between tests (tools)

Given a set of scores S𝑆 = 𝑠1, 𝑠2, … , 𝑠𝑛

we can calculated the Bayes factor (BF)

𝐵𝐹 =ෑ

𝑖=1

𝑛𝑃(𝑠𝑖|𝑐𝑎𝑠𝑢𝑎𝑙)

𝑃(𝑠𝑖|𝑛𝑒𝑢𝑡𝑟𝑎𝑙)

7

MethodsComposite model• The probability of the variant being causal is computed as the composite

likelihood

𝑃(𝑐𝑎𝑠𝑢𝑎𝑙|𝑆) =ෑ

𝑖=1

𝑛𝑃(𝑠𝑖|𝑐𝑎𝑠𝑢𝑎𝑙) × 𝜋

𝑃 𝑠𝑖 𝑐𝑎𝑠𝑢𝑎𝑙 × 𝜋 + 𝑃(𝑠𝑖|𝑛𝑒𝑢𝑡𝑟𝑎𝑙) × (1 − 𝜋)

• Use flat prior probability π = 0.5 for the causal probability of each variant

8

ResultsIntegrative resources for non-coding regulatory variant functional annotation and prediction• Prediction scores for around 8.6 billion possible SNPs in human genome

• 5247 genome-wide non-redundant variants with reliable causal evidence as the training set

• A control dataset (10 times that of the positive data) that do not contain casual and disease-associated variants

• Independent QTL datasets collected

9

ResultsIntegrative resources for non-coding regulatory variant functional annotation and prediction

Mulin Jun Li et al. Bioinformatics 2016;32:2729-2736 10

ResultsExisting methods show inconsistent prioritization of non-coding regulatory variants• Spearman’s Rank Correlation (SRC) tests for each pair of algorithms

11

SRC among eight tools for (A) refined causal dataset and (B) curated experimentally validated

dataset.

Mulin Jun Li et al. Bioinformatics 2016;32:2729-2736 12

ResultsComposite of multiple signals improves casual regulatory variant detection• A composite likelihood statistic and estimated the probability of the

investigated variant being causal• Ten-fold cross-validation• AUC 0.84 and MCC 0.41

13

Regulatory variant predictions performance of different methods.

Mulin Jun Li et al. Bioinformatics 2016;32:2729-2736 14

ResultsComposite of multiple signals improves casual regulatory variant detection• Combining only a subset of the eight

methods can achieve better predictive power

• CADD_Cscore, GWAVA_TSS, GWAS3D and SuRFR

Mulin Jun Li et al. Bioinformatics 2016;32:2729-2736 15

ResultsEvaluation of composite model on eQTL, allelic imbalance and dsQTL datasets• Three independent human QTLs datasets to further validate the capacity of

full composite model• Improvement in predicting eQTLs (AUC of 0.81) and allelic imbalanced loci

(AUC of 0.92)• Similar performance as FunSeq2 for the dsQTLs dataset (both for AUC of

0.71)

16

Performance of regulatory QTLs prediction from different methods.

Mulin Jun Li et al. Bioinformatics 2016;32:2729-2736 17

ResultsComparison with unsupervised integrative approach

18

Conclusion• Existing methods show inconsistent prioritization of non-coding regulatory

variants• Composite strategy takes advantage of the complementary attributes of

individual tools to achieve a better performance• Identifying the high quality and confident causal regulatory variants

training dataset and corresponding control is challenging• The correlations among existing methods may be attributed to the different

perspectives and logics of existing algorithms. • Large and independent gold standard is needed to test the correlation of

different tools and stability of reduced combination model.

19

top related