a w-test for main effect and epistasis testing in gwas...

A W-test for main effect and epistasis testing in GWAS data

Maggie Haitian Wang, PhDCentre for Clinical Research and Biostatistics (CCRB)

Faculty of Medicine, The Chinese University of Hong Kong (CUHK)maggiew@cuhk.edu.hk

http://www2.ccrb.cuhk.edu.hk/statgene

GIW2016, Shanghai

• Genetic association studies aim to identify disease associated bio-markers, to discover disease mechanism, potential drug targets, and disease sub-typing.

Background

http://www.mediapharma.it/wp-content/uploads/2012/06/personalised-medicine.jpg

Disease mechanism Drug target identification Precision medicine

Technology Data types Methods

http://www.nature.com/polopoly_fs/7.14984.1389810620!/image/HiSeqX_Ten_Single_Instrument_630.jpg_gen/derivatives/landscape_630/HiSeqX_Ten_Single_Instrument_630.jpg

• lasso• t-test• Chi-squaredtest• Tree-based….

Genetic association study

• Next generation sequencing (NGS) data: – More than 99% of the single nucleotide polymorphisms (SNPs) have minor

allele frequency (MAF) < 1% – Rare variants methods

• Genome-wide association studies (GWAS): – Majority of the SNPs have MAF > 5%– Common variant Methods: Fisher’s exact test, Chi-squared, Odds ratio, linear

or logistic regressions • The low frequency SNPs (1%< MAF< 5%) remain largely under-studied.

– Loss of function alleles are enriched in low frequency variants. (MacArthur et al. 2015 Science)

Methods by data types

• Ultra-high data dimension: – Burden of multiple testing:

GWAS data: 500,000 SNPs, NGS: > 10 million SNPs– Requirement on test efficiency – Difficulty to consider interaction effects due to data size and sparsity

• Results validation– Crucial to replicate GWAS results (Kraft, Zeggini and Ioannix 2009)

Common challengesof Genetic association studies

Kraft,Zeggini andIoannix (2009)Replicationingenome-wideassociation study,StatisticalScience

• Basic hypothesis:

• Under a co-dominant model: – the genotype X can be coded to takes values: (0, 1, 2)– a pair of SNPs (X1, X2): forms a 2 by 9 contingency table

The W-test formulation

Thestatisticaldistributionsofasetofdisease-associatedmarkersaredifferentinthecasegroupfromthatinthecontrolgroup.

ControlCase

… k=9

• The cell distribution of (X1, X2) in the case and control group:

– n1i : number of case subjects in the ith cell– n0i : number of control subjects in the ith cell – N1 : total number of cases– N0 : total number of controls

,)1|Pr(ˆ1

nYXp ii === ki

NnYXp i

i ,...,1,)0|Pr(ˆ0

00 ====

• First, combine the normalized log odds ratios of the cell probability distributions:

where,

• The squared terms in the summation are not independent

)ˆ1/(ˆ)ˆ1/(ˆlog∑

=⎥⎦

⎤⎢⎣

−−

ii SEppppX

SEi =1n0i

N0 − n0i+

1N1 − n1i

ControlCase

1logOR

2logOR

3logOR

kORlog

Original cell divisionn02

ControlCase

• The actual distribution of the X2 can be estimated by matching its first two moments to a random variable R:

• Let

2fcR χ=

).,cov(22

),cov(2)(),cov()(

xxxVarxxX

∑∑

∑∑∑ ∑∑

⎩⎨⎧

fcXcfXE222

Chuang and Shih (2012), Hou (2005)

• The c and f are:

• Let h=1/c, we have

),cov(22

jiji∑∑<

),cov(22

2)()]([2

∑∑<

11 ~)ˆ1/(ˆ)ˆ1/(ˆ

ii SEpppphW χ∑

=⎥⎦

⎤⎢⎣

−−

• In real data the h and f are estimated using bootstrapped samples

• Cov is estimated by large sample theories.

• h and f converge when – B> 200 – bootstrap NB= min (1000, N) – PB= min (1000, P)

• Empirically: h ≈ (k − 1)/k, f ≈ k − 1

Distribution of W-test

=vCCoefficient of variation: measures estimated h and f convergence

• The W-test follows a Chi-squared distribution, in which the degrees of freedom is estimated using smaller bootstrapped samples – It’s probability distribution is data-adaptive– No need of permutations to calculate p-values – important for genome data

• Model free– Odds ratio based, suitable for case-control data set

• Flexible– Handles SNP-SNP interactions– Handles main effect

• When k=2, it reduces to a classical odds ratio test for 2x2 table.

Properties

• Important Genetic architectures that will influence testing power:– MAF > 5% (common) – 1% < MAF < 5% (low frequency) – Linkage Disequilibrium (LD) <20% (Low)– 20%<LD<80%(mid)– LD>80%(high)

Simulation studies design

• Phenotype determined by: – A linear model:

– A non-linear model: without any main effect:

Simulation studies design

⎪⎩

⎪⎨

4.03.03.0

43746354

21322110

ppXXXXpXXXX

YPLOGITβ

ββββ

⎪⎩

⎪⎨

4.01,03.0)2(mod3.0)2(mod

ppXXpXX

• Power: 1000 simulations• Type I error: 1 million simulations• Number of candidates SNPs: 50• Number of pairs: 1,225• Causal pairs: 2• Bonferroni corrected significance level for 5% alpha: 4.1×10-5

Power and type I error

Methods Low LD Moderate LD High LD

Logistic 68.5% 76.9% 83.3%

Chi-squared 60.0% 67.2% 74.5%

W 71.1% 81.0% 86.7%

Power for linear model

Logistic 47.1% 62.5% 71.1%

Chi-squared 42.2% 65.2% 74.0%

W 49.8% 79.5% 83.8%

MAF > 5%

1%< MAF < 5%

Methods LowLD ModerateLD HighLD

Logistic 5.9% 1.7% 0.6%

Chi-squared 72.6% 69.4% 62.8%

W 88.0% 86.6% 79.4%

Power for non-linear model

Logistic 61.7% 31.8% 43.7%

Chi-squared 67.4% 43.9% 49.1%

W 95.6% 83.3% 83.9%

MAF > 5%

1%< MAF < 5%

Type I error - nominal

Logistic 3.92% 5.88% 3.92%

Chi-squared 2.82% 1.72% 3.06%

W 5.39% 6.00% 5.51%

Logistic 4.53% 5.27% 5.64%

Chi-squared 0.37% 0.25% 0.25%

W 4.04% 5.15% 6.74%

MAF > 5%

1%< MAF < 5%

• Onlaptopcomputerwith2.4GHzCPUand8GBmemory,thetimeelapsedforcomputing1000subjectsand50SNPsinteractionseffectexhaustivelyis:

Computing Speed

7.4 7.7

W-test Chi-square Logistric

Time (s)

W-test is robust when sample size reduces

LowfrequencymidLDenvironmentNon-linearmodel

• Dataset 1. Welcome Trust Case-control Consortium (WTCCC) bipolar data set (Burton, Clayton et al. 2007).

- 2,000 cases and 3,000 controls- 414,682 SNPs after QC

• Dataset 2. Genetic Association Information Network (GAIN) bipolar project in dbGaP database (McInnis, Dick et al. 2003)

- 1,079 cases and 1,089 controls- 729,304 SNPs after QC

Real data application

Q-Q Plot of W-test on real GWAS

Noinflationofspuriousassociation

(a)WTCCCdata

(a)GAINdata

• MaineffectmarkersareselectedatGenome-widesignificantP-values

Main effect - Manhattan plots

• 51 Genome-wide significant SNPs in WTCCC• 76.4% of the significant markers identified are low frequency variants.• PARK2 (rs2849605, 6q5.2) has been identified. • Neuron functions genes: HTR3B (rs17116117, 11q23.1) and CNTNAP5

(rs1919835, 2q14.3). – The HTR3B is a neuron transmitter and causes fast, depolarizing responses in

neurons after activation (Davies et al 1999). – The CNTNAP5 has been identified by many previous independent genetic and

pedigree data sets on bipolar disorder (Djurovic, Gustafsson et al. 2010), schizophrenia (Levinson, Shi et al. 2012), and autism (Pagnamenta, Bacchelli et al. 2010)

Significant main effects - WTCCC

MAF=4.2%

MAF=1.1%

• RTN4R(SNP_A-8429018,22q11.21)– encodesanogo receptor– mediatesaxonalgrowthinhibitionandmayplayaroleinregulatingaxonalregeneration

andplasticityinthecentralnervoussystem– Studiesreportedthatthedeletionofthegenewillcauseabnormalityinbrainwhite

matters(Perlstein,Chohan etal.2014);– humanandmousegeneticstudysuggestedthegenetobeacandidatemarkerfor

schizophrenia(Hsu,Woodroffe etal.2007).• Thoughnumerousevidencesofthegene’sroleinneurologydisordersfrombiomedical

experimentsandgeneticstudies,thegenehasnotbeenpreviouslydiscoveredfromtheGAINdataset(McInnis,Dicketal.2003).

Significant main effects

MAF=12.2%

Replicated significant epistasis effect

Genes replicated in WTCCC and GAIN datasetsOnly identified in GAIN dataMain effect not significant

Main effect significant Only identified in WTCCC data Significant interaction Weak interactions

PTPRTTMEM132D

CNTNAP2

MACROD2

• A majority of these replicated genes are marginally insignificant -undiscoverable through main effect screening

Replicated significant epistasis effect

SNP Gene Position MAF* P-value of pair*

rs6741692 DPP10 2q14 0.303 5.8E-38rs2407594 CSMD1 8p23 0.029 9.8E-36rs1864952 SLIT3 5q35 0.046 1.9E-35rs2849605 PARK2 6q5.2 0.021 3.3E-29rs3867492 TMEM132D 12q24.33 0.030 1.0E-27rs11222695 HNT 11q25 0.012 2.7E-25rs1494451 CNTNAP2 7q35 0.025 1.3E-21rs2785061 ACCN1 17q12 0.028 9.8E-19rs17135053 A2BP1 16p13.3 0.025 3.9E-18rs17170832 ELMO1 7p14.1 0.017 3.9E-18rs9559408 MYO16 13q33.3 0.035 4.8E-17

• DPP10: facilitates neuronal excitability and its aberrant distribution is associated with Alzheimer’s disease as revealed by immunohistochemistry (Chen et al 2010 Biomed Res Int.)

• TMEM132D: a transmembrane protein expressed in white matter in the spinal cord and optic nerve. (Nomoto 2003 J Biochem)

• PTPRT : a receptor-type protein tyrosine phosphatase for signal transduction and neurite extension, which promotes synapse formation and is reported to be highly expressed in the central nervous system (Lin 2009 Embo J)

Replicated epistasis genes

• wtest is submitted to CRAN, and on our website: www2.ccrb.cuhk.edu.hk/statgene

R-package: wtest

- MHWang, RSun, JGuo,HWeng, JLee,IHu,PShamandBCYZee(2016). AfastandpowerfulW-testforpairwise epistasistesting.NucleicAcidsResearch.- RSun, BChang,BCYZee,MHWang.wtest:anRpackagefortestingmainandinteractioneffectingenotypedata withbinarytraits.

• PhD Student: Rui Sun• Programming Support: Junfeng Guo• wtest software: www2.ccrb.cuhk.edu.hk/wtest/download.html• Our group: www2.ccrb.cuhk.edu.hk/statgene

• Grants supported this work: – Hong Kong RGC-GRF Grant [476013]– NSFC [81473035, 31401124]– CUHK Direct Grant [2014.01]

Acknowledgement

Thank you!

Maggie H. Wang: maggiew@cuhk.edu.hk

a w-test for main effect and epistasis testing in gwas...

Documents

genome-wide association studies (gwas)

gwas 11 - mv.helsinki.fi

tassel read gwas glm pca gwas mlm - cornell...

epi519 gwas talk

genetics terminology illustrated epistasis

class gwas

network-guided discovery of extensive epistasis between...

gwas – the future the proposed acquisition of gwas by...

muse: a multi-locus sampling-based epistasis … ·...

gwas qc example - h3abionet.org

lecture 7 gwas full

exeter 2011-gwas

bio epistasis

epistasis and the evolution of antimicrobial resistance ·...

deep epistasis in human metabolism - sites.bu.edu

missing heritability & gwas

a walk through gwas

haplotipos y gwas

epistasis and shapes of fitness landscapes

gwas using ssgblup