gene-environment interactions, pathways, and genome-wide association studies in asthma: what are the...

68
Pathways, and Genome-Wide Association Studies in Asthma: What are the Analysis Challenges? Examples from the Children’s Health Study Duncan Thomas University of Southern California Los Angeles, USA

Upload: kaylynn-roop

Post on 15-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Gene-Environment Interactions, Pathways, and Genome-Wide

Association Studies in Asthma:

What are the Analysis Challenges?

Examples from the Children’s Health Study

Duncan Thomas

University of Southern California

Los Angeles, USA

Conceptual Model for Oxidative Stress Pathway for Effects of Air Pollution

Oxidant Exposure

Oxidative Stress

Health Effects

Molecular & enzymatic antioxidants

Dose

Physical Activity

ROS metabolism

Xenobiotic metabolism

Oxidative production & detoxification

Inflammation

Gilliland et al. EHP 1999;107:403-7

Statistical Challenges

• Exposure assessment and modeling

• GxE and GxG interactions

• Pathways– Hierarchical modeling strategy– Mechanistic models

• GWAS• Collaborations

C

Multilevel Mixed Model

• Between times within subject

• Between subjects within community

• Between communities

Berhane et al, Statist Sci 2004; 19: 414-440

Multi-stage ModelY = LF, t = age, Z = pollution

1: Ycij = aci + bcitcij + b1Zck + ds(tcij) + ecij

– bci = subject-specific 8-yr LF growth

2: bci = Bc + b2Zci + eci – Regression on subject-specific variables

3: Bc = b0 + b3Zc + ec

– Regression on ambient pollution level

• Fit as single mixed model• Can include confounders at each level

1000

10000

8 10 12 14 16 18 20

Age (years)

FEV

1 (m

illili

ters

)

Berhane et al, Statist Sci 2004; 19: 414-440

Community FEV1 growth vs. NO2

Gauderman et al, AJRCCM 2000: 162:1383-90

LB

SDML

AL

LA

AT

LN

LE

RV

UP

LM SM

11.0

11.5

12.0

12.5

0 10 20 30 40 50

NO2 (ppb)

Ann

ual F

EV

1 G

row

th (%

) R = - 0.60p = 0.025

Spatial Variability of Measured Pollution and Traffic Density

Regionally

Within Communities

Modeled Exposure

100

1000

10000

100000

100 1000 10000 100000

Re

sid

en

ce

Ex

po

su

re

School Exposure

Traffic Density Estimates by SchoolMedian and IQR of Estimates for Homes

Atmospheric Dispersion Models

Road

Wind

Residence, x

Vehicle, y - q f

Benson, CALINE4, CA Dept of Transport 1989: #205

Effects of Local Variation in Air Pollution

Prevalent Asthma, Long-Term Residents

McConnell et al, EHP 2006:114:766-720

0.5

1

1.5

2

2.5 Distance Modeled trafficfrom freeway pollutants

>300 m150-300 m75-150 m<75 m

0-25 %

25-50 %50-75 %75 -90%>90%

Measurements of Local Variability

• Selected 234 homes and 34 schools from 10 communities

• Homes chosen based on stratified sample, above/below median distance from freeways

• Two-week NO2 measurements using Palms tubes in two seasons each (winter & summer)

• NO, NO2, O3 measurements now available on about 1000 homes

• PM measurements currently being made on ~300 homes Gauderman et al., Epidemiology 2005;16: 737-43

Sampling Strategies• Case-control: choose S to be set of asthma cases

and their town-matched controls• Surrogate diversity: choose S that maximizes the

variance of traffic density• Spatial diversity: choose S that maximizes the

geographic spread of measurements– Maximize total distance from all other points– Maximize minimum distance from nearest point– Maximize the informativeness of sample for predicting

non-sample points

• Hybrid: First measure cases and controls; then add additional subjects that would be most informative for refining E(X |Z,P,W )

Thomas, Lifetime Data Analysis 2007; 13: 565-81

Main Effects of Air Pollution: Intra-Community Variation in Measured NO2

Nonasthmatic

Gauderman et al., Epidemiology 2005;16: 737-43

0

10

20

30

40

50

60

0 1 2 3 4 5 6 7 8 9 10 11

NO

2 (

pp

b)

AL AT LB LE LN ML SM RV SD UP

Main Effects of Air Pollution: Intra-Community Variation in Measured NO2

Nonasthmatic

Gauderman et al., Epidemiology 2005;16: 737-43

0

10

20

30

40

50

60

0 1 2 3 4 5 6 7 8 9 10 11

NO

2 (

pp

b)

Asthmatic

Nonasthmatic

AL AT LB LE LN ML SM RV SD UP

W

Y

Z

X

Traffic, Land Use

Local ExposureMeasurements

HealthOutcome

True Exposure

LLocations

PRegional

Background

Molitor et al, AJE 2006;164:69-76 (nonspatial)Molitor et al, EHP 2007:1147-53 (spatial)

Bayesian Spatial Measurement Error Model

Subsample S | Y, L, W

Spatial Regression Model• Exposure model

E(Xi) = WiaW = land use covariates, dispersion model predictions

cov(Xi,Xj) = s2Iij + t2 exp(– r Dij)

MESA Air model:

x(s,t) = X0(s) + Sk Xk(s) Tk(t)

• Measurement model

E(Zi) = Xi

• Disease model

g[E(Yi)] = bXi

• Multivariate exposure model (“co-kriging”)

Spatial Measurement Error Model

-0.5

-0.45

-0.4

-0.35

-0.3

-0.25

-0.2

-0.15

-0.1

-0.05

0

0.05

w S

patia

l

w/o

Spa

tial

w S

patia

l

w/o

Spa

tial

w S

patia

l

w/o

Spa

tial

w S

patia

l

w/o

Spa

tial

w S

patia

l

w/o

Spa

tial

Base Dist Dist.Buffer Addt150m Caline

Model

NO

2 ef

fect

s o

n L

un

g F

un

ctio

n (

log

B)

95% Credible Limits

Estimates

Molitor et al, EHP 2007:1147-53

Statistical Challenges

• Exposure assessment and modeling

• GxE and GxG interactions

• Pathways– Hierarchical modeling strategy– Mechanistic models

• GWAS• Collaborations

C

Multigenic Models

• Focused Interaction Testing Framework (FITF) uses likelihood ratios to test for main effects and interactions conditional on lower-order ones

• Dimension reduction by screening for G–G associations among pooled case-control sample before testing for interactions

• False Discovery Rate used to assess significance

• Better power than exploratory methods like MDR, except for interactions with no marginal effects

Millstein et al, AJHG 2005; 78:15-27

Multigenic Models: NQ01, MPO & CAT

Millstein et al, AJHG 2006; 78: 15-27

Effects White & Hispanic Nonwhites

NQ01 0.49 (0.32 – 0.72) 0.42 (0.21 – 0.77)

MPO 0.75 (0.49 – 1.13) 1.60 (0.93 – 2.75)

CAT 0.88 (0.56 – 1.40) 0.71 (0.21 – 1.86)

NQ01 x MPO 1.48 (0.88 – 2.49) 1.29 (0.62 – 2.57)

NQ01 x CAT 1.39 (0.77 – 2.50) 0.76 (0.01 – 3.90)

MPO x CAT 0.51 (0.25 – 0.99) 0.28 (0.04 – 1.45)

NQ01 x MPO x CAT 1.14 (0.51 – 2.51) 2.12 (0.26 – 14.1)

Unadjusted p .00026 .00008

Significance threshold .00052 .05

Integrating Toxicology and Epidemiology

• Suppose we conduct a semi-ecologic epidemiology study to observe (Yci , Xc, Gci) for individuals i in community c

• AND we characterize the biological activity Bcs of samples s of the mixture Xc in toxicologic assays on cells with genotypes Gs

• Aim is to link the parameters of the two models, so toxicology can inform the epidemiologic analysis

Yci

Bcs

Xc

g

b

Gci Health outcome

Biological activity

Ambientpollution

Gs

Cell linegenotype

Individualgenotype

Putting It All Together• Use modeled local concentrations as input

to microenvironmental model for personal exposure

• Integrate over time for lifetime exposure• Estimate uncertainties and incorporate into

exposure-response analysis• Integrate exposures, genes & biomarkers

through a pathway-based biological model• Chamber studies using particle

concentrator• Incorporate toxicological assessment of

biological activity of town-specific particle composition

x(s,t)

Zt

zs

Xi Li Yi

Gi Bi

pilPil Vil

pil vil

Long-term average personal exposure

Latent disease process (e.g., inflammation)

Clinical outcome(e.g., asthma)

Genes (& other

risk factors)Biomarkers (e.g., eNO)

Spatio-temporalexposure field

Central site continuous time

monitors

Home & school measurements

GIS location histories

Accelerometer Activity histories

Usual physical activity

(Q’aire)

Usual times

(Q’aire)

True long-term

time-activity

Wst

Exposure predictors(e.g., traffic, weather)

Zi

Personal exposure

measurements

sil

nil

Usual locations

Modeling Entire Pathways

• Hierarchical modeling approach (Conti et al, Hum Hered

2003;56:83-93)

– Conventional logistic regression modeling of main effects and interactions

– Second level model with priors for interactions

– Bayes model averaging to allow for uncertainty about which terms to include

•PBPK modeling approach (Cortessis & Thomas, IARC Sci Publ

2004;57:127-150) – Explicit modeling of postulated pathway(s)– Involving latent variables for intermediate

metabolites and individual rate parameters

General Concept for a “Systems Biology” Perspective in Molecular Epidemiology

EExposures

G

Genes

Main effect and interaction covariates

YDisease

X

General Concept for a “Systems Biology” Perspective in Molecular Epidemiology

EExposures

G

Genes

Unobserved intermediate events

YDisease

?

General Concept for a “Systems Biology” Perspective in Molecular Epidemiology

EExposures

G

X1

Genes

Unobserved intermediate events

YDisease

B2

“-Omics” biomarker measurements

L “Topology” of the networkZ

External biological knowledge

(“Ontologies”)

X2 X3

XnXn-1

B3…

Hierarchical Models

• Incorporates external knowledge about pathways as “prior covariates” for coefficients of a data model

• Level I: Epidemiologic data model:– logit Pr(Yi = 1|Xi) = b0 + Sp bpXip

– X = (G,E,GxE,GxG, GxGxE,…)• Level II: Pathway model:

– bp ~ N(Sv pvZpv, s2)– Zpv = prior covariates

Prior Covariates• Define potential “exchangeability

classes”, not absolute values of differences

• Examples:– Pathway indicators

– Hung et al., CEBP 2004;13:1013-21

– In vitro functional assays– WECARE study (Concannon)

– In silico predictions (SIFT, PolyPhen, etc.)

– Zhu et al. Cancer Res 2004;64:2251-7

– Outputs from mechanistic models (e.g., PBPK)

– Parl et al., Fund Molec Epi 2008, in press

– Formal ontologies – Conti, NCI Monogr (2007)

Hierarchical Models for GxG

• Multivariate prior for bGxG:

• bp ~ N(Sv pvZpv, s2)

• b ~ MVN [PZv, s2(I – rA)–1]

where A is an “adjacency” matrix describing the a priori similarity of pairs of genes derived from an ontology database or other sources

Statistical Challenges

• Exposure assessment and modeling

• GxE and GxG interactions

• Pathways– Hierarchical modeling strategy– Mechanistic models

• GWAS• Collaborations

C

Modeling Entire Pathways

• Hierarchical modeling approach (Conti et al, Hum Hered

2003;56:83-93)

– Conventional logistic regression modeling of main effects and interactions

– Second level model with priors for interactions

– Bayes model averaging to allow for uncertainty about which terms to include

• PBPK modeling approach (Cortessis & Thomas, IARC Sci

Publ 2004;57:127-150) – Explicit modeling of postulated pathway(s)– Involving latent variables for intermediate

metabolites and individual rate parameters

X1

X2

Z1 Z2 Z3

Z4 Z5 Z6 Z7

Y

G2G3

G5

E7E5

G6

G4

E3

G1

Cyp1A2 NAT1 NAT2

Cyp1A1 EPHX1 (mEH)

GSTM3

UDP-GST

Well-donered meat

Smoking

MeIQx N-OH-MeIQx N-Acetyl-OH-MeIQx

BaP BaP 7,8-Epx BaP 7,8-DiolBaP 7,8-Diol

9,10-Epx

Polyps

G8

ColorectalPolypsModel

Heterocyclic amines (HCA) pathway

Polycyclic aromatic hydrocarbons (PAH) pathway

Complex Pathways

Example: Folate

• Linked differential equations models for biochemical reactions

• Genotype-specific enzyme activity rates• Methionine intake and intracellular folate• Boxes are metabolite concentrations, enzymes

Ulrich et al., CEPB 2008:17:1822-31Reed et al., J Nutr 2006;136:2653-61

Ulrich et al., Nat Rev Cancer 2003;3:912-20

Mechanistic Models

• Combines differential equations models for pathway with stochastic distributions of individual metabolic rates, population parameters, and disease risks

• Fitted using MCMC methods

• Allow inference on:– contribution of each exposure to each pathway

– contribution of each pathway to disease

– contribution of each gene to relevant pathway

– measures of individual heterogeneity

Stochastic Boolean Networks

Uncertainty in Pathway Structure

• Techniques like logic regression Kooperberg & Ruczinski, Gen Epi 2005;28:157-70

and Bayesian network analysisFriedman, Science 2004; 303: 799-805

can be used to infer network structure

• MCMC proceeds by adding, deleting nodes, changing node types, etc., to sample distribution of possible topologies

• Summarize strength of evidence for each connection and marginal risk of disease, averaging over topologies

Network of Metabolic

Pathways for Colorectal Cancer:

Top: Folate metabolism (with DNA methylation and

DNA damage / repair subpathways)

Middle: Bile acid metabolism

Bottom: PAH & HCA metabolism

Simulation of model uncertainty

E1 G1

Z25

E2Z26

Alcohol (Z0)

ADH3(Z1)

Folate (Z2)

Z27

5,10-MTHF 5-MTHF

Z30

G2MTHFR(Z3)

Z34

Homocysteine

G2

MTTR (Z4)

Z29 E3Vit B12 (Z5)

G4

TS (Z6)

Z31

Z32

DSBs

Z33

SSBs

Z28

Uracilmisincorporation

G5

XRCC3 (Z7)

G6

XRCC1(Z8)

Z35

E4Fat (Z9)

E5

Fibre (Z10)

Z36

G7

EPHX1(Z11)

G8

SCL10A2(Z12)

Z37

Non-reapsorbed

LCAZ38LCA/VDR

Z40

Non-detoxifiedLCA

Z43 Z44

E6Calcium(Z17)

G9FOK1

(Z13)

G10VDR(Z14)

Z39

G11PXR(Z15)

G12

CYP3A4(Z16)

Z41

Z42

E7

Smoking (Z18)

E8WDRM (Z19)

Z45

Z49

HCAs

PAHs

Z50 Z51 Z52 Z53

N-OH-MeiQx

Z46 Z48

Z54

N-Acetyl-MeiQx

G11NAT1

(Z21)

G12

NAT2(Z22)

Z47

YColorectal

Cancer (Z57)

Z56

Z55

B[a]P 7,8-epoxide B[a]P 9,10-diol B[a]P 9,10-epoxide

G13CYP1A2

(Z20)

G16

CYP1A1(Z23)

G17

GSTM (Z24)

Detoxified LCA

LCA reabsorption

CA,CDCA

Non Ca-soap

LCA

Mutation

SAM:SAHZ28

Dominant (OR)AdditiveRecessive (AND)

Inhibitory (XOR)

Logical node types (Z’s)

“Rid

iculom

e?”

Fitted Model (thickness of

arrows indicate

posterior probabilities)

E1 G1

Z25

E2Z26

Alcohol (Z0)

ADH3(Z1)

Folate (Z2)

Z28

5,10-MTHF 5-MTHF

Z27

G2MTHFR(Z3)

Z33

Homocysteine

G2

MTTR (Z4)

E3Vit B12 (Z5)

G4

TS (Z6)

Z30

Z31

DSBs

Z32

SSBs

Z29

Uracilmisincorporation

G5

XRCC3 (Z7)

G6

XRCC1(Z8)

Z34

E4Fat (Z9)

E5

Fibre (Z10)

Z35

G7

EPHX1(Z11)

G8

SCL10A2(Z12)

Z36

Non-reapsorbed

LCAZ37LCA/VDR

Z39

Non-detoxifiedLCA

Z42 Z43

E6Calcium(Z17)

G9

FOK1(Z13)

G10

VDR(Z14)

Z38

G11PXR(Z15)

G12

CYP3A4(Z16)

Z40

Z41

E7

Smoking (Z18)

E8WDRM (Z19)

Z44

Z48

HCAs

PAHs

Z49 Z50 Z51 Z52

N-OH-MeiQx

Z45 Z47

Z53

N-Acetyl-MeiQx

G11NAT1

(Z21)

G12

NAT2(Z22)

Z46

YColorectal

Cancer (Z56)

Z55

Z54

B[a]P 7,8-epoxide B[a]P 9,10-diol B[a]P 9,10-epoxide

G13

CYP1A2(Z20)

G16

CYP1A1(Z23)

G17

GSTM (Z24)

Detoxified LCA

LCA reabsorption

CA,CDCA

Non Ca-soap

LCA

Mutation

Zx

E1 G1

Z25

E2Z26

Alcohol (Z0)

ADH3(Z1)

Folate (Z2)

Z27

5,10-MTHF 5-MTHF

Z30

G2MTHFR(Z3)

Z34

Homocysteine

G2

MTTR (Z4)

Z29 E3Vit B12 (Z5)

G4

TS (Z6)

Z31

Z32

DSBs

Z33

SSBs

Z28

Uracilmisincorporation

G5

XRCC3 (Z7)

G6

XRCC1(Z8)

Z35

E4Fat (Z9)

E5

Fibre (Z10)

Z36

G7

EPHX1(Z11)

G8

SCL10A2(Z12)

Z37

Non-reapsorbed

LCAZ38LCA/VDR

Z40

Non-detoxifiedLCA

Z43 Z44

E6Calcium(Z17)

G9FOK1

(Z13)

G10VDR(Z14)

Z39

G11PXR(Z15)

G12

CYP3A4(Z16)

Z41

Z42

E7

Smoking (Z18)

E8WDRM (Z19)

Z45

Z49

HCAs

PAHs

Z50 Z51 Z52 Z53

N-OH-MeiQx

Z46 Z48

Z54

N-Acetyl-MeiQx

G11NAT1

(Z21)

G12

NAT2(Z22)

Z47

YColorectal

Cancer (Z57)

Z56

Z55

B[a]P 7,8-epoxide B[a]P 9,10-diol B[a]P 9,10-epoxide

G13CYP1A2

(Z20)

G16

CYP1A1(Z23)

G17

GSTM (Z24)

Detoxified LCA

LCA reabsorption

CA,CDCA

Non Ca-soap

LCA

Mutation

SAM:SAHZ28

Dominant (OR)AdditiveRecessive (AND)

Inhibitory (XOR)

Logical node types (Z’s)

A Cautionary Comment

So, the modeling of the interplay of many genes — which is the aim of complex systems biology — is not without danger.

Any model can be wrong (almost by definition), but particularly complex…models have much flexibility to hide their lack of biological relevance.

Jansen RG. Studying complex biological systems through multifactorial perturbation. Nat Rev Genet 2003; 4: 145-151

http://www.mickey-mouse.com/clipartm109.htm

Statistical Challenges

• Exposure assessment and modeling

• GxE and GxG interactions

• Pathways– Hierarchical modeling strategy– Mechanistic models

• GWAS• Collaborations

C

Some GWAS Issues

• Two-stage designs

• Incorporating priors

• Approaches to scanning for GxE

• Unifying pathway-based and agnostic approaches

• Post-GWAS

Some Methodological Issues in GWAS:The ENDGAME Consortium

• Multistage study designs• Choice of platform for first stage• Multiple comparisons • Prioritizing SNPs for second stage• Haplotype analyses using tag SNPs:

unifying association and sharing• GxE and GxG interactions• Control of population stratification

Thomas et al, AJHG 2005:77:337-45

Multistage Design

• Stage I: full scan of 500,000 SNPs on sample of size N1

• Stage II: genotype only SNPs “significant” at level a1 from stage I on a new sample of size N2

• Final analysis combines both samples at significance level a2, chosen to ensure an overall Type I error rate a– Significance assessed conditionally on hit in stage I

• Optimize choice of N1 and a1 to minimize cost subject to constraint on a and power

Satagopan et al., Genet Epidemiol 2003;25:149-57

• No additional SNPs at stage II:– Genotype 30% of sample in stage I a1 = .0038 (i.e., 1900 SNPs in

stage II) a2 = 1.7x10–7 – 87% of cost goes to stage I

• Test 5 flanking markers per hit in stage II:– Genotype 49% of sample in stage I

a1 = .0005 (250 loci & 1500 SNPs in stage II)

a2 = 0.5x10–7 – 95% of cost goes to stage I

Wang et al., Genet Epidemiol 2006:30:356-68

Optimal Designs Per-Genotype Cost Ratio = 17.5 for Stages II / I,

Genomewide a = .05, 1 – b = 0.9500,000 SNPs in stage I

Some Methodological Issues in GWAS:The ENDGAME Consortium

• Multistage study designs• Choice of platform for first stage• Multiple comparisons • Prioritizing SNPs for second stage• Haplotype analyses using tag SNPs:

unifying association and sharing• GxE and GxG interactions• Control of population stratification

Thomas et al, AJHG 2005:77:337-45

Hierarchical Approach to Prioritizing SNPs

• Standard multistage designs assume the a1 most significant SNPs from the first stage will be tested in later stage(s)

• Can we do better?• False discovery rate weighted by prior knowledge

Roeder et al, AJHG 2006:78:243-42

• Bayesian FDR Whittemore, J Appl Statist, 2007:34:1-9

• Empirical Bayes ranking, using an exchangeable mixture prior with a large mass at RR = 1

• Adding prior knowledge to hierarchical Bayes Lewinger et al, GE

2007;31:871-82

Hierarchical Approach to Prioritizing SNPs

• Three level model:– I: model for distribution of observed chi statistics c

in relation to true noncentrality parameter l– II: mixture model for as either null with probability

1 – p or non-null with probability p mean m and variance s2

– III: logistic model for p and linear model for m as regressions on prior covariates Z

• Ranking of SNPs by:– posterior probability of being non-null– posterior mean of l given non-null

Lewinger et al, GE 2007;31:871-82

xm

E m; covariates in means model onlyE m; covariates in probability model onlyE m; covariates in bothP m; covariates in means model onlyP m; covariates in probability model onlyP m; covariates in both

b1 = 0.693 b1 = 1.1

Type I error

Pow

er

0.01 0.04 0.07 0.10

00.1

0.2

Type I error

Po

we

r

0.01 0.04 0.07 0.10

00

.10

.2

Lewinger et al, Gen Epi 2007; 31:871-82

Some Methodological Issues in GWAS:The ENDGAME Consortium

• Multistage study designs• Choice of platform for first stage• Multiple comparisons • Prioritizing SNPs for second stage• Haplotype analyses using tag SNPs:

unifying association and sharing• GxE and GxG interactions• Control of population stratification

Thomas et al, AJHG 2005:77:337-45

Sample Sizes Needed for GxE• Required # case-control pairs a = 0.05 / a = 1x10-7

(assuming we are testing the causal locus)

IntxnEffect ExposureORGxE Prevalence 0.05 0.40

2.0 0.1 6,238 / 12,110 1,364 / 2,7480.5 2,515 / 4,946 547 / 1,325

5.0 0.1 1,001 / 1,293 245 / 3860.5 459 / 657 113 / 320

Variant G Prevalence

Minimum Detectable Effect Sizes

p(G)ORG main effect

ORGxE interaction

p(E) = 0.1 p(E) = 0.4

0.05 2.05 8.6 4.3

0.10 1.77 5.4 3.2

0.20 1.68 4.3 2.7

a = 1x10–7, 1–b = 0.80 N = 1000 cases, 2000 controls

Case-Only Design for GxEExposure Cases Controls

Genotype: Non-carrier Carrier Non-carrier Carrier

Unexposed a b A B

Exposed c d C D

ORGxE estimators:– Case-control: (ad/bc) / (AD/BC)– Case-only: ad/bc

Assuming no G-E association in controls

Umbach et al Stat Med 1994;13:153-62

Smaller variance (more power) than case-control test

Can’t test this assumption in controls, then decide whether to do case-only or case-control

Albert et al, AJE 2001:154:687-93

But can combine case-only and case-control estimators

Mukerjee et al, GE 2008;32:615-26. Li & Conti, AJE in press

Case-control vs. Case-only Design

N for 80% power (a = .05): case-control / case-only

IntxnEffect ExposureORGxE Prevalence 0.05 0.40

2.0 0.1 6,238 / 2,498 1,364 / 5670.5 2,515 / 1,020 547 / 273

5.0 0.1 1,001 / 267 245 / 800.5 459 / 136 113 / 66

Variant G Prevalence

Two-Stage Approach to GxE

• Step 1: Screen genome-wide to find SNPs most likely to be involved in a GxE interaction by testing for G-E association in combined case and control sample

• Step 2: Only test these ‘likely’ SNPs using the standard 1-df case-control interaction test

Murcray et al. Am J Epidemiol 2009;169:219-26

0

0.2

0.4

0.6

0.8

1

1 1.5 2 2.5 3 3.5 4 4.5 5

Interaction Effect Size (Rge)

Pow

er

1-step analysis

2-step analysis

GWAS Test for GxE Interaction:Power for 2-step vs. 1-step method

Murcray et al., AJE 2009:169:219-26

Conceptual Model for Oxidative Stress Pathway for Effects of Air Pollution

Oxidant Exposure

Oxidative Stress

Health Effects

Molecular & enzymatic antioxidants

Dose

Physical Activity

ROS metabolism

Xenobiotic metabolism

Oxidative production & detoxification

Inflammation

Gilliland et al. EHP 1999;107:403-7

Using Hierarchical Models to Incorporate Pathways into GWAS

• Two approaches to unification– Use GWAS to “discover” pathways

– Use pathways to inform GWAS

• Approach 1: Bayesian network analysis, gene set enrichment analysis, or other exploratory methods

– Subramanian et al, PNAS 2005; 102: 15545-50

• Approach 2: Treat pathway indicators as prior covariates

– Wang et al, Am J Hum Genet 2007;81:1278-83

Post-GWAS:Resequensing Designs

Marker Disease Causal allele Pr(D=1|M,Y)M Y D=0 D=1

Positive marker associationPositive LD and positive causal association

=0.036, RRYD=2, RRYM=1.22

MControls 0.796 0.004 0.005

Cases 0.758 0.008 0.010

MControls 0.154 0.046 0.230

Cases 0.147 0.088 0.374 Negative LD and negative causal association

=.010, RRYD=0, RRYM=1.067M Controls 0.750 0.050 0.063

Cases 0.789 0.000 0.000

M Controls 0.200 0.000 0.000Cases 0.211 0.000 0.000

Negative marker associationNegative LD and positive causal association

=0.010, RRYD=3, RRYM=0.889

MControls 0.750 0.050 0.063

Cases 0.682 0.136 0.136M Controls 0.200 0.000 0.000

Cases 0.186 0.000 0.000

Positive LD and negative causal association=0.036, RRYD=0.5, RRYM=0.887

M Controls 0.796 0.004 0.005Cases 0.816 0.002 0.003

MControls 0.200 0.046 0.230

Cases 0.211 0.024 0.130

Analysis strategyCombines full sequencing information on a stratified subset with SNP data on main study

Thomas et al, GE 2007;27:401-4

Thomas et al, Statist Sci 2009, in press

Statistical Challenges

• Exposure assessment and modeling

• GxE and GxG interactions

• Pathways– Hierarchical modeling strategy– Mechanistic models

• GWAS• Collaborations

C

Statistical Issues in Collaborations

• Combining population-based, family-based, and pedigree studies

• Meta-analysis or mega-analysis?

• Data harmonization– Phenotypes– Genotypes– Exposures and other risk factors

• Allowing for understanding heterogeneity– Fixed vs random effects models– Meta-regression

Conclusions

• Costs have now become feasible: many such studies now being undertaken

• Results of first publications very promising

• Efficient design and analysis strategies are essential

• Rich area for statistical research

• “Agnostic” genomewide scans and pathway-driven multigenic modeling are complementary

AcknowledgmentsEpidemiology

John PetersFrank GillilandRob McConnellNino KuenzliStephanie London

BiostatisticsJim GaudermanKiros BerhaneMike JerrettBryan LangholzDavid ContiDan StramBill Navidi

Field Work & Exposure AssessmentEd AvolFred LurmanField team (many!)

FundingCalifornia Air Resources Board

Helene Margolis

National Institute of Environmental Health Sciences

National Heart, Lung & Blood Institute

Health Effects Institute

Data Management & AnalysisEd RappaportHita VoraJosh MillsteinYu-Fen LiTalat IslamJohn & Jassy Molitor

GeneticsLouis Dubeau

Respiratory MedicineBill Linn

x(s,t)

Zt

zs

Xi Li Yi

Gi Bi

pilPil Vil

pil vil

Long-term average personal exposure

Latent disease process (e.g., inflammation)

Clinical outcome(e.g., asthma)

Genes (& other

risk factors)Biomarkers (e.g., eNO)

Spatio-temporalexposure field

Central site continuous time

monitors

Home & school measurements

GIS location histories

Accelerometer Activity histories

Usual physical Activity(Q’aire)

Usual times

(Q’aire)True

long-term time-activity

Wst

Exposure predictors(e.g., traffic, weather

Zi

Personal exposure

measurements

sil

nil

Usual locations

nlpl