duncan thomas university of southern california los angeles, usa
DESCRIPTION
Gene-Environment Interactions, Pathways, and Genome-Wide Association Studies in Asthma: What are the Analysis Challenges? Examples from the Children’s Health Study. Duncan Thomas University of Southern California Los Angeles, USA. - PowerPoint PPT PresentationTRANSCRIPT
Gene-Environment Interactions, Pathways, and Genome-Wide
Association Studies in Asthma:
What are the Analysis Challenges?
Examples from the Children’s Health Study
Duncan Thomas
University of Southern California
Los Angeles, USA
Conceptual Model for Oxidative Stress Pathway for Effects of Air Pollution
Oxidant Exposure
Oxidative Stress
Health Effects
Molecular & enzymatic antioxidants
Dose
Physical Activity
ROS metabolism
Xenobiotic metabolism
Oxidative production & detoxification
Inflammation
Gilliland et al. EHP 1999;107:403-7
Statistical Challenges
• Exposure assessment and modeling
• GxE and GxG interactions
• Pathways– Hierarchical modeling strategy– Mechanistic models
• GWAS• Collaborations
C
Multilevel Mixed Model
• Between times within subject
• Between subjects within community
• Between communities
Berhane et al, Statist Sci 2004; 19: 414-440
Multi-stage ModelY = LF, t = age, Z = pollution
1: Ycij = aci + bcitcij + b1Zck + ds(tcij) + ecij
– bci = subject-specific 8-yr LF growth
2: bci = Bc + b2Zci + eci – Regression on subject-specific variables
3: Bc = b0 + b3Zc + ec
– Regression on ambient pollution level
• Fit as single mixed model• Can include confounders at each level
1000
10000
8 10 12 14 16 18 20
Age (years)
FEV
1 (m
illili
ters
)
Berhane et al, Statist Sci 2004; 19: 414-440
Community FEV1 growth vs. NO2
Gauderman et al, AJRCCM 2000: 162:1383-90
LB
SDML
AL
LA
AT
LN
LE
RV
UP
LM SM
11.0
11.5
12.0
12.5
0 10 20 30 40 50
NO2 (ppb)
Ann
ual F
EV
1 G
row
th (%
) R = - 0.60p = 0.025
Spatial Variability of Measured Pollution and Traffic Density
Regionally
Within Communities
Modeled Exposure
100
1000
10000
100000
100 1000 10000 100000
Re
sid
en
ce
Ex
po
su
re
School Exposure
Traffic Density Estimates by SchoolMedian and IQR of Estimates for Homes
Atmospheric Dispersion Models
Road
Wind
Residence, x
Vehicle, y - q f
Benson, CALINE4, CA Dept of Transport 1989: #205
Effects of Local Variation in Air Pollution
Prevalent Asthma, Long-Term Residents
McConnell et al, EHP 2006:114:766-720
0.5
1
1.5
2
2.5 Distance Modeled trafficfrom freeway pollutants
>300 m150-300 m75-150 m<75 m
0-25 %
25-50 %50-75 %75 -90%>90%
Measurements of Local Variability
• Selected 234 homes and 34 schools from 10 communities
• Homes chosen based on stratified sample, above/below median distance from freeways
• Two-week NO2 measurements using Palms tubes in two seasons each (winter & summer)
• NO, NO2, O3 measurements now available on about 1000 homes
• PM measurements currently being made on ~300 homes Gauderman et al., Epidemiology 2005;16: 737-43
Sampling Strategies• Case-control: choose S to be set of asthma cases
and their town-matched controls• Surrogate diversity: choose S that maximizes the
variance of traffic density• Spatial diversity: choose S that maximizes the
geographic spread of measurements– Maximize total distance from all other points– Maximize minimum distance from nearest point– Maximize the informativeness of sample for predicting
non-sample points
• Hybrid: First measure cases and controls; then add additional subjects that would be most informative for refining E(X |Z,P,W )
Thomas, Lifetime Data Analysis 2007; 13: 565-81
Main Effects of Air Pollution: Intra-Community Variation in Measured NO2
Nonasthmatic
Gauderman et al., Epidemiology 2005;16: 737-43
0
10
20
30
40
50
60
0 1 2 3 4 5 6 7 8 9 10 11
NO
2 (
pp
b)
AL AT LB LE LN ML SM RV SD UP
Main Effects of Air Pollution: Intra-Community Variation in Measured NO2
Nonasthmatic
Gauderman et al., Epidemiology 2005;16: 737-43
0
10
20
30
40
50
60
0 1 2 3 4 5 6 7 8 9 10 11
NO
2 (
pp
b)
Asthmatic
Nonasthmatic
AL AT LB LE LN ML SM RV SD UP
W
Y
Z
X
Traffic, Land Use
Local ExposureMeasurements
HealthOutcome
True Exposure
LLocations
PRegional
Background
Molitor et al, AJE 2006;164:69-76 (nonspatial)Molitor et al, EHP 2007:1147-53 (spatial)
Bayesian Spatial Measurement Error Model
Subsample S | Y, L, W
Spatial Regression Model• Exposure model
E(Xi) = WiaW = land use covariates, dispersion model predictions
cov(Xi,Xj) = s2Iij + t2 exp(– r Dij)
MESA Air model:
x(s,t) = X0(s) + Sk Xk(s) Tk(t)
• Measurement model
E(Zi) = Xi
• Disease model
g[E(Yi)] = bXi
• Multivariate exposure model (“co-kriging”)
Spatial Measurement Error Model
-0.5
-0.45
-0.4
-0.35
-0.3
-0.25
-0.2
-0.15
-0.1
-0.05
0
0.05
w S
patia
l
w/o
Spa
tial
w S
patia
l
w/o
Spa
tial
w S
patia
l
w/o
Spa
tial
w S
patia
l
w/o
Spa
tial
w S
patia
l
w/o
Spa
tial
Base Dist Dist.Buffer Addt150m Caline
Model
NO
2 ef
fect
s o
n L
un
g F
un
ctio
n (
log
B)
95% Credible Limits
Estimates
Molitor et al, EHP 2007:1147-53
Statistical Challenges
• Exposure assessment and modeling
• GxE and GxG interactions
• Pathways– Hierarchical modeling strategy– Mechanistic models
• GWAS• Collaborations
C
Multigenic Models
• Focused Interaction Testing Framework (FITF) uses likelihood ratios to test for main effects and interactions conditional on lower-order ones
• Dimension reduction by screening for G–G associations among pooled case-control sample before testing for interactions
• False Discovery Rate used to assess significance
• Better power than exploratory methods like MDR, except for interactions with no marginal effects
Millstein et al, AJHG 2005; 78:15-27
Multigenic Models: NQ01, MPO & CAT
Millstein et al, AJHG 2006; 78: 15-27
Effects White & Hispanic Nonwhites
NQ01 0.49 (0.32 – 0.72) 0.42 (0.21 – 0.77)
MPO 0.75 (0.49 – 1.13) 1.60 (0.93 – 2.75)
CAT 0.88 (0.56 – 1.40) 0.71 (0.21 – 1.86)
NQ01 x MPO 1.48 (0.88 – 2.49) 1.29 (0.62 – 2.57)
NQ01 x CAT 1.39 (0.77 – 2.50) 0.76 (0.01 – 3.90)
MPO x CAT 0.51 (0.25 – 0.99) 0.28 (0.04 – 1.45)
NQ01 x MPO x CAT 1.14 (0.51 – 2.51) 2.12 (0.26 – 14.1)
Unadjusted p .00026 .00008
Significance threshold .00052 .05
Integrating Toxicology and Epidemiology
• Suppose we conduct a semi-ecologic epidemiology study to observe (Yci , Xc, Gci) for individuals i in community c
• AND we characterize the biological activity Bcs of samples s of the mixture Xc in toxicologic assays on cells with genotypes Gs
• Aim is to link the parameters of the two models, so toxicology can inform the epidemiologic analysis
Yci
Bcs
Xc
g
b
Gci Health outcome
Biological activity
Ambientpollution
Gs
Cell linegenotype
Individualgenotype
Putting It All Together• Use modeled local concentrations as input
to microenvironmental model for personal exposure
• Integrate over time for lifetime exposure• Estimate uncertainties and incorporate into
exposure-response analysis• Integrate exposures, genes & biomarkers
through a pathway-based biological model• Chamber studies using particle
concentrator• Incorporate toxicological assessment of
biological activity of town-specific particle composition
x(s,t)
Zt
zs
Xi Li Yi
Gi Bi
pilPil Vil
pil vil
Long-term average personal exposure
Latent disease process (e.g., inflammation)
Clinical outcome(e.g., asthma)
Genes (& other
risk factors)Biomarkers (e.g., eNO)
Spatio-temporalexposure field
Central site continuous time
monitors
Home & school measurements
GIS location histories
Accelerometer Activity histories
Usual physical activity
(Q’aire)
Usual times
(Q’aire)
True long-term
time-activity
Wst
Exposure predictors(e.g., traffic, weather)
Zi
Personal exposure
measurements
sil
nil
Usual locations
Modeling Entire Pathways
• Hierarchical modeling approach (Conti et al, Hum Hered
2003;56:83-93)
– Conventional logistic regression modeling of main effects and interactions
– Second level model with priors for interactions
– Bayes model averaging to allow for uncertainty about which terms to include
•PBPK modeling approach (Cortessis & Thomas, IARC Sci Publ
2004;57:127-150) – Explicit modeling of postulated pathway(s)– Involving latent variables for intermediate
metabolites and individual rate parameters
General Concept for a “Systems Biology” Perspective in Molecular Epidemiology
EExposures
G
Genes
Main effect and interaction covariates
YDisease
X
General Concept for a “Systems Biology” Perspective in Molecular Epidemiology
EExposures
G
Genes
Unobserved intermediate events
YDisease
?
General Concept for a “Systems Biology” Perspective in Molecular Epidemiology
EExposures
G
X1
Genes
Unobserved intermediate events
YDisease
B2
“-Omics” biomarker measurements
L “Topology” of the networkZ
External biological knowledge
(“Ontologies”)
X2 X3
XnXn-1
B3…
Hierarchical Models
• Incorporates external knowledge about pathways as “prior covariates” for coefficients of a data model
• Level I: Epidemiologic data model:– logit Pr(Yi = 1|Xi) = b0 + Sp bpXip
– X = (G,E,GxE,GxG, GxGxE,…)• Level II: Pathway model:
– bp ~ N(Sv pvZpv, s2)– Zpv = prior covariates
Prior Covariates• Define potential “exchangeability
classes”, not absolute values of differences
• Examples:– Pathway indicators
– Hung et al., CEBP 2004;13:1013-21
– In vitro functional assays– WECARE study (Concannon)
– In silico predictions (SIFT, PolyPhen, etc.)
– Zhu et al. Cancer Res 2004;64:2251-7
– Outputs from mechanistic models (e.g., PBPK)
– Parl et al., Fund Molec Epi 2008, in press
– Formal ontologies – Conti, NCI Monogr (2007)
Hierarchical Models for GxG
• Multivariate prior for bGxG:
• bp ~ N(Sv pvZpv, s2)
• b ~ MVN [PZv, s2(I – rA)–1]
where A is an “adjacency” matrix describing the a priori similarity of pairs of genes derived from an ontology database or other sources
Statistical Challenges
• Exposure assessment and modeling
• GxE and GxG interactions
• Pathways– Hierarchical modeling strategy– Mechanistic models
• GWAS• Collaborations
C
Modeling Entire Pathways
• Hierarchical modeling approach (Conti et al, Hum Hered
2003;56:83-93)
– Conventional logistic regression modeling of main effects and interactions
– Second level model with priors for interactions
– Bayes model averaging to allow for uncertainty about which terms to include
• PBPK modeling approach (Cortessis & Thomas, IARC Sci
Publ 2004;57:127-150) – Explicit modeling of postulated pathway(s)– Involving latent variables for intermediate
metabolites and individual rate parameters
X1
X2
Z1 Z2 Z3
Z4 Z5 Z6 Z7
Y
G2G3
G5
E7E5
G6
G4
E3
G1
Cyp1A2 NAT1 NAT2
Cyp1A1 EPHX1 (mEH)
GSTM3
UDP-GST
Well-donered meat
Smoking
MeIQx N-OH-MeIQx N-Acetyl-OH-MeIQx
BaP BaP 7,8-Epx BaP 7,8-DiolBaP 7,8-Diol
9,10-Epx
Polyps
G8
ColorectalPolypsModel
Heterocyclic amines (HCA) pathway
Polycyclic aromatic hydrocarbons (PAH) pathway
Complex Pathways
Example: Folate
• Linked differential equations models for biochemical reactions
• Genotype-specific enzyme activity rates• Methionine intake and intracellular folate• Boxes are metabolite concentrations, enzymes
Ulrich et al., CEPB 2008:17:1822-31Reed et al., J Nutr 2006;136:2653-61
Ulrich et al., Nat Rev Cancer 2003;3:912-20
Mechanistic Models
• Combines differential equations models for pathway with stochastic distributions of individual metabolic rates, population parameters, and disease risks
• Fitted using MCMC methods
• Allow inference on:– contribution of each exposure to each pathway
– contribution of each pathway to disease
– contribution of each gene to relevant pathway
– measures of individual heterogeneity
Stochastic Boolean Networks
Uncertainty in Pathway Structure
• Techniques like logic regression Kooperberg & Ruczinski, Gen Epi 2005;28:157-70
and Bayesian network analysisFriedman, Science 2004; 303: 799-805
can be used to infer network structure
• MCMC proceeds by adding, deleting nodes, changing node types, etc., to sample distribution of possible topologies
• Summarize strength of evidence for each connection and marginal risk of disease, averaging over topologies
Network of Metabolic
Pathways for Colorectal Cancer:
Top: Folate metabolism (with DNA methylation and
DNA damage / repair subpathways)
Middle: Bile acid metabolism
Bottom: PAH & HCA metabolism
Simulation of model uncertainty
E1 G1
Z25
E2Z26
Alcohol (Z0)
ADH3(Z1)
Folate (Z2)
Z27
5,10-MTHF 5-MTHF
Z30
G2MTHFR(Z3)
Z34
Homocysteine
G2
MTTR (Z4)
Z29 E3Vit B12 (Z5)
G4
TS (Z6)
Z31
Z32
DSBs
Z33
SSBs
Z28
Uracilmisincorporation
G5
XRCC3 (Z7)
G6
XRCC1(Z8)
Z35
E4Fat (Z9)
E5
Fibre (Z10)
Z36
G7
EPHX1(Z11)
G8
SCL10A2(Z12)
Z37
Non-reapsorbed
LCAZ38LCA/VDR
Z40
Non-detoxifiedLCA
Z43 Z44
E6Calcium(Z17)
G9FOK1
(Z13)
G10VDR(Z14)
Z39
G11PXR(Z15)
G12
CYP3A4(Z16)
Z41
Z42
E7
Smoking (Z18)
E8WDRM (Z19)
Z45
Z49
HCAs
PAHs
Z50 Z51 Z52 Z53
N-OH-MeiQx
Z46 Z48
Z54
N-Acetyl-MeiQx
G11NAT1
(Z21)
G12
NAT2(Z22)
Z47
YColorectal
Cancer (Z57)
Z56
Z55
B[a]P 7,8-epoxide B[a]P 9,10-diol B[a]P 9,10-epoxide
G13CYP1A2
(Z20)
G16
CYP1A1(Z23)
G17
GSTM (Z24)
Detoxified LCA
LCA reabsorption
CA,CDCA
Non Ca-soap
LCA
Mutation
SAM:SAHZ28
Dominant (OR)AdditiveRecessive (AND)
Inhibitory (XOR)
Logical node types (Z’s)
“Rid
iculom
e?”
Fitted Model (thickness of
arrows indicate
posterior probabilities)
E1 G1
Z25
E2Z26
Alcohol (Z0)
ADH3(Z1)
Folate (Z2)
Z28
5,10-MTHF 5-MTHF
Z27
G2MTHFR(Z3)
Z33
Homocysteine
G2
MTTR (Z4)
E3Vit B12 (Z5)
G4
TS (Z6)
Z30
Z31
DSBs
Z32
SSBs
Z29
Uracilmisincorporation
G5
XRCC3 (Z7)
G6
XRCC1(Z8)
Z34
E4Fat (Z9)
E5
Fibre (Z10)
Z35
G7
EPHX1(Z11)
G8
SCL10A2(Z12)
Z36
Non-reapsorbed
LCAZ37LCA/VDR
Z39
Non-detoxifiedLCA
Z42 Z43
E6Calcium(Z17)
G9
FOK1(Z13)
G10
VDR(Z14)
Z38
G11PXR(Z15)
G12
CYP3A4(Z16)
Z40
Z41
E7
Smoking (Z18)
E8WDRM (Z19)
Z44
Z48
HCAs
PAHs
Z49 Z50 Z51 Z52
N-OH-MeiQx
Z45 Z47
Z53
N-Acetyl-MeiQx
G11NAT1
(Z21)
G12
NAT2(Z22)
Z46
YColorectal
Cancer (Z56)
Z55
Z54
B[a]P 7,8-epoxide B[a]P 9,10-diol B[a]P 9,10-epoxide
G13
CYP1A2(Z20)
G16
CYP1A1(Z23)
G17
GSTM (Z24)
Detoxified LCA
LCA reabsorption
CA,CDCA
Non Ca-soap
LCA
Mutation
Zx
E1 G1
Z25
E2Z26
Alcohol (Z0)
ADH3(Z1)
Folate (Z2)
Z27
5,10-MTHF 5-MTHF
Z30
G2MTHFR(Z3)
Z34
Homocysteine
G2
MTTR (Z4)
Z29 E3Vit B12 (Z5)
G4
TS (Z6)
Z31
Z32
DSBs
Z33
SSBs
Z28
Uracilmisincorporation
G5
XRCC3 (Z7)
G6
XRCC1(Z8)
Z35
E4Fat (Z9)
E5
Fibre (Z10)
Z36
G7
EPHX1(Z11)
G8
SCL10A2(Z12)
Z37
Non-reapsorbed
LCAZ38LCA/VDR
Z40
Non-detoxifiedLCA
Z43 Z44
E6Calcium(Z17)
G9FOK1
(Z13)
G10VDR(Z14)
Z39
G11PXR(Z15)
G12
CYP3A4(Z16)
Z41
Z42
E7
Smoking (Z18)
E8WDRM (Z19)
Z45
Z49
HCAs
PAHs
Z50 Z51 Z52 Z53
N-OH-MeiQx
Z46 Z48
Z54
N-Acetyl-MeiQx
G11NAT1
(Z21)
G12
NAT2(Z22)
Z47
YColorectal
Cancer (Z57)
Z56
Z55
B[a]P 7,8-epoxide B[a]P 9,10-diol B[a]P 9,10-epoxide
G13CYP1A2
(Z20)
G16
CYP1A1(Z23)
G17
GSTM (Z24)
Detoxified LCA
LCA reabsorption
CA,CDCA
Non Ca-soap
LCA
Mutation
SAM:SAHZ28
Dominant (OR)AdditiveRecessive (AND)
Inhibitory (XOR)
Logical node types (Z’s)
A Cautionary Comment
So, the modeling of the interplay of many genes — which is the aim of complex systems biology — is not without danger.
Any model can be wrong (almost by definition), but particularly complex…models have much flexibility to hide their lack of biological relevance.
Jansen RG. Studying complex biological systems through multifactorial perturbation. Nat Rev Genet 2003; 4: 145-151
http://www.mickey-mouse.com/clipartm109.htm
Statistical Challenges
• Exposure assessment and modeling
• GxE and GxG interactions
• Pathways– Hierarchical modeling strategy– Mechanistic models
• GWAS• Collaborations
C
Some GWAS Issues
• Two-stage designs
• Incorporating priors
• Approaches to scanning for GxE
• Unifying pathway-based and agnostic approaches
• Post-GWAS
Some Methodological Issues in GWAS:The ENDGAME Consortium
• Multistage study designs• Choice of platform for first stage• Multiple comparisons • Prioritizing SNPs for second stage• Haplotype analyses using tag SNPs:
unifying association and sharing• GxE and GxG interactions• Control of population stratification
Thomas et al, AJHG 2005:77:337-45
Multistage Design
• Stage I: full scan of 500,000 SNPs on sample of size N1
• Stage II: genotype only SNPs “significant” at level a1 from stage I on a new sample of size N2
• Final analysis combines both samples at significance level a2, chosen to ensure an overall Type I error rate a– Significance assessed conditionally on hit in stage I
• Optimize choice of N1 and a1 to minimize cost subject to constraint on a and power
Satagopan et al., Genet Epidemiol 2003;25:149-57
• No additional SNPs at stage II:– Genotype 30% of sample in stage I a1 = .0038 (i.e., 1900 SNPs in
stage II) a2 = 1.7x10–7 – 87% of cost goes to stage I
• Test 5 flanking markers per hit in stage II:– Genotype 49% of sample in stage I
a1 = .0005 (250 loci & 1500 SNPs in stage II)
a2 = 0.5x10–7 – 95% of cost goes to stage I
Wang et al., Genet Epidemiol 2006:30:356-68
Optimal Designs Per-Genotype Cost Ratio = 17.5 for Stages II / I,
Genomewide a = .05, 1 – b = 0.9500,000 SNPs in stage I
Some Methodological Issues in GWAS:The ENDGAME Consortium
• Multistage study designs• Choice of platform for first stage• Multiple comparisons • Prioritizing SNPs for second stage• Haplotype analyses using tag SNPs:
unifying association and sharing• GxE and GxG interactions• Control of population stratification
Thomas et al, AJHG 2005:77:337-45
Hierarchical Approach to Prioritizing SNPs
• Standard multistage designs assume the a1 most significant SNPs from the first stage will be tested in later stage(s)
• Can we do better?• False discovery rate weighted by prior knowledge
Roeder et al, AJHG 2006:78:243-42
• Bayesian FDR Whittemore, J Appl Statist, 2007:34:1-9
• Empirical Bayes ranking, using an exchangeable mixture prior with a large mass at RR = 1
• Adding prior knowledge to hierarchical Bayes Lewinger et al, GE
2007;31:871-82
Hierarchical Approach to Prioritizing SNPs
• Three level model:– I: model for distribution of observed chi statistics c
in relation to true noncentrality parameter l– II: mixture model for as either null with probability
1 – p or non-null with probability p mean m and variance s2
– III: logistic model for p and linear model for m as regressions on prior covariates Z
• Ranking of SNPs by:– posterior probability of being non-null– posterior mean of l given non-null
Lewinger et al, GE 2007;31:871-82
xm
E m; covariates in means model onlyE m; covariates in probability model onlyE m; covariates in bothP m; covariates in means model onlyP m; covariates in probability model onlyP m; covariates in both
b1 = 0.693 b1 = 1.1
Type I error
Pow
er
0.01 0.04 0.07 0.10
00.1
0.2
Type I error
Po
we
r
0.01 0.04 0.07 0.10
00
.10
.2
Lewinger et al, Gen Epi 2007; 31:871-82
Some Methodological Issues in GWAS:The ENDGAME Consortium
• Multistage study designs• Choice of platform for first stage• Multiple comparisons • Prioritizing SNPs for second stage• Haplotype analyses using tag SNPs:
unifying association and sharing• GxE and GxG interactions• Control of population stratification
Thomas et al, AJHG 2005:77:337-45
Sample Sizes Needed for GxE• Required # case-control pairs a = 0.05 / a = 1x10-7
(assuming we are testing the causal locus)
IntxnEffect ExposureORGxE Prevalence 0.05 0.40
2.0 0.1 6,238 / 12,110 1,364 / 2,7480.5 2,515 / 4,946 547 / 1,325
5.0 0.1 1,001 / 1,293 245 / 3860.5 459 / 657 113 / 320
Variant G Prevalence
Minimum Detectable Effect Sizes
p(G)ORG main effect
ORGxE interaction
p(E) = 0.1 p(E) = 0.4
0.05 2.05 8.6 4.3
0.10 1.77 5.4 3.2
0.20 1.68 4.3 2.7
a = 1x10–7, 1–b = 0.80 N = 1000 cases, 2000 controls
Case-Only Design for GxEExposure Cases Controls
Genotype: Non-carrier Carrier Non-carrier Carrier
Unexposed a b A B
Exposed c d C D
ORGxE estimators:– Case-control: (ad/bc) / (AD/BC)– Case-only: ad/bc
Assuming no G-E association in controls
Umbach et al Stat Med 1994;13:153-62
Smaller variance (more power) than case-control test
Can’t test this assumption in controls, then decide whether to do case-only or case-control
Albert et al, AJE 2001:154:687-93
But can combine case-only and case-control estimators
Mukerjee et al, GE 2008;32:615-26. Li & Conti, AJE in press
Case-control vs. Case-only Design
N for 80% power (a = .05): case-control / case-only
IntxnEffect ExposureORGxE Prevalence 0.05 0.40
2.0 0.1 6,238 / 2,498 1,364 / 5670.5 2,515 / 1,020 547 / 273
5.0 0.1 1,001 / 267 245 / 800.5 459 / 136 113 / 66
Variant G Prevalence
Two-Stage Approach to GxE
• Step 1: Screen genome-wide to find SNPs most likely to be involved in a GxE interaction by testing for G-E association in combined case and control sample
• Step 2: Only test these ‘likely’ SNPs using the standard 1-df case-control interaction test
Murcray et al. Am J Epidemiol 2009;169:219-26
0
0.2
0.4
0.6
0.8
1
1 1.5 2 2.5 3 3.5 4 4.5 5
Interaction Effect Size (Rge)
Pow
er
1-step analysis
2-step analysis
GWAS Test for GxE Interaction:Power for 2-step vs. 1-step method
Murcray et al., AJE 2009:169:219-26
Conceptual Model for Oxidative Stress Pathway for Effects of Air Pollution
Oxidant Exposure
Oxidative Stress
Health Effects
Molecular & enzymatic antioxidants
Dose
Physical Activity
ROS metabolism
Xenobiotic metabolism
Oxidative production & detoxification
Inflammation
Gilliland et al. EHP 1999;107:403-7
Using Hierarchical Models to Incorporate Pathways into GWAS
• Two approaches to unification– Use GWAS to “discover” pathways
– Use pathways to inform GWAS
• Approach 1: Bayesian network analysis, gene set enrichment analysis, or other exploratory methods
– Subramanian et al, PNAS 2005; 102: 15545-50
• Approach 2: Treat pathway indicators as prior covariates
– Wang et al, Am J Hum Genet 2007;81:1278-83
Post-GWAS:Resequensing Designs
Marker Disease Causal allele Pr(D=1|M,Y)M Y D=0 D=1
Positive marker associationPositive LD and positive causal association
=0.036, RRYD=2, RRYM=1.22
MControls 0.796 0.004 0.005
Cases 0.758 0.008 0.010
MControls 0.154 0.046 0.230
Cases 0.147 0.088 0.374 Negative LD and negative causal association
=.010, RRYD=0, RRYM=1.067M Controls 0.750 0.050 0.063
Cases 0.789 0.000 0.000
M Controls 0.200 0.000 0.000Cases 0.211 0.000 0.000
Negative marker associationNegative LD and positive causal association
=0.010, RRYD=3, RRYM=0.889
MControls 0.750 0.050 0.063
Cases 0.682 0.136 0.136M Controls 0.200 0.000 0.000
Cases 0.186 0.000 0.000
Positive LD and negative causal association=0.036, RRYD=0.5, RRYM=0.887
M Controls 0.796 0.004 0.005Cases 0.816 0.002 0.003
MControls 0.200 0.046 0.230
Cases 0.211 0.024 0.130
Analysis strategyCombines full sequencing information on a stratified subset with SNP data on main study
Thomas et al, GE 2007;27:401-4
Thomas et al, Statist Sci 2009, in press
Statistical Challenges
• Exposure assessment and modeling
• GxE and GxG interactions
• Pathways– Hierarchical modeling strategy– Mechanistic models
• GWAS• Collaborations
C
Statistical Issues in Collaborations
• Combining population-based, family-based, and pedigree studies
• Meta-analysis or mega-analysis?
• Data harmonization– Phenotypes– Genotypes– Exposures and other risk factors
• Allowing for understanding heterogeneity– Fixed vs random effects models– Meta-regression
Conclusions
• Costs have now become feasible: many such studies now being undertaken
• Results of first publications very promising
• Efficient design and analysis strategies are essential
• Rich area for statistical research
• “Agnostic” genomewide scans and pathway-driven multigenic modeling are complementary
AcknowledgmentsEpidemiology
John PetersFrank GillilandRob McConnellNino KuenzliStephanie London
BiostatisticsJim GaudermanKiros BerhaneMike JerrettBryan LangholzDavid ContiDan StramBill Navidi
Field Work & Exposure AssessmentEd AvolFred LurmanField team (many!)
FundingCalifornia Air Resources Board
Helene Margolis
National Institute of Environmental Health Sciences
National Heart, Lung & Blood Institute
Health Effects Institute
Data Management & AnalysisEd RappaportHita VoraJosh MillsteinYu-Fen LiTalat IslamJohn & Jassy Molitor
GeneticsLouis Dubeau
Respiratory MedicineBill Linn
x(s,t)
Zt
zs
Xi Li Yi
Gi Bi
pilPil Vil
pil vil
Long-term average personal exposure
Latent disease process (e.g., inflammation)
Clinical outcome(e.g., asthma)
Genes (& other
risk factors)Biomarkers (e.g., eNO)
Spatio-temporalexposure field
Central site continuous time
monitors
Home & school measurements
GIS location histories
Accelerometer Activity histories
Usual physical Activity(Q’aire)
Usual times
(Q’aire)True
long-term time-activity
Wst
Exposure predictors(e.g., traffic, weather
Zi
Personal exposure
measurements
sil
nil
Usual locations
nlpl