day2 145pm crawford
TRANSCRIPT
Association Analysis
University of LouisvilleUniversity of LouisvilleCenter for Genetics and Molecular MedicineCenter for Genetics and Molecular Medicine
January 11, 2008January 11, 2008
Dana Crawford, PhDDana Crawford, PhDVanderbilt UniversityVanderbilt University
Center for Human Genetics ResearchCenter for Human Genetics Research
Association Analysis Outline
• Study Design• SNPs versus Haplotypes• Analysis Methods• Candidate Gene• Whole Genome Analysis• Replication and Function
Study Design
Does your trait or phenotype have a genetic component?
• Segregation analysis
• Recurrence risks
• Heritability
• Other sources of evidence for a geneticcomponent
Classic Segregation Analysis
• Determines if a major gene is involved
• Compares data to Mendelian models, such asAutosomal dominantAutosomal recessiveX-linked
• Results can be used as parameters forlinkage analysis (e.g. parametric LOD)
• Subject to ascertainment bias
Note: More complex methods needed for complex traits
Recurrence Risks
The chance that a disease present in thefamily will recur in that family
“Lightning striking twice”
If recurrence risk is greater in the familycompared with unrelated individuals,
the disease has a “genetic” component
Suggests familial aggregation
Recurrence Risks
Measured using the risk ratio (λ)
Sibling risk ratio = λs
λs = sibling recurrence risk population prevalence
Cystic fibrosis λs = (0.25/0.0004) = 500
Huntington disease λs = (0.50/0.0001) = 5000
Recurrence Risks: Complex traits
λ here is for first degree relative
Merikangas and Risch (2003) Science 302:599-601.
Heritability
Think “twin studies”
The proportion of phenotypic variation in a population attributable to genetic variation
Quantitative traits
Heritability measured as h2
(Can also be family studies)
Heritability and Quantitative Traits
Determined by genes and environment
Boys Girls
Mexican Americans
Blacks
Whites
Mexican Americans
Blacks
Whites
Example: Height
NHANES 1971-1974 versus NHANES 1999-2002
Freedman et al (2006) Obesity 14:301-308
Heritability and Quantitative Traits
Trait variation = genetic + environment
Genetic variation = additive + dominant
σT2 = σG
2 + σE2
σG2 = σa
2 + σd2
σE2 = σf
2 + σe2 Environmental variation =
familial/household + random/individual
hB2= σG
2 / σT2 Broad Sense heritability
Narrow Sense heritabilityhN2= σa
2 / σT2
Heritability and Twins Studies
h2 = 2(rMZ – rDZ),
where r is the correlation coefficient
Monozygotic = same genetic material = r ~ 100%
Dizygotic = half genetic material = r ~ 50%
Heritability and Twins Studies
Trait r(MZ) r(DZ) Reference
Cholesterol 0.76 0.39 Fenger et al
SBP 0.60 0.32 Evans et al
BMI 0.67 0.32 Schousboe et al
Perceived pitch 0.67 0.44 Drayna et al
Heritability: Is everything genetic?
Trait r(MZ) r(DZ) Reference
Vote choice 0.81 0.69 Hatemi et al
Religiousness 0.62 0.42 Koenig et al
Other Evidence For A Genetic Component
Monogenic disorders
Example:Phenotype of interest is sensitivity to warfarindosing, but there are no heritability estimates
Solution:Rare, familial disorder of warfarin resistance
Other Evidence For A Genetic Component
Case Reports
Example:Phenotype of interest is susceptibility toNeisseria meningitidis (prevalence: 1/100,000)
Solution:Case report of recurrent N. meningitidis inpatient
Other Evidence For A Genetic Component
• Animal models
• Biochemistry or biological pathways
• Expression data
• Previous genetic association studies
Other good arguments…
Study DesignHow well can you diagnose the disease or measure the trait?
• Narrow definitions better than all-inclusive definitionsThere are many paths that lead to the samephenotype
• Avoid misclassification and measurement errorDirect measurement versus recall/survey data or indirect proxies
• Be aware of age of onsetCan your control become a case over time?
Arguably most important step in study design
Target PhenotypesDisease or Quantitative trait?
Carlson et al. (2004) Nature 429:446-452
MI
CRP
LDL-C
IL6
LDLR
Acute Illness
Diet
Note: SNPs associated with quantitative traits may not be associated with clinical endpoint
Study Design
How many cases and controls will you need to detect an association?
Statistical Power• Null hypothesis: all alleles are equal risk
• Given that a risk allele exists, how likely is a study to reject the null?
• Study sample size ideally determined before you begin to recruit and genotype
• Statistical significance– Significance = p(false positive)– Traditional threshold 5%
• Statistical power– Power = 1- p(false negative)– Traditional threshold 80%
• Traditional thresholds balance confidence in results against reasonable sample size
Study DesignWhat are the thresholds/variables in a general power calculation?
Note: Significance threshold for 1 SNP tested
Study Design
Power Calculation Resources
• Quanto (hydra.usc.edu/gxe/)Supports quantitative, discrete traits (unrelated
and family based)
• Genetic Power Calculator (pngu.mgh.arvard.edu/~purcell/gpc/)
Supports discrete traits, variance components, quantitative traits for linkage and association studies
(List of other software: linkage.rockefeller.edu/soft/)
Study DesignHow can you maximize power for your study?
• Large sample sizeBetter estimate of variability or riskChance of misclassification / measurement error
• Large genetic effect sizeSNP risk allele with large odds ratio or explains a lot of trait variance
This is unknown at beginning of study
• Risk SNP is commonThis is unknown at beginning of studyCalculate power for a range of common MAFs (5-45%)
• Genotype the risk SNP directlyRisk SNP is unknown at beginning of studyRemember tagSNPs are imperfect proxiesAdjust sample size by 1/r2
Study Design
0
20
40
60
80
100
120
140
160
22.2 2.4 2.6 2.8
33.2 3.4 3.6 3.8
44.2 4.4 4.6 4.8
55.2 5.4 5.6 5.8
6
Genotype relative risk
(Additive model)
Sample size (cases)
0.05
0.1
0.15
0.2
0.25
Calculated using Quanto 1.1.1
MAF
Power calculation example:Cases: Adverse reaction (wheezing) to flu vaccinationControls: Vaccinated children with no adverse reactions
Study Design
Power calculation example:Immunogenicity to influenza A (H5N1) vaccine
0
100
200
300
400
500
600
700
800
900
0.01 0.04 0.07 0.1 0.13 0.16 0.19 0.22 0.25 0.28 0.31 0.34 0.37 0.4 0.43 0.46 0.49
R2
(Additive model)
Sample size
Calculated using Quanto 1.1.1
Study DesignWhy are you considering an association study instead of linkage?
• Linkage analysis is powerful for disorders with– Discernable pattern of inheritance– Rare alleles w/ large genetic effect sizes– High penetrance
• Not powerful for disorders that– have complex pattern of inheritance – are common– many risk alleles with small effect sizes– have low penetrance
Common variant/common disease hypothesis
• Common genetic variants confer susceptibility
• Risk-conferring alleles ancient; common across mostpopulations
• Risk-conferring allele has small effect
• Multiple risk alleles expected for common disease; also environment
Study Design
Study Design
Should you design a candidate gene or whole genome study?
• Candidate gene association study– Interrogate specific genes or regions– Based on previous knowledge or
biological plausibility– Hypothesis testing
• Whole genome association study– Interrogate the “entire” genome– No previous knowledge required– Hypothesis generation
Candidate gene association studies
• Choose gene based on previous knowledge– Gene function– Biological pathway– Previous linkage or association study
• Choose DNA variations for genotyping– Direct association approach– Indirect association approach
Direct Candidate Gene Association Study
Genotype “functional” SNPs
Collins et al (1997) Science 278:1580-1581
Example: Nonsynonymous SNPs
Direct Candidate Gene Association Study
Botstein and Risch (2003) Nat Genet 33 Suppl:228-37.
Problem: We don’t know what is functionaland what is not functional
Direct Candidate Gene Association Study
What would we miss?
Functional synonymous SNPs in MDR1 alterP-glycoprotein activity
Komar (2007) Science 315:466-467
Direct Candidate Gene Association Study
What would we miss?
• 99% human genome is non-coding
• Non-coding SNPs or DNA variations in– Introns– Intergenic regulatory regions
Indirect Candidate Gene Association Study
• Genotype a fraction of all SNPs regardless of “function”
• Rely on SNP-SNP correlations (linkage disequilibrium) to capture information for SNPs not genotyped
Kruglyak (2005) Nat Genet 37:1299-1300
Indirect Candidate Gene Association Study
Linkage disequilibrium (LD)
Measured by r2
r2 = [f(A1B1) – f(A1)f(B1)]2
f(A1)f(A2)f(B1)f(B2)
r2 = 0 SNPs are independentr2 = 1 SNPs are perfectly correlated AND
have the same minor allele frequency
Indirect Candidate Gene Association Study
Using LD to pick “tagSNPs”
CRPEuropean-descent10 SNPs >5% MAF
CRPEuropean-descent
4 tagSNPs
r2>0.80
Indirect Candidate Gene Association Study
“tagSNPs” are population specific
CRPEuropean-descent
4 tagSNPs
CRPAfrican-descent
10 tagSNPs
Indirect Candidate Gene Association Study
• “tagSNPs” are population specific
• Merge sets for “cosmopolitan” set
http://gvs.gs.washington.edu/GVS/
Indirect Candidate Gene Association Study
Multiple testing
• Testing many SNPs for association with disease status
• No consensus on correcting p-value– Bonferroni– False Discovery Rate
• Need to replicate findings in independent study
Indirect Candidate Gene Association Study: Pros and Cons
• Can interrogate all common SNPs in gene
• SNPs must be known and genotypes available to calculate LD and pick tagSNPs
• Multiple testing within a gene
• Limited to previous knowledge
Whole Genome Association Study
• Can now genotype 100K – 1 million SNPs
• Coverage depends on platform and chip– tagSNPs capturing HapMap common SNPs– Genic SNPs overrepresented– Conserved non-coding SNPs represented– Evenly spaced across genome
Illumina Infinium assay Affymetrix GeneChips
Whole Genome Association Study
• Same study design and challenges as candidate gene
– Mostly case-control (retrospective)– Multiple testing
• Data storage and higher-order interaction testing issues
• Hypothesis generation tool (replication)
Manolio et al. Nature Reviews Genetics 7, 812–820 (October 2006)
Case/Control Study DesignsFor either candidate gene or whole genome
Study Pros Cons
Case/Control Easier to collect Subject to bias Less expensive No risk estimates
Case/Control Study Designs: Pros and Cons
Prospective Risk estimates Harder to collect More expensive Subject to bias
For rare outcomes, case/control design may be only option
Case/Control Study Designs: Pros and Cons
Types of bias• Bias in selection of cases
Those that are currently livingMiss fatal or short episodes of diseaseMight miss mild diseasesReferral/admission bias
• Non-response bias• Exposure suspicion bias• Family information bias• Recall bias
Manolio et al. Nature Reviews Genetics 7, 812–820 (October 2006)
Often ignored in genetic association studies
Analysis Methods
Genotype QC
• Test for departures of Hardy-Weinberg Equilibrium
• Test for gender inconsistencies
• Eliminate very rare SNPs (no power)
• Eliminate SNPs with low genotyping efficiency
• Eliminate samples with low genotyping efficiency
Analysis Methods
What statistical methods do you use to analyze your data?
• SNP by SNP (borrowed from epidemiology)Chi-square and Fisher’s exact
2x2 table2x3 table
Logistic and linear regressionCovariates
• HaplotypesHaplo.stats and regression
• InteractionsTraditional regressionMDR (Ritchie et al)
Analysis Methods
Case Control
Minor allele A B
Major allele C D
Odds ratio (OR) = ratio of odds of minor allele in Cases (A/C) and Controls (B/D)
OR(A*D)/(B*C)
The Case/Control Study
Case Control
Aa A B
AA C D
For genotypes, set homozygous for major allele (A) as “referent” genotype, and calculate 2 odds ratios:
Case Control
aa A B
AA C D
Analysis Methods
Analysis Methods
Case/control:Interpretation of Odds Ratio
1.0 – Referent>1.0 – Greater odds of disease compared with controls<1.0 – Lesser odds of disease compared with controls
Confidence Intervals: probably contain true OR
OR does not measure risk*
Prospective cohort
• Disease free at beginning of study
• Followed over time for disease (“incident”)
• Follow “exposed” and “unexposed” groups
• Gold-standard study design
Analysis Methods
Analysis Methods
Prospective cohort
Case Control Total
Exposed A B (A+B)
Unexposed C D (C+D)
Risk Ratio (RR) = Incidence of disease inExposed A/(A+B)
or Unexposed C/(C+D)
Prospective Study:Interpretation of Risk Ratio
1.0 – Referent>1.0 – Risk for disease increases<1.0 – Risk for disease decreases
Confidence Intervals: probably contain true RR
*For rare diseases, OR ~ RR
Analysis Methods
Case/control: Matching
Age Gender Race
Warning: Can “over match” andmiss describing an interesting factor
Bad Example: Cases: Adults with heart disease Controls: Newborns without heart disease
Analysis Methods
Case/control: Stratifying
Age Gender Race
Warning: Need sufficient sample size to stratify or split the data into males and females
Ex. Cases with heart disease Aged-matched controls without heart disease (Exposure: smoking status)
Stratify for Gender Specific Risks
Analysis Methods
Problems in Case/Control genetic association studies –
• “Confounding” by race or ancestry
• AKA population stratification
• Solutions:MatchStratifyAdjust (using genetic
markers)“Trios”
Cardon and Palmer (2003) Lancet 361:598-604
Analysis Methods
• Given
– Height as “target” or “dependent” variable
– Sex as “explanatory” or “independent” variable
• Fit regression model
height = *sex +
Analysis Methods
Regression
Analysis Methods
• Given
– Quantitative “target” or “dependent” variable y
– Quantitative or binary “explanatory” or “independent” variables xi
• Fit regression model
y = 1x1 + 2x2 + … + ixi +
Regression
• Works best for normal y and x• Can include covariates• Fit regression model
y = 1x1 + 2x2 + … + ixi +
• Estimate errors on ’s• Use t-statistic to evaluate significance of ’s• Use F-statistic to evaluate model overall• Use R2 to evaluate variance explained by
model
Analysis Methods
Regression
Analysis Methods
Coding Genotypes
000GG
011AG
121AA
RecessiveAdditiveDominantGenotype
Genotype can be re-coded in any number
of ways for regression analysis
Example of gene-environmentInteraction and traditional
regression
Analysis Methods
Statistical Packages for Genetic Association Studies
• Candidate gene association studySAS/GeneticsSTATASPSSRPLINK
• Whole genome association studyRPLINK
Analysis Methods
Whole genome in PLINK(pngu.mgh.harvard.edu/~purcell/plink/)
MHC removed
Can adjust for population stratificationCan add covariates
P<1x10-100P<2x10-11
P<5x10-8Genome-widesignificance
P=5x10-8
Plenge et al 2007 NEJM
SNPs versus Haplotypes
• There is no right answer: explore both
• The only thing that matters is the correlation between the assayed variable and the causal variable
• Sometimes the best assayed variable is a SNP, sometimes a haplotype
SNPs versus Haplotypes
• Haplo.stats (haplotype regression)Lake et al, Hum Hered. 2003;55(1):56-65.
• PHASE (case/control haplotype)Stephens et al, Am J Hum Genet. 2005 Mar;76(3):449-62
• Haplo.view (case/control SNP analysis)Barrett et al, Bioinformatics. 2005 Jan 15;21(2):263-5.
• SNPHAP (haplotype regression?)Sham et al Behav Genet. 2004 Mar;34(2):207-14.
Statistical Packages for Genetic Association Studieswith haplotypes
Analysis Methods
Multiple testing
• Bonferroni correctionToo conservative b/c each SNP tested
may not be independent (LD)How many independent tests did you do?See Conneely and Boehnke AJHG (in press)
• False Discovery RateAlso has arbitrary threshold
• Best bet is replication
Statistical Replication
0
0.1
0.2
0.3
0.4
0.5
0.6
H2 H5 H6 H7 H8Change in ln(CRP) per copy relative to H2
Black
Mexican-American
White
Carlson et al. AJHG 2005;77:64-77
Results Consistent with CARDIA
CRP SNPs and CRP levels in NHANES III
Crawford et al Circulation 2006; 114:2458-2465
• Statistical replication is not always possible
• Association may imply mechanism
• Test for mechanism at the bench– Is predicted effect in the right direction?– Dissect haplotype effects to define functional SNPs
Functional Replication
Functional Replication
CRP Evolutionary Conservation
• TATA box: 1697• Transcript start: 1741• CRP Promoter region (bp 1444-1650) >75% conserved in mouse
Functional Replication
Low CRP Levels Associated with H1-4
• USF1 (Upstream Stimulating Factor)– Polymorphism at 1440 alters USF1 binding site
1420 1430 1440 H1-4 gcagctacCACGTGcacccagatggcCACTCGtt H7-8 gcagctacCACGTGcacccagatggcCACTAGtt H5-6 gcagctacCACGTGcacccagatggcCACTTGtt
High CRP Levels Associated with H6
• USF1 (Upstream Stimulating Factor)– Polymorphism at 1421 alters another USF1 binding site
1420 1430 1440 H1-4 gcagctacCACGTGcacccagatggcCACTCGtt H7-8 gcagctacCACGTGcacccagatggcCACTAGtt H5 gcagctacCACGTGcacccagatggcCACTTGtt H6 gcagctacCACATGcacccagatggcCACTTGtt
Functional Replication
CRP Promoter Luciferase Assay
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
H1-3 H4 H5 H6 H7-8 empty SV40p
Fold change over H1-3
Carlson et al, AJHG v77 p64
Functional Replication
Association Analysis Outline
• Study Design• SNPs versus Haplotypes• Analysis Methods• Candidate Gene• Whole Genome Analysis• Replication and Function