day2 145pm crawford

74
Association Analysis University of Louisville University of Louisville Center for Genetics and Molecular Medicine Center for Genetics and Molecular Medicine January 11, 2008 January 11, 2008 Dana Crawford, PhD Dana Crawford, PhD Vanderbilt University Vanderbilt University Center for Human Genetics Research Center for Human Genetics Research

Upload: sean-paul

Post on 12-May-2015

894 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Day2 145pm Crawford

Association Analysis

University of LouisvilleUniversity of LouisvilleCenter for Genetics and Molecular MedicineCenter for Genetics and Molecular Medicine

January 11, 2008January 11, 2008

Dana Crawford, PhDDana Crawford, PhDVanderbilt UniversityVanderbilt University

Center for Human Genetics ResearchCenter for Human Genetics Research

Page 2: Day2 145pm Crawford

Association Analysis Outline

• Study Design• SNPs versus Haplotypes• Analysis Methods• Candidate Gene• Whole Genome Analysis• Replication and Function

Page 3: Day2 145pm Crawford

Study Design

Does your trait or phenotype have a genetic component?

• Segregation analysis

• Recurrence risks

• Heritability

• Other sources of evidence for a geneticcomponent

Page 4: Day2 145pm Crawford

Classic Segregation Analysis

• Determines if a major gene is involved

• Compares data to Mendelian models, such asAutosomal dominantAutosomal recessiveX-linked

• Results can be used as parameters forlinkage analysis (e.g. parametric LOD)

• Subject to ascertainment bias

Note: More complex methods needed for complex traits

Page 5: Day2 145pm Crawford

Recurrence Risks

The chance that a disease present in thefamily will recur in that family

“Lightning striking twice”

If recurrence risk is greater in the familycompared with unrelated individuals,

the disease has a “genetic” component

Suggests familial aggregation

Page 6: Day2 145pm Crawford

Recurrence Risks

Measured using the risk ratio (λ)

Sibling risk ratio = λs

λs = sibling recurrence risk population prevalence

Cystic fibrosis λs = (0.25/0.0004) = 500

Huntington disease λs = (0.50/0.0001) = 5000

Page 7: Day2 145pm Crawford

Recurrence Risks: Complex traits

λ here is for first degree relative

Merikangas and Risch (2003) Science 302:599-601.

Page 8: Day2 145pm Crawford

Heritability

Think “twin studies”

The proportion of phenotypic variation in a population attributable to genetic variation

Quantitative traits

Heritability measured as h2

(Can also be family studies)

Page 9: Day2 145pm Crawford

Heritability and Quantitative Traits

Determined by genes and environment

Boys Girls

Mexican Americans

Blacks

Whites

Mexican Americans

Blacks

Whites

Example: Height

NHANES 1971-1974 versus NHANES 1999-2002

Freedman et al (2006) Obesity 14:301-308

Page 10: Day2 145pm Crawford

Heritability and Quantitative Traits

Trait variation = genetic + environment

Genetic variation = additive + dominant

σT2 = σG

2 + σE2

σG2 = σa

2 + σd2

σE2 = σf

2 + σe2 Environmental variation =

familial/household + random/individual

hB2= σG

2 / σT2 Broad Sense heritability

Narrow Sense heritabilityhN2= σa

2 / σT2

Page 11: Day2 145pm Crawford

Heritability and Twins Studies

h2 = 2(rMZ – rDZ),

where r is the correlation coefficient

Monozygotic = same genetic material = r ~ 100%

Dizygotic = half genetic material = r ~ 50%

Page 12: Day2 145pm Crawford

Heritability and Twins Studies

Trait r(MZ) r(DZ) Reference

Cholesterol 0.76 0.39 Fenger et al

SBP 0.60 0.32 Evans et al

BMI 0.67 0.32 Schousboe et al

Perceived pitch 0.67 0.44 Drayna et al

Page 13: Day2 145pm Crawford

Heritability: Is everything genetic?

Trait r(MZ) r(DZ) Reference

Vote choice 0.81 0.69 Hatemi et al

Religiousness 0.62 0.42 Koenig et al

Page 14: Day2 145pm Crawford

Other Evidence For A Genetic Component

Monogenic disorders

Example:Phenotype of interest is sensitivity to warfarindosing, but there are no heritability estimates

Solution:Rare, familial disorder of warfarin resistance

Page 15: Day2 145pm Crawford

Other Evidence For A Genetic Component

Case Reports

Example:Phenotype of interest is susceptibility toNeisseria meningitidis (prevalence: 1/100,000)

Solution:Case report of recurrent N. meningitidis inpatient

Page 16: Day2 145pm Crawford

Other Evidence For A Genetic Component

• Animal models

• Biochemistry or biological pathways

• Expression data

• Previous genetic association studies

Other good arguments…

Page 17: Day2 145pm Crawford

Study DesignHow well can you diagnose the disease or measure the trait?

• Narrow definitions better than all-inclusive definitionsThere are many paths that lead to the samephenotype

• Avoid misclassification and measurement errorDirect measurement versus recall/survey data or indirect proxies

• Be aware of age of onsetCan your control become a case over time?

Arguably most important step in study design

Page 18: Day2 145pm Crawford

Target PhenotypesDisease or Quantitative trait?

Carlson et al. (2004) Nature 429:446-452

MI

CRP

LDL-C

IL6

LDLR

Acute Illness

Diet

Note: SNPs associated with quantitative traits may not be associated with clinical endpoint

Page 19: Day2 145pm Crawford

Study Design

How many cases and controls will you need to detect an association?

Statistical Power• Null hypothesis: all alleles are equal risk

• Given that a risk allele exists, how likely is a study to reject the null?

• Study sample size ideally determined before you begin to recruit and genotype

Page 20: Day2 145pm Crawford

• Statistical significance– Significance = p(false positive)– Traditional threshold 5%

• Statistical power– Power = 1- p(false negative)– Traditional threshold 80%

• Traditional thresholds balance confidence in results against reasonable sample size

Study DesignWhat are the thresholds/variables in a general power calculation?

Note: Significance threshold for 1 SNP tested

Page 21: Day2 145pm Crawford

Study Design

Power Calculation Resources

• Quanto (hydra.usc.edu/gxe/)Supports quantitative, discrete traits (unrelated

and family based)

• Genetic Power Calculator (pngu.mgh.arvard.edu/~purcell/gpc/)

Supports discrete traits, variance components, quantitative traits for linkage and association studies

(List of other software: linkage.rockefeller.edu/soft/)

Page 22: Day2 145pm Crawford

Study DesignHow can you maximize power for your study?

• Large sample sizeBetter estimate of variability or riskChance of misclassification / measurement error

• Large genetic effect sizeSNP risk allele with large odds ratio or explains a lot of trait variance

This is unknown at beginning of study

• Risk SNP is commonThis is unknown at beginning of studyCalculate power for a range of common MAFs (5-45%)

• Genotype the risk SNP directlyRisk SNP is unknown at beginning of studyRemember tagSNPs are imperfect proxiesAdjust sample size by 1/r2

Page 23: Day2 145pm Crawford

Study Design

0

20

40

60

80

100

120

140

160

22.2 2.4 2.6 2.8

33.2 3.4 3.6 3.8

44.2 4.4 4.6 4.8

55.2 5.4 5.6 5.8

6

Genotype relative risk

(Additive model)

Sample size (cases)

0.05

0.1

0.15

0.2

0.25

Calculated using Quanto 1.1.1

MAF

Power calculation example:Cases: Adverse reaction (wheezing) to flu vaccinationControls: Vaccinated children with no adverse reactions

Page 24: Day2 145pm Crawford

Study Design

Power calculation example:Immunogenicity to influenza A (H5N1) vaccine

0

100

200

300

400

500

600

700

800

900

0.01 0.04 0.07 0.1 0.13 0.16 0.19 0.22 0.25 0.28 0.31 0.34 0.37 0.4 0.43 0.46 0.49

R2

(Additive model)

Sample size

Calculated using Quanto 1.1.1

Page 25: Day2 145pm Crawford

Study DesignWhy are you considering an association study instead of linkage?

• Linkage analysis is powerful for disorders with– Discernable pattern of inheritance– Rare alleles w/ large genetic effect sizes– High penetrance

• Not powerful for disorders that– have complex pattern of inheritance – are common– many risk alleles with small effect sizes– have low penetrance

Page 26: Day2 145pm Crawford

Common variant/common disease hypothesis

• Common genetic variants confer susceptibility

• Risk-conferring alleles ancient; common across mostpopulations

• Risk-conferring allele has small effect

• Multiple risk alleles expected for common disease; also environment

Study Design

Page 27: Day2 145pm Crawford

Study Design

Should you design a candidate gene or whole genome study?

• Candidate gene association study– Interrogate specific genes or regions– Based on previous knowledge or

biological plausibility– Hypothesis testing

• Whole genome association study– Interrogate the “entire” genome– No previous knowledge required– Hypothesis generation

Page 28: Day2 145pm Crawford

Candidate gene association studies

• Choose gene based on previous knowledge– Gene function– Biological pathway– Previous linkage or association study

• Choose DNA variations for genotyping– Direct association approach– Indirect association approach

Page 29: Day2 145pm Crawford

Direct Candidate Gene Association Study

Genotype “functional” SNPs

Collins et al (1997) Science 278:1580-1581

Example: Nonsynonymous SNPs

Page 30: Day2 145pm Crawford

Direct Candidate Gene Association Study

Botstein and Risch (2003) Nat Genet 33 Suppl:228-37.

Problem: We don’t know what is functionaland what is not functional

Page 31: Day2 145pm Crawford

Direct Candidate Gene Association Study

What would we miss?

Functional synonymous SNPs in MDR1 alterP-glycoprotein activity

Komar (2007) Science 315:466-467

Page 32: Day2 145pm Crawford

Direct Candidate Gene Association Study

What would we miss?

• 99% human genome is non-coding

• Non-coding SNPs or DNA variations in– Introns– Intergenic regulatory regions

Page 33: Day2 145pm Crawford

Indirect Candidate Gene Association Study

• Genotype a fraction of all SNPs regardless of “function”

• Rely on SNP-SNP correlations (linkage disequilibrium) to capture information for SNPs not genotyped

Kruglyak (2005) Nat Genet 37:1299-1300

Page 34: Day2 145pm Crawford

Indirect Candidate Gene Association Study

Linkage disequilibrium (LD)

Measured by r2

r2 = [f(A1B1) – f(A1)f(B1)]2

f(A1)f(A2)f(B1)f(B2)

r2 = 0 SNPs are independentr2 = 1 SNPs are perfectly correlated AND

have the same minor allele frequency

Page 35: Day2 145pm Crawford

Indirect Candidate Gene Association Study

Using LD to pick “tagSNPs”

CRPEuropean-descent10 SNPs >5% MAF

CRPEuropean-descent

4 tagSNPs

r2>0.80

Page 36: Day2 145pm Crawford

Indirect Candidate Gene Association Study

“tagSNPs” are population specific

CRPEuropean-descent

4 tagSNPs

CRPAfrican-descent

10 tagSNPs

Page 37: Day2 145pm Crawford

Indirect Candidate Gene Association Study

• “tagSNPs” are population specific

• Merge sets for “cosmopolitan” set

http://gvs.gs.washington.edu/GVS/

Page 38: Day2 145pm Crawford

Indirect Candidate Gene Association Study

Multiple testing

• Testing many SNPs for association with disease status

• No consensus on correcting p-value– Bonferroni– False Discovery Rate

• Need to replicate findings in independent study

Page 39: Day2 145pm Crawford

Indirect Candidate Gene Association Study: Pros and Cons

• Can interrogate all common SNPs in gene

• SNPs must be known and genotypes available to calculate LD and pick tagSNPs

• Multiple testing within a gene

• Limited to previous knowledge

Page 40: Day2 145pm Crawford

Whole Genome Association Study

• Can now genotype 100K – 1 million SNPs

• Coverage depends on platform and chip– tagSNPs capturing HapMap common SNPs– Genic SNPs overrepresented– Conserved non-coding SNPs represented– Evenly spaced across genome

Illumina Infinium assay Affymetrix GeneChips

Page 41: Day2 145pm Crawford

Whole Genome Association Study

• Same study design and challenges as candidate gene

– Mostly case-control (retrospective)– Multiple testing

• Data storage and higher-order interaction testing issues

• Hypothesis generation tool (replication)

Page 42: Day2 145pm Crawford

Manolio et al. Nature Reviews Genetics 7, 812–820 (October 2006)

Case/Control Study DesignsFor either candidate gene or whole genome

Page 43: Day2 145pm Crawford

Study Pros Cons

Case/Control Easier to collect Subject to bias Less expensive No risk estimates

Case/Control Study Designs: Pros and Cons

Prospective Risk estimates Harder to collect More expensive Subject to bias

For rare outcomes, case/control design may be only option

Page 44: Day2 145pm Crawford

Case/Control Study Designs: Pros and Cons

Types of bias• Bias in selection of cases

Those that are currently livingMiss fatal or short episodes of diseaseMight miss mild diseasesReferral/admission bias

• Non-response bias• Exposure suspicion bias• Family information bias• Recall bias

Manolio et al. Nature Reviews Genetics 7, 812–820 (October 2006)

Often ignored in genetic association studies

Page 45: Day2 145pm Crawford

Analysis Methods

Genotype QC

• Test for departures of Hardy-Weinberg Equilibrium

• Test for gender inconsistencies

• Eliminate very rare SNPs (no power)

• Eliminate SNPs with low genotyping efficiency

• Eliminate samples with low genotyping efficiency

Page 46: Day2 145pm Crawford

Analysis Methods

What statistical methods do you use to analyze your data?

• SNP by SNP (borrowed from epidemiology)Chi-square and Fisher’s exact

2x2 table2x3 table

Logistic and linear regressionCovariates

• HaplotypesHaplo.stats and regression

• InteractionsTraditional regressionMDR (Ritchie et al)

Page 47: Day2 145pm Crawford

Analysis Methods

Case Control

Minor allele A B

Major allele C D

Odds ratio (OR) = ratio of odds of minor allele in Cases (A/C) and Controls (B/D)

OR(A*D)/(B*C)

The Case/Control Study

Page 48: Day2 145pm Crawford

Case Control

Aa A B

AA C D

For genotypes, set homozygous for major allele (A) as “referent” genotype, and calculate 2 odds ratios:

Case Control

aa A B

AA C D

Analysis Methods

Page 49: Day2 145pm Crawford

Analysis Methods

Case/control:Interpretation of Odds Ratio

1.0 – Referent>1.0 – Greater odds of disease compared with controls<1.0 – Lesser odds of disease compared with controls

Confidence Intervals: probably contain true OR

OR does not measure risk*

Page 50: Day2 145pm Crawford

Prospective cohort

• Disease free at beginning of study

• Followed over time for disease (“incident”)

• Follow “exposed” and “unexposed” groups

• Gold-standard study design

Analysis Methods

Page 51: Day2 145pm Crawford

Analysis Methods

Prospective cohort

Case Control Total

Exposed A B (A+B)

Unexposed C D (C+D)

Risk Ratio (RR) = Incidence of disease inExposed A/(A+B)

or Unexposed C/(C+D)

Page 52: Day2 145pm Crawford

Prospective Study:Interpretation of Risk Ratio

1.0 – Referent>1.0 – Risk for disease increases<1.0 – Risk for disease decreases

Confidence Intervals: probably contain true RR

*For rare diseases, OR ~ RR

Analysis Methods

Page 53: Day2 145pm Crawford

Case/control: Matching

Age Gender Race

Warning: Can “over match” andmiss describing an interesting factor

Bad Example: Cases: Adults with heart disease Controls: Newborns without heart disease

Analysis Methods

Page 54: Day2 145pm Crawford

Case/control: Stratifying

Age Gender Race

Warning: Need sufficient sample size to stratify or split the data into males and females

Ex. Cases with heart disease Aged-matched controls without heart disease (Exposure: smoking status)

Stratify for Gender Specific Risks

Analysis Methods

Page 55: Day2 145pm Crawford

Problems in Case/Control genetic association studies –

• “Confounding” by race or ancestry

• AKA population stratification

• Solutions:MatchStratifyAdjust (using genetic

markers)“Trios”

Cardon and Palmer (2003) Lancet 361:598-604

Analysis Methods

Page 56: Day2 145pm Crawford

• Given

– Height as “target” or “dependent” variable

– Sex as “explanatory” or “independent” variable

• Fit regression model

height = *sex +

Analysis Methods

Regression

Page 57: Day2 145pm Crawford

Analysis Methods

• Given

– Quantitative “target” or “dependent” variable y

– Quantitative or binary “explanatory” or “independent” variables xi

• Fit regression model

y = 1x1 + 2x2 + … + ixi +

Regression

Page 58: Day2 145pm Crawford

• Works best for normal y and x• Can include covariates• Fit regression model

y = 1x1 + 2x2 + … + ixi +

• Estimate errors on ’s• Use t-statistic to evaluate significance of ’s• Use F-statistic to evaluate model overall• Use R2 to evaluate variance explained by

model

Analysis Methods

Regression

Page 59: Day2 145pm Crawford

Analysis Methods

Coding Genotypes

000GG

011AG

121AA

RecessiveAdditiveDominantGenotype

Genotype can be re-coded in any number

of ways for regression analysis

Page 60: Day2 145pm Crawford
Page 61: Day2 145pm Crawford
Page 62: Day2 145pm Crawford

Example of gene-environmentInteraction and traditional

regression

Page 63: Day2 145pm Crawford

Analysis Methods

Statistical Packages for Genetic Association Studies

• Candidate gene association studySAS/GeneticsSTATASPSSRPLINK

• Whole genome association studyRPLINK

Page 64: Day2 145pm Crawford

Analysis Methods

Whole genome in PLINK(pngu.mgh.harvard.edu/~purcell/plink/)

MHC removed

Can adjust for population stratificationCan add covariates

P<1x10-100P<2x10-11

P<5x10-8Genome-widesignificance

P=5x10-8

Plenge et al 2007 NEJM

Page 65: Day2 145pm Crawford

SNPs versus Haplotypes

• There is no right answer: explore both

• The only thing that matters is the correlation between the assayed variable and the causal variable

• Sometimes the best assayed variable is a SNP, sometimes a haplotype

Page 66: Day2 145pm Crawford

SNPs versus Haplotypes

• Haplo.stats (haplotype regression)Lake et al, Hum Hered. 2003;55(1):56-65.

• PHASE (case/control haplotype)Stephens et al, Am J Hum Genet. 2005 Mar;76(3):449-62

• Haplo.view (case/control SNP analysis)Barrett et al, Bioinformatics. 2005 Jan 15;21(2):263-5.

• SNPHAP (haplotype regression?)Sham et al Behav Genet. 2004 Mar;34(2):207-14.

Statistical Packages for Genetic Association Studieswith haplotypes

Page 67: Day2 145pm Crawford

Analysis Methods

Multiple testing

• Bonferroni correctionToo conservative b/c each SNP tested

may not be independent (LD)How many independent tests did you do?See Conneely and Boehnke AJHG (in press)

• False Discovery RateAlso has arbitrary threshold

• Best bet is replication

Page 68: Day2 145pm Crawford

Statistical Replication

0

0.1

0.2

0.3

0.4

0.5

0.6

H2 H5 H6 H7 H8Change in ln(CRP) per copy relative to H2

Black

Mexican-American

White

Carlson et al. AJHG 2005;77:64-77

Results Consistent with CARDIA

CRP SNPs and CRP levels in NHANES III

Crawford et al Circulation 2006; 114:2458-2465

Page 69: Day2 145pm Crawford

• Statistical replication is not always possible

• Association may imply mechanism

• Test for mechanism at the bench– Is predicted effect in the right direction?– Dissect haplotype effects to define functional SNPs

Functional Replication

Page 70: Day2 145pm Crawford

Functional Replication

CRP Evolutionary Conservation

• TATA box: 1697• Transcript start: 1741• CRP Promoter region (bp 1444-1650) >75% conserved in mouse

Page 71: Day2 145pm Crawford

Functional Replication

Low CRP Levels Associated with H1-4

• USF1 (Upstream Stimulating Factor)– Polymorphism at 1440 alters USF1 binding site

1420 1430 1440 H1-4 gcagctacCACGTGcacccagatggcCACTCGtt H7-8 gcagctacCACGTGcacccagatggcCACTAGtt H5-6 gcagctacCACGTGcacccagatggcCACTTGtt

Page 72: Day2 145pm Crawford

High CRP Levels Associated with H6

• USF1 (Upstream Stimulating Factor)– Polymorphism at 1421 alters another USF1 binding site

1420 1430 1440 H1-4 gcagctacCACGTGcacccagatggcCACTCGtt H7-8 gcagctacCACGTGcacccagatggcCACTAGtt H5 gcagctacCACGTGcacccagatggcCACTTGtt H6 gcagctacCACATGcacccagatggcCACTTGtt

Functional Replication

Page 73: Day2 145pm Crawford

CRP Promoter Luciferase Assay

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

H1-3 H4 H5 H6 H7-8 empty SV40p

Fold change over H1-3

Carlson et al, AJHG v77 p64

Functional Replication

Page 74: Day2 145pm Crawford

Association Analysis Outline

• Study Design• SNPs versus Haplotypes• Analysis Methods• Candidate Gene• Whole Genome Analysis• Replication and Function