day2 145pm crawford

Association Analysis

University of LouisvilleUniversity of LouisvilleCenter for Genetics and Molecular MedicineCenter for Genetics and Molecular Medicine

January 11, 2008January 11, 2008

Dana Crawford, PhDDana Crawford, PhDVanderbilt UniversityVanderbilt University

Center for Human Genetics ResearchCenter for Human Genetics Research

Association Analysis Outline

• Study Design• SNPs versus Haplotypes• Analysis Methods• Candidate Gene• Whole Genome Analysis• Replication and Function

Study Design

Does your trait or phenotype have a genetic component?

• Segregation analysis

• Recurrence risks

• Heritability

• Other sources of evidence for a geneticcomponent

Classic Segregation Analysis

• Determines if a major gene is involved

• Compares data to Mendelian models, such asAutosomal dominantAutosomal recessiveX-linked

• Results can be used as parameters forlinkage analysis (e.g. parametric LOD)

• Subject to ascertainment bias

Note: More complex methods needed for complex traits

Recurrence Risks

The chance that a disease present in thefamily will recur in that family

“Lightning striking twice”

If recurrence risk is greater in the familycompared with unrelated individuals,

the disease has a “genetic” component

Suggests familial aggregation

Recurrence Risks

Measured using the risk ratio (λ)

Sibling risk ratio = λs

λs = sibling recurrence risk population prevalence

Cystic fibrosis λs = (0.25/0.0004) = 500

Huntington disease λs = (0.50/0.0001) = 5000

Recurrence Risks: Complex traits

λ here is for first degree relative

Merikangas and Risch (2003) Science 302:599-601.

Heritability

Think “twin studies”

The proportion of phenotypic variation in a population attributable to genetic variation

Quantitative traits

Heritability measured as h2

(Can also be family studies)

Heritability and Quantitative Traits

Determined by genes and environment

Boys Girls

Mexican Americans

Blacks

Whites

Mexican Americans

Blacks

Whites

Example: Height

NHANES 1971-1974 versus NHANES 1999-2002

Freedman et al (2006) Obesity 14:301-308

Heritability and Quantitative Traits

Trait variation = genetic + environment

Genetic variation = additive + dominant

σT2 = σG

2 + σE2

σG2 = σa

2 + σd2

σE2 = σf

2 + σe2 Environmental variation =

familial/household + random/individual

hB2= σG

2 / σT2 Broad Sense heritability

Narrow Sense heritabilityhN2= σa

2 / σT2

Heritability and Twins Studies

h2 = 2(rMZ – rDZ),

where r is the correlation coefficient

Monozygotic = same genetic material = r ~ 100%

Dizygotic = half genetic material = r ~ 50%

Heritability and Twins Studies

Trait r(MZ) r(DZ) Reference

Cholesterol 0.76 0.39 Fenger et al

SBP 0.60 0.32 Evans et al

BMI 0.67 0.32 Schousboe et al

Perceived pitch 0.67 0.44 Drayna et al

Heritability: Is everything genetic?

Trait r(MZ) r(DZ) Reference

Vote choice 0.81 0.69 Hatemi et al

Religiousness 0.62 0.42 Koenig et al

Other Evidence For A Genetic Component

Monogenic disorders

Example:Phenotype of interest is sensitivity to warfarindosing, but there are no heritability estimates

Solution:Rare, familial disorder of warfarin resistance


Case Reports

Example:Phenotype of interest is susceptibility toNeisseria meningitidis (prevalence: 1/100,000)

Solution:Case report of recurrent N. meningitidis inpatient


• Animal models

• Biochemistry or biological pathways

• Expression data

• Previous genetic association studies

Other good arguments…

Study DesignHow well can you diagnose the disease or measure the trait?

• Narrow definitions better than all-inclusive definitionsThere are many paths that lead to the samephenotype

• Avoid misclassification and measurement errorDirect measurement versus recall/survey data or indirect proxies

• Be aware of age of onsetCan your control become a case over time?

Arguably most important step in study design

Target PhenotypesDisease or Quantitative trait?

Carlson et al. (2004) Nature 429:446-452

MI

CRP

LDL-C

IL6

LDLR

Acute Illness

Diet

Note: SNPs associated with quantitative traits may not be associated with clinical endpoint

Study Design

How many cases and controls will you need to detect an association?

Statistical Power• Null hypothesis: all alleles are equal risk

• Given that a risk allele exists, how likely is a study to reject the null?

• Study sample size ideally determined before you begin to recruit and genotype

• Statistical significance– Significance = p(false positive)– Traditional threshold 5%

• Statistical power– Power = 1- p(false negative)– Traditional threshold 80%

• Traditional thresholds balance confidence in results against reasonable sample size

Study DesignWhat are the thresholds/variables in a general power calculation?

Note: Significance threshold for 1 SNP tested

Study Design

Power Calculation Resources

• Quanto (hydra.usc.edu/gxe/)Supports quantitative, discrete traits (unrelated

and family based)

• Genetic Power Calculator (pngu.mgh.arvard.edu/~purcell/gpc/)

Supports discrete traits, variance components, quantitative traits for linkage and association studies

(List of other software: linkage.rockefeller.edu/soft/)

Study DesignHow can you maximize power for your study?

• Large sample sizeBetter estimate of variability or riskChance of misclassification / measurement error

• Large genetic effect sizeSNP risk allele with large odds ratio or explains a lot of trait variance

This is unknown at beginning of study

• Risk SNP is commonThis is unknown at beginning of studyCalculate power for a range of common MAFs (5-45%)

• Genotype the risk SNP directlyRisk SNP is unknown at beginning of studyRemember tagSNPs are imperfect proxiesAdjust sample size by 1/r2

Study Design

0

20

40

60

80

100

120

140

160

22.2 2.4 2.6 2.8

33.2 3.4 3.6 3.8

44.2 4.4 4.6 4.8

55.2 5.4 5.6 5.8

6

Genotype relative risk

(Additive model)

Sample size (cases)

0.05

0.1

0.15

0.2

0.25

Calculated using Quanto 1.1.1

MAF

Power calculation example:Cases: Adverse reaction (wheezing) to flu vaccinationControls: Vaccinated children with no adverse reactions

Study Design

Power calculation example:Immunogenicity to influenza A (H5N1) vaccine

0

100

200

300

400

500

600

700

800

900

0.01 0.04 0.07 0.1 0.13 0.16 0.19 0.22 0.25 0.28 0.31 0.34 0.37 0.4 0.43 0.46 0.49

R2

(Additive model)

Sample size

Calculated using Quanto 1.1.1

Study DesignWhy are you considering an association study instead of linkage?

• Linkage analysis is powerful for disorders with– Discernable pattern of inheritance– Rare alleles w/ large genetic effect sizes– High penetrance

• Not powerful for disorders that– have complex pattern of inheritance – are common– many risk alleles with small effect sizes– have low penetrance

Common variant/common disease hypothesis

• Common genetic variants confer susceptibility

• Risk-conferring alleles ancient; common across mostpopulations

• Risk-conferring allele has small effect

• Multiple risk alleles expected for common disease; also environment

Study Design

Study Design

Should you design a candidate gene or whole genome study?

• Candidate gene association study– Interrogate specific genes or regions– Based on previous knowledge or

biological plausibility– Hypothesis testing

• Whole genome association study– Interrogate the “entire” genome– No previous knowledge required– Hypothesis generation

Candidate gene association studies

• Choose gene based on previous knowledge– Gene function– Biological pathway– Previous linkage or association study

• Choose DNA variations for genotyping– Direct association approach– Indirect association approach

Direct Candidate Gene Association Study

Genotype “functional” SNPs

Collins et al (1997) Science 278:1580-1581

Example: Nonsynonymous SNPs


Botstein and Risch (2003) Nat Genet 33 Suppl:228-37.

Problem: We don’t know what is functionaland what is not functional


What would we miss?

Functional synonymous SNPs in MDR1 alterP-glycoprotein activity

Komar (2007) Science 315:466-467


What would we miss?

• 99% human genome is non-coding

• Non-coding SNPs or DNA variations in– Introns– Intergenic regulatory regions

Indirect Candidate Gene Association Study

• Genotype a fraction of all SNPs regardless of “function”

• Rely on SNP-SNP correlations (linkage disequilibrium) to capture information for SNPs not genotyped

Kruglyak (2005) Nat Genet 37:1299-1300


Linkage disequilibrium (LD)

Measured by r2

r2 = [f(A1B1) – f(A1)f(B1)]2

f(A1)f(A2)f(B1)f(B2)

r2 = 0 SNPs are independentr2 = 1 SNPs are perfectly correlated AND

have the same minor allele frequency


Using LD to pick “tagSNPs”

CRPEuropean-descent10 SNPs >5% MAF

CRPEuropean-descent

4 tagSNPs

r2>0.80


“tagSNPs” are population specific

CRPEuropean-descent

4 tagSNPs

CRPAfrican-descent

10 tagSNPs


• “tagSNPs” are population specific

• Merge sets for “cosmopolitan” set

http://gvs.gs.washington.edu/GVS/


Multiple testing

• Testing many SNPs for association with disease status

• No consensus on correcting p-value– Bonferroni– False Discovery Rate

• Need to replicate findings in independent study

Indirect Candidate Gene Association Study: Pros and Cons

• Can interrogate all common SNPs in gene

• SNPs must be known and genotypes available to calculate LD and pick tagSNPs

• Multiple testing within a gene

• Limited to previous knowledge

Whole Genome Association Study

• Can now genotype 100K – 1 million SNPs

• Coverage depends on platform and chip– tagSNPs capturing HapMap common SNPs– Genic SNPs overrepresented– Conserved non-coding SNPs represented– Evenly spaced across genome

Illumina Infinium assay Affymetrix GeneChips

Whole Genome Association Study

• Same study design and challenges as candidate gene

– Mostly case-control (retrospective)– Multiple testing

• Data storage and higher-order interaction testing issues

• Hypothesis generation tool (replication)

Manolio et al. Nature Reviews Genetics 7, 812–820 (October 2006)

Case/Control Study DesignsFor either candidate gene or whole genome

Study Pros Cons

Case/Control Easier to collect Subject to bias Less expensive No risk estimates

Case/Control Study Designs: Pros and Cons

Prospective Risk estimates Harder to collect More expensive Subject to bias

For rare outcomes, case/control design may be only option

Case/Control Study Designs: Pros and Cons

Types of bias• Bias in selection of cases

Those that are currently livingMiss fatal or short episodes of diseaseMight miss mild diseasesReferral/admission bias

• Non-response bias• Exposure suspicion bias• Family information bias• Recall bias

Manolio et al. Nature Reviews Genetics 7, 812–820 (October 2006)

Often ignored in genetic association studies

Analysis Methods

Genotype QC

• Test for departures of Hardy-Weinberg Equilibrium

• Test for gender inconsistencies

• Eliminate very rare SNPs (no power)

• Eliminate SNPs with low genotyping efficiency

• Eliminate samples with low genotyping efficiency

Analysis Methods

What statistical methods do you use to analyze your data?

• SNP by SNP (borrowed from epidemiology)Chi-square and Fisher’s exact

2x2 table2x3 table

Logistic and linear regressionCovariates

• HaplotypesHaplo.stats and regression

• InteractionsTraditional regressionMDR (Ritchie et al)

Analysis Methods

Case Control

Minor allele A B

Major allele C D

Odds ratio (OR) = ratio of odds of minor allele in Cases (A/C) and Controls (B/D)

OR(A*D)/(B*C)

The Case/Control Study

Case Control

Aa A B

AA C D

For genotypes, set homozygous for major allele (A) as “referent” genotype, and calculate 2 odds ratios:

Case Control

aa A B

AA C D

Analysis Methods

Analysis Methods

Case/control:Interpretation of Odds Ratio

1.0 – Referent>1.0 – Greater odds of disease compared with controls<1.0 – Lesser odds of disease compared with controls

Confidence Intervals: probably contain true OR

OR does not measure risk*

Prospective cohort

• Disease free at beginning of study

• Followed over time for disease (“incident”)

• Follow “exposed” and “unexposed” groups

• Gold-standard study design

Analysis Methods

Analysis Methods

Prospective cohort

Case Control Total

Exposed A B (A+B)

Unexposed C D (C+D)

Risk Ratio (RR) = Incidence of disease inExposed A/(A+B)

or Unexposed C/(C+D)

Prospective Study:Interpretation of Risk Ratio

1.0 – Referent>1.0 – Risk for disease increases<1.0 – Risk for disease decreases

Confidence Intervals: probably contain true RR

*For rare diseases, OR ~ RR

Analysis Methods

Case/control: Matching

Age Gender Race

Warning: Can “over match” andmiss describing an interesting factor

Bad Example: Cases: Adults with heart disease Controls: Newborns without heart disease

Analysis Methods

Case/control: Stratifying

Age Gender Race

Warning: Need sufficient sample size to stratify or split the data into males and females

Ex. Cases with heart disease Aged-matched controls without heart disease (Exposure: smoking status)

Stratify for Gender Specific Risks

Analysis Methods

Problems in Case/Control genetic association studies –

• “Confounding” by race or ancestry

• AKA population stratification

• Solutions:MatchStratifyAdjust (using genetic

markers)“Trios”

Cardon and Palmer (2003) Lancet 361:598-604

Analysis Methods

• Given

– Height as “target” or “dependent” variable

– Sex as “explanatory” or “independent” variable

• Fit regression model

height = *sex +

Analysis Methods

Regression

Analysis Methods

• Given

– Quantitative “target” or “dependent” variable y

– Quantitative or binary “explanatory” or “independent” variables xi

• Fit regression model

y = 1x1 + 2x2 + … + ixi +

Regression

• Works best for normal y and x• Can include covariates• Fit regression model

y = 1x1 + 2x2 + … + ixi +

• Estimate errors on ’s• Use t-statistic to evaluate significance of ’s• Use F-statistic to evaluate model overall• Use R2 to evaluate variance explained by

model

Analysis Methods

Regression

Analysis Methods

Coding Genotypes

000GG

011AG

121AA

RecessiveAdditiveDominantGenotype

Genotype can be re-coded in any number

of ways for regression analysis

Example of gene-environmentInteraction and traditional

regression

Analysis Methods

Statistical Packages for Genetic Association Studies

• Candidate gene association studySAS/GeneticsSTATASPSSRPLINK

• Whole genome association studyRPLINK

Analysis Methods

Whole genome in PLINK(pngu.mgh.harvard.edu/~purcell/plink/)

MHC removed

Can adjust for population stratificationCan add covariates

P<1x10-100P<2x10-11

P<5x10-8Genome-widesignificance

P=5x10-8

Plenge et al 2007 NEJM

SNPs versus Haplotypes

• There is no right answer: explore both

• The only thing that matters is the correlation between the assayed variable and the causal variable

• Sometimes the best assayed variable is a SNP, sometimes a haplotype

SNPs versus Haplotypes

• Haplo.stats (haplotype regression)Lake et al, Hum Hered. 2003;55(1):56-65.

• PHASE (case/control haplotype)Stephens et al, Am J Hum Genet. 2005 Mar;76(3):449-62

• Haplo.view (case/control SNP analysis)Barrett et al, Bioinformatics. 2005 Jan 15;21(2):263-5.

• SNPHAP (haplotype regression?)Sham et al Behav Genet. 2004 Mar;34(2):207-14.

Statistical Packages for Genetic Association Studieswith haplotypes

Analysis Methods

Multiple testing

• Bonferroni correctionToo conservative b/c each SNP tested

may not be independent (LD)How many independent tests did you do?See Conneely and Boehnke AJHG (in press)

• False Discovery RateAlso has arbitrary threshold

• Best bet is replication

Statistical Replication

0

0.1

0.2

0.3

0.4

0.5

0.6

H2 H5 H6 H7 H8Change in ln(CRP) per copy relative to H2

Black

Mexican-American

White

Carlson et al. AJHG 2005;77:64-77

Results Consistent with CARDIA

CRP SNPs and CRP levels in NHANES III

Crawford et al Circulation 2006; 114:2458-2465

• Statistical replication is not always possible

• Association may imply mechanism

• Test for mechanism at the bench– Is predicted effect in the right direction?– Dissect haplotype effects to define functional SNPs

Functional Replication


CRP Evolutionary Conservation

• TATA box: 1697• Transcript start: 1741• CRP Promoter region (bp 1444-1650) >75% conserved in mouse


Low CRP Levels Associated with H1-4

• USF1 (Upstream Stimulating Factor)– Polymorphism at 1440 alters USF1 binding site

1420 1430 1440 H1-4 gcagctacCACGTGcacccagatggcCACTCGtt H7-8 gcagctacCACGTGcacccagatggcCACTAGtt H5-6 gcagctacCACGTGcacccagatggcCACTTGtt

High CRP Levels Associated with H6

• USF1 (Upstream Stimulating Factor)– Polymorphism at 1421 alters another USF1 binding site

1420 1430 1440 H1-4 gcagctacCACGTGcacccagatggcCACTCGtt H7-8 gcagctacCACGTGcacccagatggcCACTAGtt H5 gcagctacCACGTGcacccagatggcCACTTGtt H6 gcagctacCACATGcacccagatggcCACTTGtt


CRP Promoter Luciferase Assay

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

H1-3 H4 H5 H6 H7-8 empty SV40p

Fold change over H1-3

Carlson et al, AJHG v77 p64


Association Analysis Outline

• Study Design• SNPs versus Haplotypes• Analysis Methods• Candidate Gene• Whole Genome Analysis• Replication and Function

day2 145pm crawford

Documents

genetic power calculator

environment genetic

study sample size

statistical power power

family studies

complex traits h

linkage analysis

half genetic material