statistical genetics workshop tasha fingerlin
TRANSCRIPT
-
7/31/2019 Statistical Genetics Workshop Tasha Fingerlin
1/34
1
Design and Analysis of Genome-WideAssociation Studies
Tasha E. FingerlinDepartments of Epidemiology and Biostatistics & Informatics
Colorado School of Public Health
University of Colorado Denver
Workshop on Statistical Genetics and Genomics
February 12, 2009
-
7/31/2019 Statistical Genetics Workshop Tasha Fingerlin
2/34
2
Today
Goal of a genetic association study
Rationale for genome-wide association studies
Design and analysis considerations for GWAs
Application to two clinically similar granulomatous lung diseases
-
7/31/2019 Statistical Genetics Workshop Tasha Fingerlin
3/34
3
Complex Traits - Multifactorial Inheritance
Examples
Some cancers - Schizophrenia
Type 1 diabetes - Cleft lip/palate
Type 2 diabetes - Hypertension
Alzheimer disease - Rheumatoid arthritis
Inflammatory bowel disease - Asthma
Genetic
Variants
Non-genetic
factors
Trait
Trait
-
7/31/2019 Statistical Genetics Workshop Tasha Fingerlin
4/34
4
Genetic Association Studies
Short-term Goal: Identify genetic variants that explain differences inphenotype among individuals in a study population
Qualitative: disease status, presence/absence of congenital defect
Quantitative: blood glucose levels, % body fat
If association found, then further study can follow to
Understand mechanism of action and disease etiology in individuals
Characterize relevance and/or impact in more general population
Long-term goal: to inform process of identifying and delivering better
prevention and treatment strategies
-
7/31/2019 Statistical Genetics Workshop Tasha Fingerlin
5/34
5
DNA Variation
>99.9 % of the sequence is identical between any two chromosomes.- Compare maternal and paternal chromosome 1 in single person
- Compare Y chromosomes between two unrelated males
Even though most of the sequence is identical between two chromosomes,
since the genome sequence is so long (~3 billion base pairs), there are stillmany variations.
Some DNA variations are responsible for biological changes, others have no
known function.
Alleles are the alternative forms of a DNA segment at a givengenetic
location.
Genetic polymorphism: DNA segment with 2 common alleles.
-
7/31/2019 Statistical Genetics Workshop Tasha Fingerlin
6/34
6
SNPs DNA sequence variations that occur when a single nucleotide
is altered
Alleles at this SNP are G and T
SNPs are the most common form of variation in the human genome
SNPs catalogued in several databases
Single Nucleotide Polymorphisms: SNPs
A T G A C A G G C
A T G A C A T G C
-
7/31/2019 Statistical Genetics Workshop Tasha Fingerlin
7/34
7
Genotype: pair of alleles (one paternal, one maternal) at a locus
Genotype for this individual is GT
Haplotype: sequence of alleles along a single chromosome
Genotypes for this individual (vertical) : CA and TT
Haplotypes (horizontal): CT and AT
Genotypes and Haplotypes
A T G A C A G G C
A T G A C A T G C
Maternal
Paternal
A T G C C A T G C
A T G A C A T G C
Maternal
Paternal
-
7/31/2019 Statistical Genetics Workshop Tasha Fingerlin
8/34
8
Scope of a Genetic Association Study
Candidate gene
Known functional variants Variants with unknown function in exons, introns, regulatory regions
Linkage candidate region Functional variants, or those with unknown function in candidate genes
More general coverage of region using many markers
Genome-wide
Test for association with hundreds of thousands (millions) of SNPs spread
across the entire genome.
Many design strategies possible for distributing markers
* Sabeti PC et al. (2002).Nature 419: 832-837
-
7/31/2019 Statistical Genetics Workshop Tasha Fingerlin
9/34
9
Genome-Wide Association Studies
Rationale:
Linkage analysis using families takes unbiased look at whole genome,
but is underpowered for the size of genetic effects we expect to see
for many complex genetic traits.
Candidate gene association studies have greater power to identify
smaller genetic effects, but rely on a priori knowledge about disease
etiology.
Genome-wide association studies combine the genomic coverage of
linkage analysis with the power of association to have much better
chance of finding complex trait susceptibility variants.
-
7/31/2019 Statistical Genetics Workshop Tasha Fingerlin
10/34
10
Why are They Possible Now?
Genotyping Technology:
Now have ability to type hundreds of thousands (or millions) of SNPs
in one reaction on a SNP chip. The cost can be as low as $200-
$300 per person.
Two primary platforms: Affymetrix and Illumina.
Design and analysis:
Availability of SNP databases, HapMap, and other resources toidentify the SNPs and design SNP chips.
Faster computers to carry out the millions of calculations make
implementation possible.
-
7/31/2019 Statistical Genetics Workshop Tasha Fingerlin
11/34
11
Design and Analysis Strategies: Moving Target
A genetic factor is like any other potential risk factor and the samestudy design and analysis principles hold in addition to those
specific to GWAs.
Standard case-control (matched or unmatched), cohort-based
quantitative trait and longitudinal designs are common.
In what follows, I will talk about current ideas and methods, with a
focus on assumptions and quality control.
Focus today is on case-control design, but many of the principles
apply to other designs.
-
7/31/2019 Statistical Genetics Workshop Tasha Fingerlin
12/34
12
SNP Chips: Number and Placement of SNPs
A typical SNP chip has at least 317,000 SNPs distributed across the
genome. Newest: ~1 million.
The newest chips can also measure (directly or indirectly) some types of
copy number variation.
We do not directly measure genotypes at all genetic polymorphisms, but
rely on association between the polymorphisms we do assay and those
which we do not assay.
SNP-SNP association, or linkage disequilibrium, is fundamental to our
ability to sample the whole genome with relatively few SNPs.
-
7/31/2019 Statistical Genetics Workshop Tasha Fingerlin
13/34
13
Linkage Disequilibrium (LD)
Linkage disequilibrium: the non-random association of alleles at
linked loci.
A measure of the tendency of some alleles to be inherited together on
haplotypes descended from ancestral chromosomes.
If these where the only two haplotypes in the population, then alleles
G and A ( C and T) are in perfect linkage disequilibrium.
If we genotype the first SNP, we know what the alleles are at the
second SNP.
A T G A C A A G C
A T C A C A T G C
-
7/31/2019 Statistical Genetics Workshop Tasha Fingerlin
14/34
14
In general, LD between two SNPs decreases with physical distance
Extent of LD varies greatly depending on region of genome
If LD strong, need fewer SNPs to capture variation in a region
-
7/31/2019 Statistical Genetics Workshop Tasha Fingerlin
15/34
15
www.hapmap.org
-
7/31/2019 Statistical Genetics Workshop Tasha Fingerlin
16/34
16
HapMap
Multi-country effort to identify, catalog common human geneticvariants.
Developed to better understand and catalogue LD patterns across thegenome in several populations.
Genotyped ~4 million SNPs on samples of African, east Asian,European ancestry.
All genotype data in a publicly available data base.
Can download the genotype data
Able to examine LD patterns across genome
Can estimate approximate coverage of a given SNP chip
Can represent 80-90% of common SNPs with
~300,000 tag SNPs for European or Asian samples
~500,000 tag SNPs for African samples
-
7/31/2019 Statistical Genetics Workshop Tasha Fingerlin
17/34
17
Case and control samples may be population-based
Cases and controls may be chosen to increase magnitude of contrast
Case sample may be selected to be enriched for predisposing variant(s)
- Family history- Early age of onset
- Increased severity of disease
Control sample may be selected to be very healthy or super controls
- E.g. for type 2 diabetes, may select individuals whohave normal response to glucose at age 70
- Control selection just as important (and tricky) as forany case-control study.
Case and Control Selection
-
7/31/2019 Statistical Genetics Workshop Tasha Fingerlin
18/34
18
Testing for Genetic Association with Disease
Question of interest: Are the alleles or genotypes at a genetic marker
associated with disease status?
Use usual statistical machinery get estimates of measures of
association and to test for association for each of the SNPs.
One typical approach: Test for association between having 0, 1 or 2
copies of rare allele at a SNP using Cochran-Armitage test for trend.
Pearson, T. A. et al. JAMA 2008;299:1335-1344.
-
7/31/2019 Statistical Genetics Workshop Tasha Fingerlin
19/34
19
Interpreting the Statistical Results
Testing for association at each of hundreds of thousands of markers dictates
that traditional statistical significance thresholds (e.g. =.05) notappropriate.
That aside (more in a few minutes), if you identify a SNP that is
significantly associated with disease, there are three possibilities:
There is a causal relationship between SNP and disease
The marker is in linkage disequilibrium with a causal locus
False positive
Many potential sources of systematic errors that might lead to false positive
results.
Genotyping quality control issues particularly important.
-
7/31/2019 Statistical Genetics Workshop Tasha Fingerlin
20/34
20
Confounding by Ancestry
(a.k.a. Population Stratification)
Control selection critical as always
Confounding by ancestry: Distortion of the relationship between the genetic
risk factor and the outcome of interest due to ancestry that is related to both
the frequency of the putative genetic risk factor and whether or not subject is
a case or a control.
Genetic Risk Factor Case/Control Status
Ancestry
-
7/31/2019 Statistical Genetics Workshop Tasha Fingerlin
21/34
21
Population Stratification
Distribution of genotypes differs between cases and controls
Might conclude that allele A (or genotype AA) related to
disease
Cases Controls
TT
AT
AA
Genotype
-
7/31/2019 Statistical Genetics Workshop Tasha Fingerlin
22/34
22
Population Stratification
If cases and controls not well-matched ancestrally Unequal distribution of non-disease-related alleles between cases and
controls
Any allele more common in population with increased risk of disease
may appear to be associated with disease
Cases Controls Genotype
Pop 1 Pop 1
Pop 2 Pop 2
TT
AT
AA
-
7/31/2019 Statistical Genetics Workshop Tasha Fingerlin
23/34
23
Population Stratification
Unequal distribution of alleles
may result from
Sample made up of morethan one distinct population
Sample made up of
individuals with differinglevels of admixture
Parra et al. AJHG 63:1839, 1998
-
7/31/2019 Statistical Genetics Workshop Tasha Fingerlin
24/34
24
Using the GWA Data to Avoid Population Stratification
Several options exist to allow controlling for ancestry using markers
across the genome
All based on idea that stratification should exist across the genome,
and that we can use the information on the genome-wide markers to
- estimate ancestry groups, remove extreme outliers,control for other variation
- estimate inflation of test statistic and adjust all test statistics
In each case, assumes constant effect of ancestry, which may or maynot be appropriate
Bottom line is that with genome of data, can do a very good job of
understanding potential for and minimizing impact of population
substructure.
-
7/31/2019 Statistical Genetics Workshop Tasha Fingerlin
25/34
25
Potential Solutions to Multiple Testing Issue
Bonferroni correction
Assume all tests performed are independent
Estimate number of independent polymorphisms in genome
Threshold often considered appropriate: 5x10-8.
Other less conservative allocation of experiment-wide over the genome
Perhaps spend more on linkage regions or for SNPs in coding regionsof gene
Permutation
Implementation for case-control study: permute case and control
status, perform all tests record the most significant p-value among
those tests and then re-permute case-control status and test again.
Repeat many times.
P-value for most significant test is the proportion of permutations
that had a best p-value as small or smaller than the one you
observe with the observed data (the data with the right case andcontrol labels .
-
7/31/2019 Statistical Genetics Workshop Tasha Fingerlin
26/34
26
Q-Q Plot
If points deviate (significantly?)
from line of equality indicate that
the two distributions are different.
Some will take point at which the
observed p-values differ from the
expected as the point to declare
statistical significance.
Important points:
Can have deviation from line that
is indicative of violated
assumptions (e.g. existence ofpopulation stratification)
In tails of distribution, have less
information, and so might require
large divergence from expected
Figure from: G. Abecasis
-
7/31/2019 Statistical Genetics Workshop Tasha Fingerlin
27/34
27
Using Multiple Samples
Rationale: Given the very large number of tests performed, use multiple
samples as a way to reduce the expected number of false positive results at
that end of the study.
Split-sample
Approach: Rather than testing entire sample on entire genome, test for
association with some proportion of your samples and then test someproportion of those markers in the rest of your samples*.
Independent samples
Approach: Rather than split your own sample, use another independentsample.
In either case, can dramatically reduce number of false positive results
while maintaining power.
-
7/31/2019 Statistical Genetics Workshop Tasha Fingerlin
28/34
28
Granulomatous Lung Diseases
Chronic Beryllium Disease (CBD)
Exposure to beryllium results in formation of granulomas in lung
among some individuals
Sarcoidosis
Unknown exposure(s) result in granuloma formation and
inflammation in lung, but other organs often involved
-
7/31/2019 Statistical Genetics Workshop Tasha Fingerlin
29/34
29
Hypothesis
Sarcoidosis and CBD share genetic factors important in theirsimilar granulomatous inflammatory pathways
CBD
Sarcoidosis
Disease Severity
Disease Risk
-
7/31/2019 Statistical Genetics Workshop Tasha Fingerlin
30/34
30
GWA : Preliminary data
-
7/31/2019 Statistical Genetics Workshop Tasha Fingerlin
31/34
31
Top Region for CBDp=10-11
R i Sh d b b th CBD d S id i
-
7/31/2019 Statistical Genetics Workshop Tasha Fingerlin
32/34
32
Region Shared by both CBD and Sarcoidosis on
8p23.2 p=10-2 10-4
-
7/31/2019 Statistical Genetics Workshop Tasha Fingerlin
33/34
33
Other Important Topics of Present and Future
Imputation
Careful consideration of non-genetic factors
Investigation of interactions: gene-environment and gene-gene
Sequencing data
-
7/31/2019 Statistical Genetics Workshop Tasha Fingerlin
34/34
34
Acknowledgements
Wake Forest University University of Michigan
Carl D. Langefeld, PhD Michael Boehnke, PhD
Goncalo R. Abecasis, PhD
National Jewish HealthLisa Maier, MPH, MD
Lori Silveira, MS