statistical genetics workshop tasha fingerlin

7/31/2019 Statistical Genetics Workshop Tasha Fingerlin

1/34

1

Design and Analysis of Genome-WideAssociation Studies

Tasha E. FingerlinDepartments of Epidemiology and Biostatistics & Informatics

Colorado School of Public Health

University of Colorado Denver

Workshop on Statistical Genetics and Genomics

February 12, 2009


2/34

2

Today

Goal of a genetic association study

Rationale for genome-wide association studies

Design and analysis considerations for GWAs

Application to two clinically similar granulomatous lung diseases


3/34

3

Complex Traits - Multifactorial Inheritance

Examples

Some cancers - Schizophrenia

Type 1 diabetes - Cleft lip/palate

Type 2 diabetes - Hypertension

Alzheimer disease - Rheumatoid arthritis

Inflammatory bowel disease - Asthma

Genetic

Variants

Non-genetic

factors

Trait

Trait


4/34

4

Genetic Association Studies

Short-term Goal: Identify genetic variants that explain differences inphenotype among individuals in a study population

Qualitative: disease status, presence/absence of congenital defect

Quantitative: blood glucose levels, % body fat

If association found, then further study can follow to

Understand mechanism of action and disease etiology in individuals

Characterize relevance and/or impact in more general population

Long-term goal: to inform process of identifying and delivering better

prevention and treatment strategies


5/34

5

DNA Variation

>99.9 % of the sequence is identical between any two chromosomes.- Compare maternal and paternal chromosome 1 in single person

- Compare Y chromosomes between two unrelated males

Even though most of the sequence is identical between two chromosomes,

since the genome sequence is so long (~3 billion base pairs), there are stillmany variations.

Some DNA variations are responsible for biological changes, others have no

known function.

Alleles are the alternative forms of a DNA segment at a givengenetic

location.

Genetic polymorphism: DNA segment with 2 common alleles.


6/34

6

SNPs DNA sequence variations that occur when a single nucleotide

is altered

Alleles at this SNP are G and T

SNPs are the most common form of variation in the human genome

SNPs catalogued in several databases

Single Nucleotide Polymorphisms: SNPs

A T G A C A G G C

A T G A C A T G C


7/34

7

Genotype: pair of alleles (one paternal, one maternal) at a locus

Genotype for this individual is GT

Haplotype: sequence of alleles along a single chromosome

Genotypes for this individual (vertical) : CA and TT

Haplotypes (horizontal): CT and AT

Genotypes and Haplotypes

A T G A C A G G C

A T G A C A T G C

Maternal

Paternal

A T G C C A T G C

A T G A C A T G C

Maternal

Paternal


8/34

8

Scope of a Genetic Association Study

Candidate gene

Known functional variants Variants with unknown function in exons, introns, regulatory regions

Linkage candidate region Functional variants, or those with unknown function in candidate genes

More general coverage of region using many markers

Genome-wide

Test for association with hundreds of thousands (millions) of SNPs spread

across the entire genome.

Many design strategies possible for distributing markers

* Sabeti PC et al. (2002).Nature 419: 832-837


9/34

9

Genome-Wide Association Studies

Rationale:

Linkage analysis using families takes unbiased look at whole genome,

but is underpowered for the size of genetic effects we expect to see

for many complex genetic traits.

Candidate gene association studies have greater power to identify

smaller genetic effects, but rely on a priori knowledge about disease

etiology.

Genome-wide association studies combine the genomic coverage of

linkage analysis with the power of association to have much better

chance of finding complex trait susceptibility variants.


10/34

10

Why are They Possible Now?

Genotyping Technology:

Now have ability to type hundreds of thousands (or millions) of SNPs

in one reaction on a SNP chip. The cost can be as low as $200-

$300 per person.

Two primary platforms: Affymetrix and Illumina.

Design and analysis:

Availability of SNP databases, HapMap, and other resources toidentify the SNPs and design SNP chips.

Faster computers to carry out the millions of calculations make

implementation possible.


11/34

11

Design and Analysis Strategies: Moving Target

A genetic factor is like any other potential risk factor and the samestudy design and analysis principles hold in addition to those

specific to GWAs.

Standard case-control (matched or unmatched), cohort-based

quantitative trait and longitudinal designs are common.

In what follows, I will talk about current ideas and methods, with a

focus on assumptions and quality control.

Focus today is on case-control design, but many of the principles

apply to other designs.


12/34

12

SNP Chips: Number and Placement of SNPs

A typical SNP chip has at least 317,000 SNPs distributed across the

genome. Newest: ~1 million.

The newest chips can also measure (directly or indirectly) some types of

copy number variation.

We do not directly measure genotypes at all genetic polymorphisms, but

rely on association between the polymorphisms we do assay and those

which we do not assay.

SNP-SNP association, or linkage disequilibrium, is fundamental to our

ability to sample the whole genome with relatively few SNPs.


13/34

13

Linkage Disequilibrium (LD)

Linkage disequilibrium: the non-random association of alleles at

linked loci.

A measure of the tendency of some alleles to be inherited together on

haplotypes descended from ancestral chromosomes.

If these where the only two haplotypes in the population, then alleles

G and A ( C and T) are in perfect linkage disequilibrium.

If we genotype the first SNP, we know what the alleles are at the

second SNP.

A T G A C A A G C

A T C A C A T G C


14/34

14

In general, LD between two SNPs decreases with physical distance

Extent of LD varies greatly depending on region of genome

If LD strong, need fewer SNPs to capture variation in a region


15/34

15

www.hapmap.org


16/34

16

HapMap

Multi-country effort to identify, catalog common human geneticvariants.

Developed to better understand and catalogue LD patterns across thegenome in several populations.

Genotyped ~4 million SNPs on samples of African, east Asian,European ancestry.

All genotype data in a publicly available data base.

Can download the genotype data

Able to examine LD patterns across genome

Can estimate approximate coverage of a given SNP chip

Can represent 80-90% of common SNPs with

~300,000 tag SNPs for European or Asian samples

~500,000 tag SNPs for African samples


17/34

17

Case and control samples may be population-based

Cases and controls may be chosen to increase magnitude of contrast

Case sample may be selected to be enriched for predisposing variant(s)

- Family history- Early age of onset

- Increased severity of disease

Control sample may be selected to be very healthy or super controls

- E.g. for type 2 diabetes, may select individuals whohave normal response to glucose at age 70

- Control selection just as important (and tricky) as forany case-control study.

Case and Control Selection


18/34

18

Testing for Genetic Association with Disease

Question of interest: Are the alleles or genotypes at a genetic marker

associated with disease status?

Use usual statistical machinery get estimates of measures of

association and to test for association for each of the SNPs.

One typical approach: Test for association between having 0, 1 or 2

copies of rare allele at a SNP using Cochran-Armitage test for trend.

Pearson, T. A. et al. JAMA 2008;299:1335-1344.


19/34

19

Interpreting the Statistical Results

Testing for association at each of hundreds of thousands of markers dictates

that traditional statistical significance thresholds (e.g. =.05) notappropriate.

That aside (more in a few minutes), if you identify a SNP that is

significantly associated with disease, there are three possibilities:

There is a causal relationship between SNP and disease

The marker is in linkage disequilibrium with a causal locus

False positive

Many potential sources of systematic errors that might lead to false positive

results.

Genotyping quality control issues particularly important.


20/34

20

Confounding by Ancestry

(a.k.a. Population Stratification)

Control selection critical as always

Confounding by ancestry: Distortion of the relationship between the genetic

risk factor and the outcome of interest due to ancestry that is related to both

the frequency of the putative genetic risk factor and whether or not subject is

a case or a control.

Genetic Risk Factor Case/Control Status

Ancestry


21/34

21

Population Stratification

Distribution of genotypes differs between cases and controls

Might conclude that allele A (or genotype AA) related to

disease

Cases Controls

TT

AT

AA

Genotype


22/34

22


If cases and controls not well-matched ancestrally Unequal distribution of non-disease-related alleles between cases and

controls

Any allele more common in population with increased risk of disease

may appear to be associated with disease

Cases Controls Genotype

Pop 1 Pop 1

Pop 2 Pop 2

TT

AT

AA


23/34

23


Unequal distribution of alleles

may result from

Sample made up of morethan one distinct population

Sample made up of

individuals with differinglevels of admixture

Parra et al. AJHG 63:1839, 1998


24/34

24

Using the GWA Data to Avoid Population Stratification

Several options exist to allow controlling for ancestry using markers

across the genome

All based on idea that stratification should exist across the genome,

and that we can use the information on the genome-wide markers to

- estimate ancestry groups, remove extreme outliers,control for other variation

- estimate inflation of test statistic and adjust all test statistics

In each case, assumes constant effect of ancestry, which may or maynot be appropriate

Bottom line is that with genome of data, can do a very good job of

understanding potential for and minimizing impact of population

substructure.


25/34

25

Potential Solutions to Multiple Testing Issue

Bonferroni correction

Assume all tests performed are independent

Estimate number of independent polymorphisms in genome

Threshold often considered appropriate: 5x10-8.

Other less conservative allocation of experiment-wide over the genome

Perhaps spend more on linkage regions or for SNPs in coding regionsof gene

Permutation

Implementation for case-control study: permute case and control

status, perform all tests record the most significant p-value among

those tests and then re-permute case-control status and test again.

Repeat many times.

P-value for most significant test is the proportion of permutations

that had a best p-value as small or smaller than the one you

observe with the observed data (the data with the right case andcontrol labels .


26/34

26

Q-Q Plot

If points deviate (significantly?)

from line of equality indicate that

the two distributions are different.

Some will take point at which the

observed p-values differ from the

expected as the point to declare

statistical significance.

Important points:

Can have deviation from line that

is indicative of violated

assumptions (e.g. existence ofpopulation stratification)

In tails of distribution, have less

information, and so might require

large divergence from expected

Figure from: G. Abecasis


27/34

27

Using Multiple Samples

Rationale: Given the very large number of tests performed, use multiple

samples as a way to reduce the expected number of false positive results at

that end of the study.

Split-sample

Approach: Rather than testing entire sample on entire genome, test for

association with some proportion of your samples and then test someproportion of those markers in the rest of your samples*.

Independent samples

Approach: Rather than split your own sample, use another independentsample.

In either case, can dramatically reduce number of false positive results

while maintaining power.


28/34

28

Granulomatous Lung Diseases

Chronic Beryllium Disease (CBD)

Exposure to beryllium results in formation of granulomas in lung

among some individuals

Sarcoidosis

Unknown exposure(s) result in granuloma formation and

inflammation in lung, but other organs often involved


29/34

29

Hypothesis

Sarcoidosis and CBD share genetic factors important in theirsimilar granulomatous inflammatory pathways

CBD

Sarcoidosis

Disease Severity

Disease Risk


30/34

30

GWA : Preliminary data


31/34

31

Top Region for CBDp=10-11

R i Sh d b b th CBD d S id i


32/34

32

Region Shared by both CBD and Sarcoidosis on

8p23.2 p=10-2 10-4


33/34

33

Other Important Topics of Present and Future

Imputation

Careful consideration of non-genetic factors

Investigation of interactions: gene-environment and gene-gene

Sequencing data


34/34

34

Acknowledgements

Wake Forest University University of Michigan

Carl D. Langefeld, PhD Michael Boehnke, PhD

Goncalo R. Abecasis, PhD

National Jewish HealthLisa Maier, MPH, MD

Lori Silveira, MS

statistical genetics workshop tasha fingerlin

Documents