practical considerations in statistical genetics ashley beecham june 19, 2015

Practical Considerations in Statistical GeneticsPractical Considerations in Statistical Genetics

Ashley Beecham

June 19, 2015

ConsiderationsConsiderations

Study Design Quality control: pre-analysis

SamplesGenetic markers

Quality control: post-analysisQ-Q plots

Quality Control: meta-analysis Multiple Testing

Study DesignStudy Design

Is your phenotype genetic (i.e. heritable)? Is it a binary trait? Or quantitative? Are there age differences? Gender differences? Are there important environmental factors to consider?

Sample Quality ControlSample Quality Control

Genotyping efficiency Gender discrepancies Relatedness Population stratification (case-control studies) Mendelian errors (families)

Sample Quality Control (Gender Checks)Sample Quality Control (Gender Checks)

Sample Mix-up or Mislabel

Possible Sample Contamination

Sample Mix-up or Mislabel

Sample Quality Control (Relatedness)Sample Quality Control (Relatedness)

Calculate the Identity by State mean between pairs and plot the standardized mean and variance using Graphical Relationship Representation (Abecasis et al, Bioinformatics 2001)

Unrelated Case-Control Trios

Sample Quality Control (Population Stratification)Sample Quality Control (Population Stratification)

Allele frequency and prevalence differences between groups Genetic drift Differential selection Little migration between subpopulations

Sample Quality Control (Population Stratification)Sample Quality Control (Population Stratification)

EIGENSTRAT (Price et al. Nature Genetics 2006)) Principle Components Analysis (PCA) method

► Applies principle components analysis to genotype data to infer population substructure from genetic data

Principal components can be used as covariates in a regression model to correct for bias caused by substructure

Quality Control of Genetic MarkersQuality Control of Genetic Markers

Genotyping efficiency Hardy-Weinberg equilibrium Differential missingness

Marker Quality Control: Hardy Weinberg EquilibriumMarker Quality Control: Hardy Weinberg Equilibrium

There are two alleles at a given locus, A and a

p=freq(A)and

q=freq(a)

p + q = 1

(p + q) (p + q) =

p2 + pq + qp + q2 =

p2 + 2pq + q2

AA homozygotes

Aa heterozygotes

aa homozygotes


p2 = f(AA)

2pq = f(Aa)

q2 = f(aa)


Under dominant modelFrequency of affecteds = p2 +2pq

Under a recessive modelFrequency of affecteds = q2

Frequency of carriers = 2pq


Simple χ2 test Laboratory error May be telling you something

Controls in HWE, Cases not


Post Analysis Quality Control: Q-Q plotsPost Analysis Quality Control: Q-Q plots

What is a Q-Q Plot? “Q” stands for quantile

Used to assess the number and magnitude of observed associations between SNPs and the trait of interest, compared to the association statistics expected under the null hypothesis of no association

►Deviations from the “identity” line True Association Sharp deviations are likely due to Error Also possible due to sample relatedness or population structure

Genomic Inflation Factor (GIF) can be computed to assess deviations► Ratio of the median observed association statistic to the expected median► A value of 1 would mean no deviation

Post Analysis Quality Control: Q-Q plotsPost Analysis Quality Control: Q-Q plots

Meta-AnalysesMeta-Analyses

There can be biases in our data not only within sites but across sites!Genotyping effectsGenotype calling effects

Batch Effects: A Tale of the ImmunoChipBatch Effects: A Tale of the ImmunoChip

ImmunoChip

Fine-MappingReplication

207,728

AS(Ankylosing Spondylitis)

CeD(Coeliac Disease)

CD(Crohn’s Disease)

IgA(IgA Deficiency)

MS(Multiple Sclerosis)

PBC(Primary Biliary Cirrhosis)

PS(Psoriasis)

RA(Rheumatoid Arthritis)

SLE(Systemic Lupus Erythematosus)

T1D(Type 1 Diabetes)

UC(Ulcerative Colitis)

AITD(Autoimmune Thyroid Disease)

WTCCC2(PD, Bipolar, Reading etc.)

A Focus on Multiple SclerosisA Focus on Multiple Sclerosis

Stratum Cases Controls

AUSNZ 247 944

Belgium 302 1703

Denmark 741 835

Finland 221 486

France 386 354

Germany 2582 5545

Italy 957 1255

Norway 894 674

Sweden 2153 2331

UK 4324 4422

US 1691 5542

TOTAL 14,498 24,091

Genotyping and Genotype CallingGenotyping and Genotype Calling

Genotyping was done at 5 sites:John P. Hussman Institute for Human Genomics, University of

MiamiWellcome Trust Sanger InstituteLocal sites in France, Germany, and the United States

All genotype calling was done at the Wellcome Trust Sanger Institute in 3 batches Initially used Illuminus and GenoSNPFinal genotype calls made with Opticall

0 0.200.40 0.60 0.80 1 1.201.40 1.601.80

Norm Intensity (A)

exm-IND10-102817747

-0.20

0

0.20

0.40

0.60

0.80

1

1.20

1.40

1.60

Norm

Inte

nsi

ty (B

)

0 0.20 0.40 0.60 0.80 1 1.20 1.40 1.60

Norm Intensity (A)

exm-IND22-16602868

0

0.20

0.40

0.60

0.80

1

1.20

1.40

Norm

Inte

nsi

ty (B

)

Using Illuminus and GenoSNP, autosomal markers were divided into categories of ‘good’, ‘middle’, and ‘bad’ based on the following criteria:Good: call rate in both was ≥95% and concordance was ≥99%

►Concordant calls were keptBad: call rate was <95% in both Illuminus and GenoSNP

►Drop all markersMiddle: marker did not meet Good or Bad criteria

►More detailed analysis was done using 1000 genomes data

Initial Marker Quality ControlInitial Marker Quality Control

Population substructure, problems related to ‘calling batches’ were discovered.

Using a test set of Swedish samples, PCA analysis was done

Miami Sanger

Initial Test for Population SubstructureInitial Test for Population Substructure

Investigating the ProblemInvestigating the Problem

Scatter plot of the first principal component’s loadings (y axis) vs –

log10(p-values) from a logistic regression model

using the genotypic center as phenotype


log10(p-values) from a test of SNP missing

between the 2 genotypic centers


log10(p-values) for deviation from Hardy-Weinberg equilibrium

We performed the following comparisons to identify the source of the problem: Define the genotyping center as phenotype and regress the variants. (A) Run genotyping missingness for the 2 centers. (B) Test for deviation of the Hardy-Weinberg equilibrium. (C)

Genotypic center as phenotype

SNP missingness between centers

HWE


In the next step, we identified all the SNPs with a p-value < 10-3 in every respective test. We removed them and then calculated the new principal components

From the above, it is clear that the different genotypic centers is not the culprit, rather it seems to be associated with differences in HWE, which are a proxy for discordant calls between centers


Example: rs13306196

For this SNP, the Illuminus call was used for both centers. In Miami, a G allele was assigned and in Sanger an A allele was assigned. This means that the cluster assignment was likely reversed between sites.

Data A1 A2 A1A1/A1A2/A2A2 Genotype Counts

All G A 1969/0/6866

Miami Illuminus 0 G 0/0/1969

Sanger Illuminus 0 A 0/0/6866

GenoSNP Illuminus

Illuminus fails to call the same allele even for some mono-allelic markers


The dichotomy of the first principal component is explained by calling discordances of the Illuminus caller. Probably a bug exists in the Illuminus calling algorithm where there are difficulties in making calls when less than 3 clusters exist.

Solution: Re-QC using GenoSNP or Opticall (new)

Solution to the ProblemSolution to the Problem

Clean GenoSNP/IlluminusOpticall

Solution to the ProblemSolution to the Problem

Using Opticall, the first principal component no longer splits the data in 2 separate clusters

In later analyses, Opticall was determined to have less variation than GenoSNP in genotype frequencies between genotype calling batches

Final Assessment of Analysis: GIFFinal Assessment of Analysis: GIF

207,728

192,402

161,311

24,388

production

Failed QC

20,38110,710

Monomorphic

MAF > 5%

28,406

MAF 0.5-5%

108,517

MAF < 0.5%

(Autosomal)

Multiple TestingMultiple Testing

In genetics, there have always been two opposing camps:Liberals: They don’t worry about it at all. They report nominal P

values and aren’t afraid to be wrong.

Conservatives: They worry about it all the time. They report only fully “corrected” P values.

Common methods:Bonferroni False Discovery Rate

practical considerations in statistical genetics ashley beecham june 19, 2015

Documents

p q p q

pq marker quality control

post analysis quality

faa marker quality control

substructure slide

freqa p q

q stands

deviation slide