practical considerations in statistical genetics ashley beecham june 19, 2015
TRANSCRIPT
Practical Considerations in Statistical GeneticsPractical Considerations in Statistical Genetics
Ashley Beecham
June 19, 2015
ConsiderationsConsiderations
Study Design Quality control: pre-analysis
SamplesGenetic markers
Quality control: post-analysisQ-Q plots
Quality Control: meta-analysis Multiple Testing
Study DesignStudy Design
Is your phenotype genetic (i.e. heritable)? Is it a binary trait? Or quantitative? Are there age differences? Gender differences? Are there important environmental factors to consider?
Sample Quality ControlSample Quality Control
Genotyping efficiency Gender discrepancies Relatedness Population stratification (case-control studies) Mendelian errors (families)
Sample Quality Control (Gender Checks)Sample Quality Control (Gender Checks)
Sample Mix-up or Mislabel
Possible Sample Contamination
Sample Mix-up or Mislabel
Sample Quality Control (Relatedness)Sample Quality Control (Relatedness)
Calculate the Identity by State mean between pairs and plot the standardized mean and variance using Graphical Relationship Representation (Abecasis et al, Bioinformatics 2001)
Unrelated Case-Control Trios
Sample Quality Control (Population Stratification)Sample Quality Control (Population Stratification)
Allele frequency and prevalence differences between groups Genetic drift Differential selection Little migration between subpopulations
Sample Quality Control (Population Stratification)Sample Quality Control (Population Stratification)
EIGENSTRAT (Price et al. Nature Genetics 2006)) Principle Components Analysis (PCA) method
► Applies principle components analysis to genotype data to infer population substructure from genetic data
Principal components can be used as covariates in a regression model to correct for bias caused by substructure
Quality Control of Genetic MarkersQuality Control of Genetic Markers
Genotyping efficiency Hardy-Weinberg equilibrium Differential missingness
Marker Quality Control: Hardy Weinberg EquilibriumMarker Quality Control: Hardy Weinberg Equilibrium
There are two alleles at a given locus, A and a
p=freq(A)and
q=freq(a)
p + q = 1
(p + q) (p + q) =
p2 + pq + qp + q2 =
p2 + 2pq + q2
AA homozygotes
Aa heterozygotes
aa homozygotes
Marker Quality Control: Hardy Weinberg EquilibriumMarker Quality Control: Hardy Weinberg Equilibrium
p2 = f(AA)
2pq = f(Aa)
q2 = f(aa)
Marker Quality Control: Hardy Weinberg EquilibriumMarker Quality Control: Hardy Weinberg Equilibrium
Under dominant modelFrequency of affecteds = p2 +2pq
Under a recessive modelFrequency of affecteds = q2
Frequency of carriers = 2pq
Marker Quality Control: Hardy Weinberg EquilibriumMarker Quality Control: Hardy Weinberg Equilibrium
Simple χ2 test Laboratory error May be telling you something
Controls in HWE, Cases not
Marker Quality Control: Hardy Weinberg EquilibriumMarker Quality Control: Hardy Weinberg Equilibrium
Post Analysis Quality Control: Q-Q plotsPost Analysis Quality Control: Q-Q plots
What is a Q-Q Plot? “Q” stands for quantile
Used to assess the number and magnitude of observed associations between SNPs and the trait of interest, compared to the association statistics expected under the null hypothesis of no association
►Deviations from the “identity” line True Association Sharp deviations are likely due to Error Also possible due to sample relatedness or population structure
Genomic Inflation Factor (GIF) can be computed to assess deviations► Ratio of the median observed association statistic to the expected median► A value of 1 would mean no deviation
Post Analysis Quality Control: Q-Q plotsPost Analysis Quality Control: Q-Q plots
Meta-AnalysesMeta-Analyses
There can be biases in our data not only within sites but across sites!Genotyping effectsGenotype calling effects
Batch Effects: A Tale of the ImmunoChipBatch Effects: A Tale of the ImmunoChip
ImmunoChip
Fine-MappingReplication
207,728
AS(Ankylosing Spondylitis)
CeD(Coeliac Disease)
CD(Crohn’s Disease)
IgA(IgA Deficiency)
MS(Multiple Sclerosis)
PBC(Primary Biliary Cirrhosis)
PS(Psoriasis)
RA(Rheumatoid Arthritis)
SLE(Systemic Lupus Erythematosus)
T1D(Type 1 Diabetes)
UC(Ulcerative Colitis)
AITD(Autoimmune Thyroid Disease)
WTCCC2(PD, Bipolar, Reading etc.)
A Focus on Multiple SclerosisA Focus on Multiple Sclerosis
Stratum Cases Controls
AUSNZ 247 944
Belgium 302 1703
Denmark 741 835
Finland 221 486
France 386 354
Germany 2582 5545
Italy 957 1255
Norway 894 674
Sweden 2153 2331
UK 4324 4422
US 1691 5542
TOTAL 14,498 24,091
Genotyping and Genotype CallingGenotyping and Genotype Calling
Genotyping was done at 5 sites:John P. Hussman Institute for Human Genomics, University of
MiamiWellcome Trust Sanger InstituteLocal sites in France, Germany, and the United States
All genotype calling was done at the Wellcome Trust Sanger Institute in 3 batches Initially used Illuminus and GenoSNPFinal genotype calls made with Opticall
0 0.200.40 0.60 0.80 1 1.201.40 1.601.80
Norm Intensity (A)
exm-IND10-102817747
-0.20
0
0.20
0.40
0.60
0.80
1
1.20
1.40
1.60
Norm
Inte
nsi
ty (B
)
0 0.20 0.40 0.60 0.80 1 1.20 1.40 1.60
Norm Intensity (A)
exm-IND22-16602868
0
0.20
0.40
0.60
0.80
1
1.20
1.40
Norm
Inte
nsi
ty (B
)
Using Illuminus and GenoSNP, autosomal markers were divided into categories of ‘good’, ‘middle’, and ‘bad’ based on the following criteria:Good: call rate in both was ≥95% and concordance was ≥99%
►Concordant calls were keptBad: call rate was <95% in both Illuminus and GenoSNP
►Drop all markersMiddle: marker did not meet Good or Bad criteria
►More detailed analysis was done using 1000 genomes data
Initial Marker Quality ControlInitial Marker Quality Control
Population substructure, problems related to ‘calling batches’ were discovered.
Using a test set of Swedish samples, PCA analysis was done
Miami Sanger
Initial Test for Population SubstructureInitial Test for Population Substructure
Investigating the ProblemInvestigating the Problem
Scatter plot of the first principal component’s loadings (y axis) vs –
log10(p-values) from a logistic regression model
using the genotypic center as phenotype
Scatter plot of the first principal component’s loadings (y axis) vs –
log10(p-values) from a test of SNP missing
between the 2 genotypic centers
Scatter plot of the first principal component’s loadings (y axis) vs –
log10(p-values) for deviation from Hardy-Weinberg equilibrium
We performed the following comparisons to identify the source of the problem: Define the genotyping center as phenotype and regress the variants. (A) Run genotyping missingness for the 2 centers. (B) Test for deviation of the Hardy-Weinberg equilibrium. (C)
Genotypic center as phenotype
SNP missingness between centers
HWE
Investigating the ProblemInvestigating the Problem
In the next step, we identified all the SNPs with a p-value < 10-3 in every respective test. We removed them and then calculated the new principal components
From the above, it is clear that the different genotypic centers is not the culprit, rather it seems to be associated with differences in HWE, which are a proxy for discordant calls between centers
Investigating the ProblemInvestigating the Problem
Example: rs13306196
For this SNP, the Illuminus call was used for both centers. In Miami, a G allele was assigned and in Sanger an A allele was assigned. This means that the cluster assignment was likely reversed between sites.
Data A1 A2 A1A1/A1A2/A2A2 Genotype Counts
All G A 1969/0/6866
Miami Illuminus 0 G 0/0/1969
Sanger Illuminus 0 A 0/0/6866
GenoSNP Illuminus
Illuminus fails to call the same allele even for some mono-allelic markers
Investigating the ProblemInvestigating the Problem
The dichotomy of the first principal component is explained by calling discordances of the Illuminus caller. Probably a bug exists in the Illuminus calling algorithm where there are difficulties in making calls when less than 3 clusters exist.
Solution: Re-QC using GenoSNP or Opticall (new)
Solution to the ProblemSolution to the Problem
Clean GenoSNP/IlluminusOpticall
Solution to the ProblemSolution to the Problem
Using Opticall, the first principal component no longer splits the data in 2 separate clusters
In later analyses, Opticall was determined to have less variation than GenoSNP in genotype frequencies between genotype calling batches
Final Assessment of Analysis: GIFFinal Assessment of Analysis: GIF
207,728
192,402
161,311
24,388
production
Failed QC
20,38110,710
Monomorphic
MAF > 5%
28,406
MAF 0.5-5%
108,517
MAF < 0.5%
(Autosomal)
Multiple TestingMultiple Testing
In genetics, there have always been two opposing camps:Liberals: They don’t worry about it at all. They report nominal P
values and aren’t afraid to be wrong.
Conservatives: They worry about it all the time. They report only fully “corrected” P values.
Common methods:Bonferroni False Discovery Rate