lecture 8: association mappingnitro.biosci.arizona.edu/workshops/twipb2013/mod1/mod1-8b-gore.pdf ·...
TRANSCRIPT
Lecture 8:
Association Mapping
Michael Gore lecture notes
Tucson Winter Institute
version 18 Jan 2013
Molecular Diversity: Genotype
Single-Nucleotide Polymorphism (SNP)
…TGAACCTAAGTATGTCCG…
…TGAACCTAAGTATGTCCG…
…TGAACCTAAGTATGTCCG…
…TGAACCTAGGTATGTCCG…
…TGAACCTAGGTATGTCCG…
…TGAACCTAGGTATGTCCG…
A/GSNP allele
Line 1
Line 2
Line 3
Line 4
Line 5
Line 6
Molecular Diversity: Genotype
Single-Nucleotide Polymorphism (SNP)
…TGAACCTAAGTATGTCCG…
…TGAACCTAAGTATGTCCG…
…TGAACCTAAGTATGTCCG…
…TGAACCTAGGTATGTCCG…
…TGAACCTAGGTATGTCCG…
…TGAACCTAGGTATGTCCG…
A/GSNP allele
Line 1
Line 2
Line 3
Line 4
Line 5
Line 6
1 to 1.4% Nucleotide Diversity (π) in maize
Maize has higher nucleotide diversity
than any other major crop species
Devos 2005 Curr. Opin. Plant Bio.
Maize has higher nucleotide diversity
than any other major crop species
Devos 2005 Curr. Opin. Plant Bio.
2-5 times higher than that of grasses
Maize has higher nucleotide diversity
than any other major crop species
Devos 2005 Curr. Opin. Plant Bio.
2-5 times higher than that of grasses14 times higher than that of humans
Functional Diversity: Phenotype
Spectrum of Provitamin A (carotenoids) Seed Content
Photo from T. Rocheford
Functional Diversity: PhenotypePortion of seed and trichome diversity exhibited by Gossypium
Photo from T. RochefordPhoto from Cotton Incorporated
Heritability – Amount of phenotypic
variation attributable to genetic factors
High Heritability Low Heritability
ENV
ENVGENE
GENE
Genetic Architecture of Polygenic Traits
P = μ + G + E + GG + GE + e
?
Phenotype
Genotype Environment
? ?
?
Number, location, and effect size of QTL?
How do we connect genotype
to phenotype?
Chia, Song et al. 2012 Nature Genetics
Kernel Color variation in Hi27 x A272 population
Photo from T. Rocheford
IBD blocks in a region of maize chromosome 10
Linkage Analysis: Family
10 Mb interval
in maize
could contain
200 or more genes
P1
F1
F2
P2
1 generation of
recombination
QTL
Interval
Forward: Phenotype to Genotype
Linkage Analysis: Family
10 Mb interval
in maize
could contain
200 or more genes
P1
F1
F2
P2
1 generation of
recombination
QTL
Interval
Hundreds of markers needed to capture recent
recombination at the expense of lower resolution
Forward: Phenotype to Genotype
Many generations
of recombination
Genome-Wide Association Studies
(GWAS): Natural Populations
Reverse: Genotype to Phenotype
Many generations
of recombination
Genome-Wide Association Studies
(GWAS): Natural Populations
High-resolution, but thousands to
millions of markers needed
Reverse: Genotype to Phenotype
Linkage Disequilibrium (LD)
LD is the non-random correlation of
alleles at two loci
D, D′ (normalized), and r2 are
commonly used summary statistics to
estimate pairwise LD
Likelihood-based LD estimators are
extensively used for evolutionary and
population genetic studies
D and D′
D describes the difference between
coupling and repulsion gamete frequenciesHedrick, P. W. (1987) Genetics 117, 331–341.
D captures information about allelic
association and allele frequencies
D′ is preferred because it is normalized
and thus ranges between 0 and 1
D and D′ may be highly erratic with rare
alleles and small sample sizes
r2
r2 (0 to 1) is the squared value of
Pearson’s correlation coefficient Hill, W.G., and Robertson, A. (1968). Linkage disequilibrium in finite populations.
Theor. Appl. Genet. 38, 226–231.
r2 summarizes both recombinational and
mutational history, while D and D′
measures only recombination
r2 is preferred in association studies
because it is more indicative of how
markers might correlate with QTL
Linkage Disequilibrium
1 2Complete Disequilibrium
Modified from Rafalski 2002 COPB 5:94–100; Gaut and Long 2003 Plant Cell 15:1502-1506
6 0
0 6
Locus 1
Locu
s 2
|D′| = 1
r2 = 1
1 2Complete Equilibrium
3 3
3 3
Locus 1
Locu
s 2
|D′| = 0
r2 = 0
* Complete LD between sites
* Same mutational history
* Low mapping resolution
* Pattern implies recombination
regardless of mutational history
* High mapping resolution
Linkage Disequilibrium
1 2Partial Disequilibrium
Modified from Rafalski 2002 COPB 5:94–100; Gaut and Long 2003 Plant Cell 15:1502-1506
6 3
0 3
Locus 1
Locu
s 2
|D′| = 1
r2 = 0.333
1 2Complete Equilibrium
4 4
2 2
Locus 1
Locu
s 2
|D′| = 0
r2 = 0
* Site 2 may be a relatively new mutation
without recombination
* Moderate mapping resolution
* Pattern implies recombination
r2 in Association Mapping
1 2
SNP Marker
Explains 10% of total phenotypic variance
NOT Genotyped Causative SNP
5 3
0 2
Locus 1
Locu
s 2
|D′| = 1
r2 = 0.25
SNP Marker will explain
25% of the total QTL
variation, but only 2.5%
of the total phenotypic
variation. Need large
sample size.
r2> 0.80 is recommended
for association studies
Visualize extent of LD between pairs
of loci: LD vs. physical distance
Remington et al. 2001 98:11479-11484
d3
Visualize extent of LD between pairs
of loci: Matrix heatmap
Flint-Garcia et al. 2003 Annual Review of Plant Biology 54:357-374
r2
Fisher Exact Test
sh1
Visualize extent of LD between pairs
of loci: HAPLOVIEW software
r2 = 0 white
0 < r2 < 1 shades of grey
r2 = 1 black
Days to Pollen Shed
Number of Nodes
Salvi et al. 2007 PNAS 104:11376-11381
Miniature Inverted-Repeat
Transposable Element
r2 LD scores for all marker pairs involving Mite
Visualize extent of LD between all
marker pairs involving strongest hit
What forces shape LD?
Natural and artificial selection
Recombination rate
Genetic drift
Mutation rate
Population structure
Population expansion/bottleneck
Admixture
Mating system
Slatkin 2008 Nature Reviews Genetics 9:477-485
Buckler and Gore 2007 Nature Genetics 39:1056-1057
Non-coding sites
Synonymous sites
LD decay in Major Crops
Number of Markers Needed for GWAS
Arabidopsis – ~200,000 SNPs
Grape – ~2,000,000 SNPs
Diverse maize – ~20,000,000 SNPs
The number markers needed for GWAS
depends on genome size, LD decay in the
germplasm, nucleotide diversity, and
QTL effect sizes
ER Mardis. Nature 470, 198-203 (2011) doi:10.1038/nature09796
Changes in instrument capacity over the past decade, and the
timing of major sequencing projects
Population Structure
P1
P2
Modified from Escalante et al. 2004 TIP 20:388-395
p=1
q=0
p=0
q=1
FST=1
Homozygous
Diploid
Population differentiation results from
changes in allele frequencies caused by
genetic drift, selection, local adaptation, etc.
Population Structure
P1
P2
Modified from Escalante et al. 2004 TIP 20:388-395
p=0.5
q=0.5
p=0.5
q=0.5
FST= 0
Homozygous
Diploid
No Population differentiation
Population Structure
Population differentiation results from
changes in allele frequencies caused by
genetic drift, selection, local adaptation, etc.
P1
P2
Modified from Escalante et al. 2004 TIP 20:388-395
Homozygous
Diploidp=0.9
q=0.1
p=0.25
q=0.75
FST=0.43
Fitch-Margoliash tree for 260 maize inbred lines using the log-transformed proportion of
shared alleles distance from 94 SSR markers
Liu et al. 2003 Genetics 165:2117-2128
Maize Population Structure
Population Structure in Crops
Garris et al. 2005 Genetics 169:1631-1638
Flint-Garcia et al. 2005 Plant Journal 144:1054-1064
Is GWAS possible
for Indica and
Japonica with an
Fst of 0.43?
Correlation between Population
Structure and Traits
Traits may be the cause of population structure
There will be less statistical power to detect
associations for these type of traits
Linkage populations are needed to break up
population structure
Flint-Garcia et al. 2005 Plant Journal 144:1054-1064
Andes U.S.
Population structure can produce associations
G TG G G G TT T G T T
P=0.04
GT80
100
120
140
160
180
200
Pla
nt
Heig
ht
P<<0.001
T G0
2
4
6
8
10
Kern
el H
ue
These non-functional associations can be accounted for by
estimating the population structure using random markers.Slide from Ed Buckler
Mixed Linear Model
y = Xβ + Sα + Qv + Zu + e
y is a vector of phenotypic observation
β is a vector of fixed effects other than SNP or
population group effects;
α is a vector of SNP effects (QTN);
v is a vector of population effects;
u is a vector of polygene background effects;
e is a vector of residual effects;
Q is a matrix from STRUCTURE relating y to v; and
X, S and Z are incidence matrices of 1s and 0s relating
y to β, α and u, respectively.
Yu, Pressoir, et al. 2005. Nature Genetics 38:203-208
Structured Association (Q)
A set of random markers is used to
estimate population structure
Estimates are incorporated into a
statistical analysis to control for genetic
structure
A kinship coefficient (F) is the
probability that two homologous genes are
identical by descent
Kinship from genetic markers is an
estimate of relative kinship that is based
on probabilities of identical by state
Even with pedigrees, marker-based
kinship has higher accuracy
Kinship Coefficient (K)
Loiselle et al. 1995. Am. J. Bot. 82: 1420–1425
Q (pop structure) + K (relatedness)
Yu, Pressoir, et al. 2005. Nature Genetics 38:203-208
Mod
el C
om
pa
riso
n
Myles et al. 2009 Plant Cell 21:2194-2202
Power analysis with 1000 individuals
Statistical Power in GWAS
Site Frequency Spectrum
of Random SNPs
Statistical power of detection in GWAS
for SNPs explaining 0.1–0.5% variation
typ
e I
err
or
rate
of
5 x
10
-7
Visscher 2008 Nature Genetics 40:489 - 490
Huang et al. 2010. Nature Genetics 42:961-967
GWAS in 373 indica rice lines with
nearly 1 million SNPs: low structure
qsw5
Huang et al. 2010. Nature Genetics 42:961-967
GWAS in 373 indica rice lines with
nearly 1 million SNPs: high structure
Need
crosses
Resolution (bp)
Re
se
arc
h t
ime
(ye
ar)
1 1 x 104 1 x 1071
5
Association mapping
Positional cloning
Recombinant inbred lines
Pedigree
Intermated recombinant inbreds
F2 / BC
Near-isogenic lines
All
ele
nu
mb
er
10
2
40
Linkage Mapping vs. Association Mapping
Yu and Buckler 2006 Current Opinion in Biotechnology 17:155-160
Linkage Mapping vs. Association Mapping
Low resolution
Small reference population
& allele numbers
Balanced allele frequency
Known population
structure
High resolution
Large reference
population & allele
numbers
Rare alleles
Cryptic population
structure
Integration of Linkage Analysis
and GWAS for Trait Dissection
• 25 diverse lines where chosen to maximize diversity based on SSRs
• Crossed to B73 for a reference design
• Nested Association Mapping = NAM
• Project joint efforts:
Buckler (Cornell; USDA-ARS)
Holland (NCSU; USDA-ARS),
Kresovich (Cornell), and
McMullen (U of MO; USDA-ARS) groups
25 families for a total of 5,000 RILs
Genotyping B73-rare SNPs to track recent recombination
P1
P2
P25
B73
Pop1
Pop2
Pop25
.
.
.
.
.
.
.
.
.
5,000 RIL Linkage Map
Linkage resolution Linkage resolution
Pop1
Pop2
Pop25
5,000 RIL Linkage Map
Genotyping-by-sequencing of parents and
overlay genotypes onto recombination blocks
P1
P2
P25
B73
.
.
.
.
.
.
.
.
.
NAM resolutionNAM resolution
Resequencing of 103 maize lines
Chia, Song et al. 2012 Nature Genetics
HapMap2: 55 million SNPs
~20 million SNPs for NAM
NAM unites power of QTL mapping and
high resolution of association mapping
Gene-level mapping resolution when
using ancient recombination
0
5
10
15
20
25
30
35
40
75 80 85 90
Initial Scan
i66 controlled
MAP POSITION in cMs
LOD
SC
ORE
1 pop
25 pops
Recent
recombination