lecture 8: association mappingnitro.biosci.arizona.edu/workshops/twipb2013/mod1/mod1-8b-gore.pdf ·...

61
Lecture 8: Association Mapping Michael Gore lecture notes Tucson Winter Institute version 18 Jan 2013

Upload: lydung

Post on 27-Jun-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

Lecture 8:

Association Mapping

Michael Gore lecture notes

Tucson Winter Institute

version 18 Jan 2013

Global Migration of Maize

www.cgiar.org

Geographic Adaptation

Molecular Diversity: Genotype

Single-Nucleotide Polymorphism (SNP)

…TGAACCTAAGTATGTCCG…

…TGAACCTAAGTATGTCCG…

…TGAACCTAAGTATGTCCG…

…TGAACCTAGGTATGTCCG…

…TGAACCTAGGTATGTCCG…

…TGAACCTAGGTATGTCCG…

A/GSNP allele

Line 1

Line 2

Line 3

Line 4

Line 5

Line 6

Molecular Diversity: Genotype

Single-Nucleotide Polymorphism (SNP)

…TGAACCTAAGTATGTCCG…

…TGAACCTAAGTATGTCCG…

…TGAACCTAAGTATGTCCG…

…TGAACCTAGGTATGTCCG…

…TGAACCTAGGTATGTCCG…

…TGAACCTAGGTATGTCCG…

A/GSNP allele

Line 1

Line 2

Line 3

Line 4

Line 5

Line 6

1 to 1.4% Nucleotide Diversity (π) in maize

Maize has higher nucleotide diversity

than any other major crop species

Devos 2005 Curr. Opin. Plant Bio.

Maize has higher nucleotide diversity

than any other major crop species

Devos 2005 Curr. Opin. Plant Bio.

2-5 times higher than that of grasses

Maize has higher nucleotide diversity

than any other major crop species

Devos 2005 Curr. Opin. Plant Bio.

2-5 times higher than that of grasses14 times higher than that of humans

Functional Diversity: Phenotype

Spectrum of Provitamin A (carotenoids) Seed Content

Photo from T. Rocheford

Functional Diversity: PhenotypePortion of seed and trichome diversity exhibited by Gossypium

Photo from T. RochefordPhoto from Cotton Incorporated

Mendelian Traits: Single gene

Oligogenic Traits: Several genes

Seed color

Polygenic Traits: Numerous genes F

req

uen

cy o

f P

hen

oty

pe

Grain Yield

Heritability – Amount of phenotypic

variation attributable to genetic factors

High Heritability Low Heritability

ENV

ENVGENE

GENE

Genetic Architecture of Polygenic Traits

P = μ + G + E + GG + GE + e

?

Phenotype

Genotype Environment

? ?

?

Number, location, and effect size of QTL?

How do we connect genotype

to phenotype?

Chia, Song et al. 2012 Nature Genetics

Kernel Color variation in Hi27 x A272 population

Photo from T. Rocheford

IBD blocks in a region of maize chromosome 10

Phenotype

Genotype

Reverse Genetics: Association

Forward Genetics: Linkage

Linkage Analysis: Family

10 Mb interval

in maize

could contain

200 or more genes

P1

F1

F2

P2

1 generation of

recombination

QTL

Interval

Forward: Phenotype to Genotype

Linkage Analysis: Family

10 Mb interval

in maize

could contain

200 or more genes

P1

F1

F2

P2

1 generation of

recombination

QTL

Interval

Hundreds of markers needed to capture recent

recombination at the expense of lower resolution

Forward: Phenotype to Genotype

Many generations

of recombination

Genome-Wide Association Studies

(GWAS): Natural Populations

Reverse: Genotype to Phenotype

Many generations

of recombination

Genome-Wide Association Studies

(GWAS): Natural Populations

High-resolution, but thousands to

millions of markers needed

Reverse: Genotype to Phenotype

Myles et al. 2009 Plant Cell 21:2194-2202

Linkage Disequilibrium (LD)

LD is the non-random correlation of

alleles at two loci

D, D′ (normalized), and r2 are

commonly used summary statistics to

estimate pairwise LD

Likelihood-based LD estimators are

extensively used for evolutionary and

population genetic studies

D and D′

D describes the difference between

coupling and repulsion gamete frequenciesHedrick, P. W. (1987) Genetics 117, 331–341.

D captures information about allelic

association and allele frequencies

D′ is preferred because it is normalized

and thus ranges between 0 and 1

D and D′ may be highly erratic with rare

alleles and small sample sizes

r2

r2 (0 to 1) is the squared value of

Pearson’s correlation coefficient Hill, W.G., and Robertson, A. (1968). Linkage disequilibrium in finite populations.

Theor. Appl. Genet. 38, 226–231.

r2 summarizes both recombinational and

mutational history, while D and D′

measures only recombination

r2 is preferred in association studies

because it is more indicative of how

markers might correlate with QTL

Linkage Disequilibrium

1 2Complete Disequilibrium

Modified from Rafalski 2002 COPB 5:94–100; Gaut and Long 2003 Plant Cell 15:1502-1506

6 0

0 6

Locus 1

Locu

s 2

|D′| = 1

r2 = 1

1 2Complete Equilibrium

3 3

3 3

Locus 1

Locu

s 2

|D′| = 0

r2 = 0

* Complete LD between sites

* Same mutational history

* Low mapping resolution

* Pattern implies recombination

regardless of mutational history

* High mapping resolution

Linkage Disequilibrium

1 2Partial Disequilibrium

Modified from Rafalski 2002 COPB 5:94–100; Gaut and Long 2003 Plant Cell 15:1502-1506

6 3

0 3

Locus 1

Locu

s 2

|D′| = 1

r2 = 0.333

1 2Complete Equilibrium

4 4

2 2

Locus 1

Locu

s 2

|D′| = 0

r2 = 0

* Site 2 may be a relatively new mutation

without recombination

* Moderate mapping resolution

* Pattern implies recombination

r2 in Association Mapping

1 2

SNP Marker

Explains 10% of total phenotypic variance

NOT Genotyped Causative SNP

5 3

0 2

Locus 1

Locu

s 2

|D′| = 1

r2 = 0.25

SNP Marker will explain

25% of the total QTL

variation, but only 2.5%

of the total phenotypic

variation. Need large

sample size.

r2> 0.80 is recommended

for association studies

Visualize extent of LD between pairs

of loci: LD vs. physical distance

Remington et al. 2001 98:11479-11484

d3

Visualize extent of LD between pairs

of loci: Matrix heatmap

Flint-Garcia et al. 2003 Annual Review of Plant Biology 54:357-374

r2

Fisher Exact Test

sh1

Visualize extent of LD between pairs

of loci: HAPLOVIEW software

r2 = 0 white

0 < r2 < 1 shades of grey

r2 = 1 black

Days to Pollen Shed

Number of Nodes

Salvi et al. 2007 PNAS 104:11376-11381

Miniature Inverted-Repeat

Transposable Element

r2 LD scores for all marker pairs involving Mite

Visualize extent of LD between all

marker pairs involving strongest hit

What forces shape LD?

Natural and artificial selection

Recombination rate

Genetic drift

Mutation rate

Population structure

Population expansion/bottleneck

Admixture

Mating system

Slatkin 2008 Nature Reviews Genetics 9:477-485

Buckler and Gore 2007 Nature Genetics 39:1056-1057

Non-coding sites

Synonymous sites

LD decay in Major Crops

Number of Markers Needed for GWAS

Arabidopsis – ~200,000 SNPs

Grape – ~2,000,000 SNPs

Diverse maize – ~20,000,000 SNPs

The number markers needed for GWAS

depends on genome size, LD decay in the

germplasm, nucleotide diversity, and

QTL effect sizes

ER Mardis. Nature 470, 198-203 (2011) doi:10.1038/nature09796

Changes in instrument capacity over the past decade, and the

timing of major sequencing projects

Population Structure

P1

P2

Modified from Escalante et al. 2004 TIP 20:388-395

p=1

q=0

p=0

q=1

FST=1

Homozygous

Diploid

Population differentiation results from

changes in allele frequencies caused by

genetic drift, selection, local adaptation, etc.

Population Structure

P1

P2

Modified from Escalante et al. 2004 TIP 20:388-395

p=0.5

q=0.5

p=0.5

q=0.5

FST= 0

Homozygous

Diploid

No Population differentiation

Population Structure

Population differentiation results from

changes in allele frequencies caused by

genetic drift, selection, local adaptation, etc.

P1

P2

Modified from Escalante et al. 2004 TIP 20:388-395

Homozygous

Diploidp=0.9

q=0.1

p=0.25

q=0.75

FST=0.43

Fitch-Margoliash tree for 260 maize inbred lines using the log-transformed proportion of

shared alleles distance from 94 SSR markers

Liu et al. 2003 Genetics 165:2117-2128

Maize Population Structure

Population Structure in Crops

Garris et al. 2005 Genetics 169:1631-1638

Flint-Garcia et al. 2005 Plant Journal 144:1054-1064

Is GWAS possible

for Indica and

Japonica with an

Fst of 0.43?

Correlation between Population

Structure and Traits

Traits may be the cause of population structure

There will be less statistical power to detect

associations for these type of traits

Linkage populations are needed to break up

population structure

Flint-Garcia et al. 2005 Plant Journal 144:1054-1064

Andes U.S.

Population structure can produce associations

G TG G G G TT T G T T

P=0.04

GT80

100

120

140

160

180

200

Pla

nt

Heig

ht

P<<0.001

T G0

2

4

6

8

10

Kern

el H

ue

These non-functional associations can be accounted for by

estimating the population structure using random markers.Slide from Ed Buckler

Mixed Linear Model

y = Xβ + Sα + Qv + Zu + e

y is a vector of phenotypic observation

β is a vector of fixed effects other than SNP or

population group effects;

α is a vector of SNP effects (QTN);

v is a vector of population effects;

u is a vector of polygene background effects;

e is a vector of residual effects;

Q is a matrix from STRUCTURE relating y to v; and

X, S and Z are incidence matrices of 1s and 0s relating

y to β, α and u, respectively.

Yu, Pressoir, et al. 2005. Nature Genetics 38:203-208

Structured Association (Q)

A set of random markers is used to

estimate population structure

Estimates are incorporated into a

statistical analysis to control for genetic

structure

A kinship coefficient (F) is the

probability that two homologous genes are

identical by descent

Kinship from genetic markers is an

estimate of relative kinship that is based

on probabilities of identical by state

Even with pedigrees, marker-based

kinship has higher accuracy

Kinship Coefficient (K)

Loiselle et al. 1995. Am. J. Bot. 82: 1420–1425

Q (pop structure) + K (relatedness)

Yu, Pressoir, et al. 2005. Nature Genetics 38:203-208

Mod

el C

om

pa

riso

n

Myles et al. 2009 Plant Cell 21:2194-2202

Power analysis with 1000 individuals

Statistical Power in GWAS

Site Frequency Spectrum

of Random SNPs

Statistical power of detection in GWAS

for SNPs explaining 0.1–0.5% variation

typ

e I

err

or

rate

of

5 x

10

-7

Visscher 2008 Nature Genetics 40:489 - 490

Huang et al. 2010. Nature Genetics 42:961-967

GWAS in 373 indica rice lines with

nearly 1 million SNPs: low structure

qsw5

Huang et al. 2010. Nature Genetics 42:961-967

GWAS in 373 indica rice lines with

nearly 1 million SNPs: high structure

Need

crosses

Resolution (bp)

Re

se

arc

h t

ime

(ye

ar)

1 1 x 104 1 x 1071

5

Association mapping

Positional cloning

Recombinant inbred lines

Pedigree

Intermated recombinant inbreds

F2 / BC

Near-isogenic lines

All

ele

nu

mb

er

10

2

40

Linkage Mapping vs. Association Mapping

Yu and Buckler 2006 Current Opinion in Biotechnology 17:155-160

Linkage Mapping vs. Association Mapping

Low resolution

Small reference population

& allele numbers

Balanced allele frequency

Known population

structure

High resolution

Large reference

population & allele

numbers

Rare alleles

Cryptic population

structure

Integration of Linkage Analysis

and GWAS for Trait Dissection

• 25 diverse lines where chosen to maximize diversity based on SSRs

• Crossed to B73 for a reference design

• Nested Association Mapping = NAM

• Project joint efforts:

Buckler (Cornell; USDA-ARS)

Holland (NCSU; USDA-ARS),

Kresovich (Cornell), and

McMullen (U of MO; USDA-ARS) groups

25 families for a total of 5,000 RILs

Genotyping B73-rare SNPs to track recent recombination

P1

P2

P25

B73

Pop1

Pop2

Pop25

.

.

.

.

.

.

.

.

.

5,000 RIL Linkage Map

Linkage resolution Linkage resolution

Pop1

Pop2

Pop25

5,000 RIL Linkage Map

Genotyping-by-sequencing of parents and

overlay genotypes onto recombination blocks

P1

P2

P25

B73

.

.

.

.

.

.

.

.

.

NAM resolutionNAM resolution

GBS of the NAM parents

Gore, Chia et al. 2009 Science

HapMap1: 3 million SNPs

Resequencing of 103 maize lines

Chia, Song et al. 2012 Nature Genetics

HapMap2: 55 million SNPs

~20 million SNPs for NAM

Yu et al. 2008 Genetics

QTL Detection Power: NAM Population

Size, Heritability, and Number of QTL

NAM unites power of QTL mapping and

high resolution of association mapping

Gene-level mapping resolution when

using ancient recombination

0

5

10

15

20

25

30

35

40

75 80 85 90

Initial Scan

i66 controlled

MAP POSITION in cMs

LOD

SC

ORE

1 pop

25 pops

Recent

recombination

Tian et al. 2011 Nature Genetics 43:159–162

GWAS of Leaf Traits in NAM

Software for Genome-Wide Association and

Genomic Selection

http://www.maizegenetics.net/gapit

GAPIT - Genome Association and Prediction Integrated Tool

TASSEL - Trait Analysis by aSSociation, Evolution and Linkage

http://www.maizegenetics.net/bioinformatics